基于GRU+LDA的群聊主题挖掘

doi:10.3969/j.issn.1006-2475.2018.12.014

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (12): 72-.doi: 10.3969/j.issn.1006-2475.2018.12.014

基于GRU+LDA的群聊主题挖掘

(1.武汉邮电科学研究院,湖北武汉430074；2. 南京烽火星空通信发展有限公司,江苏南京210000;
3.南京烽火天地通信科技有限公司,江苏南京210019)

收稿日期:2018-07-11 出版日期:2019-01-03 发布日期:2019-01-04
作者简介:汤鲲(1979-),男, 武汉邮电科学研究院、南京烽火星空通信发展有限公司高级工程师,硕士,研究方向:大数据分析；陈思思(1993-),女,湖北孝感人,武汉邮电科学研究院、南京烽火天地通信科技有限公司硕士研究生,研究方向:机器学习,数据挖掘。

GRU and LDA Based Group Chat Topic Mining

(1. Wuhan Research Institute of Posts and Telecommunications, Wuhan430074, China;
2. Fiber Home Starry Sky Co. Ltd., Nanjing210000, China;
3. Fiber Home World Communication Technology Co. Ltd., Nanjing210019, China)

Received:2018-07-11 Online:2019-01-03 Published:2019-01-04

摘要/Abstract

摘要： 社交网络发展迅速，即时消息系统已成为人们日常生活中必不可少的沟通交流工具。在线群聊能使人们迅速交流生活、技术及工作等信息，但是由于群聊信息更新较快，大量的信息导致跟进群聊话题是困难的。传统的主题挖掘模型不能很好地适用于群聊文本的挖掘。通过对群聊文本的特征进行分析，提出一种基于GRU和LDA的群聊会话主题挖掘（GLB-GCTM, GRU and LDA Based Group Chat Topic Mining）模型，解决了传统主题模型不能解决的词语顺序问题。首先，假定每个文档有一个基于高斯分布的主题向量，然后根据GRU原理产生每个词的隐含状态，根据当前词的隐含状态的伯努利分布确定当前词是否为停用词，以决定所使用的语言模型。该方法使用笔者加入的10个QQ群最近3个月的群聊数据集进行试验验证，结合对比实验评估标准，该模型能够有效识别出群聊文本中的主题。

关键词: 主题挖掘, 群聊文本, 深度学习, GRU, LDA

Abstract: As the fast development of social network, instant messaging system has become an essential communication tool in our daily lives. We can quickly exchange information about life, technology and work through online group chat. However, due to the faster update of group chat messages, it is difficult for us to obtain group chat topics. And traditional topic mining models are not well suited to the topic mining of group chat texts. By analyzing the characteristics of group chat messages, GRU and LDA Based Group Chat Topic Mining(GLB-GCTM) model is proposed, which solves the problem of word order that cannot be solved by traditional theme models. First, assuming that each document has a Gaussian-distribution topic vector, then the latent state of each word is generated according to the GRU, and the current word is determined as a stop word based on the Bernoulli distribution of the latent state of the current word to determine which language model to use. This method uses ten QQ groups that authors join in and collect the last three-months group chat messages for test. The model can effectively identify the topics in the group chat text combined with the comparative experiment evaluation criteria.

Key words: topic mining, group chat, deep learning, GRU； LDA

中图分类号:

TP274

汤鲲1,2，陈思思1,3. 基于GRU+LDA的群聊主题挖掘[J]. 计算机与现代化, 2018, 0(12): 72-.

TANG Kun1,2, CHEN Si-si1,3. GRU and LDA Based Group Chat Topic Mining[J]. Computer and Modernization, 2018, 0(12): 72-.

参考文献

［1］ LIN J. Automatic author profiling of online chat logs［D］. USA: Naval Postgraduate School, 2007.
［2］ UTHUS D C, AHA D W. Multiparticipant chat analysis: A survey［J］. Artificial Interlligence, 2013(199-200):106-121.
［3］ ELNAHRAWY E M. Log-based chat room monitoring using text categorization: A comparative study［C］// The International Conference on Information and Knowledge Sharing. 2002:111-115.
［4］ TASCI S, GUNGOR T. LDA-based keyword selection in text categorization［C］// 24th International Symposium on Computer and Information Sciences, 2009. 2009:230-235.
［5］ LEVOW G A. Prosody-based topic segmentation for mandarin broadcast news［C］//Proceedings of HLT-NAACL， 2004. 2004:137-140.
［6］ BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation ［J］. Journal of Machine Learning Research,2003(3):993-1022.
［7］ DIAO Q, JIANG J, ZHU F, et al. Finding bursty topics from microblogs［C］// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. 2012:536-544.
［8］朱佳晖. 基于深度学习的主题建模方法研究［D］. 武汉：武汉大学, 2017.
［9］裘实. 基于深度学习的对话系统主题分配技术研究［D］. 哈尔滨：哈尔滨工业大学, 2017.
［10］石磊,杜军平,梁美玉. 基于RNN和主题模型的社交网络突发话题发现［J］. 通信学报, 2018,39(4):189-198.
［11］陈培新,郭武. 融合潜在主题信息和卷积语义特征的文本主题分类［J］. 信号处理, 2017,33(8):1090-1096.
［12］刘畅,张一珂,张鹏远,等. 基于改进主题分布特征的神经网络语言模型［J］. 电子与信息学报, 2018,40(1):219-225.
［13］杨东,王移芝. 基于Attention-based C-GRU神经网络的文本分类［J］. 计算机与现代化, 2018(2):96-100.
［14］李雪莲,段鸿,许牧. 基于门循环单元神经网络的中文分词法［J］. 厦门大学学报(自然科学版), 2017,56(2):237-243.
［15］李康. 基于变分求解的有监督狄利克雷过程混合主成分分析［D］. 广州:中山大学, 2015.
［16］何天文,王红. 基于语义语法分析的中文语句困惑度评价［J］. 计算机应用研究, 2017,34(12):3538-3542.
［17］梁喜涛,顾磊. 中文分词与词性标注研究［J］. 计算机技术与发展, 2015,25(2):175-180.
［18］吴晓倩,胡学钢. 基于N-最短路径的中文分词技术研究［J］. 安徽理工大学学报(自然科学版), 2014,34(1):72-75.
［19］邱秀连,田小虎,廖闻剑. 基于正负反馈的SEIR微博舆情传播模型［J］. 计算机与现代化, 2018(2): 44-48.
［20］李城,沙俊淞,武文. 基于最长公共子序列的微博谣言溯源研究［J］. 计算机与现代化, 2018(1): 107-112.

[1]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[2]	陈凯1, 李宜汀1, 2, 全华凤1 . 基于改进YOLOv8的河道废弃瓶检测方法[J]. 计算机与现代化, 2024, 0(11): 113-120.
[3]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[4]	王莹莹, 郝潇. 基于Res2Net和递归门控卷积的细粒度图像分类[J]. 计算机与现代化, 2024, 0(10): 74-79.
[5]	史星宇1, 李强2, 庄莉3, 梁懿3, 王秋琳3, 陈锴3, 伍臣周3, 常胜1. 一种面向工业部署的目标检测模型蒸馏技术[J]. 计算机与现代化, 2024, 0(10): 93-99.
[6]	张泽1, 张建权2, 3, 周国鹏2, 3. 基于改进YOLOv8s的摄像头模组缺陷检测[J]. 计算机与现代化, 2024, 0(09): 107-113.
[7]	程亚子1, 雷亮1, 2, 陈瀚1, 赵毅然1. 基于转置注意力的多尺度深度融合单目深度估计[J]. 计算机与现代化, 2024, 0(09): 121-126.
[8]	程萌, 李浩. 改进YOLOv5s的落叶树鸟巢检测方法[J]. 计算机与现代化, 2024, 0(08): 24-29.
[9]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[10]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[11]	徐新爱, 李钢. 基于DCGAN的课堂表情图像生成方法[J]. 计算机与现代化, 2024, 0(08): 88-91.
[12]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.
[13]	黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
[14]	武丽1, 张征浩2, 葛彩成2, 俞俊2. 基于改进SCNN网络的车道线检测算法[J]. 计算机与现代化, 2024, 0(07): 87-92.
[15]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.

基于GRU+LDA的群聊主题挖掘

GRU and LDA Based Group Chat Topic Mining

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价