一种基于图挖掘的LDA改进算法

计算机与现代化 ›› 2022, Vol. 0 ›› Issue (07): 61-66.

一种基于图挖掘的LDA改进算法

(南京航空航天大学经济与管理学院,江苏南京211106)

出版日期:2022-07-25 发布日期:2022-07-25
作者简介:李珊(197—)，女，江苏南京人,副教授,博士,研究方向:商务智能,大数据分析,服务科学,企业信息化,E-mail:lishan@nuaa.edu.cn；陈妙苗(1996—),女,浙江杭州人,硕士研究生,研究方向:文本挖掘,数据分析,E-mail: katiechan@nuaa.edu.cn；郑晨（1997—）,男,湖南怀化人,硕士研究生,研究方向:互联网金融,商业数据分析,E-mail: 916889591@qq.com。
基金资助:
中央高校基本科研业务费专项基金资助项目（NJ2019023）

An Improved LDA Algorithm Based on Graph Mining

(1. Dept. of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)

Online:2022-07-25 Published:2022-07-25

摘要/Abstract

摘要： LDA作为文本主题识别领域中使用最广泛的模型之一，其基于词袋模型的假设简单化地赋予词汇相同的权重，使得主题分布易向高频词倾斜，影响了识别主题的语义连贯性。本文针对该问题提出一种基于图挖掘的LDA改进算法GoW-LDA，首先基于特征词对在文本中的共现先后关系构建语义图模型，然后利用网络统计特征中节点的加权度，将文本的语义结构特点和关联性以权重修正的形式融入LDA主题建模中。实验结果显示，GoW-LDA相较于传统LDA和基于TF-IDF的LDA，能够大幅降低主题模型的混淆度，提高主题识别的互信息指数，并且有效减少模型的训练时间，为文本主题识别提供了一种新的解决思路。

关键词: 文本主题识别, 图挖掘；潜在狄利克雷分布

Abstract: As one of the most widely used models in the field of text topic recognition, LDA simplifies the assignment of the same weight to words based on the assumption of bag-of-words model, which makes the topic distribution inclined to high-frequency words, as well as affects the semantic coherence of the recognized topics. This paper proposes an improved LDA algorithm based on graph mining, named GoW-LDA, which firstly builds a semantic graph model based on the co-occurrence of feature word pairs in the text, then uses the weighting degree of nodes in network statistical features to integrate the semantic structure characteristics and relevance of the text into the LDA topic modeling in the form of weight correction. Experimental results show that, compared with traditional LDA and TF-IDF-based LDA, GoW-LDA can greatly reduce the complexity of topic models, improve the PMI of topic recognition, and effectively reduce the training time, which provides for a new solution idea text topic recognition.

Key words: text topic recognition, graph mining, LDA(Latent Dirichlet Allocation)

李珊, 陈妙苗, 郑晨. 一种基于图挖掘的LDA改进算法[J]. 计算机与现代化, 2022, 0(07): 61-66.

LI Shan, CHEN Miao-miao, ZHENG Chen. An Improved LDA Algorithm Based on Graph Mining[J]. Computer and Modernization, 2022, 0(07): 61-66.

参考文献

［1］科技疯汇. CNNIC发布第49次《中国互联网络发展状况统计报告》［EB/OL］. (2022-02-25)［2022-04-14］. http://k.sina.com.cn/article_2194868354_82d30882040014pxw.html?sudaref=www.baidu.com&display=0&retcode=0.
［2］王曰芬,王一山,杨洁. 基于社区发现和关键节点识别的网络舆情主题发现与实证分析［J］. 图书与情报, 2020(5):48-58.
［3］段鹏. 5G时代互联网主流意识形态传播经验与内涵重释［J］. 现代出版, 2020(6):5-9.
［4］刘小玲,谭宗颖. 新兴技术主题识别方法研究进展［J］. 图书情报工作, 2020,64(11):145-152.
［5］ HAYES J H, PAYNE J, ESSEX E, et al. Towards improved network security requirements and policy: Domain-specific completeness analysis via topic modeling［C］// 2020 IEEE 7th International Workshop on Artificial Intelligence for Requirements Engineering(AIRE). 2020:83-86.
［6］游丹丹,陈福集. 我国网络舆情热点话题发现研究综述［J］. 现代情报, 2017,37(3):165-171.
［7］许海云,董坤,刘春江,等. 文本主题识别关键技术研究综述［J］. 情报科学, 2017,35(1):153-160.
［8］吴江,王凯利,董克,等. 信息计量领域网络分析方法应用研究综述［J］. 情报学报, 2021,40(10):1118-1128.
［9］吴锦池,余维杰. 融合知识库语义的文本聚类研究［J］. 情报杂志, 2021,40(5):156-164.
［10］郭红梅,张智雄. 基于图挖掘的文本主题识别方法研究综述［J］. 中国图书馆学报, 2015,41(6):97-108.
［11］XU G X, MENG Y T, CHEN Z, et al. Research on topic detection and tracking for online news texts［J］. IEEE Access, 2019,7:58407-58418.
［12］BLEI D M, NG A Y, JORDAN M I. Latent Dirichlet allocation［J］. Journal of Machine Learning Research, 2003,3:993-1022.
［13］KOLTCOV S, IGNATENKO V. Renormalization analysis of topic models［J］. Entropy, 2020,22(5):23. DOI: 10.3390/e22050556.
［14］NEWMAN D, KARIMI S, CAVEDON L. External evaluation of topic models［C］// Proceedings of the 14th Australasian Document Computing Symposium. 2009:11-18.
［15］XU G X, X WU, YAO H S, et al. Research on topic recognition of network sensitive information based on SW-LDA model［J］. IEEE Access, 2019,7:21527-21538.
［16］居亚亚,杨璐,严建峰. 基于动态权重的LDA算法［J］. 计算机科学, 2019,46(8):260-265.
［17］谭旭,庄穆妮,毛太田,等. 基于LDA-ARMA混合模型的大规模网络舆情情感演化分析［J］. 情报杂志, 2020,39(10):121-129.
［18］YANG Y M, LIU H M, GUAN Z Y, et al. CoHomo: A cluster-attribute correlation aware graph clustering framework［J］. Neurocomputing, 2020,412:327-338.
［19］CINQUE M, CORTE R D, MOSCATO V, et al. A graph-based approach to detect unexplained sequences in a log［J］. Expert Systems with Applications, 2021,171:114556.
［20］陈磊,王丹丹,王青,等. 基于图挖掘扩展学习的增强需求跟踪恢复方法［J］. 计算机研究与发展, 2021,58(4):777-793.
［21］ROUSSEAU F, VAZIRGIANNIS M. Graph-of-word and TW-IDF: New approach to Ad Hoc IR［C］// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013:59-68.
［22］WON J S, KIM K H, SOHNG K Y, et al. Trends in nursing research on infections: Semantic network analysis and topic modeling［J］. International Journal of Environmental Research and Public Health, 2021,18(13):6915-6915.
［23］刘海涛. 汉语语义网络的统计特性［J］. 科学通报, 2009,54(14):2060-2064.
［24］LI X M, ZHANG A, LI C C, et al. Exploring coherent topics by topic modeling with term weighting［J］. Information Processing and Management, 2018,54(6):1345-1358.
［25］毛存礼,梁昊远,余正涛,等. 基于神经自回归分布估计的涉案新闻主题模型构建方法［J］. 中文信息学报, 2021,35(2):89-98.
［26］WU X N, ZENG J, YAN J F, et al. Finding better topics: Features, priors and constraints［C］// 2014 Pacific-Asia Conference on Knowledge Discovery and Data Mining. 2014:296-310.
［27］ZHANG J W, ZENG J, YUAN M X, et al. LDA revisited: Entropy, prior and convergence［C］// Proceedings of the 25th ACM International on Conference on Information and Knowledge Management. 2016:1763-1772.
［28］张孝飞,陈航行,张春花. 基于语义概念和词共现的微博主题词提取研究［J］. 情报科学, 2021,39(1):142-147.