计算机与现代化 ›› 2022, Vol. 0 ›› Issue (07): 61-66.

• 算法设计与分析 • 上一篇    下一篇

一种基于图挖掘的LDA改进算法

  

  1. (南京航空航天大学经济与管理学院,江苏南京211106)
  • 出版日期:2022-07-25 发布日期:2022-07-25
  • 作者简介:李珊(197—),女,江苏南京人,副教授,博士,研究方向:商务智能,大数据分析,服务科学,企业信息化,E-mail:lishan@nuaa.edu.cn; 陈妙苗(1996—),女,浙江杭州人,硕士研究生,研究方向:文本挖掘,数据分析,E-mail: katiechan@nuaa.edu.cn; 郑晨(1997—),男,湖南怀化人,硕士研究生,研究方向:互联网金融,商业数据分析,E-mail: 916889591@qq.com。
  • 基金资助:
    中央高校基本科研业务费专项基金资助项目(NJ2019023)

An Improved LDA Algorithm Based on Graph Mining

  1. (1. Dept. of Economics and Management, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
  • Online:2022-07-25 Published:2022-07-25

摘要: LDA作为文本主题识别领域中使用最广泛的模型之一,其基于词袋模型的假设简单化地赋予词汇相同的权重,使得主题分布易向高频词倾斜,影响了识别主题的语义连贯性。本文针对该问题提出一种基于图挖掘的LDA改进算法GoW-LDA,首先基于特征词对在文本中的共现先后关系构建语义图模型,然后利用网络统计特征中节点的加权度,将文本的语义结构特点和关联性以权重修正的形式融入LDA主题建模中。实验结果显示,GoW-LDA相较于传统LDA和基于TF-IDF的LDA,能够大幅降低主题模型的混淆度,提高主题识别的互信息指数,并且有效减少模型的训练时间,为文本主题识别提供了一种新的解决思路。

关键词: 文本主题识别, 图挖掘; 潜在狄利克雷分布

Abstract: As one of the most widely used models in the field of text topic recognition, LDA simplifies the assignment of the same weight to words based on the assumption of bag-of-words model, which makes the topic distribution inclined to high-frequency words, as well as affects the semantic coherence of the recognized topics. This paper proposes an improved LDA algorithm based on graph mining, named GoW-LDA, which firstly builds a semantic graph model based on the co-occurrence of feature word pairs in the text, then uses the weighting degree of nodes in network statistical features to integrate the semantic structure characteristics and relevance of the text into the LDA topic modeling in the form of weight correction. Experimental results show that, compared with traditional LDA and TF-IDF-based LDA, GoW-LDA can greatly reduce the complexity of topic models, improve the PMI of topic recognition, and effectively reduce the training time, which provides for a new solution idea text topic recognition.

Key words: text topic recognition, graph mining, LDA(Latent Dirichlet Allocation)