计算机与现代化

• 人工智能 • 上一篇    下一篇

结合主题模型词向量的CNN文本分类

  

  1. (太原科技大学计算机科学与技术学院,山西太原030024)
  • 收稿日期:2019-05-09 出版日期:2019-10-28 发布日期:2019-10-29
  • 作者简介:牛雪莹(1989-),女,山西灵石人,助教,CCF会员,硕士,研究方向:文本分类,深度学习,中文信息处理,E-mail: 943863079@qq.com。
  • 基金资助:
    山西省应用基础研究项目(201801D221179); 教育部产学合作协同育人项目(201801128011)

CNN Text Classification Based on Topic Model Word Vectors

  1. (College of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China)
  • Received:2019-05-09 Online:2019-10-28 Published:2019-10-29

摘要: 挖掘微博文本中的信息对自动问答、舆情分析等应用研究具有重要意义。文本数据的分类研究是文本数据挖掘的基础。本文提出将Word2vec和LDA(Latent Dirichlet Allocation)的文本表示同时输入卷积神经网络模型进行高层语义特征抽象和分类学习,使得输入的词向量既能表现词语之间的语义信息又能体现文本的主题思想。首先用Word2vec和LDA模型分别在语料库中学习产生词向量,然后词向量分别级联得到各自的文本矩阵表示,最后将文本矩阵作为2个通道同时输入到卷积神经网络做分类训练,并通过微博数据实验验证了该方法的有效性。

关键词: Word2vec, LDA, 文本分类, 卷积神经网络

Abstract: Mining information in Weibo text is of great significance to automatic question and answer, public opinion analysis and other applied research. The text classification study is the basis of text mining. This paper proposes to input simultaneously the text representations of Word2vec and LDA(Latent Dirichlet Allocation) into convolutional neural network model for high-level semantic feature abstraction and classification learning. The input word vectors can represent both the semantic information between the words and the theme of the text. First, We get the word vectors respectively based on the Word2vec model and LDA. Then the word vectors generated by the two models are cascaded to obtain their text matrix representations. Finally, We put the text matrices into the convolutional neural network simultaneously as two channels to classify the texts, and the effectiveness of the method is verified by experiments on Weibo data.

Key words: Word2Vec, LDA, text classification, convolutional neural network

中图分类号: