结合主题模型词向量的CNN文本分类

doi:10.3969/j.issn.1006-2475.2019.10.002

计算机与现代化 ›› 2019, Vol. 0 ›› Issue (10): 7-.doi: 10.3969/j.issn.1006-2475.2019.10.002

结合主题模型词向量的CNN文本分类

(太原科技大学计算机科学与技术学院,山西太原030024)

收稿日期:2019-05-09 出版日期:2019-10-28 发布日期:2019-10-29
作者简介:牛雪莹(1989-),女,山西灵石人,助教,CCF会员,硕士,研究方向:文本分类,深度学习,中文信息处理,E-mail: 943863079@qq.com。
基金资助:
山西省应用基础研究项目(201801D221179); 教育部产学合作协同育人项目(201801128011)

CNN Text Classification Based on Topic Model Word Vectors

(College of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China)

Received:2019-05-09 Online:2019-10-28 Published:2019-10-29

摘要/Abstract

摘要： 挖掘微博文本中的信息对自动问答、舆情分析等应用研究具有重要意义。文本数据的分类研究是文本数据挖掘的基础。本文提出将Word2vec和LDA（Latent Dirichlet Allocation）的文本表示同时输入卷积神经网络模型进行高层语义特征抽象和分类学习，使得输入的词向量既能表现词语之间的语义信息又能体现文本的主题思想。首先用Word2vec和LDA模型分别在语料库中学习产生词向量，然后词向量分别级联得到各自的文本矩阵表示，最后将文本矩阵作为2个通道同时输入到卷积神经网络做分类训练，并通过微博数据实验验证了该方法的有效性。

关键词: Word2vec, LDA, 文本分类, 卷积神经网络

Abstract: Mining information in Weibo text is of great significance to automatic question and answer, public opinion analysis and other applied research. The text classification study is the basis of text mining. This paper proposes to input simultaneously the text representations of Word2vec and LDA(Latent Dirichlet Allocation) into convolutional neural network model for high-level semantic feature abstraction and classification learning. The input word vectors can represent both the semantic information between the words and the theme of the text. First, We get the word vectors respectively based on the Word2vec model and LDA. Then the word vectors generated by the two models are cascaded to obtain their text matrix representations. Finally, We put the text matrices into the convolutional neural network simultaneously as two channels to classify the texts, and the effectiveness of the method is verified by experiments on Weibo data.

Key words: Word2Vec, LDA, text classification, convolutional neural network

中图分类号:

TP391

牛雪莹. 结合主题模型词向量的CNN文本分类[J]. 计算机与现代化, 2019, 0(10): 7-.

NIU Xue-ying. CNN Text Classification Based on Topic Model Word Vectors[J]. Computer and Modernization, 2019, 0(10): 7-.

参考文献

［1］盛成成,朱勇,刘涛. 基于微博社交平台的舆情分析［J］. 智能计算机与应用, 2019,9(1):57-59.
［2］ ZHANG Y, JIN R, ZHOU Z H. Understanding bag-of-words model: A statistical framework［J］. International Journal of Machine Learning and Cybernetics, 2010,1(1-4):43-52.
［3］ SALTON G, WONG A, YANG C S. A vector space model for automatic indexing［J］. Communications of the ACM, 1975,18(11):613-620．
［4］ PHAN X H, NGUYEN L M, HORIGUCHI S．Learning to classify short and sparse text & Web with hidden topics from large-scale data collections［C］// Proceedings of the 17th International Conference on World Wide Web. 2008:91-100．
［5］张志飞,苗夺谦,高灿. 基于LDA主题模型的短文本分类方法［J］. 计算机应用, 2013,33(6):1587-1590.
［6］王细薇,樊兴华,赵军. 一种基于特征扩展的中文短文本分类方法［J］. 计算机应用, 2009,29(3):843-845．
［7］范云杰,刘怀亮. 基于维基百科的中文短文本分类研究［J］. 现代图书情报技术, 2012(3):47-52.
［8］周茜,赵明生,扈旻. 中文文本分类中的特征选择研究［J］. 中文信息学报, 2004(3):17-23.
［9］孟涛,王诚. 基于扩展短文本词特征向量的分类研究［J］. 计算机技术与发展, 2019,29(4):57-62.
［10］KRIZHEVSKY A, SUTSKEVER T, HINTON G. Imagenet classification with deep convolutional neural networks［C］// Proceedings of the 25th International Conference on Neural Information Processing Systems. 2012:1097-1105.
［11］侯一民,周慧琼,王政一. 深度学习在语音识别中的研究进展综述［J］. 计算机应用研究, 2017,34(8):2241-2246.
［12］MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality［C］// Proceedings of the 26th International Conference on Neural Information Processing Systems. 2013:3111-3119.
［13］MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in Vector space ［J/OL］. Computer Science, 2013. ［2019-05-01］. http://export.arxiv.org/pdf/1301.3781.
［14］KALCHBRENNER N, GREFENSTETTE E, BLUNSOM P. A convolutional neural network for modelling sentences［J/OL］. Computer Science, 2014.［2019-05-01］. http://de.arxiv.org/pdf/1404.2188.
［15］KIM Y. Convolutional neural networks for sentence classification［J/OL］. Computer Science, 2014.［2019-05-01］. https://arxiv.org/abs/1408.5882.
［16］韩栋,王春华,肖敏. 基于句子级学习改进CNN的短文本分类方法［J］. 计算机工程与设计, 2019,40(1):256-260.
［17］LAI S, XU L H, LIU K, et al. Recurrent convolutional neural networks for text classification［C］// Proceedings of the 29th AAAI Conference on Artificial Intelligence. 2015:2267-2273.
［18］ZHOU C, SUN C L, LIU Z Y, et al. A C-LSTM neural network for text classification［J/OL］. Computer Science, 2015.［2019-05-01］. http://arxiv.org/abs/1511.08630.
［19］YANG Z, YANG D, DYER C, et al. Hierarchical attention networks for document classification［C］// Annual Conference of the North American Chapter of the Association for Computational Linguistics. 2016:1480-1489.
［20］张谦,高章敏,刘嘉勇. 基于Word2vec的微博短文本分类研究［J］. 信息网络安全, 2017(1):57-62.
［21］江大鹏. 基于词向量的短文本分类方法研究［D］. 杭州：浙江大学, 2015.
［22］汪静,罗浪,王德强. 基于Word2Vec的中文短文本分类问题研究［J］. 计算机系统应用, 2018,27(5):209-215.
［23］王盛,樊兴华,陈现麟．利用上下位关系的中文短文本分类［J］．计算机应用, 2010,30(3):603-606．

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[3]	刘宝宝, 杨菁菁, 陶露, 王贺应. 基于注意力的DSMSC的遥感图像场景分类[J]. 计算机与现代化, 2024, 0(12): 72-77.
[4]	陈雪松1, 李衡1, 王浩畅2. 结合注意力机制和Mengzi模型的短文本分类[J]. 计算机与现代化, 2024, 0(09): 101-106.
[5]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.
[6]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[7]	杨江1, 孙晓梅1, 许韬2. 基于业务内容构建股票关联关系的股价预测[J]. 计算机与现代化, 2024, 0(07): 21-25.
[8]	刘存莉1, 雷占占2, 郑澳2. 基于循环卷积神经网络的排水管网缺陷检测方法[J]. 计算机与现代化, 2024, 0(07): 26-35.
[9]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[10]	李珊, 王林娜, 高丁佳, 宣海波. 基于图神经网络的多层银企网络融合研究[J]. 计算机与现代化, 2024, 0(05): 27-32.
[11]	钟海龙1, 2, 何月顺1, 何璘琳1, 陈杰1, 田鸣3, 郑瑞银4. 基于代价敏感卷积神经网络的加密流量分类#br# #br#[J]. 计算机与现代化, 2024, 0(05): 55-60.
[12]	高埂1, 肖风丽2, 杨飞1. 基于改进MobileNetV3-Small的色素减退性皮肤病诊断[J]. 计算机与现代化, 2024, 0(05): 120-126.
[13]	游嘉靖1, 2, 何月顺1, 何璘琳1, 钟海龙1, 2. 基于AHP-CNN的加密流量分类方法[J]. 计算机与现代化, 2024, 0(04): 83-87.
[14]	许跃雯1, 李明1, 李莉2. 基于对比学习MocoV2的COVID-19图像分类#br#[J]. 计算机与现代化, 2024, 0(02): 81-87.
[15]	周成诚, 曾庆军, 杨康, 胡家铭, 韩春伟. 基于高效通道注意力模块的运动想象脑电识别[J]. 计算机与现代化, 2023, 0(12): 19-23.

结合主题模型词向量的CNN文本分类

CNN Text Classification Based on Topic Model Word Vectors

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价