基于Sentence-LDA主题模型的短文本分类

doi:10.3969/j.issn.1006-2475.2019.03.019

计算机与现代化 ›› 2019, Vol. 0 ›› Issue (03): 102-.doi: 10.3969/j.issn.1006-2475.2019.03.019

基于Sentence-LDA主题模型的短文本分类

(1.武汉邮电科学研究院,湖北武汉430000；2.南京烽火天地通信科技有限公司,江苏南京210000)

收稿日期:2018-09-05 出版日期:2019-04-08 发布日期:2019-04-10
作者简介:张浩(1995-)，男，陕西西安人，硕士研究生，研究方向：自然语言处理，数据挖掘，E-mail: 970225985@qq.com；钟敏（1993-），女，湖北孝感人，硕士研究生，研究方向：数据挖掘，自然语言处理。

Chinese Short Text Classification Based on Sentence-LDA Topic Model

(1. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430000, China；
2.Nanjing Fiberhome World Communication Technology Co. Ltd., Nanjing 210000, China)

Received:2018-09-05 Online:2019-04-08 Published:2019-04-10

摘要/Abstract

摘要： 短文本特征稀疏、上下文依赖性强的特点，导致传统长文本分类技术不能有效地被直接应用。为了解决短文本特征稀疏的问题，提出基于Sentence-LDA主题模型进行特征扩展的短文本分类方法。该主题模型是隐含狄利克雷分布模型（Latent Dirichlet Allocation， LDA）的扩展，假设一个句子只产生一个主题分布。利用训练好的Sentence-LDA主题模型预测原始短文本的主题分布，从而将得到的主题词扩展到原始短文本特征中，完成短文本特征扩展。对扩展后的短文本使用支持向量机（Support Vector Machine， SVM）进行最后的分类。实验显示，与传统的基于向量空间模型（Vector Space Model，VSM）直接表示短文本的方法比较，本文提出的方法可以有效地提高短文本分类的准确率。

关键词: 短文本分类, Sentence-LDA, 主题模型, 特征扩展, SVM

Abstract: The short text features are sparse and the context is strongly dependent, which leads to the traditional long text classification technology can’t be directly applied. In order to solve the problem of short text feature sparseness, a short text classification method based on Sentence-LDA topic model is proposed. The topic model is an extension of the LDA (Latent Dirichlet Allocation) model, it assumes that a sentence produces only one topic distribution. The trained Sentence-LDA topic model is used to predict the topic distribution of the original short text, thereby extend the obtained topic words into the original short text features, and complete the short text feature expansion. The SVM (Support Vector Machine) is finally used to classify the expanded short text. Experiments show that compared with the traditional method of directly representing short text based on VSM (Vector Space Model), the proposed method can effectively improve the accuracy of short text classification.

Key words: short text classification, Sentence-LDA, topic model, feature extension, SVM

中图分类号:

TP391.1

张浩1,2，钟敏1，2. 基于Sentence-LDA主题模型的短文本分类[J]. 计算机与现代化, 2019, 0(03): 102-.

ZHANG Hao1,2, ZHONG Min1,2. Chinese Short Text Classification Based on Sentence-LDA Topic Model[J]. Computer and Modernization, 2019, 0(03): 102-.

参考文献

［1］孟欣,左万利. 基于wordembedding的短文本特征扩展与分类［J］. 小型微型计算机系统, 2017,38(8):1712-1717.
［2］ YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts［C］// Proceedings of the 22nd ACM International Conference on World Wide Web. 2013:1445-1456.
［3］汪静,罗浪,王德强. 基于Word2Vec的中文短文本分类问题研究［J］. 计算机系统应用, 2018,27(5):209-215.
［4］ WANG X, WANG J, YANG Y, et al. Labeled LDA-kernel SVM: A short Chinese text supervised classification based on Sina Weibo［C］// 2017 4th IEEE International Conference on Information Science and Control Engineering (ICISCE). 2017:428-432.
［5］ SONG G, YE Y, DU X, et al. Short text classification: A survey［J］. Journal of Multimedia, 2014,9(5):635-644.
［6］ IKONOMAKIS M, TAMPAKAS V. Text classification: A recent overview［C］// WSEAS International Conference on Computers.World Scientific and Engineering Academy and Society(WSEAS). 2005:1-6.
［7］ SALTON G, WONG A, YANG C S. A vector space model for automatic indexing［J］.Communications of the ACM, 1974,18(11):613-620.
［8］赵辉,刘怀亮. 一种基于维基百科的中文短文本分类算法［J］. 图书情报工作, 2013,57(11):120-124.
［9］翟延冬,王康平,张东娜,等. 一种基于WordNet的短文本语义相似性算法［J］. 电子学报, 2012,40(3):617-620.
［10］赵谦. 基于HowNet的短文本语义相似度计算方法研究［D］. 太原：太原理工大学, 2017.
［11］郭永辉. 面向短文本分类的特征扩展方法［D］. 哈尔滨：哈尔滨工业大学, 2013.
［12］吕超镇,姬东鸿,吴飞飞. 基于LDA特征扩展的短文本分类［J］. 计算机工程与应用, 2015,51(4):123-127.
［13］LAULY S, BOULANGER A, LAROCHELLE H. Learning multilingual word representations using a bag-of-words autoencoder［J］. Computer Science, 2014, arXiv:1401.1803.
［14］SALTON G, YANG C S. On the specification of term values in automatic indexing［J］. Journal of Documentation, 1973,29(4):351-372.
［15］BSCHKEN J, ALLENBY G M. Sentence-based text analysis for customer reviews［J］. Marketing Science, 2016,35(6):953-975.
［16］BALIKAS G, AMINI M R, CLAUSEL M. On a topic model for sentences［C］// Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016:921-924.
［17］BLEI D M, NG A Y,JORDAN M I. Latent dirichlet allocation［J］. Journal of Machine Learning Research, 2003,3:993-1022.
［18］GRIFFITHS T L, STEYVERS M. Finding scientific topics［J］. Proceedings of the National Academy of Sciences, 2004,101(Suppl 1): 5228-5235.
［19］金宸，李维华，姬晨，等. 基于双向LSTM神经网络模型的中文分词［J］. 中文信息学报， 2018，32(2):29-37.
［20］官琴，邓三鸿，王昊. 中文文本聚类常用停用词表对比研究［J］. 数据分析与知识发现， 2017，1（3）：76-84.
［21］李文波，孙乐，张大鲲. 基于Labeled-LDA模型的文本分类新算法［J］.计算机学报， 2008,31(4):620-627.
［22］胡勇军,江嘉欣,常会友. 基于LDA高频词扩展的中文短文本分类［J］. 数据分析与知识发现, 2013(6):42-48.
［23］MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representation in vector space［J］. Computer Science, 2013, arXiv:1301.3781.

[1]	文勇军, 何环晶, 唐立军, . 基于LDA的隐式标签协同过滤推荐算法[J]. 计算机与现代化, 2022, 0(03): 53-58.
[2]	殷敏, 李晓辉, 李常宝, 顾平莉, 张可, 吕守业. 一种法律判决预测的影响因素分析方法[J]. 计算机与现代化, 2021, 0(04): 37-41.
[3]	周灵, 张英俊, 潘理虎. 一种基于情感特征的短文本分类方法[J]. 计算机与现代化, 2020, 0(07): 80-84.
[4]	杨金山，李智 . 基于稀疏逻辑回归和多元融合算法的 #br# 慢性肾病进展预测模型[J]. 计算机与现代化, 2019, 0(03): 13-.
[5]	崔丽娜1，郭虎升2. 基于增量学习的非平衡SVM分类方法[J]. 计算机与现代化, 2018, 0(07): 20-.
[6]	李霄野，李春生，李龙,张可佳. 基于LDA模型的文本聚类检索[J]. 计算机与现代化, 2018, 0(06): 7-.
[7]	刘微，杨慧婕，刘守印. 基于ACCA-FCM和SVM-RFE的蓄电池SOH特征选择算法[J]. 计算机与现代化, 2018, 0(01): 11-18.
[8]	文芳，康彩琴，陈立文，丁汇，徐琨，王宁宁. 基于RGBD数据的静态手势识别[J]. 计算机与现代化, 2018, 0(01): 74-77.
[9]	郭东亮,刘小明,郑秋生. 基于卷积神经网络的互联网短文本分类方法[J]. 计算机与现代化, 2017, 0(4): 78-81.
[10]	王峥1，刘师培2，彭艳兵1. 基于句法决策树和SVM的短文本语境识别模型[J]. 计算机与现代化, 2017, 0(3): 13-.
[11]	张舒雅，吴科艳，黄炎子，刘守印. 基于SVM_KNN的老人跌倒检测算法[J]. 计算机与现代化, 2017, 0(12): 49-55.
[12]	张建恒，黄蔚，胡国超. 基于LDA模型和AP聚类的主题事件抽取技术[J]. 计算机与现代化, 2017, 0(12): 77-81+87.
[13]	姚庆华，蒲雯. PSO-SVM理论在路面识别中的应用[J]. 计算机与现代化, 2016, 0(9): 115-118.
[14]	晁拴社1，楚恒1,2，王兴1. 高光谱图像数据的多尺度多核SVM分类[J]. 计算机与现代化, 2016, 0(2): 11-14,20.
[15]	佘维军，刘子平，杨卫芳. 基于改进LDA主题模型的产品特征抽取[J]. 计算机与现代化, 2016, 0(11): 1-6,57.

基于Sentence-LDA主题模型的短文本分类

Chinese Short Text Classification Based on Sentence-LDA Topic Model

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价