Chinese Short Text Classification Based on Sentence-LDA Topic Model

doi:10.3969/j.issn.1006-2475.2019.03.019

Abstract

Abstract: The short text features are sparse and the context is strongly dependent, which leads to the traditional long text classification technology can’t be directly applied. In order to solve the problem of short text feature sparseness, a short text classification method based on Sentence-LDA topic model is proposed. The topic model is an extension of the LDA (Latent Dirichlet Allocation) model, it assumes that a sentence produces only one topic distribution. The trained Sentence-LDA topic model is used to predict the topic distribution of the original short text, thereby extend the obtained topic words into the original short text features, and complete the short text feature expansion. The SVM (Support Vector Machine) is finally used to classify the expanded short text. Experiments show that compared with the traditional method of directly representing short text based on VSM (Vector Space Model), the proposed method can effectively improve the accuracy of short text classification.

Key words: short text classification, Sentence-LDA, topic model, feature extension, SVM

CLC Number:

TP391.1

ZHANG Hao1,2, ZHONG Min1,2. Chinese Short Text Classification Based on Sentence-LDA Topic Model[J]. Computer and Modernization, 2019, 0(03): 102-.

References

［1］孟欣,左万利. 基于wordembedding的短文本特征扩展与分类［J］. 小型微型计算机系统, 2017,38(8):1712-1717.
［2］ YAN X, GUO J, LAN Y, et al. A biterm topic model for short texts［C］// Proceedings of the 22nd ACM International Conference on World Wide Web. 2013:1445-1456.
［3］汪静,罗浪,王德强. 基于Word2Vec的中文短文本分类问题研究［J］. 计算机系统应用, 2018,27(5):209-215.
［4］ WANG X, WANG J, YANG Y, et al. Labeled LDA-kernel SVM: A short Chinese text supervised classification based on Sina Weibo［C］// 2017 4th IEEE International Conference on Information Science and Control Engineering (ICISCE). 2017:428-432.
［5］ SONG G, YE Y, DU X, et al. Short text classification: A survey［J］. Journal of Multimedia, 2014,9(5):635-644.
［6］ IKONOMAKIS M, TAMPAKAS V. Text classification: A recent overview［C］// WSEAS International Conference on Computers.World Scientific and Engineering Academy and Society(WSEAS). 2005:1-6.
［7］ SALTON G, WONG A, YANG C S. A vector space model for automatic indexing［J］.Communications of the ACM, 1974,18(11):613-620.
［8］赵辉,刘怀亮. 一种基于维基百科的中文短文本分类算法［J］. 图书情报工作, 2013,57(11):120-124.
［9］翟延冬,王康平,张东娜,等. 一种基于WordNet的短文本语义相似性算法［J］. 电子学报, 2012,40(3):617-620.
［10］赵谦. 基于HowNet的短文本语义相似度计算方法研究［D］. 太原：太原理工大学, 2017.
［11］郭永辉. 面向短文本分类的特征扩展方法［D］. 哈尔滨：哈尔滨工业大学, 2013.
［12］吕超镇,姬东鸿,吴飞飞. 基于LDA特征扩展的短文本分类［J］. 计算机工程与应用, 2015,51(4):123-127.
［13］LAULY S, BOULANGER A, LAROCHELLE H. Learning multilingual word representations using a bag-of-words autoencoder［J］. Computer Science, 2014, arXiv:1401.1803.
［14］SALTON G, YANG C S. On the specification of term values in automatic indexing［J］. Journal of Documentation, 1973,29(4):351-372.
［15］BSCHKEN J, ALLENBY G M. Sentence-based text analysis for customer reviews［J］. Marketing Science, 2016,35(6):953-975.
［16］BALIKAS G, AMINI M R, CLAUSEL M. On a topic model for sentences［C］// Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016:921-924.
［17］BLEI D M, NG A Y,JORDAN M I. Latent dirichlet allocation［J］. Journal of Machine Learning Research, 2003,3:993-1022.
［18］GRIFFITHS T L, STEYVERS M. Finding scientific topics［J］. Proceedings of the National Academy of Sciences, 2004,101(Suppl 1): 5228-5235.
［19］金宸，李维华，姬晨，等. 基于双向LSTM神经网络模型的中文分词［J］. 中文信息学报， 2018，32(2):29-37.
［20］官琴，邓三鸿，王昊. 中文文本聚类常用停用词表对比研究［J］. 数据分析与知识发现， 2017，1（3）：76-84.
［21］李文波，孙乐，张大鲲. 基于Labeled-LDA模型的文本分类新算法［J］.计算机学报， 2008,31(4):620-627.
［22］胡勇军,江嘉欣,常会友. 基于LDA高频词扩展的中文短文本分类［J］. 数据分析与知识发现, 2013(6):42-48.
［23］MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representation in vector space［J］. Computer Science, 2013, arXiv:1301.3781.

[1]	YANG Sun-zhe, SUN Ai-zhen. Rice Nitrogen Nutrition Diagnosis Based on HSV Color and LBP Texture Features [J]. Computer and Modernization, 2023, 0(07): 86-92.
[2]	YIN Min, LI Xiao-hui, LI Chang-bao, GU Ping-li, ZHANG Ke, LYU Shou-ye. An Analysis Method of Influencing Factors of Legal Judgment Prediction [J]. Computer and Modernization, 2021, 0(04): 37-41.
[3]	HUANG Xu-bin, LIANG Shu-jie. Classification of Motor EEG Signals Based on PCA and PSO-SVM [J]. Computer and Modernization, 2021, 0(03): 70-76.
[4]	WANG Wen-wei, XIAO Jun-bi, CHENG Peng, ZHANG Yue. SDN-based DDoS Attack Defense System [J]. Computer and Modernization, 2021, 0(02): 117-121.
[5]	ZHOU Ling, ZHANG Ying-jun, PAN Li-hu. A Short Text Classification Method Based on Emotional Features [J]. Computer and Modernization, 2020, 0(07): 80-84.
[6]	ZHANG Wen-hua, ZHANG Zhi-jun. SVM-based Verification Method for New Energy Bus Operation Mileage [J]. Computer and Modernization, 2020, 0(05): 39-.
[7]	ZHANG Kai-yan1,2, PAN Yang3, LOU Ji-chao4. Fault Classification of Process Layer Network in Intelligent Substation Based on ANP-SVM [J]. Computer and Modernization, 2019, 0(07): 72-.
[8]	YANG Jin-shan, LI Zhi . Progression Prediction Model of Chronic Kidney Disease Based on #br# Sparse Logistic Regression and Multiple Ensemble Algorithm [J]. Computer and Modernization, 2019, 0(03): 13-.
[9]	LI Ke-lin. Discovering Domain Experts in Online Q&A Communities [J]. Computer and Modernization, 2018, 0(09): 72-.
[10]	CUILi-na1,GUOHu-sheng2. ImbalancedSVMClassificationMethodBasedonIncrementalLearning [J]. Computer and Modernization, 2018, 0(07): 20-.
[11]	WANG Zheng-jie, YANG Wei-li, WANG Zhe, HOU Yu-shan, GUO Yin-jing. Research on Parameters Setting and Classification Characters of Four Classification Algorithms [J]. Computer and Modernization, 2018, 0(02): 54-.
[12]	LIU Wei, YANG Hui-jie, LIU Shou-yin. A Feature Selection Algorithm of Battery SOH Based on ACCA-FCM and SVM-RFE [J]. Computer and Modernization, 2018, 0(01): 11-18.
[13]	WEN Fang， KANG Cai-qin， CHEN Li-wen， DING Hui， XU Kun， WANG Ning-ning. Static Hand Gesture Recognition Based on RGBD Data [J]. Computer and Modernization, 2018, 0(01): 74-77.
[14]	WANG Si-yuan1, WANG Heng-sheng1,2, LIU Tong1. Recognition of Mobile Robot Natural Language Navigation Instructions Based on Word Embedding and SVM [J]. Computer and Modernization, 2017, 0(9): 24-28,55.
[15]	LIU Yi, LAN Shao-hua. Approach for Detecting Covert Timing Channels Based on One-class SVM [J]. Computer and Modernization, 2017, 0(6): 108-111+121.