计算机与现代化

• 中文信息处理技术 • 上一篇    下一篇

基于Sentence-LDA主题模型的短文本分类

  

  1. (1.武汉邮电科学研究院,湖北武汉430000;2.南京烽火天地通信科技有限公司,江苏南京210000)
  • 收稿日期:2018-09-05 出版日期:2019-04-08 发布日期:2019-04-10
  • 作者简介:张浩(1995-),男,陕西西安人,硕士研究生,研究方向:自然语言处理,数据挖掘,E-mail: 970225985@qq.com; 钟敏(1993-),女,湖北孝感人,硕士研究生,研究方向:数据挖掘,自然语言处理。

Chinese Short Text Classification Based on Sentence-LDA Topic Model

  1. (1. Wuhan Research Institute of Posts and Telecommunications, Wuhan 430000, China; 
    2.Nanjing Fiberhome World Communication Technology Co. Ltd., Nanjing 210000, China)
  • Received:2018-09-05 Online:2019-04-08 Published:2019-04-10

摘要: 短文本特征稀疏、上下文依赖性强的特点,导致传统长文本分类技术不能有效地被直接应用。为了解决短文本特征稀疏的问题,提出基于Sentence-LDA主题模型进行特征扩展的短文本分类方法。该主题模型是隐含狄利克雷分布模型(Latent Dirichlet Allocation, LDA)的扩展,假设一个句子只产生一个主题分布。利用训练好的Sentence-LDA主题模型预测原始短文本的主题分布,从而将得到的主题词扩展到原始短文本特征中,完成短文本特征扩展。对扩展后的短文本使用支持向量机(Support Vector Machine, SVM)进行最后的分类。实验显示,与传统的基于向量空间模型(Vector Space Model,VSM)直接表示短文本的方法比较,本文提出的方法可以有效地提高短文本分类的准确率。

关键词: 短文本分类, Sentence-LDA, 主题模型, 特征扩展, SVM

Abstract:  The short text features are sparse and the context is strongly dependent, which leads to the traditional long text classification technology can’t be directly applied. In order to solve the problem of short text feature sparseness, a short text classification method based on Sentence-LDA topic model is proposed. The topic model is an extension of the LDA (Latent Dirichlet Allocation) model, it assumes that a sentence produces only one topic distribution. The trained Sentence-LDA topic model is used to predict the topic distribution of the original short text, thereby extend the obtained topic words into the original short text features, and complete the short text feature expansion. The SVM (Support Vector Machine) is finally used to classify the expanded short text. Experiments show that compared with the traditional method of directly representing short text based on VSM (Vector Space Model), the proposed method can effectively improve the accuracy of short text classification.

Key words: short text classification, Sentence-LDA, topic model, feature extension, SVM

中图分类号: