计算机与现代化 ›› 2021, Vol. 0 ›› Issue (06): 6-11.

• 算法设计与分析 • 上一篇    下一篇

基于改进CHI和TF-IDF的短文本分类的研究

  

  1. (青岛大学计算机科学技术学院,山东青岛266071)
  • 出版日期:2021-07-05 发布日期:2021-07-05
  • 作者简介:代继鹏(1995—),男,山东泰安人,硕士研究生,研究方向:数据挖掘,复杂网络,E-mail: djp4992@163.com; 通信作者:邵峰晶(1955—),女,山东青岛人,教授,博士,研究方向:复杂网络理论与应用,数据挖掘,E-mail: sfj@qdu.edu.cn; 孙仁诚(1977—),男,山东青岛人,副教授,博士,研究方向:大数据分析与处理,复杂网络,E-mail: qdsunstar@163.com。
  • 基金资助:
    国家自然科学青年基金资助项目(41706198)

Short Text Classification Based on Improved CHI and TF-IDF

  1. (College of Computer Science and Technology, Qingdao University, Qingdao 266071, China)
  • Online:2021-07-05 Published:2021-07-05

摘要: 为提高对数据量较少的短文本分类效果,有效降低特征空间的特征维度,本文针对传统CHI统计方法和TF-IDF权重计算方法的缺点,提出一种新的类词因子改进特征选择方法,用来提高分类准确性。对于传统CHI统计方法对低频词敏感、TF-IDF权重计算方法忽略特征项在类别间和类别内分布情况,通过引入类词因子来改进传统CHI统计方法和TF-IDF权重计算方法,并将两者结合使用,减少低频词带来的干扰。同时考虑类内和类间特征词分布的特殊情况,使用XGBoost分类算法将提出方法用在数据量少且文本短的话题文本分类实验中,实验结果表明,与传统的CHI和TF-IDF方法相比,加入类词因子的特征选择方法提高了在平衡和非平衡语料集上的分类准确性,大幅度降低了对内存的占用。

关键词: 文本分类, 特征选择, XGBoost, 卡方统计量, TF-IDF

Abstract: In order to improve the effect of classifying short texts with a small amount of data, and effectively reduce the feature dimension of the feature space, aiming at the defects of the traditional CHI statistical method and the TF-IDF weight calculation method, this paper proposes a new factor of word class and frequency to improve the feature selection method, and consequently to enhance the classification accuracy. As the traditional CHI statistical method is sensitive to low-frequency words, and the TF-IDF weight calculation method ignores the distribution of feature items between and within classes, the paper introduces the factor of word class and frequency to improve the traditional CHI statistical method and the TF-IDF weight calculation method, and uses the two methods in combination to reduce the interference caused by low-frequency words, with consideration to the special situation of the distribution of feature words within and between classes. The paper uses the XGBoost classification algorithm to apply the proposed method in the classification experiment of topic text with small amount of data and short text. The experimental results show that, compared with the traditional CHI and TF-IDF methods, the feature selection method with factor of word class and frequency improves the classification accuracy on the balanced and unbalanced corpus, and greatly reduces the memory usage. 

Key words: text classification, feature selection, XGBoost, chi-square statistics, TF-IDF