Computer and Modernization ›› 2021, Vol. 0 ›› Issue (06): 6-11.

Previous Articles     Next Articles

Short Text Classification Based on Improved CHI and TF-IDF

  

  1. (College of Computer Science and Technology, Qingdao University, Qingdao 266071, China)
  • Online:2021-07-05 Published:2021-07-05

Abstract: In order to improve the effect of classifying short texts with a small amount of data, and effectively reduce the feature dimension of the feature space, aiming at the defects of the traditional CHI statistical method and the TF-IDF weight calculation method, this paper proposes a new factor of word class and frequency to improve the feature selection method, and consequently to enhance the classification accuracy. As the traditional CHI statistical method is sensitive to low-frequency words, and the TF-IDF weight calculation method ignores the distribution of feature items between and within classes, the paper introduces the factor of word class and frequency to improve the traditional CHI statistical method and the TF-IDF weight calculation method, and uses the two methods in combination to reduce the interference caused by low-frequency words, with consideration to the special situation of the distribution of feature words within and between classes. The paper uses the XGBoost classification algorithm to apply the proposed method in the classification experiment of topic text with small amount of data and short text. The experimental results show that, compared with the traditional CHI and TF-IDF methods, the feature selection method with factor of word class and frequency improves the classification accuracy on the balanced and unbalanced corpus, and greatly reduces the memory usage. 

Key words: text classification, feature selection, XGBoost, chi-square statistics, TF-IDF