计算机与现代化

• 算法分析与设计 • 上一篇    下一篇

文本分类中特征权重算法改进研究

  

  1. (西安工业大学计算机科学与工程学院,陕西西安710021)
  • 收稿日期:2017-09-29 出版日期:2018-03-08 发布日期:2018-03-09
  • 作者简介:李鹏鹏(1993-),男,山西临汾人,西安工业大学计算机科学与工程学院硕士研究生,研究方向:人工智能及应用; 范会敏(1960-),女,陕西富平人,教授,硕士,研究方向:人工智能,专家系统。
  • 基金资助:
    陕西省科技厅工业攻关项目(2017GY-070)

Research on Improvement of Feature Weights in Text Classification

  1. (School of Computer Science and Engineering, Xian Technological University, Xian 710021, China)
  • Received:2017-09-29 Online:2018-03-08 Published:2018-03-09

摘要: 为了克服传统的TF-IDF(Term Frequency Inverse Document Frequency)算法分类F1值低的缺陷,利用特征词在类内和类间的分布信息,提出一种改进的TF-IDF-dist算法。实验结果表明,改进的算法在不同特征维度下F1值平均提升3.2%,结合不同特征选择算法,F1值平均提升2.75%,并且对倾斜数据集有更强的适应性,表明本文算法在文本分类中的有效性。

关键词: 机器学习, 文本分类, 特征权重, TF-IDF

Abstract: In order to overcome the shortcomings of traditional TF-IDF (Term Frequency Inverse Document Frequency) algorithm, the improved TF-IDF-dist algorithm is proposed by using the distribution of feature words. The experimental results show that the improved algorithm has an average increase of F1 value by 3.2% in the different feature dimensions. With the different feature selection algorithm, the F1 value is increased by 2.75% and the improved TF-IDF-dist algorithm has more adaptability on the imbalance datasets. It shows the validity of the algorithm in text classification.

Key words: machine learning, text classification, feature weights, TF-IDF

中图分类号: