计算机与现代化

• 人工智能 • 上一篇    下一篇

一种改进的CHI文本特征选择方法

  

  1. 北京工业大学电子信息与控制工程学院,北京100124
  • 收稿日期:2016-04-12 出版日期:2016-11-15 发布日期:2016-11-23
  • 作者简介:樊存佳(1990-),女(蒙古族),内蒙古乌兰察布人,北京工业大学电子信息与控制工程学院硕士研究生,研究方向:数据挖掘; 汪友生(1965-),男,副教授,博士,研究方向:图像处理,数据挖掘; 王雨婷(1992-),女,硕士研究生。

An Improved CHI Text Feature Selection Algorithm

  1. College of Electronic Information and Control Engineering, Beijing University of Technology, Beijing 100124, China
  • Received:2016-04-12 Online:2016-11-15 Published:2016-11-23

摘要: 特征选择是文本分类过程中非常重要的环节。CHI统计是一种经典的特征选择方法,针对CHI统计方法存在的不足,一方面,为了兼顾特征项的文档频和词频,本文在CHI中引入词频因子和类间方差;另一方面,为了排除在指定类中很少出现但在其他类中普遍存在的特征项,降低人为选取比例因子带来的误差,本文在CHI中引入自适应比例因子。实验结果表明,与CHI统计方法相比,改进后的CHI特征选择方法提高了非平衡语料集上的分类准确度。

关键词: CHI统计, 词频因子, 类间方差, 自适应比例因子

Abstract: In the process of text classification, feature selection algorithm is a greatly important part. CHI statistics is a classical feature selection method, but it has some defects. Aiming at the shortage of CHI statistics algorithm, on the one hand, in order to take into account the document frequency and word frequency of items, word frequency factor and variance among classes were introduced into CHI algorithm. On the other hand, in order to exclude the items which rarely appear in the specified class and largely appear in other classes, and reduce the error of artificially selecting scaling factor, the adaptive scaling factor was introduced into CHI algorithm. The results show that the improved CHI feature selection algorithm is superior to CHI statistics algorithm in the unbalanced corpus.

Key words: CHI statistics, word frequency factor, variance among classes, adaptive scaling factor

中图分类号: