计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

一种改进的类别区分词特征选择算法

  

  1. (广西大学计算机与电子信息学院,广西南宁530004)
  • 收稿日期:2018-09-06 出版日期:2019-04-08 发布日期:2019-04-10
  • 作者简介:李富星(1993-),男,湖北十堰人,硕士研究生,研究方向:数据挖掘,E-mail: xingf_lee93@126.com; 蒙祖强(1974-),广西罗城人,男,教授,博士,研究方向:大数据智能,跨媒体挖掘,知识发现。
  • 基金资助:
    广西自然科学基金资助项目(2015GXNSFAA139292)

An Improved Feature Selection Algorithm Based on Category Distinguished Words

  1. (School of Computer and Electronic Information, Guangxi University, Nanning 530004, China)
  • Received:2018-09-06 Online:2019-04-08 Published:2019-04-10

摘要: 传统类别区分词特征选择算法以类间分散度和类内重要度作为度量指标,忽略了2个指标对特征评分函数的贡献权重往往不同这一事实,从而在一定程度上影响了特征选择效果。在类别区分词特征选择算法基础上,引入平衡因子,通过调节平衡因子来调整2个指标对特征评价函数的贡献权重,完成更加高效的特征选择,进而达到更好的文本分类效果。使用朴素贝叶斯算法进行文本分类,相比主流特征选择算法,改进算法在分类准确率、查准率、查全率和F1指标上都取得了可观的性能提升。

 

关键词: 文本分类; 特征选择; 平衡因子; 类别区分词
 

Abstract: The traditional category distinguished words(CDW) feature selection algorithm, which takes inter-class dispersion degree and intra-class importance degree as comprehensive metrics, ignores the fact that contribution weights of the two indicators to feature scoring function are often different, and thus affects feature selection efficiency to some extent. A CDW feature selection algorithm combining with balance factor(ICDW) is proposed. During feature selection, the contribution weights of two indicators to feature scoring function are adjusted by continuously adjusting the value of the balance factor to complete more efficient feature selection. Using Nave Bayes classification algorithm for text categorization, experiments show that classification performance of ICDW algorithm not only outperforms that of CDW algorithm, but also exceeds that of ECE, IG and CHI, which are commonly used for feature selection.

Key words: text categorization, feature selection, balance factor, category distinguished words

中图分类号: