计算机与现代化

• 中文信息处理技术 • 上一篇    下一篇

面向短文本的特征选择及文本表示

  

  1. (河北工业大学计算机科学与软件学院,天津300401)
  • 收稿日期:2018-08-15 出版日期:2019-04-08 发布日期:2019-04-10
  • 作者简介:马建红(1965-),女,河北保定人,教授,博士,研究方向:计算机辅助创新设计过程与方法,TRIZ,软件工程,CAI软件技术,E-mail: m_zh2002@126.com; 刘广森(1993-),男,河北石家庄人,硕士研究生,研究方向:自然语言处理,文本特征选择,文本表示,E-mail: liuguangsen0409@qq.com。
  • 基金资助:
    中国科学技术咨询服务中心计算机辅助创新设计公共服务平台建设服务采购项目(HSZT2015FD/254)

Text Feature Selection and Text Representation for Short Essays

  1. (College of Computer Science and Software, Hebei University of Technology, Tianjin 300401, China)
  • Received:2018-08-15 Online:2019-04-08 Published:2019-04-10

摘要:

短文本由于其稀疏性、实时性、非标准性等特点,在文本特征选择和文本表示方面存在较多问题,从而影响文本分类精度。针对文本特征选择方面存在较高的特征维数灾难的问题,提出一种二阶段的文本特征选择算法。首先在互信息算法的基础上,引入平衡因子、频度、集中度、词性及词在文本中的位置等5个指标对互信息值进行计算,然后将排序结果靠前的特征集初始化进行遗传算法的训练从而得到最优特征集合。因为TFIDF在计算时针对的是整篇语料而没有考虑类间分布不均的情况,在计算IDF公式时引入方差,并将改进后的TFIDF公式对Word2Vec词向量进行加权表示文本。将改进算法应用在人工构建的百科用途短文本语料集中进行实验,实验结果表明改进的文本特征选择算法和文本表示算法对分类效果有2%~5%的提升。

关键词: 文本特征选择, 文本表示, 遗传算法, 文本分类

Abstract:  Due to its sparsity, real-time and non-standard features, short essay has many problems in text feature selection and text representation, which affects text classification accuracy. Aiming at the problem of high feature dimension disaster in text feature selection, a two-stage text feature selection algorithm is proposed. First,  balance parameter, frequency, concentration, part of speech, and location are introduced into mutual information algorithm, and then the characteristic set with previous rank in the sorting result is initialized to train genetic algorithm to get optimal text feature set. Because the calculation of TFIDF aims at the whole corpus without considering the uneven distribution between classes, the variance is introduced when calculating the IDF formula. And the improved TFIDF formula is used to weight the Word2Vec word vector to represent the text vector. The improved algorithms are applied in the artificially constructed encyclopedic short essay corpus for experiments. Experiments show that the improved text feature selection algorithm and text representation algorithm have a 2%-5% improvement in classification effect.

Key words: text feature selection, text representation, genetic algorithm, text classification

中图分类号: