Computer and Modernization

Previous Articles     Next Articles

Text Feature Selection and Text Representation for Short Essays

  

  1. (College of Computer Science and Software, Hebei University of Technology, Tianjin 300401, China)
  • Received:2018-08-15 Online:2019-04-08 Published:2019-04-10

Abstract:  Due to its sparsity, real-time and non-standard features, short essay has many problems in text feature selection and text representation, which affects text classification accuracy. Aiming at the problem of high feature dimension disaster in text feature selection, a two-stage text feature selection algorithm is proposed. First,  balance parameter, frequency, concentration, part of speech, and location are introduced into mutual information algorithm, and then the characteristic set with previous rank in the sorting result is initialized to train genetic algorithm to get optimal text feature set. Because the calculation of TFIDF aims at the whole corpus without considering the uneven distribution between classes, the variance is introduced when calculating the IDF formula. And the improved TFIDF formula is used to weight the Word2Vec word vector to represent the text vector. The improved algorithms are applied in the artificially constructed encyclopedic short essay corpus for experiments. Experiments show that the improved text feature selection algorithm and text representation algorithm have a 2%-5% improvement in classification effect.

Key words: text feature selection, text representation, genetic algorithm, text classification

CLC Number: