Text Feature Selection and Text Representation for Short Essays

doi:10.3969/j.issn.1006-2475.2019.03.018

Abstract

Abstract: Due to its sparsity, real-time and non-standard features, short essay has many problems in text feature selection and text representation, which affects text classification accuracy. Aiming at the problem of high feature dimension disaster in text feature selection, a two-stage text feature selection algorithm is proposed. First, balance parameter, frequency, concentration, part of speech， and location are introduced into mutual information algorithm, and then the characteristic set with previous rank in the sorting result is initialized to train genetic algorithm to get optimal text feature set. Because the calculation of TFIDF aims at the whole corpus without considering the uneven distribution between classes, the variance is introduced when calculating the IDF formula. And the improved TFIDF formula is used to weight the Word2Vec word vector to represent the text vector. The improved algorithms are applied in the artificially constructed encyclopedic short essay corpus for experiments. Experiments show that the improved text feature selection algorithm and text representation algorithm have a 2%-5% improvement in classification effect.

Key words: text feature selection, text representation, genetic algorithm, text classification

CLC Number:

TP391

MA Jian-hong， LIU Guang-sen， YAO Shuang， YANG Zhi . Text Feature Selection and Text Representation for Short Essays[J]. Computer and Modernization, doi: 10.3969/j.issn.1006-2475.2019.03.018.

References

［1］黄令贺. 网络百科信息质量动态演进探索研究［J/OL］. 情报理论与实践. (2017-11-23)［2018-08-15］. http://kns.cnki.net/kcms/detail/11.1762.G3.20171123.1121.016.html.
［2］李太白. 短文本分类中特征选择算法的研究［D］. 重庆：重庆师范大学, 2013.
［3］宗成庆. 统计自然语言处理［M］. 北京：清华大学出版社, 2013.
［4］ ROGATI M, YANG Y M. High-performing feature selection for text classification［C］// Proceedings of the 11th ACM International Conference on Information and Knowledge Management. 2002:659-661.
［5］邱云飞,王建坤,李雪,等. 基于线性组合文本特征选择方法［J］. 计算机应用研究, 2011,28(6):2099-2101.
［6］张玉芳,万斌候,熊忠阳. 文本分类中的特征降维方法研究［J］. 计算机应用研究, 2012,29(7):2541-2543.
［7］吕建新,郑伟. 互信息特征选择方法的研究与改进［J］. 情报科学, 2014,32(8):82-84.
［8］袁磊. 基于改进CHI特征选择的情感文本分类研究［J］. 传感器与微系统, 2017,36(5):47-51.
［9］裘国永,王娜,汪万紫. 基于互信息和遗传算法的两阶段特征选择方法［J］. 计算机应用研究, 2012,29(8):2903-2905.
［10］毛雅楠. 基于一种改进特征选择方法的股票分类研究［D］. 北京：北京交通大学, 2017.
［11］袁满,欧阳元新,熊璋,等. 一种基于频繁词集的短文本特征扩展方法［J］. 东南大学学报(自然科学版), 2014,44(2):256-260.
［12］罗燕. 基于维基百科的短文本处理方法［D］. 石家庄：河北师范大学, 2016.
［13］BLEI D M, NG A Y, JORDAN M I. Latent dirichlet allocation［J］. Journal of Machine Learning Research, 2003,3:993-1022.
［14］张猛. 基于LDA的短文本分类中特征扩展方法的研究［D］. 北京：中国地质大学, 2017.
［15］MIKOLOY T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space［J］. Computer Science， 2013，arXiv:1301.3781.
［16］江大鹏. 基于词向量的短文本分类方法研究［D］. 杭州：浙江大学, 2015.
［17］唐明,朱磊,邹显春. 基于Word2Vec的一种文档向量表示［J］. 计算机科学, 2016,43(6):214-217,269.
［18］ZHENG Z H, WU X Y, SRIHARI R K. Feature selection for text categorization on imbalanced data［J］. ACM SIGKDD Explorations Newsletter, 2004,6(1):80-89.
［19］赵云志. 统计分析法自动标引的改进［J］. 情报学报, 2000,19(4):333-337.
［20］侯汉清,章成志,郑红. Web概念挖掘中标引源加权方案初探［J］. 情报学报, 2005,24(1):87-92.
［21］朱陈柔玲,张达敏,张慕雪,等. 遗传算法在图像处理中的应用［J］. 通信技术, 2017(7):1401-1406.
［22］周明,孙树栋. 遗传算法原理及应用［M］. 北京：国防工业出版社, 1999.
［23］奉国和. 文本分类性能评价研究［J］. 情报杂志, 2011,30(8):66-70.

[1]	LIU Xian-zhuo, DENG Wei-si, XIE En-yan. Adaptive Protection System for Distribution Network Considering Grid Connection of Distributed Generation [J]. Computer and Modernization, 2023, 0(09): 120-126.
[2]	ZHANG Zhi-xia, XIE Bao-qiang. Natural Gas Load Forecasting Based on FCGA-LSTM and Transfer Learning [J]. Computer and Modernization, 2023, 0(07): 7-12.
[3]	XU Ya-xin, HE Ze-en, XU Xu-kan. Automatic Classification Method of CNC Machine Tool Fault Text Based on CNN-BiLSTM [J]. Computer and Modernization, 2023, 0(04): 7-14.
[4]	WANG Min, XU Ying-hao, ZHU Xi-jun. Prediction Model of Diabetic Complications Based on BP Neural Network Optimized by Improved Genetic Algorithm#br# [J]. Computer and Modernization, 2022, 0(11): 69-74.
[5]	WANG Yang, CHEN Mei, LI Hui. FOCoR: A Course Recommendation Approach Based on Feature Selection Optimization [J]. Computer and Modernization, 2022, 0(10): 1-7.
[6]	RAN Hao-jie, WANG Hong-zhi. Distribution Center Site Selection of Fresh Agricultural Products Based on Improved Simulated Annealing Algorithm [J]. Computer and Modernization, 2022, 0(10): 36-40.
[7]	ZHAN Jun-wei, ZHUANG Yi. Mobile Edge Computing Task Offloading Model and Algorithm Based on Energy Consumption and Delay Optimization [J]. Computer and Modernization, 2022, 0(08): 86-93.
[8]	ZHNAG Jun, QIU Long-long. A Text Classification Model Based on BERT and Pooling Operation [J]. Computer and Modernization, 2022, 0(06): 1-7.
[9]	WU Dai-yang, ZHAO Jie, LIANG Jia-ming, DONG Zhen-ning, LIANG Zhou-yang. Host Matching for C2C Online Short-term Rentals [J]. Computer and Modernization, 2022, 0(06): 43-48.
[10]	XU Sheng-chao, XIONG Mao-hua. Optimization of Container Cloud Resource Allocation Based on Genetic Algorithm [J]. Computer and Modernization, 2022, 0(01): 108-112.
[11]	XU Ming, ZHANG Jian-ming, CHEN Song-hang, CHEN Hao. Multi-objective Optimization Algorithm for Flexible Job Shop Scheduling Problem [J]. Computer and Modernization, 2021, 0(12): 1-6.
[12]	ZHANG Lin-peng, WANG Xi-yuan, LI Qiang. Image Classification Based on Double-pooling Feature Weighted Convolutional Neural Network [J]. Computer and Modernization, 2021, 0(11): 67-71.
[13]	GUO Shu-wu, CHEN Jun-hua. Textbook Classification Method of Index of Moral Education Based on Deep Learning [J]. Computer and Modernization, 2021, 0(09): 106-112.
[14]	JIA Peng-tao, SUN Wei. A Survey of Text Classification Based on Deep Learning [J]. Computer and Modernization, 2021, 0(07): 29-37.
[15]	DAI Ji-peng, SHAO Feng-jing, SUN Ren-cheng. Short Text Classification Based on Improved CHI and TF-IDF [J]. Computer and Modernization, 2021, 0(06): 6-11.