基于改进CHI和TF-IDF的短文本分类的研究

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (06): 6-11.

基于改进CHI和TF-IDF的短文本分类的研究

(青岛大学计算机科学技术学院，山东青岛266071)

出版日期:2021-07-05 发布日期:2021-07-05
作者简介:代继鹏（1995—），男，山东泰安人，硕士研究生，研究方向：数据挖掘，复杂网络，E-mail: djp4992@163.com；通信作者：邵峰晶（1955—），女，山东青岛人，教授，博士，研究方向：复杂网络理论与应用，数据挖掘，E-mail: sfj@qdu.edu.cn；孙仁诚（1977—），男，山东青岛人，副教授，博士，研究方向：大数据分析与处理，复杂网络，E-mail: qdsunstar@163.com。
基金资助:
国家自然科学青年基金资助项目(41706198)

Short Text Classification Based on Improved CHI and TF-IDF

(College of Computer Science and Technology, Qingdao University, Qingdao 266071, China)

Online:2021-07-05 Published:2021-07-05

摘要/Abstract

摘要： 为提高对数据量较少的短文本分类效果，有效降低特征空间的特征维度，本文针对传统CHI统计方法和TF-IDF权重计算方法的缺点，提出一种新的类词因子改进特征选择方法，用来提高分类准确性。对于传统CHI统计方法对低频词敏感、TF-IDF权重计算方法忽略特征项在类别间和类别内分布情况，通过引入类词因子来改进传统CHI统计方法和TF-IDF权重计算方法，并将两者结合使用，减少低频词带来的干扰。同时考虑类内和类间特征词分布的特殊情况，使用XGBoost分类算法将提出方法用在数据量少且文本短的话题文本分类实验中，实验结果表明，与传统的CHI和TF-IDF方法相比，加入类词因子的特征选择方法提高了在平衡和非平衡语料集上的分类准确性，大幅度降低了对内存的占用。

关键词: 文本分类, 特征选择, XGBoost, 卡方统计量, TF-IDF

Abstract: In order to improve the effect of classifying short texts with a small amount of data, and effectively reduce the feature dimension of the feature space, aiming at the defects of the traditional CHI statistical method and the TF-IDF weight calculation method, this paper proposes a new factor of word class and frequency to improve the feature selection method, and consequently to enhance the classification accuracy. As the traditional CHI statistical method is sensitive to low-frequency words, and the TF-IDF weight calculation method ignores the distribution of feature items between and within classes, the paper introduces the factor of word class and frequency to improve the traditional CHI statistical method and the TF-IDF weight calculation method, and uses the two methods in combination to reduce the interference caused by low-frequency words, with consideration to the special situation of the distribution of feature words within and between classes. The paper uses the XGBoost classification algorithm to apply the proposed method in the classification experiment of topic text with small amount of data and short text. The experimental results show that, compared with the traditional CHI and TF-IDF methods, the feature selection method with factor of word class and frequency improves the classification accuracy on the balanced and unbalanced corpus, and greatly reduces the memory usage.

Key words: text classification, feature selection, XGBoost, chi-square statistics, TF-IDF

代继鹏, 邵峰晶, 孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究[J]. 计算机与现代化, 2021, 0(06): 6-11.

DAI Ji-peng, SHAO Feng-jing, SUN Ren-cheng. Short Text Classification Based on Improved CHI and TF-IDF[J]. Computer and Modernization, 2021, 0(06): 6-11.

参考文献

［1］裴英博,刘晓霞. 文本分类中改进型CHI特征选择方法的研究［J］. 计算机工程与应用, 2011,47(4):128-130.
［2］蒋健. 文本分类中特征提取和特征加权方法研究［D］. 重庆:重庆大学, 2010.
［3］张玉芳,陈小莉,熊忠阳. 基于信息增益的特征词权重调整算法研究［J］. 计算机工程与应用, 2007,43(35):159-161.
［4］ SALTON G, MCGILL M J. Introduction to modern information retrieval［J］. Information Processing & Management, 1983,19(6):402-403.
［5］缪鸣燕,董卫国,胡亚华. 电子病历与手写病历在中医肛肠科中的应用效果对比分析［J］. 中医药导报, 2015,21(3):101-103.
［6］ JIN C X， MA T H， HOU R T， et al． Chi-square statistics feature selection based on term frequency and distribution for text categorization［J］. IETE Journal of Research, 2015,61(4):351-362.
［7］高宝林,周治国,杨文维,等. 基于类别和改进的CHI相结合的特征选择方法［J］. 计算机应用研究, 2018,35(6):1660-1662.
［8］王露瑶,张涛,陈才,等. 基于卡方统计改进的TF-IDF的文本分类的研究［J］. 电子世界, 2019(6):24-25.
［9］张玉杰. 基于偏斜数据集的中文文本分类问题的改进特征权重算法研究［D］. 长春:东北师范大学, 2010.
［10］杨文维. 基于微博特征的情感分析方法研究［D］. 长春:东北师范大学, 2018.
［11］童欣. 面向客服互动微博的短文本分类方法研究［D］. 北京:北京邮电大学, 2015.
［12］张保富,施化吉,马素琴. 基于TFIDF文本特征加权方法的改进研究［J］. 计算机应用与软件, 2011,28(2):17-20.
［13］朱菊花. 中文网页分类技术的研究与实现［D］. 成都:西南交通大学, 2008.
［14］李学明,李海瑞,薛亮,等. 基于信息增益与信息熵的TFIDF算法［J］. 计算机工程, 2012,38(8):37-40.
［15］SHAH F P, PATEL V. A review on feature selection and feature extraction for text classification［C］// IEEE International Conference on Wireless Communications, Signal Processing and Networking. 2016:2264-2268.
［16］王振. 基于机器学习的文本分类研究与实现［D］. 南京:南京邮电大学, 2018.
［17］程春惠,何钦铭. 面向不均衡类别朴素贝叶斯犯罪案件文本分类［J］. 计算机工程与应用, 2009,45(35):126-128.
［18］YANG C, DURAISWAMI R, GUMEROV N, et al． Improved fast Gauss transform and efficient kernel density estimation［C］// IEEE International Conference on Computer Vision. 2003:464-471.
［19］梁伍七,李斌,许磊. 基于类别的CHI特征选择方法［J］. 安徽广播电视大学学报, 2015(3):124-128.
［20］崔志刚. 基于电商网站商品评论数据的用户情感分析［D］. 北京:北京交通大学, 2014.
［21］姚海英. 中文文本分类中卡方统计特征选择方法和TF-IDF权重计算方法的研究［D］. 长春:吉林大学, 2016.
［22］CHEN T Q, GUESTRIN C. XGBoost: A scalable tree boosting system［C］// Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2016:785-794.
［23］辛竹,周亚建. 文本分类中互信息特征选择方法的研究与算法改进［J］.计算机应用, 2013,33(S2):116-118.

[1]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[2]	苏凯旋. 基于改进XGBoost模型的网络入侵检测研究[J]. 计算机与现代化, 2024, 0(06): 109-114.
[3]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[4]	罗澍寰, 孙武, 游杰, 王伟, 胡必伟, 姜南. 基于可见-近红外光谱法无损检测梨总酸含量[J]. 计算机与现代化, 2024, 0(05): 80-84.
[5]	徐涯昕, 何泽恩, 徐绪堪. 基于CNN-BiLSTM网络的数控机床故障文本自动分类[J]. 计算机与现代化, 2023, 0(04): 7-14.
[6]	王扬, 陈梅, 李晖. FOCoR:一种基于特征选择优化的课程推荐技术[J]. 计算机与现代化, 2022, 0(10): 1-7.
[7]	孙丹, 施炜利, 饶兰香, 孟莎莎, 郭晓明, 李逸伦. 基于改进混合采样和XGBoost算法的信用卡欺诈检测方法[J]. 计算机与现代化, 2022, 0(09): 111-118.
[8]	赵迎利, 朱旭. 基于组稀疏联合学习的影像遗传学数据关联分析[J]. 计算机与现代化, 2022, 0(08): 43-49.
[9]	孙丹, 饶兰香, 施炜利, 孟莎莎, 胡少文, 胡必伟, 应嵩. 基于混合N-Gram模型和XGBoost算法的内部威胁检测方法[J]. 计算机与现代化, 2022, 0(08): 99-105.
[10]	杨进, 张晨. 改进EasyEnsemble的软投票策略下的用户购买预测方法[J]. 计算机与现代化, 2022, 0(07): 47-53.
[11]	张军, 邱龙龙. 一种基于BERT和池化操作的文本分类模型[J]. 计算机与现代化, 2022, 0(06): 1-7.
[12]	关慧, 盛靖媛, 曹同洲. 一种基于改进TF-IDF的SQL注入攻击检测算法[J]. 计算机与现代化, 2022, 0(06): 122-126.
[13]	莫云. 基于混合特征选择的脑电解码方法[J]. 计算机与现代化, 2022, 0(04): 92-96.
[14]	赵延平, 王芳, 夏杨. 基于支持向量机的短文本分类方法[J]. 计算机与现代化, 2022, 0(02): 92-96.
[15]	郭书武, 陈军华. 基于深度学习的教材德目分类方法[J]. 计算机与现代化, 2021, 0(09): 106-112.