基于改进TFIDF算法的文本分类研究

doi:doi:  10.3969/j.issn.10062475.2014.09.002

计算机与现代化 ›› 2014, Vol. 0 ›› Issue (9): 6-9.doi: doi: 10.3969/j.issn.10062475.2014.09.002

基于改进TFIDF算法的文本分类研究

同济大学经济与管理学院，上海200092

收稿日期:2014-06-25 出版日期:2014-10-10 发布日期:2014-11-04
作者简介:郑霖（1989），男，浙江衢州人，同济大学经济与管理学院硕士研究生，研究方向：信息管理与信息系统；徐德华，男，副教授，硕士生导师，研究方向：管理信息系统，电子商务。

Research on Text Categorization Based on Improved TFIDF Algorithm

School of Economics and Management, Tongji University, Shanghai 200092, China

Received:2014-06-25 Online:2014-10-10 Published:2014-11-04

摘要/Abstract

摘要： 由于文本分类在信息检索、邮件过滤、网页分类、个性化推荐等领域有着广泛的应用价值，所以自文本分类的概念提出以来，受到了学者们的广泛关注。在文本分类的研究中，学者们运用了很多方法，其中TFIDF是文档特征权重计算的最常用算法之一，但是传统的TFID算法忽略了特征项在类内和类间的分布，导致很多区分度不大的特征项被赋予了较大的权重。针对传统TFIDF算法的不足，本文在IDF的计算过程中，用词条在类内与类间的文档占比来考虑词条在类内与类间的分布。在实验中，用改进的权重算法表示文本向量，通过考察分类的效果，验证了改进算法的有效性。


关键词: TFIDF算法, 特征选择, 文本分类

Abstract: Due to the broad application of text categorization in information retrieval, email filtering, Web page classification, personalized recommendation and other fields, it raised extensive attention among scholars since the concept of text categorization was presented. In text classification research, scholars have adopted a lot of methods, and TFIDF was one of the most commonly used algorithms to calculate the weight of feature items. But the traditional TFIDF algorithm ignored the distribution of feature items within classes and among classes, leading to high weight given to many items with little discrimination. In this paper, with the purpose of improving the traditional TFIDF algorithm, we modified the calculation method of IDF, adding some factors which reflected the distribution of feature items within classes and among classes. In the experiment, we applied the improved TFIDF algorithm into text categorization. By investigating the effect of text classification, the improving algorithm was verified valid.


Key words: TFIDF algorithm, feature items selection, text categorization

中图分类号:

中图分类号： TP301.6

郑霖，徐德华. 基于改进TFIDF算法的文本分类研究[J]. 计算机与现代化, 2014, 0(9): 6-9.

ZHENG Lin， XU Dehua. Research on Text Categorization Based on Improved TFIDF Algorithm[J]. Computer and Modernization, 2014, 0(9): 6-9.

参考文献

［1］
Salton G, McGill M J. Introduction to Modern Information Retrieval［M］. McGrawHill, 1983.
［2］〖JP2〗Luhn H P. Autoencoding of Documents for Information Retrieval Systems［M］// Modern Trends in Documentation. New York: Pergamon Press, 1959:6895. 
［3］ Salton G, Wong A, Yang C S. A vector space model for automate indexing［J］. Communications of ACM, 1975,18(11):613620.
［4］ Lewis D D. Nave Bayes at forty: The independence assumption in information retrieval［C］// Proceedings of the 10th European Conference on Machine Learning. 1998:415. 
［5］李荣陆，胡运发. 基于密度的KNN文本分类器训练样本裁剪方法［J］. 计算机研究与发展， 2004,41(4):539545. 
［6］ Hsu C, Lin C. A comparison on methods for multiclass support vector machines［J］. IEEE Transactions on Neural Networks, 2002,13(2):415425.
［7］宋惟然. 中文文本分类中的特征选择和权重计算方法研究［D］. 北京:北京工业大学， 2013.
［8］候敏. 计算语言学与汉语自动分析［M］. 北京:北京广播学院出版社， 1999.
［9］苗夺谦，卫志华. 中文文本信息处理的原理与应用［M］. 北京:清华大学出版社， 2007.
［10］Salton G. On the construction of effective vocabularies for information retrieval［C］// Proceedings of the 1973 Meeting on Programming Languages and Information Retrieval. 1973: 4860.
［11］张保富，施化吉，马素琴. 基于TFIDF文本特征加权方法的改进研究［J］. 计算机应用与软件， 2011，28(2):1720.
［12］张建娥. 基于TFIDF和词语关联度的中文关键词提取方法［J］. 情报科学， 2012，30(10):15421555.
［13］李学明，李海瑞，薛亮，等. 基于信息增益与信息熵的TFIDF算法［J］. 计算机工程， 2012，38(8):3740.
［14］Cohen W, Singer Y. Contextsensitive learning methods for text categorization［J］. ACM Trans. Information Systems, 1996,17(2):146173.
［15］Han E H, Karypis G. Centroidbased document classification: Analysis & experimental results［C］// European Conference on Principles of Data Mining and Knowledge Discovery (PKDD). 2000:424431.
［16］张玉芳，彭时名，吕佳. 基于文本分类TFIDF方法的改进与应用［J］. 计算机工程， 2006，32(19):7678.
［17］施聪莺，徐朝军，杨晓江. TFIDF算法研究综述［J］. 计算机应用， 2009，29(B6)：167170,180.

[1]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[2]	苏凯旋. 基于改进XGBoost模型的网络入侵检测研究[J]. 计算机与现代化, 2024, 0(06): 109-114.
[3]	张可1, 艾中良2, 刘忠麟3, 顾平莉1, 刘学林4. 基于多元组匹配损失的司法论辩理解方法[J]. 计算机与现代化, 2024, 0(06): 115-120.
[4]	罗澍寰, 孙武, 游杰, 王伟, 胡必伟, 姜南. 基于可见-近红外光谱法无损检测梨总酸含量[J]. 计算机与现代化, 2024, 0(05): 80-84.
[5]	徐涯昕, 何泽恩, 徐绪堪. 基于CNN-BiLSTM网络的数控机床故障文本自动分类[J]. 计算机与现代化, 2023, 0(04): 7-14.
[6]	王扬, 陈梅, 李晖. FOCoR:一种基于特征选择优化的课程推荐技术[J]. 计算机与现代化, 2022, 0(10): 1-7.
[7]	赵迎利, 朱旭. 基于组稀疏联合学习的影像遗传学数据关联分析[J]. 计算机与现代化, 2022, 0(08): 43-49.
[8]	张军, 邱龙龙. 一种基于BERT和池化操作的文本分类模型[J]. 计算机与现代化, 2022, 0(06): 1-7.
[9]	莫云. 基于混合特征选择的脑电解码方法[J]. 计算机与现代化, 2022, 0(04): 92-96.
[10]	赵延平, 王芳, 夏杨. 基于支持向量机的短文本分类方法[J]. 计算机与现代化, 2022, 0(02): 92-96.
[11]	郭书武, 陈军华. 基于深度学习的教材德目分类方法[J]. 计算机与现代化, 2021, 0(09): 106-112.
[12]	张东方, 陈海燕, 袁立罡. S2R2:基于相关性与冗余性分析的半监督特征选择[J]. 计算机与现代化, 2021, 0(09): 113-120.
[13]	贾澎涛, 孙炜. 基于深度学习的文本分类综述[J]. 计算机与现代化, 2021, 0(07): 29-37.
[14]	代继鹏, 邵峰晶, 孙仁诚. 基于改进CHI和TF-IDF的短文本分类的研究[J]. 计算机与现代化, 2021, 0(06): 6-11.
[15]	陈丝雨, 庄毅, 李静. 基于LSTM网络的移动云计算多元负载预测模型[J]. 计算机与现代化, 2021, 0(06): 74-85.

基于改进TFIDF算法的文本分类研究

Research on Text Categorization Based on Improved TFIDF Algorithm

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价