计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

基于改进TFIDF算法的文本分类研究

  

  1. 同济大学经济与管理学院,上海200092
  • 收稿日期:2014-06-25 出版日期:2014-10-10 发布日期:2014-11-04
  • 作者简介:郑霖(1989),男,浙江衢州人,同济大学经济与管理学院硕士研究生,研究方向:信息管理与信息系统; 徐德华,男,副教授,硕士生导师,研究方向:管理信息系统,电子商务。

Research on Text Categorization Based on Improved TFIDF Algorithm

  1. School of Economics and Management, Tongji University, Shanghai 200092, China
  • Received:2014-06-25 Online:2014-10-10 Published:2014-11-04

摘要: 由于文本分类在信息检索、邮件过滤、网页分类、个性化推荐等领域有着广泛的应用价值,所以自文本分类的概念提出以来,受到了学者们的广泛关注。在文本分类的研究中,学者们运用了很多方法,其中TFIDF是文档特征权重计算的最常用算法之一,但是传统的TFID算法忽略了特征项在类内和类间的分布,导致很多区分度不大的特征项被赋予了较大的权重。针对传统TFIDF算法的不足,本文在IDF的计算过程中,用词条在类内与类间的文档占比来考虑词条在类内与类间的分布。在实验中,用改进的权重算法表示文本向量,通过考察分类的效果,验证了改进算法的有效性。

关键词: TFIDF算法, 特征选择, 文本分类

Abstract: Due to the broad application of text categorization in information retrieval, email filtering, Web page classification, personalized recommendation and other fields, it raised extensive attention among scholars since the concept of text categorization was presented. In text classification research, scholars have adopted a lot of methods, and TFIDF was one of the most commonly used algorithms to calculate the weight of feature items. But the traditional TFIDF algorithm ignored the distribution of feature items within classes and among classes, leading to high weight given to many items with little discrimination. In this paper, with the purpose of improving the traditional TFIDF algorithm, we modified the calculation method of IDF, adding some factors which reflected the distribution of feature items within classes and among classes. In the experiment, we applied the improved TFIDF algorithm into text categorization. By investigating the effect of text classification, the improving algorithm was verified valid.

 

Key words: TFIDF algorithm, feature items selection, text categorization

中图分类号: