计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

一种改进的k-means文本聚类优化方法

  

  1. 常熟理工学院外国语学院,江苏常熟215500
  • 收稿日期:2014-12-09 出版日期:2015-03-23 发布日期:2015-03-26
  • 作者简介:王琼(1982-),女,江苏常熟人,常熟理工学院外国语学院实验师,硕士,研究方向:数据挖掘,智能化信息处理。
  • 基金资助:
    江苏省自然科学基金资助项目(BK2012209); 苏州市科技发展计划项目(SYG201409)

An Improved K-means Optimization Approach for Text Clustering

  1. School of Foreign Language, Changshu Institute of Technology, Changshu 215500, China
  • Received:2014-12-09 Online:2015-03-23 Published:2015-03-26

摘要: 提出一种改进的k-means文本聚类优化方法k-meansSC,将待聚类文档集分词处理后提取主要词条集,并分别采用布尔函数、TFIDF函数表示文本特征向量,通过实验对比它们各自的优缺点。同时基于该词条集构建支持度矩阵与置信度矩阵,以此定义相似度计算公式,并在不同聚类个数条件下详细分析了该公式与其他距离计算公式的迭代次数及错误函数的表现情况,实验结果表明,在一定条件下采用TFIDF型文本特征向量能有效地提高运行效率及聚类有效性。

关键词: k-means, 相似度, 文本聚类, 支持度, 置信度

Abstract: K-meansSC as an improved k-means optimization approach for text clustering is proposed. By means of processing of word segmentation, clustering document sets will be treated for extraction of main entry sets. Then the feature vectors of the document are respectively represented by Boolean function and TFIDF function, through the comparison of their respective strengths and weaknesses. Based on the entry set building support degrees matrix and confidence degrees matrix, similar degrees calculation formula can be defined, and under different clustering number conditions the formula and  other distance calculation formula of iteration number and error function of performance situation have been in detailed analysis. Experimental results shows that under certain conditions TFIDF function featuring document vector can effectively improve processing efficiency and clustering effectiveness.

Key words: k-means, similarity, text clustering, support, confidence

中图分类号: