计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

基于改进kmeans算法的文本聚类

  

  1. (南京航空航天大学计算机科学与技术学院,江苏南京211106)
  • 出版日期:2018-04-28 发布日期:2018-05-02
  • 作者简介:蒋丽(1991),女,江苏宜兴人,南京航空航天大学计算机科学与技术学院硕士研究生,CCF会员,研究方向:数据挖掘; 薛善良(1972),男,安徽定远人,副教授,硕士生导师,CCF会员,博士后,研究方向:物联网。

 Text Clustering Based on Improved kmeans Algorithm

  1.  (College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
  • Online:2018-04-28 Published:2018-05-02

摘要: 针对原始kmeans算法对聚类数k很敏感这一问题,设计一种改进的kmeans算法。该算法基于共现词的原理计算词向量之间的相似性,并根据相似性阈值将数据划分为k+x个簇,再将kmeans算法用于k+x个簇中。将改进后的算法应用于文本聚类中,实验结果表明,改进后的算法比原算法聚类准确性更高。

关键词: kmeans算法, 共现词, 词向量, 相似性

Abstract: To solve the problem that the original kmeans algorithm is sensitive to the clustering number k, an improved kmeans algorithm is proposed. The algorithm is designed to firstly calculate the similarity between word vectors based on the principle of cooccurrence words and divides the data into k+x clusters according to the similarity threshold and then uses kmeans algorithm for k+x clusters. The proposed algorithm is applied to the text clustering. The experimental results show that the proposed algorithm is more accurate than the original algorithm.

Key words: kmeans algorithm, cooccurrence word, word vector, similarity

中图分类号: