Computer and Modernization

Previous Articles     Next Articles

 Text Clustering Based on Improved kmeans Algorithm

  

  1.  (College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)
  • Online:2018-04-28 Published:2018-05-02

Abstract: To solve the problem that the original kmeans algorithm is sensitive to the clustering number k, an improved kmeans algorithm is proposed. The algorithm is designed to firstly calculate the similarity between word vectors based on the principle of cooccurrence words and divides the data into k+x clusters according to the similarity threshold and then uses kmeans algorithm for k+x clusters. The proposed algorithm is applied to the text clustering. The experimental results show that the proposed algorithm is more accurate than the original algorithm.

Key words: kmeans algorithm, cooccurrence word, word vector, similarity

CLC Number: