Text Clustering Based on Improved kmeans Algorithm

doi:10.3969/j.issn.10062475.2018.04.004

Abstract

Abstract: To solve the problem that the original kmeans algorithm is sensitive to the clustering number k, an improved kmeans algorithm is proposed. The algorithm is designed to firstly calculate the similarity between word vectors based on the principle of cooccurrence words and divides the data into k+x clusters according to the similarity threshold and then uses kmeans algorithm for k+x clusters. The proposed algorithm is applied to the text clustering. The experimental results show that the proposed algorithm is more accurate than the original algorithm.

Key words: kmeans algorithm, cooccurrence word, word vector, similarity

CLC Number:

TP391

JIANG Li, XUE Shanliang. Text Clustering Based on Improved kmeans Algorithm[J]. Computer and Modernization, 2018, 0(04): 17-.

References

［1］张宏兵. Web文本挖掘技术在网页推荐中的应用研究［D］. 南京:南京理工大学, 2013.
［2］于宽. 改进Kmeans算法在文本聚类中的应用［D］. 大连:大连交通大学, 2007.
［3］邓海. 降维多核KMeans算法在文本聚类中的研究［D］. 南宁:广西大学, 2013.
［4］程杨. 中文短文本聚类算法的研究［D］. 长春:吉林大学, 2016.
［5］杨河彬. 基于词向量的搜索词分类、聚类研究［D］. 上海:华东师范大学, 2015.
［6］ Wu Xindong, Kumar V, Quinlan J R, et al. Top 10 algorithms in data mining［J］. Knowledge and Information Systems, 2008,14(1):137.
［7］朱明. 数据挖掘［M］. 合肥:中国科学技术大学出版社, 2002.
［8］ Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer［M］. AddisonWesley, 1989.
［9］ BaezaYates R, RibeiroNeto B. Modern Information Retrieval［M］. AddisonWesley, 1999.
［10］Mikolov T. Word2vec Project［EB/OL］. https://code.google.com/p/word2vec/, 20140918.
［11］Turian J, Ratinov L, Bengio Y. Word representations: A simple and general method for semisupervised learning［C］// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010:384394.
［12］周练. Word2vec的工作原理及应用探究［J］. 科技情报开发与经济, 2015,25(2):145148.
［13］熊富林,邓怡豪,唐晓晟. Word2vec的核心架构及其应用［J］. 南京师范大学学报(工程技术版), 2015,15(1):4348.
［14］唐明,朱磊,邹显春. 基于Word2vec的一种文档向量表示［J］. 计算机科学, 2016,43(6):214217.
［15］董文. 基于LDA和Word2vec的推荐算法研究［D］. 北京:北京邮电大学, 2015.
［16］刘敏. 基于词向量的句子相似度计算及其在基于实例的机器翻译中的应用［D］. 北京:北京理工大学, 2015.

[1]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[2]	LIU Wenliang1, WU Fei1, HE Deming1, ZHAO Weiwei2, PAN Jianhong3. Text Clustering Method for Fragmented Reply Based on Dissimilarity Matrix [J]. Computer and Modernization, 2024, 0(09): 56-60.
[3]	Renqingzhuoma1, 2, 3, Yongcuo1, 2, 3, TANG Chaochao1, 2, 3. Data Filtering Strategies for Tibetan-Chinese Neural Machine Translation [J]. Computer and Modernization, 2024, 0(06): 19-24.
[4]	WANG Hong-jie, XU Sheng-chao. Clustering Method of Cloud Platform Abnormal Transmission Data Based on Hilbert Similarity [J]. Computer and Modernization, 2023, 0(09): 27-31.
[5]	WANG Hong, GE Hong. Cross Modal Hash Retrieval Based on Attention Mechanism and Semantic Similarity [J]. Computer and Modernization, 2023, 0(08): 44-53.
[6]	MA Yu-juan, HAN Jian-ning, SHI Shao-jie, CAO Shang-bin, YANG Zhi-xiu. Improved Kmeans Segmentation Algorithm for Brain Tumor Based on HMRF [J]. Computer and Modernization, 2023, 0(03): 1-5.
[7]	LIU Guo-li, XU Hong-nan, TAN You-qian. Collaborative Filtering Recommendation Algorithm Combined with Expert Trust [J]. Computer and Modernization, 2022, 0(11): 60-68.
[8]	TIAN Feng, DENG Xiao-ping, ZHANG Gui-qing, WANG Bao-yi. A Non-intrusive Load Monitoring Method Based on Improved kNN Algorithm and Transient Steady State Features [J]. Computer and Modernization, 2022, 0(10): 29-35.
[9]	ZENG Yi-bin, GE Hong. Cross-modal Retrieval Based on Context Fusion and Multi-similarity Learning [J]. Computer and Modernization, 2022, 0(08): 50-56.
[10]	ZOU Meng-yuan, FAN Zhi-qiang, XU Luo, LIU Jie, LIANG Wan-lu. Similarity Measurement Method of Inf-ProA Information Activity Process Model [J]. Computer and Modernization, 2022, 0(02): 26-32.
[11]	ZHU Ding-kai, TIE Zhi-xin, HONG Shun-he. An Initialization Algorithm of HRG Model and Its Application in Link Prediction [J]. Computer and Modernization, 2022, 0(02): 38-44.
[12]	GENG Hua-cong, LIANG Hong-tao, LIU Guo-zhu. Recipe Recommendation Algorithm Based on Knowledge Graph and Collaborative Filtering [J]. Computer and Modernization, 2021, 0(08): 24-29.
[13]	WAN Yang-ye, GUO Jin-li. Link Prediction Algorithm Based on Resource Allocation and Graph Embedding Weighting [J]. Computer and Modernization, 2021, 0(07): 12-17.
[14]	HUANG Hao, CHEN Li. A Collaborative Filtering Algorithm Based on Information Entropy and Improved Similarity [J]. Computer and Modernization, 2021, 0(06): 29-34.
[15]	SHU Peng, DU Qing-wei. A Method for Mobile Community Detection Based on Multi-dimensional Informational Fusion [J]. Computer and Modernization, 2021, 0(05): 88-92.