基于改进kmeans算法的文本聚类

doi:10.3969/j.issn.10062475.2018.04.004

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (04): 17-.doi: 10.3969/j.issn.10062475.2018.04.004

基于改进kmeans算法的文本聚类

(南京航空航天大学计算机科学与技术学院，江苏南京211106)

出版日期:2018-04-28 发布日期:2018-05-02
作者简介:蒋丽(1991)，女，江苏宜兴人，南京航空航天大学计算机科学与技术学院硕士研究生，CCF会员，研究方向：数据挖掘; 薛善良(1972)，男，安徽定远人，副教授，硕士生导师，CCF会员，博士后，研究方向：物联网。

Text Clustering Based on Improved kmeans Algorithm

(College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)

Online:2018-04-28 Published:2018-05-02

摘要/Abstract

摘要： 针对原始kmeans算法对聚类数k很敏感这一问题，设计一种改进的kmeans算法。该算法基于共现词的原理计算词向量之间的相似性，并根据相似性阈值将数据划分为k+x个簇，再将kmeans算法用于k+x个簇中。将改进后的算法应用于文本聚类中，实验结果表明，改进后的算法比原算法聚类准确性更高。

关键词: kmeans算法, 共现词, 词向量, 相似性

Abstract: To solve the problem that the original kmeans algorithm is sensitive to the clustering number k, an improved kmeans algorithm is proposed. The algorithm is designed to firstly calculate the similarity between word vectors based on the principle of cooccurrence words and divides the data into k+x clusters according to the similarity threshold and then uses kmeans algorithm for k+x clusters. The proposed algorithm is applied to the text clustering. The experimental results show that the proposed algorithm is more accurate than the original algorithm.

Key words: kmeans algorithm, cooccurrence word, word vector, similarity

中图分类号:

TP391

蒋丽，薛善良. 基于改进kmeans算法的文本聚类[J]. 计算机与现代化, 2018, 0(04): 17-.

JIANG Li, XUE Shanliang. Text Clustering Based on Improved kmeans Algorithm[J]. Computer and Modernization, 2018, 0(04): 17-.

参考文献

［1］张宏兵. Web文本挖掘技术在网页推荐中的应用研究［D］. 南京:南京理工大学, 2013.
［2］于宽. 改进Kmeans算法在文本聚类中的应用［D］. 大连:大连交通大学, 2007.
［3］邓海. 降维多核KMeans算法在文本聚类中的研究［D］. 南宁:广西大学, 2013.
［4］程杨. 中文短文本聚类算法的研究［D］. 长春:吉林大学, 2016.
［5］杨河彬. 基于词向量的搜索词分类、聚类研究［D］. 上海:华东师范大学, 2015.
［6］ Wu Xindong, Kumar V, Quinlan J R, et al. Top 10 algorithms in data mining［J］. Knowledge and Information Systems, 2008,14(1):137.
［7］朱明. 数据挖掘［M］. 合肥:中国科学技术大学出版社, 2002.
［8］ Salton G. Automatic Text Processing: The Transformation, Analysis, and Retrieval of Information by Computer［M］. AddisonWesley, 1989.
［9］ BaezaYates R, RibeiroNeto B. Modern Information Retrieval［M］. AddisonWesley, 1999.
［10］Mikolov T. Word2vec Project［EB/OL］. https://code.google.com/p/word2vec/, 20140918.
［11］Turian J, Ratinov L, Bengio Y. Word representations: A simple and general method for semisupervised learning［C］// Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics. 2010:384394.
［12］周练. Word2vec的工作原理及应用探究［J］. 科技情报开发与经济, 2015,25(2):145148.
［13］熊富林,邓怡豪,唐晓晟. Word2vec的核心架构及其应用［J］. 南京师范大学学报(工程技术版), 2015,15(1):4348.
［14］唐明,朱磊,邹显春. 基于Word2vec的一种文档向量表示［J］. 计算机科学, 2016,43(6):214217.
［15］董文. 基于LDA和Word2vec的推荐算法研究［D］. 北京:北京邮电大学, 2015.
［16］刘敏. 基于词向量的句子相似度计算及其在基于实例的机器翻译中的应用［D］. 北京:北京理工大学, 2015.

[1]	马瑜涓, 韩建宁, 史韶杰, 曹尚斌, 杨志秀. 基于HMRF的改进Kmeans脑肿瘤分割算法[J]. 计算机与现代化, 2023, 0(03): 1-5.
[2]	曾奕斌, 葛红. 基于语义融合和多重相似性学习的跨模态检索[J]. 计算机与现代化, 2022, 0(08): 50-56.
[3]	祝丁恺, 铁治欣, 洪顺贺. 一种HRG模型初始化算法及在链路预测中的应用[J]. 计算机与现代化, 2022, 0(02): 38-44.
[4]	万杨晔, 郭进利. 基于资源分配与图嵌入加权的链路预测算法[J]. 计算机与现代化, 2021, 0(07): 12-17.
[5]	张勤1,2，贾渊1，王耀彬1. 基于稀疏特征点的零件图像拼接方法[J]. 计算机与现代化, 2019, 0(02): 19-.
[6]	司文豪1，贾雷萍2，戚银城2. 基于卷积神经网络的中文人物关系抽取方法[J]. 计算机与现代化, 2018, 0(09): 17-.
[7]	林学峰1,夏元轶2,郭金龙1,于晓文1. 基于卷积神经网络的敏感文件检测方法[J]. 计算机与现代化, 2018, 0(07): 28-.
[8]	郑碧如，吴广潮. 基于信息论方法的分类数据相似性度量[J]. 计算机与现代化, 2018, 0(05): 30-.
[9]	刘琪,张鹏程,王继民. 基于相关反馈的时间序列相似性搜索[J]. 计算机与现代化, 2018, 0(02): 22-26.
[10]	章伟明,周武能. 基于鬼影判断抑制和局部运动补偿的改进ViBe算法[J]. 计算机与现代化, 2018, 0(01): 51-55.
[11]	王思远1,王恒升1,2,刘通1. 基于词向量与SVM的移动机器人自然语言导航指令识别[J]. 计算机与现代化, 2017, 0(9): 24-28,55.
[12]	杨秀萍. 融合用户评分和属性相似度的协同过滤推荐算法[J]. 计算机与现代化, 2017, 0(7): 16-19.
[13]	曹洋洋,林意. 基于离均差的时间序列相似性度量[J]. 计算机与现代化, 2017, 0(5): 76-82,87.
[14]	朱明,魏慧琴. 基于项目属性偏好的协同过滤算法[J]. 计算机与现代化, 2017, 0(4): 23-26.
[15]	相若晨，孙美凤. 基于词向量与句法树的中文句子情感分析[J]. 计算机与现代化, 2016, 0(8): 27-31.

基于改进kmeans算法的文本聚类

Text Clustering Based on Improved kmeans Algorithm

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价