基于MapReduce框架下的K-means聚类算法的改进

doi:10.3969/j.issn.1006-2475.2019.08.006

计算机与现代化 ›› 2019, Vol. 0 ›› Issue (08): 28-.doi: 10.3969/j.issn.1006-2475.2019.08.006

基于MapReduce框架下的K-means聚类算法的改进

(沈阳工业大学理学院,辽宁沈阳110870)

收稿日期:2018-12-02 出版日期:2019-08-15 发布日期:2019-08-16
作者简介:宋阳(1992-),女,辽宁昌图人,硕士研究生，研究方向：智能优化算法与可靠性理论，E-mail: syy3415@163.com；石鸿雁(1963-),女,教授,博士,研究方向：智能优化算法与可靠性理论。
基金资助:
国家自然科学基金资助项目(61074005)；辽宁省高等学校优秀科技人才支持计划项目（LR2012005）

Improved K-means Clustering Algorithm Based on MapReduce Framework

(School of Science, Shenyang University of Technology, Shenyang 110870, China)

Received:2018-12-02 Online:2019-08-15 Published:2019-08-16

摘要/Abstract

摘要： 针对K-means算法处理海量数据的聚类效果和速率，提出一种基于MapReduce框架下的K-means算法分布式并行化编程模型。首先对K-means聚类算法初始化敏感的问题，给出一种新的相异度函数，根据数据间的相异程度来确定k值，并选取相异度较小的点作为初始聚类中心，再把K-means算法部署在MapReduce编程模型上，通过改进MapReduce编程模型来加快K-means算法处理海量数据的速度。实验表明，基于MapReduce框架下改进的K-means算法与传统的K-means算法相比，准确率及收敛时间方面均有所提高，并且并行聚类模型在不同数据规模和计算节点数目上具有良好的扩展性。

关键词: K-means算法, 相异度函数, MapReduce模型

Abstract: Aiming at the clustering effect and speed of K-means algorithm in processing massive data, a distributed parallel programming model of K-means clustering algorithm based on MapReduce framework is proposed. First, for the sensitive initialization problem of K-means clustering algorithm, a new dissimilarity function is given, according to the degree of dissimilarity between data, k value is determined, and the point with smaller dissimilarity is selected as the initial clustering center, then the K-means algorithm is deployed on the MapReduce programming model, K-means algorithm speeds up to deal with massive data by improving MapReduce programming model. Experiments show that both accuracy and convergence time of the improved K-means algorithm under MapReduce are improved compared with the traditional K-means algorithm, and the parallel clustering model has good expansivity in different data scales and the number of calculated nodes.

Key words: K-means algorithm, dissimilarity function, MapReduce model

中图分类号:

宋阳，石鸿雁. 基于MapReduce框架下的K-means聚类算法的改进[J]. 计算机与现代化, 2019, 0(08): 28-.

SONG Yang, SHI Hong-yan. Improved K-means Clustering Algorithm Based on MapReduce Framework[J]. Computer and Modernization, 2019, 0(08): 28-.

参考文献

［1］ ZHANG G, ZHANG C, ZHANG H. Improved K-means algorithm based on density Canopy［J］. Knowledge-Based Systems, 2018,145:289-297.
［2］张文杰,蒋烈辉. 一种基于MapReduce并行化计算的大数据聚类算法［J/OL］. 计算机应用研究, 2020，37（1）.〖JP+2〗［2018-11-05］. http://www.arocmag.com/article/02-2020-01-055.html.
［3］ WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining［J］. Knowledge and Information Systems, 2008,14(1):1-37.
［4］ ISMKHAN H. I-k-means-+: An iterative clustering algorithm based on an enhanced version of the k-means［J］. Pattern Recognition, 2018,79:402-414.
［5］ SHAHRIVARI S, JALILI S. Single-pass and linear-time K-means clustering based on MapReduce［J］. Information Systems, 2016,60:1-12.
［6］ ZHANG T, MA F. Improved rough K-means clustering algorithm based on weighted distance measure with Gaussian function［J］. International Journal of Computer Athematics, 2017,94(4):663-675.
［7］ TSAPANOS N, TEFAS A, NIKOLAIDIS N, et al. Efficient MapReduce kernel K-means for big data clustering ［C］// Proceedings of the 9th ACM Hellenic Conference on Artificial Intelligence. 2016:28.
［8］孟子健,马江洪. 一种可选初始聚类中心的改进k均值算法［J］. 统计与决策, 2014(12):12-14.
［9］王秀华. 基于随机抽样的加速K-均值聚类方法［J］. 计算机与现代化, 2013(12):27-29,33.
［10］高榕,李晶,肖雅夫,等. 基于云环境K-means聚类的并行算法［J］. 武汉大学学报（理学版）, 2015,61(4):368-374.
［11］蒋丽,薛善良. 基于改进K-means算法的文本聚类［J］. 计算机与现代化, 2018(4):17-21.
［12］JEONG Y, LEE J, MOON J, et al. K-means data clustering with memristor networks［J］. Nano Letters, 2018,18(7):4447-4453.
［13］戴月明,王明慧,张明,等. SVD优化初始簇中心的K-means中文文本聚类算法［J］. 系统仿真学报, 2018,30(10):3835-3842.
［14］李法运,农罗锋. 基于向量语义相似度的改进K-means算法［J］. 情报科学, 2013(2):34-37.
［15］卞亦文. 大样本数据聚类的改进方法［J］. 统计与决策, 2009(1):12-13.
［16］吴夙慧,成颖,郑彦宁，等. K-means算法研究综述［J］. 数据分析与知识发现, 2011,27(5):28-35.
［17］DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters［J］. Communications of the ACM, 2008,51(1):107-113.
［18］刘帅. MapReduce编程模型中key值二次分类算法［J］. 计算机时代, 2018(3):58-59,62.
［19］杨辉华,杜玲玲,李灵巧,等. 并行MapReduce PLS算法及其在光谱分析中的应用［J］. 光谱学与光谱分析, 2012,32(9):2399-2404.
［20］李建江,崔健,王聃,等. MapReduce并行编程模型研究综述［J］. 电子学报， 2011,39(11):2635-2642.
［21］李晓英,周大涛. 基于K-means聚类的调查问卷动态赋权统计方法［J］. 统计与决策, 2018,34(23):80-83.
［22］张良均. MATLAB数据分析与挖掘实战［M］. 北京：机械工业出版社, 2015.

[1]	秦阳, 詹勇, 明路遥, 杨舒淇, 蓝振祎. 基于改进K-means算法的通勤交通小区识别[J]. 计算机与现代化, 2024, 0(07): 63-68.
[2]	韩雪. 基于约束聚类和粒子群算法的多路径规划[J]. 计算机与现代化, 2023, 0(08): 7-11.
[3]	庄丽丽, 石鸿雁. 基于改进布谷鸟搜索的k-means算法的离群点检测[J]. 计算机与现代化, 2021, 0(10): 15-22.
[4]	张磊, 郎贤礼, 王乐. 基于图像融合与YOLOv3的铝型材表面缺陷检测[J]. 计算机与现代化, 2020, 0(11): 8-15.
[5]	张苏宁，王月娟，吴水明，景栋盛. 基于Krylov子空间方法的网络入侵数据聚类[J]. 计算机与现代化, 2019, 0(10): 121-.
[6]	冷泳林. 基于MapReduce的分布式AP聚类算法[J]. 计算机与现代化, 2014, 0(10): 104-107.

基于MapReduce框架下的K-means聚类算法的改进

Improved K-means Clustering Algorithm Based on MapReduce Framework

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 6

编辑推荐

Metrics

本文评价