Improved K-means Clustering Algorithm Based on MapReduce Framework

doi:10.3969/j.issn.1006-2475.2019.08.006

Abstract

Abstract: Aiming at the clustering effect and speed of K-means algorithm in processing massive data, a distributed parallel programming model of K-means clustering algorithm based on MapReduce framework is proposed. First, for the sensitive initialization problem of K-means clustering algorithm, a new dissimilarity function is given, according to the degree of dissimilarity between data, k value is determined, and the point with smaller dissimilarity is selected as the initial clustering center, then the K-means algorithm is deployed on the MapReduce programming model, K-means algorithm speeds up to deal with massive data by improving MapReduce programming model. Experiments show that both accuracy and convergence time of the improved K-means algorithm under MapReduce are improved compared with the traditional K-means algorithm, and the parallel clustering model has good expansivity in different data scales and the number of calculated nodes.

Key words: K-means algorithm, dissimilarity function, MapReduce model

CLC Number:

SONG Yang, SHI Hong-yan. Improved K-means Clustering Algorithm Based on MapReduce Framework[J]. Computer and Modernization, doi: 10.3969/j.issn.1006-2475.2019.08.006.

References

［1］ ZHANG G, ZHANG C, ZHANG H. Improved K-means algorithm based on density Canopy［J］. Knowledge-Based Systems, 2018,145:289-297.
［2］张文杰,蒋烈辉. 一种基于MapReduce并行化计算的大数据聚类算法［J/OL］. 计算机应用研究, 2020，37（1）.〖JP+2〗［2018-11-05］. http://www.arocmag.com/article/02-2020-01-055.html.
［3］ WU X, KUMAR V, QUINLAN J R, et al. Top 10 algorithms in data mining［J］. Knowledge and Information Systems, 2008,14(1):1-37.
［4］ ISMKHAN H. I-k-means-+: An iterative clustering algorithm based on an enhanced version of the k-means［J］. Pattern Recognition, 2018,79:402-414.
［5］ SHAHRIVARI S, JALILI S. Single-pass and linear-time K-means clustering based on MapReduce［J］. Information Systems, 2016,60:1-12.
［6］ ZHANG T, MA F. Improved rough K-means clustering algorithm based on weighted distance measure with Gaussian function［J］. International Journal of Computer Athematics, 2017,94(4):663-675.
［7］ TSAPANOS N, TEFAS A, NIKOLAIDIS N, et al. Efficient MapReduce kernel K-means for big data clustering ［C］// Proceedings of the 9th ACM Hellenic Conference on Artificial Intelligence. 2016:28.
［8］孟子健,马江洪. 一种可选初始聚类中心的改进k均值算法［J］. 统计与决策, 2014(12):12-14.
［9］王秀华. 基于随机抽样的加速K-均值聚类方法［J］. 计算机与现代化, 2013(12):27-29,33.
［10］高榕,李晶,肖雅夫,等. 基于云环境K-means聚类的并行算法［J］. 武汉大学学报（理学版）, 2015,61(4):368-374.
［11］蒋丽,薛善良. 基于改进K-means算法的文本聚类［J］. 计算机与现代化, 2018(4):17-21.
［12］JEONG Y, LEE J, MOON J, et al. K-means data clustering with memristor networks［J］. Nano Letters, 2018,18(7):4447-4453.
［13］戴月明,王明慧,张明,等. SVD优化初始簇中心的K-means中文文本聚类算法［J］. 系统仿真学报, 2018,30(10):3835-3842.
［14］李法运,农罗锋. 基于向量语义相似度的改进K-means算法［J］. 情报科学, 2013(2):34-37.
［15］卞亦文. 大样本数据聚类的改进方法［J］. 统计与决策, 2009(1):12-13.
［16］吴夙慧,成颖,郑彦宁，等. K-means算法研究综述［J］. 数据分析与知识发现, 2011,27(5):28-35.
［17］DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters［J］. Communications of the ACM, 2008,51(1):107-113.
［18］刘帅. MapReduce编程模型中key值二次分类算法［J］. 计算机时代, 2018(3):58-59,62.
［19］杨辉华,杜玲玲,李灵巧,等. 并行MapReduce PLS算法及其在光谱分析中的应用［J］. 光谱学与光谱分析, 2012,32(9):2399-2404.
［20］李建江,崔健,王聃,等. MapReduce并行编程模型研究综述［J］. 电子学报， 2011,39(11):2635-2642.
［21］李晓英,周大涛. 基于K-means聚类的调查问卷动态赋权统计方法［J］. 统计与决策, 2018,34(23):80-83.
［22］张良均. MATLAB数据分析与挖掘实战［M］. 北京：机械工业出版社, 2015.

[1]	HAN Xue. Multi Path Planning Based on Constrained Clustering and Particle Swarm Optimization [J]. Computer and Modernization, 2023, 0(08): 7-11.
[2]	ZHUANG Li-li, SHI Hong-yan. Outlier Detection Based on Improved Cuckoo Search k-means Algorithm [J]. Computer and Modernization, 2021, 0(10): 15-22.
[3]	ZHANG Lei, LANG Xian-li, WANG Le. Surface Defect Detection of Aluminum Profile Based on Image Fusion and YOLOv3 [J]. Computer and Modernization, 2020, 0(11): 8-15.
[4]	ZHANG Su-ning, WANG Yue-juan, WU Shui-ming, JING Dong-sheng. Network Intrusion Data Clustering Algorithm Based on Krylov Subspace [J]. Computer and Modernization, 2019, 0(10): 121-.
[5]	LI Na, MAO Guo-jun, DENG Kang-li. K-means-based KDJ Integrated Analyzing Methods for Stock Transactions [J]. Computer and Modernization, 2018, 0(10): 12-.
[6]	LIU Huan1,2, YAN Zhen1,2. Road Extraction in High Resolution Remote Sensing Images Based on Improved K-means Algorithm [J]. Computer and Modernization, 2017, 0(11): 55-61.
[7]	LENG Yong-lin. Distributed AP Clustering Algorithm Based on MapReduce [J]. Computer and Modernization, 2014, 0(10): 104-107.
[8]	ZHENG Cheng;LI Hong. Texts Clustering of K-means Based on LDA [J]. Computer and Modernization, 2013, 1(8): 78-80,8.