计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

基于MapReduce框架下的K-means聚类算法的改进

  

  1. (沈阳工业大学理学院,辽宁沈阳110870)
  • 收稿日期:2018-12-02 出版日期:2019-08-15 发布日期:2019-08-16
  • 作者简介:宋阳(1992-),女,辽宁昌图人,硕士研究生,研究方向:智能优化算法与可靠性理论,E-mail: syy3415@163.com; 石鸿雁(1963-),女,教授,博士,研究方向:智能优化算法与可靠性理论。
  • 基金资助:
    国家自然科学基金资助项目(61074005); 辽宁省高等学校优秀科技人才支持计划项目(LR2012005)

Improved K-means Clustering Algorithm Based on MapReduce Framework

  1. (School of Science, Shenyang University of Technology, Shenyang 110870, China)
  • Received:2018-12-02 Online:2019-08-15 Published:2019-08-16

摘要: 针对K-means算法处理海量数据的聚类效果和速率,提出一种基于MapReduce框架下的K-means算法分布式并行化编程模型。首先对K-means聚类算法初始化敏感的问题,给出一种新的相异度函数,根据数据间的相异程度来确定k值,并选取相异度较小的点作为初始聚类中心,再把K-means算法部署在MapReduce编程模型上,通过改进MapReduce编程模型来加快K-means算法处理海量数据的速度。实验表明,基于MapReduce框架下改进的K-means算法与传统的K-means算法相比,准确率及收敛时间方面均有所提高,并且并行聚类模型在不同数据规模和计算节点数目上具有良好的扩展性。

关键词: K-means算法, 相异度函数, MapReduce模型

Abstract: Aiming at the clustering effect and speed of K-means algorithm in processing massive data, a distributed parallel programming model of K-means clustering algorithm based on MapReduce framework is proposed. First, for the sensitive initialization problem of K-means clustering algorithm, a new dissimilarity function is given, according to the degree of dissimilarity between data, k value is determined, and the point with smaller dissimilarity is selected as the initial clustering center, then the K-means algorithm is deployed on the MapReduce programming model, K-means algorithm speeds up to deal with massive data by improving MapReduce programming model. Experiments show that both accuracy and convergence time of the improved K-means algorithm under MapReduce are improved compared with the traditional K-means algorithm, and the parallel clustering model has good expansivity in different data scales and the number of calculated nodes.

Key words: K-means algorithm, dissimilarity function, MapReduce model

中图分类号: