计算机与现代化

• 软件工程 • 上一篇    下一篇

Spark平台的分布式阶段自适应关联规则挖掘算法

  

  1. (1.汕尾职业技术学院信息工程系,广东汕尾516600; 2.华为技术有限公司,广东深圳518129)
  • 收稿日期:2019-05-08 出版日期:2019-12-11 发布日期:2019-12-11
  • 作者简介:石慧(1983-),女,湖北咸宁人,讲师,硕士,研究方向:云计算与大数据,智能算法,E-mail: 455987511@qq.com; 陈恩(1984-),男,高级工程师,研究方向:系统架构设计与分析,人工智能,E-mail: 363196200@qq.com。
  • 基金资助:
    全国高等院校计算机基础教育研究会2016年度科研规划纵向课题(2016GHB02005); 广东省高职高专云计算与大数据专业委员会2019年度教育科研课题(GDYJSKT19-02)

Distributed Stage Adaptive Association Rules Mining Algorithm Based on Spark

  1. (1. Department of Information Engineering, Shanwei Polytechnic, Shanwei 516600, China;
    2. Huawei Technologies Co., Ltd., Shenzhen 518129, China)
  • Received:2019-05-08 Online:2019-12-11 Published:2019-12-11

摘要: 为满足日益增长的海量数据挖掘需求,迫切需要设计一种能够在多台机器上运行的分布式关联规则挖掘算法。Apriori这种高度迭代算法在Hadoop平台上运行时每次迭代执行大量的磁盘I/O操作,大大影响并限制了算法的运行效率。本文利用Spark对分布式计算内置支持的特点,在Spark平台上设计并实现一种分布式关联规则挖掘算法,称为阶段式自适应挖掘算法(Staged Adaptive Apriori)。算法使用自适应的数据集部分处理的策略对频繁项集进行高效挖掘,在每次迭代前初步评估执行时间,并采用较为合适的方法来减少时间和空间的复杂性,是一种基于数据集性质的自适应关联规则挖掘算法。实验结果表明了算法的有效性。

关键词: 关联规则挖掘, Apriori算法, MapReduce, Spark

Abstract: In order to meet the growing demand for massive data mining, it is urgent to design a distributed association rule mining algorithm that can run on multiple machines. Apriori is a highly iterative algorithm that performs a large number of disk I/O operations per iteration when running on the Hadoop platform, greatly affecting and limiting the efficiency of the algorithm. This paper uses Spark to support the characteristics of memory distribution calculation and designs and implements a distributed association rule mining algorithm called Staged Adaptive Apriori on the Spark platform. The algorithm uses the adaptive data set partial processing strategy to efficiently mine frequent itemsets. The algorithm initially evaluates the execution time before each iteration, and adopts a more appropriate method to reduce the complexity of time and space. It is an adaptive association rule mining algorithm based on the nature of data sets. The experimental results demonstrate the effectiveness of the algorithm.

Key words: association rule mining, Apriori, MapReduce, Spark

中图分类号: