Spark平台的分布式阶段自适应关联规则挖掘算法

doi:10.3969/j.issn.1006-2475.2019.12.007

计算机与现代化 ›› 2019, Vol. 0 ›› Issue (12): 31-.doi: 10.3969/j.issn.1006-2475.2019.12.007

Spark平台的分布式阶段自适应关联规则挖掘算法

(1.汕尾职业技术学院信息工程系，广东汕尾516600; 2.华为技术有限公司，广东深圳518129)

收稿日期:2019-05-08 出版日期:2019-12-11 发布日期:2019-12-11
作者简介:石慧(1983-)，女，湖北咸宁人，讲师，硕士，研究方向：云计算与大数据，智能算法，E-mail: 455987511@qq.com; 陈恩(1984-)，男，高级工程师，研究方向：系统架构设计与分析，人工智能，E-mail: 363196200@qq.com。
基金资助:
全国高等院校计算机基础教育研究会2016年度科研规划纵向课题(2016GHB02005); 广东省高职高专云计算与大数据专业委员会2019年度教育科研课题(GDYJSKT19-02)

Distributed Stage Adaptive Association Rules Mining Algorithm Based on Spark

(1. Department of Information Engineering, Shanwei Polytechnic, Shanwei 516600, China;
2. Huawei Technologies Co., Ltd., Shenzhen 518129, China)

Received:2019-05-08 Online:2019-12-11 Published:2019-12-11

摘要/Abstract

摘要： 为满足日益增长的海量数据挖掘需求，迫切需要设计一种能够在多台机器上运行的分布式关联规则挖掘算法。Apriori这种高度迭代算法在Hadoop平台上运行时每次迭代执行大量的磁盘I/O操作，大大影响并限制了算法的运行效率。本文利用Spark对分布式计算内置支持的特点，在Spark平台上设计并实现一种分布式关联规则挖掘算法，称为阶段式自适应挖掘算法(Staged Adaptive Apriori)。算法使用自适应的数据集部分处理的策略对频繁项集进行高效挖掘，在每次迭代前初步评估执行时间，并采用较为合适的方法来减少时间和空间的复杂性，是一种基于数据集性质的自适应关联规则挖掘算法。实验结果表明了算法的有效性。

关键词: 关联规则挖掘, Apriori算法, MapReduce, Spark

Abstract: In order to meet the growing demand for massive data mining, it is urgent to design a distributed association rule mining algorithm that can run on multiple machines. Apriori is a highly iterative algorithm that performs a large number of disk I/O operations per iteration when running on the Hadoop platform, greatly affecting and limiting the efficiency of the algorithm. This paper uses Spark to support the characteristics of memory distribution calculation and designs and implements a distributed association rule mining algorithm called Staged Adaptive Apriori on the Spark platform. The algorithm uses the adaptive data set partial processing strategy to efficiently mine frequent itemsets. The algorithm initially evaluates the execution time before each iteration, and adopts a more appropriate method to reduce the complexity of time and space. It is an adaptive association rule mining algorithm based on the nature of data sets. The experimental results demonstrate the effectiveness of the algorithm.

Key words: association rule mining, Apriori, MapReduce, Spark

中图分类号:

TP301.6

石慧1，陈恩2. Spark平台的分布式阶段自适应关联规则挖掘算法[J]. 计算机与现代化, 2019, 0(12): 31-.

SHI Hui1, CHEN En2. Distributed Stage Adaptive Association Rules Mining Algorithm Based on Spark[J]. Computer and Modernization, 2019, 0(12): 31-.

参考文献

［1］WITTEN I H, FRANK E, HALL M A. 数据挖掘:实用机器学习工具与技术［M］. 北京:机械工业出版社, 2012.
［2］AGRAWAL R, IMIELINSKI T, SWAMI A. Mining association rules between sets of items in large database［C］// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. 1993:207-216.
［3］Apache Software Foundation. Apache Hadoop 2.7.2［DB/OL］. (2016-01-26)［2016-03-18］. http://hadoop.apache.org/docs/r2.7.2/.
［4］ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets［C］// Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. 2010: Article No. 10. DOI: 10.1007/s00256-009-0861-0.
［5］QIU H J, GU R, YUAN C F, et al. YAFIM: A parallel frequent itemset mining algorithm with Spark［C］// Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium Workshops. 2014:1664-1671.
［6］RATHEE S, KAUL M, KASHYAP A. R-Apriori: An efficient Apriori based algorithm on Spark［C］// Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management. 2015:27-34.
［7］崔妍,包志强. 关联规则挖掘综述［J］. 计算机应用研究, 2016,33(2):330-334.
［8］BLOOM B H. Space/time trade-offs in hash coding with allowable errors［J］. Communications of the ACM, 1970,13(7):422-426.
［9］杨磊,黄建智. 多路平衡型矩阵Bloom Filter［J］. 湖南大学学报(自然科学版), 2018,45(2):133-140.
［10］肖明忠,代亚非,李晓明. 拆分型Bloom Filter［J］. 电子学报, 2004,32(2):241-245.
［11］DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters［J］. Communications of the ACM, 2008,51(1):107-113.
［12］AKIL B, ZHOU Y, ROHM U. On the usability of Hadoop MapReduce, Apache Spark & Apache Flink for data science［C］// Proceedings of the 2017 IEEE International Conference on Big Data. 2017:303-310.
［13］吴信东,嵇圣硙. MapReduce与Spark用于大数据分析之比较［J］. 软件学报, 2018,29(6):1770-1791.
［14］SUMITHRA R, PAUL S, LATHA D P P. A hybrid algorithm combining weighted and hashT Apriori algorithms in Map Reduce model using Eucalyptus cloud platform［J］. WSEAS Transactions on Computers, 2015,14:382-388.
［15］DHANYA S, VYSAAKAN M, MAHESH A S. An enhancement of the MapReduce Apriori algorithm using vertical data layout and set theory concept of intersection［M］// Intelligent Systems Technologies and Applications. Springer, 2016,2:225-233.
［16］SINGH S, GARG R, MISHRA P K. Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster［J］. Computers & Electrical Engineering, 2018,67:348-364.
［17］谢志明,王鹏. 基于MapReduce架构的并行矩阵Apriori算法［J］. 计算机应用研究, 2017,34(2):401-404.
［18］程阳,章韵. 基于MapReduce-HBase的Apriori算法的改进与研究［J］. 南京邮电大学学报(自然科学版), 2018,38(5):91-99.
［19］刘莉萍,章新友,牛晓录,等. 基于Spark的并行关联规则挖掘算法研究综述［J/OL］. 计算机工程与应用, (2019-01-30)［2019-04-20］. http://kns.cnki.net/kcms/detail/11.2127.TP.20190128.1804.009.html.
［20］LUO Y H, YANG Z F, SHI H K, et al. A distributed frequent itemsets mining algorithm using sparse Boolean matrix on Spark［M］// Web Technologies and Applications. Springer, 2016:419-423.
［21］KARIM R, COCHEZ M, BEYAN O D, et al. Mining maximal frequent patterns in transactional databases and dynamic data streams: A Spark-based approach［J］. Information Sciences, 2018,432:278-300.
［22］FIMI Workshops. Frequent Itemset Mining Dataset Repository［EB/OL］. ［2019-04-20］. http://fimi.uantwerpen.be/data/.
［23］SPMF. An Open-Source Data Mining Library［EB/OL］. ［2019-04-20］. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.

[1]	钟松影. 基于关联规则Apriori算法的纺织原料成本预警[J]. 计算机与现代化, 2023, 0(07): 43-43.
[2]	王劭华, 欧阳会丹, 孙丹, 王康, 吴鸿萍, 钟询, 褚兴平, 杨松涛. 基于Apriori算法的大学生体测项目关联规则挖掘[J]. 计算机与现代化, 2023, 0(03): 66-70.
[3]	刘露, 申国伟, 郭春, 崔允贺, 蒋朝惠, 伍大勇. 一种基于深度强化学习的Spark Streaming参数优化方法[J]. 计算机与现代化, 2021, 0(10): 49-56.
[4]	潘卫军, 刘皓晨, 王润东, 胡博文. 基于ANN的改进Spark系统在空管大数据处理中的应用[J]. 计算机与现代化, 2020, 0(12): 78-82.
[5]	钟志松, 彭清桦, 吴广潮. 基于图嵌入的用户加权Slope One算法[J]. 计算机与现代化, 2020, 0(08): 69-75.
[6]	王云，李丛. 基于改进关联规则算法的警情数据分析[J]. 计算机与现代化, 2019, 0(12): 1-.
[7]	宋阳，石鸿雁. 基于MapReduce框架下的K-means聚类算法的改进[J]. 计算机与现代化, 2019, 0(08): 28-.
[8]	张利娟1，仇建伟1，杜登崇2，王鑫1. 基于Spark和PSO算法的军事物流配送路径优化问题研究[J]. 计算机与现代化, 2018, 0(11): 65-.
[9]	陈丽娟，谢伙生. 带负项值的onshelf效用项集并行挖掘算法[J]. 计算机与现代化, 2018, 0(04): 13-.
[10]	倪满满1,2. 基于ALS模型协同过滤推荐算法的优化[J]. 计算机与现代化, 2018, 0(02): 76-.
[11]	王龙，姚文明. 基于Spark的并行遗传算法在物流配送问题中的应用[J]. 计算机与现代化, 2018, 0(01): 19-22.
[12]	张春,周静. 动车组故障关联规则挖掘优化算法研究与应用[J]. 计算机与现代化, 2017, 0(9): 74-78.
[13]	陈志飞，冯钧. 一种基于Apriori算法的优化挖掘算法[J]. 计算机与现代化, 2016, 0(9): 1-5.
[14]	徐欢1,冯钧1,张鹏程1,唐志贤2,刘艺1,陈志飞1,张立霞1. 基于Hadoop的分布式CIF四叉树索引方法[J]. 计算机与现代化, 2016, 0(11): 12-19,24.
[15]	蔡明，左勇安. 基于MapReduce的混合蚁群算法研究[J]. 计算机与现代化, 2016, 0(10): 6-.

Spark平台的分布式阶段自适应关联规则挖掘算法

Distributed Stage Adaptive Association Rules Mining Algorithm Based on Spark

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价