Distributed Stage Adaptive Association Rules Mining Algorithm Based on Spark

doi:10.3969/j.issn.1006-2475.2019.12.007

Abstract

Abstract: In order to meet the growing demand for massive data mining, it is urgent to design a distributed association rule mining algorithm that can run on multiple machines. Apriori is a highly iterative algorithm that performs a large number of disk I/O operations per iteration when running on the Hadoop platform, greatly affecting and limiting the efficiency of the algorithm. This paper uses Spark to support the characteristics of memory distribution calculation and designs and implements a distributed association rule mining algorithm called Staged Adaptive Apriori on the Spark platform. The algorithm uses the adaptive data set partial processing strategy to efficiently mine frequent itemsets. The algorithm initially evaluates the execution time before each iteration, and adopts a more appropriate method to reduce the complexity of time and space. It is an adaptive association rule mining algorithm based on the nature of data sets. The experimental results demonstrate the effectiveness of the algorithm.

Key words: association rule mining, Apriori, MapReduce, Spark

CLC Number:

TP301.6

SHI Hui1, CHEN En2. Distributed Stage Adaptive Association Rules Mining Algorithm Based on Spark[J]. Computer and Modernization, 2019, 0(12): 31-.

References

［1］WITTEN I H, FRANK E, HALL M A. 数据挖掘:实用机器学习工具与技术［M］. 北京:机械工业出版社, 2012.
［2］AGRAWAL R, IMIELINSKI T, SWAMI A. Mining association rules between sets of items in large database［C］// Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data. 1993:207-216.
［3］Apache Software Foundation. Apache Hadoop 2.7.2［DB/OL］. (2016-01-26)［2016-03-18］. http://hadoop.apache.org/docs/r2.7.2/.
［4］ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets［C］// Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing. 2010: Article No. 10. DOI: 10.1007/s00256-009-0861-0.
［5］QIU H J, GU R, YUAN C F, et al. YAFIM: A parallel frequent itemset mining algorithm with Spark［C］// Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium Workshops. 2014:1664-1671.
［6］RATHEE S, KAUL M, KASHYAP A. R-Apriori: An efficient Apriori based algorithm on Spark［C］// Proceedings of the 8th Workshop on Ph.D. Workshop in Information and Knowledge Management. 2015:27-34.
［7］崔妍,包志强. 关联规则挖掘综述［J］. 计算机应用研究, 2016,33(2):330-334.
［8］BLOOM B H. Space/time trade-offs in hash coding with allowable errors［J］. Communications of the ACM, 1970,13(7):422-426.
［9］杨磊,黄建智. 多路平衡型矩阵Bloom Filter［J］. 湖南大学学报(自然科学版), 2018,45(2):133-140.
［10］肖明忠,代亚非,李晓明. 拆分型Bloom Filter［J］. 电子学报, 2004,32(2):241-245.
［11］DEAN J, GHEMAWAT S. MapReduce: Simplified data processing on large clusters［J］. Communications of the ACM, 2008,51(1):107-113.
［12］AKIL B, ZHOU Y, ROHM U. On the usability of Hadoop MapReduce, Apache Spark & Apache Flink for data science［C］// Proceedings of the 2017 IEEE International Conference on Big Data. 2017:303-310.
［13］吴信东,嵇圣硙. MapReduce与Spark用于大数据分析之比较［J］. 软件学报, 2018,29(6):1770-1791.
［14］SUMITHRA R, PAUL S, LATHA D P P. A hybrid algorithm combining weighted and hashT Apriori algorithms in Map Reduce model using Eucalyptus cloud platform［J］. WSEAS Transactions on Computers, 2015,14:382-388.
［15］DHANYA S, VYSAAKAN M, MAHESH A S. An enhancement of the MapReduce Apriori algorithm using vertical data layout and set theory concept of intersection［M］// Intelligent Systems Technologies and Applications. Springer, 2016,2:225-233.
［16］SINGH S, GARG R, MISHRA P K. Performance optimization of MapReduce-based Apriori algorithm on Hadoop cluster［J］. Computers & Electrical Engineering, 2018,67:348-364.
［17］谢志明,王鹏. 基于MapReduce架构的并行矩阵Apriori算法［J］. 计算机应用研究, 2017,34(2):401-404.
［18］程阳,章韵. 基于MapReduce-HBase的Apriori算法的改进与研究［J］. 南京邮电大学学报(自然科学版), 2018,38(5):91-99.
［19］刘莉萍,章新友,牛晓录,等. 基于Spark的并行关联规则挖掘算法研究综述［J/OL］. 计算机工程与应用, (2019-01-30)［2019-04-20］. http://kns.cnki.net/kcms/detail/11.2127.TP.20190128.1804.009.html.
［20］LUO Y H, YANG Z F, SHI H K, et al. A distributed frequent itemsets mining algorithm using sparse Boolean matrix on Spark［M］// Web Technologies and Applications. Springer, 2016:419-423.
［21］KARIM R, COCHEZ M, BEYAN O D, et al. Mining maximal frequent patterns in transactional databases and dynamic data streams: A Spark-based approach［J］. Information Sciences, 2018,432:278-300.
［22］FIMI Workshops. Frequent Itemset Mining Dataset Repository［EB/OL］. ［2019-04-20］. http://fimi.uantwerpen.be/data/.
［23］SPMF. An Open-Source Data Mining Library［EB/OL］. ［2019-04-20］. http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php.

[1]	ZHONG Song-ying. Textile Raw Material Cost Warning Based on Apriori Algorithm of Association Rules [J]. Computer and Modernization, 2023, 0(07): 43-43.
[2]	WANG Shao-hua, OUYANG Hui-dan, SUN Dan, WANG Kang, WU Hong-ping, ZHONG Xun, CHU Xing-ping, YANG Song-tao. Association Rule Mining of Undergraduate Physical Test Items Based on Apriori Algorithm [J]. Computer and Modernization, 2023, 0(03): 66-70.
[3]	LIU Lu, SHEN Guo-wei, GUO Chun, CUI Yun-he, JIANG Chao-hui, WU Da-yong. A Spark Streaming Parameter Optimization Method Based on Deep Reinforcement Learning [J]. Computer and Modernization, 2021, 0(10): 49-56.
[4]	GUO Xin, CHEN Ying, ZHANG Ming-huan, ZHANG Xuan, PAN Shu-ming, TANG Lu-jia. Analysis and Prediction of Training Effects of National Disaster Life Support Course with Machine Learning Methods [J]. Computer and Modernization, 2020, 0(12): 61-66.
[5]	PAN Wei-jun, LIU Hao-chen, WANG Run-dong, HU Bo-wen. Application of Improved Spark System Based on ANN in Big Data Processing of Air Traffic Management#br# [J]. Computer and Modernization, 2020, 0(12): 78-82.
[6]	ZHONG Zhi-song, PENG Qing-hua, WU Guang-chao. User-weighted Slope One Algorithm Based on Graph Embedding [J]. Computer and Modernization, 2020, 0(08): 69-75.
[7]	WANG Yun, LI Cong. Analysis of Alarm Data Based on Improved Association Rules Algorithm [J]. Computer and Modernization, 2019, 0(12): 1-.
[8]	SONG Yang, SHI Hong-yan. Improved K-means Clustering Algorithm Based on MapReduce Framework [J]. Computer and Modernization, 2019, 0(08): 28-.
[9]	ZHANG Li-juan1， QIU Jian-wei1， DU Deng-chong2， WANG Xin1. Research on Military Logistics Distribution Routing Optimization Problem #br# Based on Spark and PSO Algorithm [J]. Computer and Modernization, 2018, 0(11): 65-.
[10]	CHEN Lijuan, XIE Huosheng. A Parallel Algorithm for Mining onshelf Utility Itemset with Negative Item Values [J]. Computer and Modernization, 2018, 0(04): 13-.
[11]	NI Man-man1,2. Improved Collaborative Filtering Recommendation Algorithm Based on ALS Model [J]. Computer and Modernization, 2018, 0(02): 76-.
[12]	WANG Long， YAO Wen-ming. Application of Parallel Genetic Algorithm Based on Spark in Logistics Distribution [J]. Computer and Modernization, 2018, 0(01): 19-22.
[13]	ZHANG Chun, ZHOU Jing. esearch and Application on Association Rule Mining Optimization Algorithm for High Speed EMU Malfunction [J]. Computer and Modernization, 2017, 0(9): 74-78.
[14]	CHEN Zhi-fei, FENG Jun. An Optimized Data Mining Algorithm Based on Apriori Algorithm [J]. Computer and Modernization, 2016, 0(9): 1-5.
[15]	XU Huan1, FENG Jun1, ZHANG Peng-cheng1, TANG Zhi-xian2, LIU Yi1, CHEN Zhi-fei1, ZHANG Li-xia1. Distributed CIF Quadtree Indexing Method Based on Hadoop [J]. Computer and Modernization, 2016, 0(11): 12-19,24.