#br# Distributed Big Data Machine Learning Algorithms Based on Spark

doi:10.3969/j.issn.1006-2475.2018.11.021

Abstract

Abstract: For big data, machine learning technology is a tool of analysis which is indispensable. For machine learning, more and more data may improve the accuracy of the model, however complex machine learning algorithms urgently require such key technologies as distributed memory computing in terms of time and performance. Spark distributed memory computing can implement the parallel operation of the algorithm, which is beneficial for machine learning algorithms to process large data sets. Therefore, this paper presents nonlinear machine learning algorithms in Spark distributed memory environment, including multi-layer variable neural network, BPPGD SVM, K-means. And we make optimizations about data compression, data bias sampling, or data loading based on the above implementation. At the same time, the SparkML scheduling framework is implemented to dispatch the above optimization algorithms. The experimental results show that the average error of the three optimized algorithms is reduced 40% and the average time is reduced 90%.

Key words: data compression, biased sampling, random gradient descent, neural network, support vector machine

CLC Number:

TP183

WANG Rui1, HAN Rui2, Jia Yu-xiang1. #br# Distributed Big Data Machine Learning Algorithms Based on Spark[J]. Computer and Modernization, 2018, 0(11): 119-.

References

［1］ SAI J, WANG B, WU B. BPPGD: Budgeted parallel primal gradient descent kernel SVM on Spark［C］// IEEE International Conference on Data Science in Cyberspace(DSC). 2016:74-79.
［2］ BORTHAKUR D. The Hadoop distributed file system: Architecture and design［J］. Hadoop Project Website, 2007,11(11):1-10.
［3］ ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets［C］// Usenix Conference on Hot Topics in Cloud Computing. 2010,15(1):10.
［4］ VAVILAPALLI V K, MURTHY A C，DOUGLAS C, et al. Apache Hadoop YARN: Yet another resource negotiator［C］// Proceedings of the 4th Annual Symposium on Cloud Computing. ACM, 2013:1-16.
［5］〖KG-*3〗BORTHAKUR D. HDFS Architecture Guide［EB/OL］. Hadoop Apache Project ［2018-09-07］. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
［6］ ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing［C］// Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. 2012:2-2.
［7］ GitHub. SWIM［EB/OL］. ［2018-05-03］. https://github.com/SWIMProjectUCB/SWIM.
［8］ PERCY M. Collaborative Filtering for Netflix［D］. Santa Cruz: Jack Baskin School of Engineering, 2009.
［9］ KOLLIOS G, GUNOPULOS D, KOUDAS N, et al. Efficient biased sampling for approximate clustering and outlier detection in large data sets［J］. IEEE Transactions on Knowledge and Data Engineering, 2003,15(5):1170-1187.
［10］BOTTOU L. Large-scale machine learning with stochastic gradient descent［C］// Proceedings of COMPSTAT’2010. 2010:177-186.
［11］HECHT-NIELSEN R. Theory of the backpropagation neural network［C］// International Joint Conference on Neural Networks. 1989:593-605.
［12］Wikipedia. Rectifier (Neural Networks) ［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Rectifier_(neural_networks).
［13］Wikipedia. Sigmoid Function［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Sigmoid_function.
［14］ZHU Z A, CHEN W, WANG G, et al. P-packSVM: Parallel primal gradient descent kernel SVM［C］// IEEE the 9th IEEE International Conference on Data Mining. 2009:677-686.
［15］Wikipedia. Euclidean_Distance［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Euclidean_distance.
［16］Nodalpoint. Nonlinear Neural Network［EB/OL］. ［2018-05-03］. https://www.nodalpoint.com/nonlinear-regression-using-spark-part-1-nonlinear-models/.
［17］Wikipedia. Residual Sum of Squares［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Residual_sum_of_squares.

[1]	HE Sida, CHEN Pinghua. Intent-based Lightweight Self-Attention Network for Sequential Recommendation [J]. Computer and Modernization, 2024, 0(12): 1-9.
[2]	ZHANG Xiaodong1, BAI Guangzhi1, LI Min1, LI Haoyang2. Oil and Gas Well Production Prediction Model Based on Empirical Wavelet Transform [J]. Computer and Modernization, 2024, 0(12): 53-58.
[3]	LIU Baobao, YANG Jingjing, TAO Lu, WANG Heying . DSMSC Based on Attention Mechanism for Remote Sensing Image Scene Classification [J]. Computer and Modernization, 2024, 0(12): 72-77.
[4]	CHEN Liang, LI Cheng, YI Wei, XIONG Wei, WANG Xiaofan, TANG Haidong. Helmet Wearing Detection in Electric Power Field Based on#br# Millimeter-wave Radar and Visual Fusion [J]. Computer and Modernization, 2024, 0(12): 100-107.
[5]	MA Yu, YANG Yong, REN Ge, Palidan Tuerxun. Automated Essay Scoring Method Based on GCN and Fine Tuned BERT [J]. Computer and Modernization, 2024, 0(09): 33-37.
[6]	GAO Shuaipeng, WANG Yifan. Survey on Group-level Emotion Recognition in Images [J]. Computer and Modernization, 2024, 0(08): 98-107.
[7]	LAI Kun. Chinese Paper Invoice Text Recognition Method with Character Blurring [J]. Computer and Modernization, 2024, 0(08): 114-119.
[8]	ZHOU Xianxi, MU Li. News Long Text Classification Model Based on Improved TF-IDF and AGLCNN [J]. Computer and Modernization, 2024, 0(08): 120-126.
[9]	YANG Jiang1, SUN Xiaomei1, XU Tao2. Stock Price Prediction Based on Business Content to Construct Stock Association Relationships [J]. Computer and Modernization, 2024, 0(07): 21-25.
[10]	LIU Cunli1, LEI Zhanzhan2, ZHENG Ao2. Circular Convolutional Neural Network-based Defect Detection Method for#br# Drainage Pipe Networks [J]. Computer and Modernization, 2024, 0(07): 26-35.
[11]	WU Junkai, MAO Zhengchong. Laser Constant Current Source Controller Based on CDKF-RBFPID [J]. Computer and Modernization, 2024, 0(07): 100-105.
[12]	CAO Ning1, YAN Xin’e1, XU Genqi2, XU Youwen1, ZHANG Zhengbo2, DU Qianyun2. Prediction Model of Hydraulic Engineering Slope Mechanical Parameters Based on DEFA-LSSAR [J]. Computer and Modernization, 2024, 0(07): 106-111.
[13]	ZHOU Chao, CONG Xin, ZI Lingling, XIAO Guping. Recommendation Algorithm Model Based on DNN and Attention Mechanism [J]. Computer and Modernization, 2024, 0(06): 1-7.
[14]	WANG Zhiqiang, ZHENG Shuang. Semi-supervised Image Generation Model Based on StyleGAN [J]. Computer and Modernization, 2024, 0(06): 14-18.
[15]	LI Shan, WANG Linna, GAO Dingjia, XUAN Haibo. Multi-layer Bank-enterprise Converged Network Based on Graph Neural Network [J]. Computer and Modernization, 2024, 0(05): 27-32.