基于Spark的分布式大数据机器学习算法

doi:10.3969/j.issn.1006-2475.2018.11.021

计算机与现代化 ›› 2018, Vol. 0 ›› Issue (11): 119-.doi: 10.3969/j.issn.1006-2475.2018.11.021

• 数据挖掘 • 上一篇

基于Spark的分布式大数据机器学习算法

(1．郑州大学信息工程学院,河南郑州450001；2.中国科学院计算技术研究所先进计算机系统研究中心,北京100190)

收稿日期:2018-05-03 出版日期:2018-11-22 发布日期:2018-11-23
作者简介:王芮（1996-），女，辽宁盘锦人，郑州大学信息工程学院本科生,研究方向：机器学习；韩锐（1985-），男，湖北武汉人，中国科学院计算技术研究所先进计算机系统研究中心助理研究员，博士，研究方向：云计算与数据中心软件系统。

#br# Distributed Big Data Machine Learning Algorithms Based on Spark

(1. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China；
2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China)

Received:2018-05-03 Online:2018-11-22 Published:2018-11-23

摘要/Abstract

摘要： 对于大数据而言，机器学习技术是不可或缺的；对于机器学习而言，大规模的数据可以提升模型的精准度。然而复杂的机器学习算法从时间和性能上都急需分布式内存计算这种关键技术。Spark分布式内存计算可以实现算法的并行操作，有利于机器学习算法处理大数据集。因此本文提出在Spark分布式内存环境下实现非线性机器学习算法，其中包括多层可变神经网络、BPPGD SVM、K-means，并在实现的基础上进行数据压缩、数据偏向抽样或者数据加载等方面的优化。为了实现充分配置资源批量运行脚本，本文也实现SparkML调度框架来调度以上优化算法。实验结果表明，优化后的3种算法平均误差降低了40%，平均时间缩短了90%。

关键词: 数据压缩, 偏向抽样, 随机梯度下降, 神经网络, 支持向量机

Abstract: For big data, machine learning technology is a tool of analysis which is indispensable. For machine learning, more and more data may improve the accuracy of the model, however complex machine learning algorithms urgently require such key technologies as distributed memory computing in terms of time and performance. Spark distributed memory computing can implement the parallel operation of the algorithm, which is beneficial for machine learning algorithms to process large data sets. Therefore, this paper presents nonlinear machine learning algorithms in Spark distributed memory environment, including multi-layer variable neural network, BPPGD SVM, K-means. And we make optimizations about data compression, data bias sampling, or data loading based on the above implementation. At the same time, the SparkML scheduling framework is implemented to dispatch the above optimization algorithms. The experimental results show that the average error of the three optimized algorithms is reduced 40% and the average time is reduced 90%.

Key words: data compression, biased sampling, random gradient descent, neural network, support vector machine

中图分类号:

TP183

王芮1，韩锐2，贾玉祥1. 基于Spark的分布式大数据机器学习算法[J]. 计算机与现代化, 2018, 0(11): 119-.

WANG Rui1, HAN Rui2, Jia Yu-xiang1. #br# Distributed Big Data Machine Learning Algorithms Based on Spark[J]. Computer and Modernization, 2018, 0(11): 119-.

参考文献

［1］ SAI J, WANG B, WU B. BPPGD: Budgeted parallel primal gradient descent kernel SVM on Spark［C］// IEEE International Conference on Data Science in Cyberspace(DSC). 2016:74-79.
［2］ BORTHAKUR D. The Hadoop distributed file system: Architecture and design［J］. Hadoop Project Website, 2007,11(11):1-10.
［3］ ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster computing with working sets［C］// Usenix Conference on Hot Topics in Cloud Computing. 2010,15(1):10.
［4］ VAVILAPALLI V K, MURTHY A C，DOUGLAS C, et al. Apache Hadoop YARN: Yet another resource negotiator［C］// Proceedings of the 4th Annual Symposium on Cloud Computing. ACM, 2013:1-16.
［5］〖KG-*3〗BORTHAKUR D. HDFS Architecture Guide［EB/OL］. Hadoop Apache Project ［2018-09-07］. http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html.
［6］ ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing［C］// Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. 2012:2-2.
［7］ GitHub. SWIM［EB/OL］. ［2018-05-03］. https://github.com/SWIMProjectUCB/SWIM.
［8］ PERCY M. Collaborative Filtering for Netflix［D］. Santa Cruz: Jack Baskin School of Engineering, 2009.
［9］ KOLLIOS G, GUNOPULOS D, KOUDAS N, et al. Efficient biased sampling for approximate clustering and outlier detection in large data sets［J］. IEEE Transactions on Knowledge and Data Engineering, 2003,15(5):1170-1187.
［10］BOTTOU L. Large-scale machine learning with stochastic gradient descent［C］// Proceedings of COMPSTAT’2010. 2010:177-186.
［11］HECHT-NIELSEN R. Theory of the backpropagation neural network［C］// International Joint Conference on Neural Networks. 1989:593-605.
［12］Wikipedia. Rectifier (Neural Networks) ［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Rectifier_(neural_networks).
［13］Wikipedia. Sigmoid Function［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Sigmoid_function.
［14］ZHU Z A, CHEN W, WANG G, et al. P-packSVM: Parallel primal gradient descent kernel SVM［C］// IEEE the 9th IEEE International Conference on Data Mining. 2009:677-686.
［15］Wikipedia. Euclidean_Distance［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Euclidean_distance.
［16］Nodalpoint. Nonlinear Neural Network［EB/OL］. ［2018-05-03］. https://www.nodalpoint.com/nonlinear-regression-using-spark-part-1-nonlinear-models/.
［17］Wikipedia. Residual Sum of Squares［EB/OL］. ［2018-05-03］. https://en.wikipedia.org/wiki/Residual_sum_of_squares.

[1]	何思达, 陈平华. 基于意图的轻量级自注意力序列推荐模型[J]. 计算机与现代化, 2024, 0(12): 1-9.
[2]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[3]	刘宝宝, 杨菁菁, 陶露, 王贺应. 基于注意力的DSMSC的遥感图像场景分类[J]. 计算机与现代化, 2024, 0(12): 72-77.
[4]	陈亮, 李诚, 易伟, 熊伟, 汪晓帆, 唐海东. 基于毫米波雷达与视觉融合的电力现场安全帽佩戴检测[J]. 计算机与现代化, 2024, 0(12): 100-107.
[5]	马钰, 杨勇, 任鸽, 帕力旦·吐尔逊. 基于GCN和微调BERT的作文自动评分方法[J]. 计算机与现代化, 2024, 0(09): 33-37.
[6]	陈雪松1, 李衡1, 王浩畅2. 结合注意力机制和Mengzi模型的短文本分类[J]. 计算机与现代化, 2024, 0(09): 101-106.
[7]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.
[8]	来坤. 字符模糊的中文纸质发票文字识别方法[J]. 计算机与现代化, 2024, 0(08): 114-119.
[9]	周宪溪, 牟莉. 基于改进TF-IDF和AGLCNN的新闻长文本分类模型[J]. 计算机与现代化, 2024, 0(08): 120-126.
[10]	杨江1, 孙晓梅1, 许韬2. 基于业务内容构建股票关联关系的股价预测[J]. 计算机与现代化, 2024, 0(07): 21-25.
[11]	刘存莉1, 雷占占2, 郑澳2. 基于循环卷积神经网络的排水管网缺陷检测方法[J]. 计算机与现代化, 2024, 0(07): 26-35.
[12]	武军凯, 茅正冲. 基于CDKF-RBFPID的激光器恒流源控制器[J]. 计算机与现代化, 2024, 0(07): 100-105.
[13]	曹宁1, 严心娥1, 徐根祺2, 许又文1, 张正勃2, 杜倩云2. 基于DEFA-LSSAR的水利工程边坡力学参数预测模型[J]. 计算机与现代化, 2024, 0(07): 106-111.
[14]	周超, 丛鑫, 訾玲玲, 肖谷平. 基于DNN与注意力机制的推荐算法模型[J]. 计算机与现代化, 2024, 0(06): 1-7.
[15]	王志强, 郑爽. 基于半监督学习的StyleGAN图像生成模型[J]. 计算机与现代化, 2024, 0(06): 14-18.

基于Spark的分布式大数据机器学习算法

#br# Distributed Big Data Machine Learning Algorithms Based on Spark

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价