Computer and Modernization

Previous Articles    

#br# Distributed Big Data Machine Learning Algorithms Based on Spark

  

  1. (1. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China;
    2. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China)
  • Received:2018-05-03 Online:2018-11-22 Published:2018-11-23

Abstract:  For big data, machine learning technology is a tool of analysis which is indispensable. For machine learning, more and more data may improve the accuracy of the model, however complex machine learning algorithms urgently require such key technologies as distributed memory computing in terms of time and performance. Spark distributed memory computing can implement the parallel operation of the algorithm, which is beneficial for machine learning algorithms to process large data sets. Therefore, this paper presents nonlinear machine learning algorithms in Spark distributed memory environment, including multi-layer variable neural network, BPPGD SVM, K-means. And we make optimizations about data compression, data bias sampling, or data loading based on the above implementation. At the same time, the SparkML scheduling framework is implemented to dispatch the above optimization algorithms. The experimental results show that the average error of the three optimized algorithms is reduced 40% and the average time is reduced 90%.

Key words: data compression, biased sampling, random gradient descent, neural network, support vector machine

CLC Number: