Computer and Modernization

Previous Articles     Next Articles

Scheduling Strategy of Reduce Task Based on Data Localization in Hadoop

  

  1. Information Centre, The Second Affiliated Hospital of Chongqing Medical University, Chongqing 400010, China
  • Received:2015-10-28 Online:2016-01-22 Published:2016-01-26

Abstract: In the MapReduce task processing, when Reduce task is executed, and the data need to be pulled in the Map stage, it will cost a large amount of network bandwidth, network bottlenecks will occur even. Therefore, we propose a task allocation strategy based on localization and load balance. First of all, the user sets the sampling variable M. The Map function is executed in Map stages, and we select the first M data blocks for sampling. Next, the system assigns tasks by considering the data localization and the sample results. Once again, the system assigns tasks by considering the load balance. The system will generate a task allocation table after the task allocation based on the data localization and the load balance. Finally, the system executes the Reduce task and begins to pull data. Subsequent tasks are assigned based on the task allocation table. Through experimental verification, assigning task based on the data localization and the load balance can not only reduce the transmission of data and the network bandwidth consumption in Shuffle stage, but also it can avoid the situation that there are many tasks on some nodes and there are no tasks on other nodes. So the strategy can improve the overall capacity of the data processing.

Key words:  sampling, MapReduce, localization, task allocation, load balance

CLC Number: