计算机与现代化

• 网络与通信 • 上一篇    下一篇

Hadoop环境下基于数据本地化的Reduce任务调度策略

  

  1. 重庆医科大学附属第二医院信息中心, 重庆400010
  • 收稿日期:2015-10-28 出版日期:2016-01-22 发布日期:2016-01-26
  • 作者简介:王浩(1987-),男,山西太原人,重庆医科大学附属第二医院信息中心助理工程师,硕士,研究方向:大数据。
  • 基金资助:
    重庆市科技计划项目(cstc2013jcsf10034)

Scheduling Strategy of Reduce Task Based on Data Localization in Hadoop

  1. Information Centre, The Second Affiliated Hospital of Chongqing Medical University, Chongqing 400010, China
  • Received:2015-10-28 Online:2016-01-22 Published:2016-01-26

摘要: 在MapReduce模型任务处理过程中,当Reduce任务开始执行,远程拉取Map阶段的输出数据时,会消耗大量的网络带宽,甚至会出现网络瓶颈问题。本文提出基于数据本地化和负载均衡的任务分配策略。该策略中用户首先设置采样数据量M,在Map阶段对前M个数据块进行采样;其次根据采样结果,同时考虑数据本地化因素,将Reduce任务进行分配;然后基于负载均衡将Reduce任务进行再分配,通过任务分配,系统生成一个任务分配表;最后启动Reduce任务,系统开始数据拉取,未被采样的数据根据任务分配表进行任务分配。通过大量实验验证,基于数据本地化和负载均衡的任务分配策略,既能减少Shuffle阶段数据的传输量,又能降低网络带宽的消耗,同时可以避免出现某些节点空闲而其它节点任务量大甚至处理不了的情况,从而提高了集群处理数据的整体能力。

关键词: 采样, MapReduce, 本地化, 任务分配, 负载均衡

Abstract: In the MapReduce task processing, when Reduce task is executed, and the data need to be pulled in the Map stage, it will cost a large amount of network bandwidth, network bottlenecks will occur even. Therefore, we propose a task allocation strategy based on localization and load balance. First of all, the user sets the sampling variable M. The Map function is executed in Map stages, and we select the first M data blocks for sampling. Next, the system assigns tasks by considering the data localization and the sample results. Once again, the system assigns tasks by considering the load balance. The system will generate a task allocation table after the task allocation based on the data localization and the load balance. Finally, the system executes the Reduce task and begins to pull data. Subsequent tasks are assigned based on the task allocation table. Through experimental verification, assigning task based on the data localization and the load balance can not only reduce the transmission of data and the network bandwidth consumption in Shuffle stage, but also it can avoid the situation that there are many tasks on some nodes and there are no tasks on other nodes. So the strategy can improve the overall capacity of the data processing.

Key words:  sampling, MapReduce, localization, task allocation, load balance

中图分类号: