Hadoop环境下基于数据本地化的Reduce任务调度策略

doi:10.3969/j.issn.1006-2475.2016.01.023

摘要/Abstract

摘要： 在MapReduce模型任务处理过程中，当Reduce任务开始执行，远程拉取Map阶段的输出数据时，会消耗大量的网络带宽，甚至会出现网络瓶颈问题。本文提出基于数据本地化和负载均衡的任务分配策略。该策略中用户首先设置采样数据量M，在Map阶段对前M个数据块进行采样；其次根据采样结果，同时考虑数据本地化因素，将Reduce任务进行分配；然后基于负载均衡将Reduce任务进行再分配，通过任务分配，系统生成一个任务分配表；最后启动Reduce任务，系统开始数据拉取，未被采样的数据根据任务分配表进行任务分配。通过大量实验验证，基于数据本地化和负载均衡的任务分配策略，既能减少Shuffle阶段数据的传输量，又能降低网络带宽的消耗，同时可以避免出现某些节点空闲而其它节点任务量大甚至处理不了的情况，从而提高了集群处理数据的整体能力。


关键词: 采样, MapReduce, 本地化, 任务分配, 负载均衡

Abstract: In the MapReduce task processing, when Reduce task is executed, and the data need to be pulled in the Map stage, it will cost a large amount of network bandwidth, network bottlenecks will occur even. Therefore, we propose a task allocation strategy based on localization and load balance. First of all, the user sets the sampling variable M. The Map function is executed in Map stages, and we select the first M data blocks for sampling. Next, the system assigns tasks by considering the data localization and the sample results. Once again, the system assigns tasks by considering the load balance. The system will generate a task allocation table after the task allocation based on the data localization and the load balance. Finally, the system executes the Reduce task and begins to pull data. Subsequent tasks are assigned based on the task allocation table. Through experimental verification, assigning task based on the data localization and the load balance can not only reduce the transmission of data and the network bandwidth consumption in Shuffle stage, but also it can avoid the situation that there are many tasks on some nodes and there are no tasks on other nodes. So the strategy can improve the overall capacity of the data processing.


Key words: sampling, MapReduce, localization, task allocation, load balance

中图分类号:

TP311

王浩. Hadoop环境下基于数据本地化的Reduce任务调度策略[J]. 计算机与现代化, doi: 10.3969/j.issn.1006-2475.2016.01.023.

WANG Hao. Scheduling Strategy of Reduce Task Based on Data Localization in Hadoop[J]. Computer and Modernization, doi: 10.3969/j.issn.1006-2475.2016.01.023.

参考文献

［1］ Dean J, Ghematat S. MapReduce: Simplified data processing on large clusters［C］// Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation. 2004:10.
［2］ Thusoo A, Sarma J S, Jain N, et al. Hive: A warehousing solution over a map-reduce framework［J］. Proceedings of the VLDB Endowment, 2009,2(2):1626-1629.
［3］张密密. MapReduce模型在Hadoop实现中的性能分析及改进优化［D］. 成都:电子科技大学, 2010.
［4］周一可. 云计算下MapReduce编程模型可用性的研究与优化［D］. 上海:上海交通大学, 2011.
［5］ Lin Yuting, Agrawal D, Chen Chun, et al. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework［C］// Proceedings of the 2011 ACM Conference on Management of Data. 2011:961-972.
［6］ The Apache Software Foundation. What Is Apache Hadoop? ［DB/OL］. http://hadoop.apache.org/, 2015-09-30.
［7］ Ghemawat S, Gobioff H, Leung Shun-Tak. The Google file system［C］// Proceedings of the 19th ACM Symposium on Operating Systems Principles. 2003:29-43.
［8］ Kaldewey T, Shekita E J, Tata S. Clydesdale: Structured data processing on MapReduce［C］// Proceedings of the 15th International Conference on Extending Database Technology. 2015:15-25.
［9］ TPC. TPC-H［DB/OL］. http://www.tpc.org/tpch/, 2015-9-30.
［10］Chang Fay, Dean J, Ghemawat S, et al. Bigtable: A distributed storage system for structured data［J］. ACM Transactions on Computer Systems, 2008,26(2):205-218.
［11］Chen Shih-ying, Chen Po-chun. An efficient join query processing based on MJR framework, software engineering［C］// Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. 2012:698-703.
［12］Mackey G, Sehrish S, Bent J, et al. Introducing Map-Reduce to high end computing［C］// Proceedings of Petascale Data Storage Workshop, 2008. 2008:1-6.
［13］Condie T, Conway N, Alvaro P, et al. MapReduce online［C］// Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation. 2010:21-21.
［14］邱荣太. 基于Hadoop平台的Map_Reduce应用研究［D］. 焦作:河南理工大学, 2009.
［15］Marozzo F, Talia D, Trunfio P. A framework for managing mapReduce applications in dynamic distributed environments［C］// Proceedings of the 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. 2011:149-158.
［16］Wbite T. Hadoop权威指南［M］. 周敏奇,王晓玲,金澈清,等译. 2版. 北京：清华大学出版社, 2011.
［17］Andyelvis. Hadoop中一些采样器的实现［DB/OL］. http://blog.csdn.net/andyelvis/article/details/7294811, 2012-02-26.
［18］Shvachko K, Hairong K, Radia S, et al. The Hadoop distributed file system［C］// 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). 2010:1-10.
［19］Wbite T. Hadoop The Definitive Guide［M］. 2nd ed. CA: O’Reilly Media, Inc., 2010: 175-177.

[1]	王重阳, 庄毅. 基于SDN和改进CSA算法的多作业集群的负载均衡算法[J]. 计算机与现代化, 2023, 0(11): 28-35.
[2]	朱宏启, 王诚. 基于电商用户行为的隐式反馈推荐应用研究[J]. 计算机与现代化, 2023, 0(11): 44-50.
[3]	刘付谦, 秦华妮, 赖惠慧. 基于SMOTE和贝叶斯优化的Adj-LightGBM人岗匹配算法[J]. 计算机与现代化, 2023, 0(03): 90-95.
[4]	彭穗, 许亮, 张志强, 娄源媛, 余浩, 秦晓辉. 基于混合藤Copula和ILHS的概率电压稳定评估算法[J]. 计算机与现代化, 2022, 0(12): 6-12.
[5]	陈刚, 王志坚, 徐胜超. 基于粒子群算法的移动边缘计算任务分配方法[J]. 计算机与现代化, 2022, 0(11): 32-36.
[6]	袁嘉立, 刘梦赤. 面向信息网模型的动态数据划分算法[J]. 计算机与现代化, 2022, 0(10): 100-105.
[7]	孟晓龙, . DNeStCount:数据相关的拆分注意力机制的编码器-解码器结构的人群计数方法[J]. 计算机与现代化, 2022, 0(09): 68-77.
[8]	徐胜超, 叶力洪. 基于长短期记忆神经网络的容器云队列在线任务动态分配[J]. 计算机与现代化, 2022, 0(07): 79-84.
[9]	龚云翔, 袁仕芳, 刘付谦. 基于集成学习与不平衡数据的返贫预测[J]. 计算机与现代化, 2022, 0(04): 12-16.
[10]	倪雅婷, 杨文晖, 苗放, 黄安琪, 蒋媛. 基于Nginx的DRC集群动态负载均衡策略[J]. 计算机与现代化, 2022, 0(04): 58-64.
[11]	唐洁, 文元美. 基于3CNN-BiGRU的睡眠自动分期研究[J]. 计算机与现代化, 2022, 0(02): 120-126.
[12]	周传华, 朱俊杰, 徐文倩, 邓佳佳. 基于聚类欠采样的集成分类算法[J]. 计算机与现代化, 2021, 0(11): 72-76.
[13]	闫岭岭, 江峰, 杜军威, 杨爱光. 基于混合采样与Random_Stacking的软件缺陷预测[J]. 计算机与现代化, 2021, 0(08): 70-76.
[14]	段鹏飞, 兰茹. 基于区块链的网络级移动目标防御系统设计[J]. 计算机与现代化, 2021, 0(08): 121-126.
[15]	章恒, 鞠时光. 基于概念漂移检测的网络数据流分类[J]. 计算机与现代化, 2021, 0(07): 107-114.