Scheduling Strategy of Reduce Task Based on Data Localization in Hadoop

doi:10.3969/j.issn.1006-2475.2016.01.023

Abstract

Abstract: In the MapReduce task processing, when Reduce task is executed, and the data need to be pulled in the Map stage, it will cost a large amount of network bandwidth, network bottlenecks will occur even. Therefore, we propose a task allocation strategy based on localization and load balance. First of all, the user sets the sampling variable M. The Map function is executed in Map stages, and we select the first M data blocks for sampling. Next, the system assigns tasks by considering the data localization and the sample results. Once again, the system assigns tasks by considering the load balance. The system will generate a task allocation table after the task allocation based on the data localization and the load balance. Finally, the system executes the Reduce task and begins to pull data. Subsequent tasks are assigned based on the task allocation table. Through experimental verification, assigning task based on the data localization and the load balance can not only reduce the transmission of data and the network bandwidth consumption in Shuffle stage, but also it can avoid the situation that there are many tasks on some nodes and there are no tasks on other nodes. So the strategy can improve the overall capacity of the data processing.


Key words: sampling, MapReduce, localization, task allocation, load balance

CLC Number:

TP311

WANG Hao. Scheduling Strategy of Reduce Task Based on Data Localization in Hadoop[J]. Computer and Modernization, doi: 10.3969/j.issn.1006-2475.2016.01.023.

References

［1］ Dean J, Ghematat S. MapReduce: Simplified data processing on large clusters［C］// Proceedings of the 6th Conference on Symposium on Opearting Systems Design & Implementation. 2004:10.
［2］ Thusoo A, Sarma J S, Jain N, et al. Hive: A warehousing solution over a map-reduce framework［J］. Proceedings of the VLDB Endowment, 2009,2(2):1626-1629.
［3］张密密. MapReduce模型在Hadoop实现中的性能分析及改进优化［D］. 成都:电子科技大学, 2010.
［4］周一可. 云计算下MapReduce编程模型可用性的研究与优化［D］. 上海:上海交通大学, 2011.
［5］ Lin Yuting, Agrawal D, Chen Chun, et al. Llama: Leveraging columnar storage for scalable join processing in the MapReduce framework［C］// Proceedings of the 2011 ACM Conference on Management of Data. 2011:961-972.
［6］ The Apache Software Foundation. What Is Apache Hadoop? ［DB/OL］. http://hadoop.apache.org/, 2015-09-30.
［7］ Ghemawat S, Gobioff H, Leung Shun-Tak. The Google file system［C］// Proceedings of the 19th ACM Symposium on Operating Systems Principles. 2003:29-43.
［8］ Kaldewey T, Shekita E J, Tata S. Clydesdale: Structured data processing on MapReduce［C］// Proceedings of the 15th International Conference on Extending Database Technology. 2015:15-25.
［9］ TPC. TPC-H［DB/OL］. http://www.tpc.org/tpch/, 2015-9-30.
［10］Chang Fay, Dean J, Ghemawat S, et al. Bigtable: A distributed storage system for structured data［J］. ACM Transactions on Computer Systems, 2008,26(2):205-218.
［11］Chen Shih-ying, Chen Po-chun. An efficient join query processing based on MJR framework, software engineering［C］// Proceedings of the 2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing. 2012:698-703.
［12］Mackey G, Sehrish S, Bent J, et al. Introducing Map-Reduce to high end computing［C］// Proceedings of Petascale Data Storage Workshop, 2008. 2008:1-6.
［13］Condie T, Conway N, Alvaro P, et al. MapReduce online［C］// Proceedings of the 7th USENIX Conference on Networked Systems Design and Implementation. 2010:21-21.
［14］邱荣太. 基于Hadoop平台的Map_Reduce应用研究［D］. 焦作:河南理工大学, 2009.
［15］Marozzo F, Talia D, Trunfio P. A framework for managing mapReduce applications in dynamic distributed environments［C］// Proceedings of the 2011 19th Euromicro International Conference on Parallel, Distributed and Network-Based Processing. 2011:149-158.
［16］Wbite T. Hadoop权威指南［M］. 周敏奇,王晓玲,金澈清,等译. 2版. 北京：清华大学出版社, 2011.
［17］Andyelvis. Hadoop中一些采样器的实现［DB/OL］. http://blog.csdn.net/andyelvis/article/details/7294811, 2012-02-26.
［18］Shvachko K, Hairong K, Radia S, et al. The Hadoop distributed file system［C］// 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). 2010:1-10.
［19］Wbite T. Hadoop The Definitive Guide［M］. 2nd ed. CA: O’Reilly Media, Inc., 2010: 175-177.

[1]	GE Hong-fei, LI Yi-ran. Visual SLAM Algorithm Based on Adaptive Fusion of Point and Edge Features [J]. Computer and Modernization, 2023, 0(03): 107-112.
[2]	CHEN Gang, WANG Zhi-jian, XU Sheng-chao. Mobile Edge Computing Task Allocation Method Based on Particle Swarm Optimization [J]. Computer and Modernization, 2022, 0(11): 32-36.
[3]	ZHAO Na, LIU Wen-biao, WANG Lian-tao, WANG Meng-ru, REN Zhen-xing. Substation Monitoring Picture Recognition Algorithm for Automatic Human-machine Interface Verification [J]. Computer and Modernization, 2022, 0(06): 96-103.
[4]	JI Ping, GUO Ying. Underwater Localization Algorithm of Range Correction Based on Long Short-Term Memory [J]. Computer and Modernization, 2022, 0(02): 52-57.
[5]	LEI Ming, JIANG Han-sheng, WU Guo-liang, ZHAO Yu-juan, LIANG Jian. Load Balancing Technology Under Big Data Architecture Based on HBase [J]. Computer and Modernization, 2021, 0(06): 91-95.
[6]	SHI Hui1, CHEN En2. Distributed Stage Adaptive Association Rules Mining Algorithm Based on Spark [J]. Computer and Modernization, 2019, 0(12): 31-.
[7]	SONG Yang, SHI Hong-yan. Improved K-means Clustering Algorithm Based on MapReduce Framework [J]. Computer and Modernization, 2019, 0(08): 28-.
[8]	LI Shi-bao, WANG Sheng-zhi, ZHANG Xin, CHEN Hai-hua, LIU Jian-hang, HE Yi-jing. Indoor Localization Algorithm Based on Semi-supervised #br# Learning of Global Manifold Geometry [J]. Computer and Modernization, 2019, 0(07): 82-.
[9]	TU Xue-zhen. An Optimized Kernel State File Sending Method [J]. Computer and Modernization, 2019, 0(05): 13-.
[10]	WANG Xu1,2, NI Hong2, HAN Rui2. Auction-based Resource Match Algorithm in Data Centers [J]. Computer and Modernization, 2018, 0(10): 114-.
[11]	ZHOUMing-quan,JIANGGuo-hua. LogisticRegression-basedSoftwareFaultLocalizationinFunctionLevel [J]. Computer and Modernization, 2018, 0(07): 93-.
[12]	CHEN Lijuan, XIE Huosheng. A Parallel Algorithm for Mining onshelf Utility Itemset with Negative Item Values [J]. Computer and Modernization, 2018, 0(04): 13-.
[13]	ZHANG Chun, ZHOU Jing. esearch and Application on Association Rule Mining Optimization Algorithm for High Speed EMU Malfunction [J]. Computer and Modernization, 2017, 0(9): 74-78.
[14]	DONG Yuan-quan1, WANG Hao2. Multiple Target Localizationin WSNs via CS Reconstruction Method Based on Discrete CSO Algorithm [J]. Computer and Modernization, 2017, 0(12): 23-27.
[15]	YANG Qin1,3, WANG Liang-liang2,RUI Xue1. Design of Linux Supporting Platform for Distance Education in Chinese, #br# Uyghur and Kazakh Based on DVB-S [J]. Computer and Modernization, 2016, 0(2): 86-90.