基于MapReduce的混合连接算法

doi:10.3969/j.issn.1006-2475.2015.06.019

计算机与现代化 ›› 2015, Vol. 0 ›› Issue (6): 86-91.doi: 10.3969/j.issn.1006-2475.2015.06.019

基于MapReduce的混合连接算法

(重庆大学计算机学院,重庆400044)

收稿日期:2015-01-27 出版日期:2015-06-16 发布日期:2015-06-18
作者简介:胡龙(1988-)，男，安徽亳州人，重庆大学计算机学院硕士研究生，研究方向：数据库应用系统；罗军(1962-)，男，重庆人，副教授，硕士生导师，研究方向：大型网络及数据库，大型MIS系统建模及设计，数据库应用系统及办公自动化。

Hybrid Join Algorithm Based on MapReduce

(College of Computer Science, Chongqing University, Chongqing 400044, China)

Received:2015-01-27 Online:2015-06-16 Published:2015-06-18

摘要/Abstract

摘要： 运行在Hadoop上的数据仓库Hive可以让更多的用户通过SQL接口来处理Hadoop数据。然而，Hive却没有为连接操作提供有效的途径，而连接操作是一种常见且在Hadoop中非常费时的操作。为了解决连接操作在Hadoop中性能的问题，本文提出一种混合策略的连接算法HJA，根据当前应用场景在几种连接算法之间选择相对较合适的算法，实验结果表明，HJA可以在大多数的Hadoop场景中发挥很好的性能。

关键词: MapReduce, Hadoop, 分区连接, autotunning, Hive

Abstract: Hive, the database on Hadoop, enables more users to process relation data by providing sqllike interface. However, Hive does not provide an efficient approach for join, a common but expensive operator in Hadoop. In order to solve the performance of join, this paper proposes a novel hybrid algorithm, HJA, which can help to automatically choose the relatively better one among several methods, according to the current situation. Experiments results show that HJA can get best performance in most situations.

Key words: MapReduce, Hadoop, partition join, autotuning, Hive

中图分类号:

TP391

胡龙,罗军. 基于MapReduce的混合连接算法[J]. 计算机与现代化, 2015, 0(6): 86-91.

HU Long, LUO Jun. Hybrid Join Algorithm Based on MapReduce[J]. Computer and Modernization, 2015, 0(6): 86-91.

参考文献

［1］ Jeffrey Dean, Sanjay Ghemawat. MapReduce: Simplied data processing on large clusters［C］// Operating Systems Design and Implementation, San Francisco, 2004. 2004:137-150.
［2］ The Apache Software Foundation. Hadoop［EB/OL］. http://hadoop.apache.org, 2014-12-20.
［3］ Ma Lili, Liao Huaming, He Yongqiang, et al. A switch criteria for hybrid datasets merging on Top of MapReduce［C］// Proceedings of the 8th International Conference on Grid and Cooperative Computing, 2009. 2009:293-298.
［4］ The Apache Software Foundation. Hive［EB/OL］. http://hive.apache.org/, 2014-12-20.
［5］ Olston C, Reed B, Srivastava U, et al. Pig latin:A NotSoForeign-language for data processing［C］// SIGMOD,2008. 2008:1099-1110.
［6］ Zaharia M, Konwinski A, Joseph A D, et al. Improving MapReduce performance in heterogeneous environments［C］// SIGMOD, 2012. 2012:29-42.
［7］ Pavlo A, Paulson E, Rasin A, et al. A comparsion of approaches to largescale data analysis［C］// SIGMOD, 2009. 2009:165-178.
［8］ Taniar D, Leung C H C, Rahayu W, et al. HighPerformance Parallel Database Processing and Grid Databases［M］. John Wiley & Sons, Inc., 2008.
［9］ Yang H, Dasdan A, Hsiao R L,et al. Map-Reduce-Merge: Simplified relational data processing on large clusters［C］// SIGMOD,2007. 2007:1029-1040.
［10］Vernica R, Carey M J, Li C. Efficient parallel set-similarity joins using MapReduce［C］// New SIGMOD， 2010. 2010:495-506.
［11］The Apache Software Foundation. Apache Hadoop Main2.6.0 API［EB/OL］. http://hadoop.apache.org/docs/current/api/, 2014-12-30.
［12］Isard M, Budiu M, Yu Y, et al. Dryad: Distributed dataparallel programs form sequential building blocks［C］// Proceedings of the 2007 Eurosys Conference. 2007:59-72.
［13］Jiang D, Tung A K H, Chen G. Map-join-reduce: Toward scalable and efficient data analysis on large clusters［J］. IEEE Transactions on Knowledge and Data Engineering, 2011,23(9):12991311.
［14］Afrati F, Ullman J D. Optimizing joins in a MapReduce environment［C］// Proceedings of the 13th International Conference on Extending Database Technology. 2010:99-110.
［15］Lin Y T, Agrawal D, Chen C, et al. Llama: Leveraging columnar storage for scalable join processing in the Map Reduce framework［C］// The 31th SIGMOD. 2011:961-972.
［16］Xu Y, Kostamaa P, Zhou X, et al. Handing data skew in parallel joins in shared-nothing systems［C］// Proceedings of the ACM SIGMOD International Conference on Management of Data, 2008. 2008:1043-1052.
［17］Okcan A, Riedewald M. Processing thetajoins using MapReduce［C］// Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011,949-960.
［18］Blanas S, Patel J M, Ercegovac V, et al. A comparsion of join algorithms for log processing in MapReduce［C］// Proceedings of the ACM SIGMOD International Conference on Management of Data, 2010. 2010:975-986.

[1]	张军, 苏文浩 . 基于LZO的Hadoop文件归档优化方法[J]. 计算机与现代化, 2023, 0(06): 1-6.
[2]	苏林萍，安然，李为，崔文超，张晓良. 基于Hadoop的电力运维审计系统的设计[J]. 计算机与现代化, 2020, 0(01): 49-.
[3]	石慧1，陈恩2. Spark平台的分布式阶段自适应关联规则挖掘算法[J]. 计算机与现代化, 2019, 0(12): 31-.
[4]	宋阳，石鸿雁. 基于MapReduce框架下的K-means聚类算法的改进[J]. 计算机与现代化, 2019, 0(08): 28-.
[5]	周天绮. 基于移动通信大数据的城市人口空间分布统计[J]. 计算机与现代化, 2018, 0(05): 45-.
[6]	陈丽娟，谢伙生. 带负项值的onshelf效用项集并行挖掘算法[J]. 计算机与现代化, 2018, 0(04): 13-.
[7]	温贺平1,鲍晶晶2,柯居鑫1,刘树威1. 基于Hadoop的Lorenz超混沌加密算法设计[J]. 计算机与现代化, 2018, 0(03): 108-.
[8]	张春,周静. 动车组故障关联规则挖掘优化算法研究与应用[J]. 计算机与现代化, 2017, 0(9): 74-78.
[9]	朱珊，艾丽华. 基于Hadoop的大规模图像存储与检索[J]. 计算机与现代化, 2017, 0(6): 61-66+83.
[10]	李程，柴小丽，谢彬，唐鹏. 一种Hadoop YARN的资源调度机制[J]. 计算机与现代化, 2017, 0(11): 29-34.
[11]	张进，冯钧，陆佳民. 基于Hadoop的空间关键字索引方法[J]. 计算机与现代化, 2017, 0(11): 76-83.
[12]	冯钧，徐维纲，冯读庆，陆佳民，徐欢. 面向海量水利数据的索引方法研究[J]. 计算机与现代化, 2017, 0(10): 29-35,41.
[13]	李娜，陈正鸣，吕嘉，刘春芳. HDFS访问中间件的事务设计与实现[J]. 计算机与现代化, 2017, 0(1): 46-50.
[14]	孙立华1，胡牧1，孟庆强1，钱亚康1，王松2. 配网线损大数据高性能计算解决方案[J]. 计算机与现代化, 2016, 0(12): 42-46,50.
[15]	徐欢1,冯钧1,张鹏程1,唐志贤2,刘艺1,陈志飞1,张立霞1. 基于Hadoop的分布式CIF四叉树索引方法[J]. 计算机与现代化, 2016, 0(11): 12-19,24.

基于MapReduce的混合连接算法

Hybrid Join Algorithm Based on MapReduce

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价