基于LZO的Hadoop文件归档优化方法

doi:10.3969/j.issn.1006-2475.2023.06.001

计算机与现代化 ›› 2023, Vol. 0 ›› Issue (06): 1-6.doi: 10.3969/j.issn.1006-2475.2023.06.001

• 算法设计与分析 • 下一篇

基于LZO的Hadoop文件归档优化方法

张军¹, 苏文浩²

1.东华理工大学软件学院,江西南昌 330013;
2.东华理工大学信息工程学院,江西南昌 330013

收稿日期:2022-11-09 修回日期:2023-01-19 出版日期:2023-06-28 发布日期:2023-06-28
作者简介:张军（1978—）,男（土家族）,湖南常德人,副教授,硕士生导师,博士,研究方向:处理器/存储器性能功耗优化,自然语言处理,E-mail: zhangjun_whu@whu.edu.cn; 苏文浩（1997—）,男,山东昌邑人,硕士研究生,研究方向:大数据,大数据存储, E-mail: 35165003@qq.com。
基金资助:
国家自然科学基金资助项目（62162002, 61662002,61902189）; 江西省自然科学基金资助项目（20212BAB202002）; 江西省放射性地学大数据技术工程实验室开放基金资助项目（JELRGBDT201905）; 江苏省高等学校基础科学（自然科学）研究项目（22KJA520004）

Optimization Method of Hadoop File Archiving Based on LZO

ZHANG Jun¹, SU Wen-hao²

1. School of Software, East China University of Technology, Nanchang 330013, China;
2. School of Information Engineering, East China University of Technology, Nanchang 330013, China

Received:2022-11-09 Revised:2023-01-19 Online:2023-06-28 Published:2023-06-28

摘要/Abstract

摘要： Hadoop分布式框架被广泛应用于各行业大数据处理领域,但在存储大量小文件时会产生更多的元数据信息,导致名称节点的内存占用过多,影响其对外提供高性能和高并发访问的能力。对小文件进行归档存储,是有效解决该问题的常用方案。同时,考虑到数据压缩可有效减少数据存储空间大小与网络数据传输负载,本文提出一种基于实时无损压缩算法LZO的Hadoop文件归档优化方法LA（LZO-Archive）。该方法先对小文件进行归档合并,在索引文件生成阶段融合LZO压缩算法,减少索引文件的生成时间。同时,设计一种文件压缩存储算法,对数据文件和索引文件进行压缩存储,有效减少了数据所占用的数据节点磁盘空间大小和名称节点内存空间占用大小。实验结果表明,相对于原始的HDFS数据存储方法、基准的文件归档方法HAR以及对比方法LHF,本文提出的LA方法在文件归档时间、名称节点内存占用、数据节点磁盘空间占用、文件访问时间等方面表现更好。

关键词: Hadoop, HDFS, 小文件, 归档方案, LZO算法

Abstract: The distributed framework Hadoop is widely used in various fields of big data processing. However, more metadata information will be generated while a large number of small files are stored in Hadoop, which can lead to excessive usage of memory in NameNode and affect its ability to provide high performance and high concurrent access. Archiving and storing small files is an effective solution to this problem. At the same time, as data compression can effectively reduce the size of data storage space and network data transmission load, this paper proposes a Hadoop file archiving optimization method named LA （LZO-Archive）based on a real-time lossless compression algorithm LZO. In order to reduce the time of generating index files, LA incorporates LZO compression algorithm during the process of the index file generation stage on the basis of archiving and merging small files. Moreover, a file compression storage algorithm is designed in LA to compress and store data files and index files, which can effectively reduce the occupied disk space in DataNode and the occupied memory space in NameNode. This paper also elaborates the design and implementation of experimental method for LA. Experimental results show that compared with the original HDFS data storage method, the benchmark method of file archiving HAR and the comparison method LHF, the proposed method LA performs better in the aspects of file archiving time, memory usage in NameNode, disk space usage in DataNode, and file access time.

Key words: Hadoop, HDFS, small files, archiving storage, LZO algorithm

中图分类号:

TP311
TP333

张军, 苏文浩. 基于LZO的Hadoop文件归档优化方法[J]. 计算机与现代化, 2023, 0(06): 1-6.

ZHANG Jun, SU Wen-hao. Optimization Method of Hadoop File Archiving Based on LZO[J]. Computer and Modernization, 2023, 0(06): 1-6.

参考文献

[1] 廖彬,于炯,张陶,等. 基于分布式文件系统HDFS的节能算法[J]. 计算机学报, 2013, 36(5):1047-1064.
[2] HVACHKO K, KUANG H, RADIA S, et al.The Hadoop distributed file system[C]// 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies. 2010:1-10.
[3] 徐鹏. Hadoop 2.X HDFS源码剖析[M]. 北京:电子工业出版社, 2016.
[4] 郑通,郭卫斌,范贵生. HDFS中海量小文件合并与预取优化方法的研究[J]. 计算机科学, 2017,44(S2):516-519.
[5] 李文武,张建锋,王景林. 基于EHDFS的海量小文件存储与检索方法[J]. 计算机工程与设计, 2022,43(2):376-383.
[6] 顾玉宛,王文闻,孙玉强. 一种面向HDFS中海量小文件的存取优化方法[J].计算机应用研究, 2017,34(8):2319-2323.
[7] 李洪奇,朱丽萍,孙国玉,等. 面向海量小文件的分布式存储系统设计与实现[J]. 计算机工程与设计, 2016,37(1):86-92.
[8] ASF Infrabot. SequenceFile[EB/OL].(2019-09-07)[2022-04-12]. https://cwiki.apache.org/confluence/display/HADOOP2/SequenceFile.
[9] The Apache Software Foundation. MapFile[EB/OL].[2022-04-12]. https://hadoop.Apache.org/docs/r2.6.2/api/org/apache/hadoop/io/MapFile.html.
[10] The Apache Software Foundation. Hadoop Archives Guide [EB/OL].[2022-04-12]. https://hadoop.apache.org/docs/stable1/hadoop_archives.html.
[11] SHEORAN S, SETHIA D, SARAN H.Optimized MapFile based storage of small files in Hadoop[C] // 2017 IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID), 2017: 906-912.
[12] MENG B, GUO W, FAN G, et al.A novel approach for efficient accessing of small files in HDFS: TLB-MapFile[C]// 2016 17th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD). 2016:681-686.
[13] VORAPONGKITIPUN C, NUPAIROJ N.Improving performance of small-file accessing in Hadoop[C]// 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE). 2014:200-205.
[14] TAO W J, ZHAI Y L, TCHAYE-KONDI J.LHF: A new archive based approach to accelerate massive small files access performance in HDFS[C]// 2019 IEEE 5th International Conference on Big Data Computing Service and Applications (BigDataService). 2019:40-48.
[15] 郑翠芳. 几种常用无损数据压缩算法研究[J]. 计算机技术与发展, 2011,21(9):73-76.
[16] ZIV J, LEMPEL A.A universal algorithm for sequential data compression[J]. IEEE Transactions on Information Theory, 1977,23(5):337-342.
[17] LHUILLIER M, QUAN L.Match propagation for image-based modeling and rendering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002,24(8):1140-1146.
[18] RAUSCHERT P, KLIMETS Y, VELTEN J, et al.Very fast GZIP compression by means of content addressable memories[C]// 2004 IEEE Region 10 Conference TENCON 2004. 2004:391-394.
[19] GitHub Inc.. Snappy[EB/OL].[2022-04-12].http://google.github.io/snappy/.
[20] Seward J. Bzip2[EB/OL].[2022-04-12]. http://www.bzip.org/.
[21] LI S H, LUO J R, WU Y CH, et al.Continuous and realtime data acquisition embedded system for EAST[J]. IEEE Transaction on Nuclear Science, 2010,57(2):696-699.
[22] 宋秉玺. 高效无损压缩算法的研究与实现[D]. 西安:西安电子科技大学, 2014.
[23] 向丽辉,缪力,张大方. 压缩对Hadoop性能影响研究[J]. 计算机工程与科学, 2015, 37(2): 207-212.
[24] 王松,房利国, 韩炼冰,刘鸿博. 一种快速解压的无损压缩算法[J]. 通信技术, 2020,53(5):1121-1126.
[25] 夏靖波,韦泽鲲,付凯,等. 云计算中Hadoop技术研究与应用综述[J]. 计算机科学, 2016,43(11):6-11.
[26] 董新华,李瑞轩,周湾湾,等. Hadoop系统性能优化与功能增强综述[J]. 计算机研究与发展, 2013,50(S2):1-15.
[27] ZHAI Y L, TCHAYE-KONDI J, LIN K J, et al.Hadoop perfect file: A fast and memory-efficient metadata access archive file to face small files problem in HDFS[J]. Journal of Parallel and Distributed Computing, 2021,156:119-130.
[28] DONG B, ZHENG Q H, TIAN F, et al.An optimized approach for storing and accessing small files on cloud storage[J]. Journal of Network and Computer Applications, 2012,35(6):1847-1862.
[29] 王刚. 云平台下HDFS HA的研究与实现[D]. 西安:西北大学, 2013.

基于LZO的Hadoop文件归档优化方法

Optimization Method of Hadoop File Archiving Based on LZO

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	雷鸣, 姜罕盛, 武国良, 赵玉娟, 梁健. 基于HBase的大数据架构下负载平衡技术[J]. 计算机与现代化, 2021, 0(06): 91-95.
[2]	苏林萍，安然，李为，崔文超，张晓良. 基于Hadoop的电力运维审计系统的设计[J]. 计算机与现代化, 2020, 0(01): 49-.
[3]	王子炫，魏力，张育平. 基于磁光虚拟存储系统的文件调度算法[J]. 计算机与现代化, 2019, 0(05): 7-.
[4]	周天绮. 基于移动通信大数据的城市人口空间分布统计[J]. 计算机与现代化, 2018, 0(05): 45-.
[5]	温贺平1,鲍晶晶2,柯居鑫1,刘树威1. 基于Hadoop的Lorenz超混沌加密算法设计[J]. 计算机与现代化, 2018, 0(03): 108-.
[6]	朱珊，艾丽华. 基于Hadoop的大规模图像存储与检索[J]. 计算机与现代化, 2017, 0(6): 61-66+83.
[7]	李程，柴小丽，谢彬，唐鹏. 一种Hadoop YARN的资源调度机制[J]. 计算机与现代化, 2017, 0(11): 29-34.
[8]	张进，冯钧，陆佳民. 基于Hadoop的空间关键字索引方法[J]. 计算机与现代化, 2017, 0(11): 76-83.
[9]	冯钧，徐维纲，冯读庆，陆佳民，徐欢. 面向海量水利数据的索引方法研究[J]. 计算机与现代化, 2017, 0(10): 29-35,41.
[10]	李娜，陈正鸣，吕嘉，刘春芳. HDFS访问中间件的事务设计与实现[J]. 计算机与现代化, 2017, 0(1): 46-50.
[11]	孙立华1，胡牧1，孟庆强1，钱亚康1，王松2. 配网线损大数据高性能计算解决方案[J]. 计算机与现代化, 2016, 0(12): 42-46,50.
[12]	徐欢1,冯钧1,张鹏程1,唐志贤2,刘艺1,陈志飞1,张立霞1. 基于Hadoop的分布式CIF四叉树索引方法[J]. 计算机与现代化, 2016, 0(11): 12-19,24.
[13]	祁鹏年,朱晋,郝君慧,许丰平. 异构环境下Hadoop推测执行算法[J]. 计算机与现代化, 2015, 0(8): 80-83,88.
[14]	胡龙,罗军. 基于MapReduce的混合连接算法[J]. 计算机与现代化, 2015, 0(6): 86-91.
[15]	王宾,刘钊远. 基于Rsync的远程文件同步优化模型[J]. 计算机与现代化, 2015, 0(4): 10-13.