计算机与现代化 ›› 2023, Vol. 0 ›› Issue (06): 1-6.doi: 10.3969/j.issn.1006-2475.2023.06.001

• 算法设计与分析 •    下一篇

基于LZO的Hadoop文件归档优化方法

张军1, 苏文浩2   

  1. 1.东华理工大学软件学院,江西 南昌 330013;
    2.东华理工大学信息工程学院,江西 南昌 330013
  • 收稿日期:2022-11-09 修回日期:2023-01-19 出版日期:2023-06-28 发布日期:2023-06-28
  • 作者简介:张军(1978—),男(土家族),湖南常德人,副教授,硕士生导师,博士,研究方向:处理器/存储器性能功耗优化,自然语言处理,E-mail: zhangjun_whu@whu.edu.cn; 苏文浩(1997—),男,山东昌邑人,硕士研究生,研究方向:大数据,大数据存储, E-mail: 35165003@qq.com。
  • 基金资助:
    国家自然科学基金资助项目(62162002, 61662002,61902189); 江西省自然科学基金资助项目(20212BAB202002); 江西省放射性地学大数据技术工程实验室开放基金资助项目(JELRGBDT201905); 江苏省高等学校基础科学(自然科学)研究项目(22KJA520004)

Optimization Method of Hadoop File Archiving Based on LZO

ZHANG Jun1, SU Wen-hao2   

  1. 1. School of Software, East China University of Technology, Nanchang 330013, China;
    2. School of Information Engineering, East China University of Technology, Nanchang 330013, China
  • Received:2022-11-09 Revised:2023-01-19 Online:2023-06-28 Published:2023-06-28

摘要: Hadoop分布式框架被广泛应用于各行业大数据处理领域,但在存储大量小文件时会产生更多的元数据信息,导致名称节点的内存占用过多,影响其对外提供高性能和高并发访问的能力。对小文件进行归档存储,是有效解决该问题的常用方案。同时,考虑到数据压缩可有效减少数据存储空间大小与网络数据传输负载,本文提出一种基于实时无损压缩算法LZO的Hadoop文件归档优化方法LA(LZO-Archive)。该方法先对小文件进行归档合并,在索引文件生成阶段融合LZO压缩算法,减少索引文件的生成时间。同时,设计一种文件压缩存储算法,对数据文件和索引文件进行压缩存储,有效减少了数据所占用的数据节点磁盘空间大小和名称节点内存空间占用大小。实验结果表明,相对于原始的HDFS数据存储方法、基准的文件归档方法HAR以及对比方法LHF,本文提出的LA方法在文件归档时间、名称节点内存占用、数据节点磁盘空间占用、文件访问时间等方面表现更好。

关键词: Hadoop, HDFS, 小文件, 归档方案, LZO算法

Abstract: The distributed framework Hadoop is widely used in various fields of big data processing. However, more metadata information will be generated while a large number of small files are stored in Hadoop, which can lead to excessive usage of memory in NameNode and affect its ability to provide high performance and high concurrent access. Archiving and storing small files is an effective solution to this problem. At the same time, as data compression can effectively reduce the size of data storage space and network data transmission load, this paper proposes a Hadoop file archiving optimization method named LA (LZO-Archive)based on a real-time lossless compression algorithm LZO. In order to reduce the time of generating index files, LA incorporates LZO compression algorithm during the process of the index file generation stage on the basis of archiving and merging small files. Moreover, a file compression storage algorithm is designed in LA to compress and store data files and index files, which can effectively reduce the occupied disk space in DataNode and the occupied memory space in NameNode. This paper also elaborates the design and implementation of experimental method for LA. Experimental results show that compared with the original HDFS data storage method, the benchmark method of file archiving HAR and the comparison method LHF, the proposed method LA performs better in the aspects of file archiving time, memory usage in NameNode, disk space usage in DataNode, and file access time.

Key words: Hadoop, HDFS, small files, archiving storage, LZO algorithm

中图分类号: