计算机与现代化 ›› 2024, Vol. 0 ›› Issue (02): 36-42.doi: 10.3969/j.issn.1006-2475.2024.02.006

• 算法分析与设计 • 上一篇    下一篇

基于CRF的分区倒排索引压缩算法

  

  1. 北京交通大学计算机与信息技术学院,北京 100044)
  • 出版日期:2024-02-19 发布日期:2024-03-19
  • 作者简介:作者简介:王子琛(1998—),男,北京人,硕士研究生,研究方向:信息检索,E-mail: 20120425@bjtu.edu.cn; 通信作者:瞿有利(1974—),男,副教授,博士,研究方向:信息检索,E-mail: ylqu@bjtu.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61976015)

A Partition Inverted Index Compression Algorithm Based on CRF

  1. (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China)
  • Online:2024-02-19 Published:2024-03-19

摘要: 摘要:倒排索引是大型搜索引擎的核心数据结构,本质是倒排列表中整数序列的集合。倒排索引压缩可以有效减少倒排索引所占空间,提高对关键词的检索效率。本文提出的基于条件随机场(CRF)的分区倒排索引压缩算法主要关注域值分区的分区方式。该算法对序列进行预分区,并且使用条件随机场对预分区进行标注并重组,有效减少了压缩时间。根据分区类型,该算法使用相应的编码方式,进一步减少了压缩后的空间占用。与其他倒排索引压缩算法进行对比实验分析,结果表明本文算法在压缩率上超过目前一些域值分区的算法,并且在解压时间上与其他域值分区算法相当。该算法在时间和空间上取得了较好的平衡。





关键词: 关键词:倒排索引, 数据压缩, 域值分区, 条件随机场, 搜索引擎

Abstract: Abstract: The inverted index is the core data structure of a large search engine, which is essentially a collection of integer sequences in an inverted table. Inverted index compression can effectively reduce the space occupied by inverted indexes and improve retrieval efficiency of keywords. The partition inverted index compression algorithm based on conditional random field (CRF) mainly focuses on the partition mode of universe partition. The algorithm pre-partitions the sequence and uses conditional random fields to label and reorganize the pre-partitions, which effectively reduces the compression time. According to the partition type, the algorithm uses the corresponding encoding method to further reduce the space occupation after compression. Compared with other inverted index compression algorithms, this algorithm outperforms some current universe partition algorithms in compression ratio, and is equivalent to other universes partition algorithms in decompression time. The algorithm achieves a good balance between time and space.

Key words: Key words: inverted index, data compression, universe partition, conditional random field, search engines

中图分类号: