Computer and Modernization ›› 2024, Vol. 0 ›› Issue (02): 36-42.doi: 10.3969/j.issn.1006-2475.2024.02.006

Previous Articles     Next Articles

A Partition Inverted Index Compression Algorithm Based on CRF

  

  1. (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China)
  • Online:2024-02-19 Published:2024-03-19

Abstract: Abstract: The inverted index is the core data structure of a large search engine, which is essentially a collection of integer sequences in an inverted table. Inverted index compression can effectively reduce the space occupied by inverted indexes and improve retrieval efficiency of keywords. The partition inverted index compression algorithm based on conditional random field (CRF) mainly focuses on the partition mode of universe partition. The algorithm pre-partitions the sequence and uses conditional random fields to label and reorganize the pre-partitions, which effectively reduces the compression time. According to the partition type, the algorithm uses the corresponding encoding method to further reduce the space occupation after compression. Compared with other inverted index compression algorithms, this algorithm outperforms some current universe partition algorithms in compression ratio, and is equivalent to other universes partition algorithms in decompression time. The algorithm achieves a good balance between time and space.

Key words: Key words: inverted index, data compression, universe partition, conditional random field, search engines

CLC Number: