计算机与现代化 ›› 2013, Vol. 1 ›› Issue (1): 5-7.doi:

• 算法设计与分析 • 上一篇    下一篇

基于二进制串的Trie索引树分词词典机制的研究

韩 莹1,丰继林1,袁 静2,王茂发1,张艳霞1,陈福明1,陈新房1,潘志安1   

  1. 1.防灾科技学院灾害信息工程系,北京101601 2.清华大学计算机科学与技术系,北京100084
  • 收稿日期:2012-08-03 修回日期:1900-01-01 出版日期:2013-02-06 发布日期:2013-02-06

Research on Dictionary Mechanism Based on Binary of Trie Index Trees

HAN Ying1, FENG Jilin1, YUAN Jing2, WANG Maofa1, ZHANG Yanxia1, 
CHEN Fuming1, CHEN Xinfang1, PAN Zhian1
  

  1. 1.Department of Disaster Information Engineering, Institute of Disaster Prevention, Beijing 101601, China; 2.Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2012-08-03 Revised:1900-01-01 Online:2013-02-06 Published:2013-02-06

摘要:

分词词典是信息处理系统的一个基本组成部分,其查询效率将直接影响信息处理系统的性能。根据信息在计算机内都是以二进制编码存放的原理,本文把对字符串的处理转化成对二进制串的处理(支持任何语言的字符串),建立基于Trie索引树的分词词典机制。可以根据不同应用系统需求,自动调整二进制串的长度,建立不同的Trie树结构,便于在存储空间和查询效率之间寻找合适的平衡点。这种基于索引的查询速度与词库中词的多少无关,只与词本身的长度有关系;并且公共的前缀索引值随着词汇量的增大而节省大量内存空间。

关键词: 关键词:信息处理, 分词, 词典机制, Trie索引树

Abstract:

The dictionary mechanism serves as one of the basic components in Chinese word segmentation system. Its performance influences the segmentation speed significantly. Based on binary system, processing of text(any language string) can be converted into a binary processing, a mechanism of the Trie index tree dictionary can be created. According to different application requirements, the dictionary mechanism can automatically adjust the structure of the Trie which can help to find the right balance between storage space and query efficiency. The query speed of this dictionary mechanism is nothing to do with the total word number of dictionary, only influenced by the word length; Common prefix of the index value with the increase of vocabulary save a lot of memory space.

Key words: Key words: Chinese information processing, Chinese word segmentation, dictionary mechanism, Trie