基于二进制串的Trie索引树分词词典机制的研究

计算机与现代化 ›› 2013, Vol. 1 ›› Issue (1): 5-7.doi:

基于二进制串的Trie索引树分词词典机制的研究

韩莹1，丰继林1,袁静2，王茂发1，张艳霞1，陈福明1，陈新房1，潘志安1

1.防灾科技学院灾害信息工程系，北京101601； 2.清华大学计算机科学与技术系，北京100084

收稿日期:2012-08-03 修回日期:1900-01-01 出版日期:2013-02-06 发布日期:2013-02-06

Research on Dictionary Mechanism Based on Binary of Trie Index Trees

HAN Ying1, FENG Jilin1, YUAN Jing2, WANG Maofa1, ZHANG Yanxia1, 

CHEN Fuming1, CHEN Xinfang1, PAN Zhian1

1.Department of Disaster Information Engineering, Institute of Disaster Prevention, Beijing 101601, China; 2.Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Received:2012-08-03 Revised:1900-01-01 Online:2013-02-06 Published:2013-02-06

摘要/Abstract

摘要：

分词词典是信息处理系统的一个基本组成部分，其查询效率将直接影响信息处理系统的性能。根据信息在计算机内都是以二进制编码存放的原理，本文把对字符串的处理转化成对二进制串的处理（支持任何语言的字符串），建立基于Trie索引树的分词词典机制。可以根据不同应用系统需求，自动调整二进制串的长度，建立不同的Trie树结构，便于在存储空间和查询效率之间寻找合适的平衡点。这种基于索引的查询速度与词库中词的多少无关，只与词本身的长度有关系；并且公共的前缀索引值随着词汇量的增大而节省大量内存空间。



关键词: 关键词：信息处理, 分词, 词典机制, Trie索引树

Abstract:

The dictionary mechanism serves as one of the basic components in Chinese word segmentation system. Its performance influences the segmentation speed significantly. Based on binary system, processing of text(any language string) can be converted into a binary processing， a mechanism of the Trie index tree dictionary can be created. According to different application requirements, the dictionary mechanism can automatically adjust the structure of the Trie which can help to find the right balance between storage space and query efficiency. The query speed of this dictionary mechanism is nothing to do with the total word number of dictionary, only influenced by the word length; Common prefix of the index value with the increase of vocabulary save a lot of memory space.



Key words: Key words: Chinese information processing, Chinese word segmentation, dictionary mechanism, Trie

韩莹;丰继林;袁静;王茂发;张艳霞;陈福明;陈新房;潘志安. 基于二进制串的Trie索引树分词词典机制的研究[J]. 计算机与现代化, 2013, 1(1): 5-7.

HAN Ying;FENG Jilin;YUAN Jing;WANG Maofa;ZHANG Yanxia; CHEN Fuming;CHEN Xinfang;PAN Zhian. Research on Dictionary Mechanism Based on Binary of Trie Index Trees[J]. Computer and Modernization, 2013, 1(1): 5-7.

[1]	周慧, 徐名海, 许晓东. 基于Attention-BIGRU-CRF的中文分词模型[J]. 计算机与现代化, 2022, 0(08): 7-12.
[2]	盖璇. 基于聚类分析算法的垃圾邮件识别[J]. 计算机与现代化, 2020, 0(10): 17-22.
[3]	陈平平,耿笑冉,邹敏,谭定英. 基于机器学习的文本情感倾向性分析[J]. 计算机与现代化, 2020, 0(03): 77-.
[4]	张淑静1,苗开超1,张亚力1,杨彬1,李腾2,刘宜轩1,汪翔1. 基于Bi-LSTM-CRF算法的气象预警信息质控系统的实现[J]. 计算机与现代化, 2019, 0(06): 111-.
[5]	李富星，蒙祖强 . 一种改进的类别区分词特征选择算法[J]. 计算机与现代化, 2019, 0(03): 73-.
[6]	陈德华1,刘茜茜1,乐嘉锦1,潘乔1,朱立峰2. 病理镜检文本数据的结构化处理方法[J]. 计算机与现代化, 2016, 0(4): 1-6.
[7]	穆翠霞1,2,刘振华1,武涛1. 面向电子商务的网页检索聚类方法[J]. 计算机与现代化, 2014, 0(5): 70-73.
[8]	冯光;乔丹丹;常静怡. 基于分词匹配的主观题自动评阅技术研究[J]. 计算机与现代化, 2013, 1(3): 212-214,.
[9]	施询之;孙宁远;李骋罡. 基于微博信息库和文本分词的人机对话模型设计[J]. 计算机与现代化, 2013, 1(11): 207-209.
[10]	穆俊鹏;董魁锋;张明. 基于动态特征库的电子邮件分类的研究[J]. 计算机与现代化, 2012, 203(7): 120-123.
[11]	穆俊鹏;董魁锋;张明. 基于动态特征库的电子邮件分类的研究[J]. 计算机与现代化, 2012, 203(7): 120-123.
[12]	杨伟超;刘阳;李淑霞. 基于搜索引擎的一站式检索平台设计与实现[J]. 计算机与现代化, 2012, 1(11): 220-222.
[13]	姜文志;顾佼佼;胡文萱;栗飞. 条件随机场模型的应用研究及改进[J]. 计算机与现代化, 2011, 1(11): 55-4.
[14]	孙殿哲;魏海平;陈岩. Nutch中庖丁解牛中文分词的实现与评测[J]. 计算机与现代化, 2010, 1(6): 187-0190.