计算机与现代化 ›› 2010, Vol. 1 ›› Issue (6): 187-0190.doi: 10.3969/j.issn.1006-2475.2010.06.053

• 应用与开发 • 上一篇    

Nutch中庖丁解牛中文分词的实现与评测

孙殿哲1,魏海平2,陈 岩1   

  1. 1.辽宁石油化工大学研究生学院,辽宁 抚顺 113001;2.辽宁石油化工大学计算机与通讯工程学院,辽宁 抚顺 113001
  • 收稿日期:2010-02-22 修回日期:1900-01-01 出版日期:2010-07-01 发布日期:2010-07-01

Realization and Evaluation of Paodingjieniu Chinese Segmentation in Nutch

SUN Dian-zhe1, WEI Hai-ping2, CHEN Yan1   

  1. 1.Graduate School, Liaoning Shihua University, Fushun 113001, China; 2.School of Computer and Communication Engineering, Liaoning Shihua University, Fushun 113001, China
  • Received:2010-02-22 Revised:1900-01-01 Online:2010-07-01 Published:2010-07-01

摘要: 中文分词是搜索引擎面临的主要挑战之一。本文通过分析Nutch文档的评分机制,针对Nutch中文分词模块的分词不符合汉语习惯的情况,提出采用以词典分词法为基础的庖丁解牛分词模块对Nutch要采集的数据进行切分,描述在Nutch上实现庖丁解牛分词模块的方法,并对该分词模块进行测试。实验表明,庖丁解牛分词模块的分词结果更符合汉语习惯,并且在词项对文档的覆盖方面更加均衡,另外索引文件所占的存储空间节省20%~65%。

关键词: 中文分词, 评分机制, 庖丁解牛

Abstract: Chinese word segmentation is one of main challenges for search engine. By analyzing the scoring mechanism of the document of Nutch, for the situation that word segmentation of Chinese word segmentation module of Nutch does not conform to Chinese language habit, this paper proposes to use Paodingjieniu Chinese word segmentation module based on dictionary to segment the data collected by Nutch, describes the method that how to realize Paodingjieniu Chinese word segmentation module on Nutch, then tests the word segmentation module. Experiments show that the word segmentation result of Paodingjieniue word segmentation module more conforms to Chinese language habit, and the coverage of terms are more balanced for documents, in addition, 20%~65% of the storage space of index file is saved.

Key words: Chinese word segmentation, scoring mechanism, Paodingjieniu

中图分类号: