计算机与现代化

• 数据挖掘 • 上一篇    下一篇

面向军事领域的中文分词技术研究

  

  1. (陆军工程大学石家庄校区模拟训练中心,河北石家庄050000)
  • 收稿日期:2018-04-03 出版日期:2018-11-22 发布日期:2018-11-23
  • 作者简介:李健龙(1993-),男,吉林榆树人,陆军工程大学石家庄校区模拟训练中心硕士研究生,研究方向:自然语言处理; 王盼卿(1965-),男,河北石家庄人,教授,研究方向:自然语言处理。

Research on Chinese Word Segmentation Technology for Military Field

  1. (Simulation Training Center, Army Engineering University, Shijiazhuang 050000, China)
  • Received:2018-04-03 Online:2018-11-22 Published:2018-11-23

摘要: 在分词模型跨领域分词时,其性能会有明显的下降。由于标注军队遗留系统开发文档语料的工作比较复杂,本文提出n-gram与词典相结合的中文分词领域自适应方法。该方法通过提取目标语料的n-gram特征训练适应特征领域的分词模型,然后利用领域词典对分词结果进行逆向最大匹配的校正。实验结果表明,在军队遗留系统相关文档语料上,该方法训练的分词模型将F值提高了12.4%。

关键词: 条件随机场, n-gram特征, 领域词典

Abstract: When the word segmentation model cross-field word segmentation, the performance will be significantly reduced. Due to the complexity of annotating the corpus work of the legacy system development documents of the army, this paper proposes an adaptation method of Chinese word segmentation in combination with n-gram and domain dictionary. By extracting the n-gram features of the target corpus, the method adapts to the word segmentation model in the feature domain. Then, the domain dictionary is used to perform reverse maximum matching correction on the word segmentation results. Experimental results show that in the corpus of documents related to the legacy system of the army, the word segmentation model trained by this method improves the F-measure by 12.4%.

Key words:  , n-gram characteristics; domain dictionary

中图分类号: