计算机与现代化 ›› 2022, Vol. 0 ›› Issue (08): 7-12.

• 中文信息处理技术 • 上一篇    下一篇

基于Attention-BIGRU-CRF的中文分词模型

  

  1. (南京邮电大学通信与信息工程学院,江苏南京210003)
  • 出版日期:2022-08-22 发布日期:2022-08-22
  • 作者简介:周慧(1996—),女,安徽铜陵人,硕士研究生,研究方向:自然语言处理,自动驾驶,E-mail: 1135146521@qq.com; 徐名海(1976—),男,副教授,博士,研究方向:无线异构网络融合,环境感知,智能交通,E-mail: d0207@njupt.edu.cn; 许晓东(1997—),男,硕士研究生,研究方向:复杂网络建模,E-mail: 645061332@qq.com。

Chinese Word Segmentation Model Based on Attention-BIGRU-CRF

  1. (College of Communication and Information Engineering, Nanjing University
    of Post and Telecommunication, Nanjing 210003, China)
  • Online:2022-08-22 Published:2022-08-22

摘要: 自然语言处理是人工智能发展的重要分支,而中文分词是自然语言处理的第一步,提高中文分词的效率可以提高自然语言处理的结果的准确性。因此提出一种Attention-BIGRU-CRF模型,首先将中文文本通过词向量转换,将文本转换成向量的形式,再利用BIGRU进行序列化学习,随后引入attention机制将BIGRU的输入和输出进行相关性计算获取更精确向量值,最后将该向量值与BIGRU序列化得到的向量值进行拼接作为CRF层的输入并得到标签预测结果。由仿真结果可知,Attention-BIGRU-CRF模型在人民日报2014和MSRA的语料库得到的F1值分别为97.34%和98.25%,处理文本的分词速率为248.1 KB/s。故融合attention机制和BIGRU-CRF网络的模型既能够提高分词准确率,又能提高分词时间和效率。

关键词: 自然语言处理, 双向门控循环单元, 条件随机场, 注意力机制, 中文分词

Abstract: Natural language processing is an important branch of the development of artificial intelligence, and Chinese word segmentation is the first step of natural language processing. Improving the efficiency of Chinese word segmentation can improve the accuracy of the results of natural language processing. Therefore, an Attention-BIGRU-CRF-CRF model is proposed in this paper. Firstly, the Chinese text is transformed into vector form through word vector conversion, and then the BIGRU is used for serialization learning. Then, the attention mechanism is introduced to calculate the correlation between the input and output of BIGRU to obtain more accurate vector values, Finally, the vector value is spliced with the vector value serialized by BIGRU as the input of CRF layer, and the label prediction result is obtained. The simulation results show that the F1 values of Attention-BIGRU-CRF model in the corpus of people’s daily 2014 and MSRA are 97.34% and 98.25% respectively, and the word segmentation rate of processed text is 248.1 KB/s. Therefore, the model integrating attention mechanism and BIGRU-CRF network can not only improve the accuracy of word segmentation, but also improve the time efficiency of word segmentation.

Key words: BIGRU, CRF, attention mechanism, Chinese word segmentation