计算机与现代化 ›› 2021, Vol. 0 ›› Issue (11): 100-105.

• 模式识别 • 上一篇    下一篇

融合注意力机制的药用植物文本命名实体识别

  

  1. (1.贵州大学计算机科学与技术学院,贵州贵阳550025;2.公共大数据国家重点实验室(贵州大学),贵州贵阳550025)
  • 出版日期:2021-12-13 发布日期:2021-12-13
  • 作者简介:王运乾(1991—),男,贵州六盘水人,硕士研究生,研究方向:知识图谱,自然语言处理,E-mail: 1406436535@qq.com; 王以松(1975—),男,贵州铜仁人,教授,博士,研究方向:知识表示和推理,机器学习和人工智能; 陈攀峰(1982—),男,湖北黄冈人,博士研究生,研究方向:知识图谱,知识表示和推理; 邹龙(1996—),男,四川内江人,硕士研究生,研究方向:深度学习,知识表示。
  • 基金资助:
    国家自然科学基金资助项目(61976065, U1836205)

Named Entity Recognition of Medicinal Plant Texts Integrated with Attention Mechanism

  1. (1. College of Computer Science and Technology, Guizhou University, Guiyang 550025, China;
    2. State Key Laboratory of Public Big Data (Guizhou University), Guiyang 550025, China)
  • Online:2021-12-13 Published:2021-12-13

摘要: 药用植物文本的命名实体识别对中医药领域的信息抽取和知识图谱构建起着重要作用。针对药用植物属性文本存在长序列语义稀疏的问题,提出一种基于注意力机制的双向长短时记忆网络(BiLSTM)和条件随机场(CRF)模型相结合的疾病实体识别方法(BiLSTM+ATT-CRF,BAC)。首先对药用植物属性文本进行预处理和半自动化标注构建数据集,并进行预训练得到低维词向量;然后将这些低维词向量输入BiLSTM网络中,得到双向语义依赖的特征向量;Attention层把注意力集中到与当前输出特征高度相关的信息上;最后通过条件随机场(CRF)算法获取最优的标签序列并解码输出。实验结果表明,BAC方法针对药用植物属性文本的长序列语义稀疏问题,疾病命名实体识别效果较传统方法更优。利用BAC方法训练好的模型从1680条文本句子中识别疾病命名实体,共抽取出1422个疾病实体。与药用植物名称进行匹配,共抽取出4316个药用植物治疗疾病的三元组数据。

关键词: 知识图谱, 注意力机制; 双向长短时记忆网络(BiLSTM); 条件随机场(CRF); 疾病命名实体识别

Abstract: Named entity recognition of medicinal plant texts plays an important role in information extraction and knowledge graph construction in the field of traditional Chinese medicine. Aiming at the problem of long sequence semantic sparsity in medicinal plant attribute text, a disease entity recognition method BAC based on attention mechanism of BiLSTM and CRF model is proposed. Firstly, the medicinal plant attribute text is preprocessed and semi-automatic annotation is used to construct the medicinal plant knowledge data set, and the low-dimensional word vector is obtained by pre-training. Then, these vectors are fed into the attention-based BiLSTM network to obtain feature vectors that better represent disease entities. Finally, the optimal tag sequence is obtained by conditional random field (CRF) algorithm. The comparison of experimental results shows that the accuracy of BAC method reaches 93.78%, which is 4.46% higher than BiLSTM-CRF model, it can effectively improve the recognition effect of named entity of disease in medicinal plant attribute text. The model trained by BAC method is used to identify disease named entities from 1680 text sentences, and a total of 1422 disease entities are extracted. By matching with the names of medicinal plants, a total of 4316 triples of the relationship between medicinal plants name and diseases entities are extracted.

Key words: knowledge graph, attention mechanism, bidirectional long-short term memory network (BiLSTM), conditional random field (CRF), disease named entity recognition