计算机与现代化 ›› 2021, Vol. 0 ›› Issue (02): 51-55.

• 数据库与数据挖掘 • 上一篇    下一篇

基于RoBERTa-WWM的中文电子病历命名实体识别

  

  1. (1.蚌埠医学院护理学院,安徽蚌埠233030;2.中国科学技术大学科学岛分院,安徽合肥230001)
  • 出版日期:2021-03-01 发布日期:2021-03-01
  • 作者简介:朱岩(1992—),女,安徽淮南人,硕士研究生,研究方向:医学信息与护理服务,E-mail: 1148905687@qq.com; 通信作者:张利(1971—),女,安徽蚌埠人,副教授,硕士,研究方向:医学信息与护理服务,E-mail: 390026586@qq.com; 王煜(1991—),男,博士研究生,研究方向:医学信息学,自然语言处理,E-mail: briskyu@mail.ustc.edu.cn。
  • 基金资助:
    安徽省科技厅重大专项(18030801133); 蚌埠医学院研究生科技创新计划项目(Byycx1940)

Named Entity Recognition on Chinese Electronic Medical Records Based on RoBERTa-WWM

  1. (1. School of  Nursing, Bengbu Medical College, Bengbu 233030, China; 
    2. Science Island Branch, University of Science and Technology of China, Hefei 230001, China)
  • Online:2021-03-01 Published:2021-03-01

摘要: 电子病历(EMRs)中包含着丰富的信息,如临床症状、诊断结果和药物疗效。命名实体识别(Named Entity Recognition, NER)旨在从非结构化文本中抽取命名实体,这也是从电子病历中抽取有价值信息的初始步骤。本文提出一种基于预训练模型RoBERTa-WWM(A Robustly Optimized BERT Pre-training Approach-Whole Word Masking)的命名实体识别方法。该方法引入预训练模型RoBERTa-WWM,利用其生成含有先验知识的语义表示。与BERT(Bidirectional Encoder Representations from Transformers)相比,RoBERTa-WWM生成的语义表示更适用于中文的命名实体识别任务,因为其在预训练阶段会进行全词掩码。RoBERTa-WWM生成的语义表示被依次输入双向长短时记忆(Bidirectional Long Short-Term Memory, BiLSTM)和条件随机场(Conditional Random Field, CRF)模型。实验结果表明,该方法在“2019全国知识图谱与语义计算大会(CCKS 2019)”数据集上可以有效提升F1值,提高中文电子病历中命名实体的识别效果。

关键词: 电子病历, 命名实体识别; RoBERTa-WWM; 信息抽取

Abstract: Electronic Medical Records (EMRs) contain abundant information, such as clinical symptoms, diagnosis results and drug efficacy. Named Entity Recognition (NER) aims to extract named entities from unstructured texts. It is also the initial step to extract valuable information from the EMRs. This paper proposes a method to recognize named entities based on the RoBERTa-WWM (A Robustly Optimized BERT Pre-training Approach-Whole Word  Masking). RoBERTa-WWM is a kind of pre-training model, which is utilized to generate semantic representations with prior knowledge. Compared with BERT (Bidirectional Encoder Representations from Transformers), the semantic representations generated by RoBERTa-WWM are more suitable for Chinese NER task because it masks the whole word during pre-training. The semantic representations are then inputted into Bidirectional Long  Short-Term Memory (BiLSTM) and Conditional Random Field (CRF) models in turn. The experimental results show that this method can effectively improve the F1-score on “China Conference on Knowledge Graph and Semantic Computing 2019 (CCKS 2019)” dataset and improve the performance of NER in Chinese EMRs.

Key words: electronic medical records, named entity recognition, RoBERTa-WWM, information extraction