基于RoBERTa-WWM的中文电子病历命名实体识别

计算机与现代化 ›› 2021, Vol. 0 ›› Issue (02): 51-55.

基于RoBERTa-WWM的中文电子病历命名实体识别

(1.蚌埠医学院护理学院,安徽蚌埠233030；2.中国科学技术大学科学岛分院,安徽合肥230001)

出版日期:2021-03-01 发布日期:2021-03-01
作者简介:朱岩(1992—),女,安徽淮南人,硕士研究生,研究方向:医学信息与护理服务,E-mail: 1148905687@qq.com；通信作者:张利(1971—),女,安徽蚌埠人,副教授,硕士,研究方向:医学信息与护理服务,E-mail: 390026586@qq.com；王煜(1991—),男,博士研究生,研究方向:医学信息学,自然语言处理,E-mail: briskyu@mail.ustc.edu.cn。
基金资助:
安徽省科技厅重大专项(18030801133); 蚌埠医学院研究生科技创新计划项目(Byycx1940)

Named Entity Recognition on Chinese Electronic Medical Records Based on RoBERTa-WWM

(1. School of Nursing, Bengbu Medical College, Bengbu 233030, China；
2. Science Island Branch, University of Science and Technology of China, Hefei 230001, China)

Online:2021-03-01 Published:2021-03-01

摘要/Abstract

摘要： 电子病历（EMRs）中包含着丰富的信息，如临床症状、诊断结果和药物疗效。命名实体识别（Named Entity Recognition， NER）旨在从非结构化文本中抽取命名实体，这也是从电子病历中抽取有价值信息的初始步骤。本文提出一种基于预训练模型RoBERTa-WWM（A Robustly Optimized BERT Pre-training Approach-Whole Word Masking）的命名实体识别方法。该方法引入预训练模型RoBERTa-WWM，利用其生成含有先验知识的语义表示。与BERT（Bidirectional Encoder Representations from Transformers）相比，RoBERTa-WWM生成的语义表示更适用于中文的命名实体识别任务，因为其在预训练阶段会进行全词掩码。RoBERTa-WWM生成的语义表示被依次输入双向长短时记忆（Bidirectional Long Short-Term Memory， BiLSTM）和条件随机场（Conditional Random Field， CRF）模型。实验结果表明，该方法在“2019全国知识图谱与语义计算大会（CCKS 2019）”数据集上可以有效提升F1值，提高中文电子病历中命名实体的识别效果。

关键词: 电子病历, 命名实体识别； RoBERTa-WWM；信息抽取

Abstract: Electronic Medical Records (EMRs) contain abundant information, such as clinical symptoms, diagnosis results and drug efficacy. Named Entity Recognition (NER) aims to extract named entities from unstructured texts. It is also the initial step to extract valuable information from the EMRs. This paper proposes a method to recognize named entities based on the RoBERTa-WWM (A Robustly Optimized BERT Pre-training Approach-Whole Word Masking). RoBERTa-WWM is a kind of pre-training model, which is utilized to generate semantic representations with prior knowledge. Compared with BERT (Bidirectional Encoder Representations from Transformers), the semantic representations generated by RoBERTa-WWM are more suitable for Chinese NER task because it masks the whole word during pre-training. The semantic representations are then inputted into Bidirectional Long Short-Term Memory (BiLSTM) and Conditional Random Field (CRF) models in turn. The experimental results show that this method can effectively improve the F1-score on “China Conference on Knowledge Graph and Semantic Computing 2019 (CCKS 2019)” dataset and improve the performance of NER in Chinese EMRs.

Key words: electronic medical records, named entity recognition, RoBERTa-WWM, information extraction

朱岩, 张利, 王煜. 基于RoBERTa-WWM的中文电子病历命名实体识别[J]. 计算机与现代化, 2021, 0(02): 51-55.

ZHU Yan, ZHANG Li, WANG Yu. Named Entity Recognition on Chinese Electronic Medical Records Based on RoBERTa-WWM[J]. Computer and Modernization, 2021, 0(02): 51-55.

参考文献

［1］杨锦锋,于秋滨,关毅,等. 电子病历命名实体识别和实体关系抽取研究综述［J］. 自动化学报, 2014,40(8):1537-1562.
［2］梁书彤,郭茂祖,赵玲玲. 基于机器学习的医疗决策支持系统综述［J］. 计算机工程与应用, 2019,55(19):1-11.
［3］ SANG S T, YANG Z H, LIU X X, et al. GrEDeL: A knowledge graph embedding based method for drug discovery from biomedical literatures［J］. IEEE Access, 2018,7:8404-8415.
［4］ JAGANNATHA A N, YU H. Bidirectional RNN for medical event detection in electronic health records［C］// Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2016:473-482.
［5］ FRIEDMAN C, ALDERSON P O, AUSTIN J H M, et al. A general natural-language text processor for clinical radiology［J］. Journal of the American Medical Informatics Association, 1994,1(2):161-174.
［6］ GAIZAUSKAS R, DEMETRIOU G, HUMPHREYS K. Term recognition and classification in biological science journal articles［C］// Proceedings of the Computional Terminology for Medical and Biological Applications Workshop of the 2nd International Conference on NLP. 2000:37-44.
［7］ ZHOU G D, SU J. Named entity recognition using an HMM-based chunk tagger［C］// Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. 2002:473-480.
［8］ JOZEFOWICZ R, ZAREMBA W, SUTSKEVER I. An empirical exploration of recurrent network architectures［C］// Proceedings of the 32nd International Conference on Machine Learning. 2015:2342-2350.
［9］ VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need［C］// Proceedings of the 31st International Conference on Neural Information Processing Systems. 2017:6000-6010.
［10］DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:4171-4186.
［11］CUI Y M, CHE W Y, LIU T, et al. Revisiting pre-trained models for Chinese natural language processing［J］. arXiv: 2004.13922, 2020.
［12］SONG M, YU H, HAN W S. Developing a hybrid dictionary-based bio-entity recognition technique［J］. BMC Medical Informatics and Decision Making, 2015,15(S1): Article No. S9.
［13］LIANG J, XIAN X M, HE X J, et al. A novel approach towards medical entity recognition in Chinese clinical text［J］. Journal of Healthcare Engineering, 2017,2017: Article ID 4898963.
［14］LI Z, ZHANG Q, LIU Y, et al. Recurrent neural networks with specialized word embedding for Chinese clinical named entity recognition［C］// CEUR Workshop Proceedings. 2017,1976:55-60.
［15］XIA Y H, WANG Q. Clinical named entity recognition: ECUST in the CCKS-2017 shared task 2［C］// CEUR Workshop Proceedings. 2017,1976:43-48.
［16］ZHU YY, WANG GY, KARLSSON B F. CAN-NER: Convolutional attention network for Chinese named entity recognition［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:3384-3393.
［17］PETERS M E, NEUMANN M, IYYER M, et al. Deep contextualized word representations［C］// Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2018:2227-2237.
［18］HOCHREITER S, SCHMIDHUBER J. Long short-term memory［J］. Neural Computation, 1997,9(8):1735-1780.
［19］LAFFERTY J D, MCCALLUM A, PEREIRA F C N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data［C］// Proceedings of the 18th International Conference on Machine Learning. 2001:282-289.
［20］马孟铖,杨晴雯,艾斯卡尔·艾木都拉,等. 基于词向量和条件随机场的中文命名实体分类［J］. 计算机工程与设计, 2020,41(9):2515-2522.
［21］李文洁,张晴晴,张鹏远,等. 基于维特比算法的深度神经网络语音端点检测［J］. 重庆邮电大学学报(自然科学版), 2018,30(2):210-215.
［22］柏兵,侯霞,石松. 基于CRF和BI-LSTM的命名实体识别方法［J］. 北京信息科技大学学报(自然科学版), 2018,33(6):27-33.
［23］陈剑,何涛,闻英友,等. 基于BERT模型的司法文书实体识别方法［J］. 东北大学学报(自然科学版), 2020,41(10):1382-1387.

[1]	郑立瑞, 肖晓霞, 邹北骥, 刘彬, 周展. 基于BERT的电子病历命名实体识别[J]. 计算机与现代化, 2024, 0(01): 87-91.
[2]	于清, 马志龙, 徐春. 基于BERT和非自回归的医疗知识抽取[J]. 计算机与现代化, 2023, 0(01): 120-126.
[3]	冯云霞, 衣鹏, 韩正亮, 宋波. 基于CNN-BGRU-CRF的中文电子病历实体抽取方法[J]. 计算机与现代化, 2020, 0(11): 60-64.
[4]	杨晓夫,秦函书. 基于电子病历利用矩阵乘法构建医生推荐模型[J]. 计算机与现代化, 2019, 0(06): 81-.
[5]	施俊华;叶保留;李朝品. BP神经网络的医学数据挖掘[J]. 计算机与现代化, 2013, 1(3): 92-95.
[6]	贺乐平. 以电子病历为核心的数字化医院建设[J]. 计算机与现代化, 2013, 1(2): 157-158+.
[7]	耿飙. 基于XML电子病历文档多重签名的应用研究[J]. 计算机与现代化, 2012, 208(12): 131-135.
[8]	姜玉蕾;陈亚东. 基于.NET的电子病历特殊打印控制的实现[J]. 计算机与现代化, 2012, 208(12): 37-39,4.
[9]	吴元立;杨宏桥;吴飞;李学斯. 电子病历在移动医疗中的研究与发展[J]. 计算机与现代化, 2012, 1(11): 162-166.
[10]	吴元立;杨宏桥;吴飞;李学斯. 适用于iPad的跨平台电子病历查阅系统的设计与实现 [J]. 计算机与现代化, 2011, 194(10): 106-109.
[11]	袁雪莉. 电子病历的现状与难点分析[J]. 计算机与现代化, 2010, 1(10): 198-200,.