计算机与现代化 ›› 2022, Vol. 0 ›› Issue (09): 40-50.

• 人工智能 • 上一篇    下一篇

基于多任务学习的电子病历实体识别方法

  

  1. (北京交通大学计算机与信息技术学院,北京100044)
  • 出版日期:2022-09-22 发布日期:2022-09-22
  • 作者简介:于鹏(1994—),男,山东威海人,硕士研究生,研究方向:自然语言处理,实体抽取,E-mail: 18811629029@163.com; 通信作者:陈钰枫(1981—),女,福建南平人,副教授,博士,研究方向:自然语言处理,人工智能,E-mail: chenyf@bjtu.edu.cn; 徐金安(1970—),男,河南开封人,副教授,博士,研究方向:自然语言处理,机器翻译,E-mail: jaxu@bjtu.edu.cn; 张玉洁(1961—),女,河南安阳人,教授,博士,研究方向:自然语言处理,机器翻译,E-mail: yjzhang@bjtu.edu.cn。
  • 基金资助:
    国家自然科学基金面上项目(61976016, 61976015, 61876198)

Entity Recognition Method on EMR Based on Multi-task Learning

  1. (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China)
  • Online:2022-09-22 Published:2022-09-22

摘要: 中文电子病历NER是医疗信息抽取的难点。本文提出一种多任务学习的实体识别方法,联合实体识别和分词训练模型,使用基于Bi-LSTM的私有层提取专有信息,融合注意力网络作为共享层并增加通用特征增强机制来筛选全局信息,降低过拟合风险并增强模型的泛化能力。此外提出均衡样本过采样方法扩充数据集,有效解决实体类别不平衡所带来的问题。使用CCKS2017/CCKS2020电子病历实体识别语料和Medicine医药分词语料联合训练,实验结果显示本文提出的模型整体性能提升明显,同时也显著提高了Medicine语料的分词实验效果,F1值较基线提升了3个百分点。实验表明本文提出的模型能够有效改善因电子病历中数据不规范、无结构或专有名词等原因造成的实体切分错误等问题。

关键词: 深度学习, 命名实体识别, 多任务学习, 神经网络, 注意力机制

Abstract: Named entity recognition of Chinese EMR is the difficulty in medical information extraction. This paper proposes a multi-task learning mechanism to recognize entity which jointly entity recognition and word segmentation training. The private layers based on Bi-LSTM are used to extract private features, the attention network is used as the shared layer and the general feature enhancement mechanism is added to filter the gobal information, which reduces the risk of over-fitting and enhanced the model generalization ability. Moreover, the balanced oversampling method is proposed to augment EMR dataset, which effectively solves the problem caused by the huge discrepancy in EMR entity types. The CCKS2017/CCKS2020 EMR entity recognition dataset and medicine word segmentation dataset are used for joint learning. The experimental results show that the overall performance in EMR entity recognition is significantly improved, and the word segmentation benchmark in medicine dataset is also raised by 3 percent points in F1 value. The detailed analysis show that the proposed model can effectively correct the entity chunking errors caused by irregular writing style, unstructured text or professional nouns/terms in EMR dataset.

Key words: deep learning, named entity recognition, multi-task learning, neural network, attention mechanism