计算机与现代化 ›› 2023, Vol. 0 ›› Issue (01): 43-48.

• 人工智能 • 上一篇    下一篇

基于藏文BERT的藏医药医学实体识别

  

  1. (1.西藏大学信息科学技术学院,西藏 拉萨 850000; 2.西藏自治区藏文信息技术人工智能重点实验室,西藏 拉萨 850000; 3.藏文信息技术教育部工程研究中心,西藏 拉萨 850000)
  • 出版日期:2023-03-02 发布日期:2023-03-02
  • 作者简介:朱亚军(1994—),男,河南周口人,硕士研究生,研究方向:自然语言处理,E-mail: zhuyajun2022@163.com; 通信作者:拥措(1974—),女(藏族),青海玉树人,教授,博士生导师,博士,研究方向:自然语言处理,人工智能,E-mail: yongtso@163.com;尼玛扎西(1964—),男(藏族),西藏拉萨人,教授,博士生导师,博士,研究方向:藏语计算语言学,E-mail: nmzx@utibet.edu.cn。
  • 基金资助:
    科技部重点研发计划重点专项(2017YFB1402200); 西藏自治区科技创新基地自主研究项目(XZ2021JR002G); 西藏大学研究生“高水平人才培养计划”项目(2019-GSP-S118)

Tibetan Medical Entity Recognition Based on Tibetan BERT

  1. (1. School of Information Science and Technology, Tibetan University, Lhasa 850000, China; 2. State Key Laboratory of Artificial Intelligence for Tibetan Information Technology in Tibet Autonomous Region, Lhasa 850000, China; 3. Ministry of Education Engineering Research Center for Tibetan Information Technology, Lhasa 850000, China)
  • Online:2023-03-02 Published:2023-03-02

摘要: 藏医药文本字符嵌入对藏医药医学实体识别有着重要意义,但目前藏文缺少高质量的藏文语言模型。本文结合藏文结构特点使用普通藏文新闻文本训练基于音节的藏文BERT模型,并基于藏文BERT模型构建BERT-BiLSTM-CRF模型。该模型首先使用藏文BERT模型对藏医药文本字符嵌入进行学习,增强字符嵌入对藏文字符及其上下文信息的表示能力,然后使用BiLSTM层进一步抽取藏医药文本中字符之间的依赖关系,最后使用CRF层强化标注序列的合法性。实验结果表明,使用藏文BERT模型初始化藏医药文本字符嵌入有助于提高藏医药医学实体识别效果,F1值达96.18%。

关键词: 藏文, 藏医药, 命名实体识别, BERT, 双向长短期记忆

Abstract: Tibetan medicine character embedding is of great significance for Tibetan medical entity recognition, but there is a lack of high-quality Tibetan language model. Combined with Tibetan structural characteristics, the BERT model based on syllable is trained by using ordinary Tibetan news text, and a BERT-BiLSTM-CRF model is built by using the Tibetan BERT model. Firstly, the model uses Tibetan BERT model to learn the character embedding of Tibetan medicine text, and enhances the ability of character embedding to express Tibetan characters and their context information. And then, the BiLSTM layer is used to further extract the dependencies between characters in Tibetan medicine text. Finally, the CRF layer is used to strengthen the legitimacy of the label sequence. The experimental results show that using Tibetan BERT model to initialize character embedding is helpful to improve the recognition of Tibetan medical entity, and the F1 value reaches 96.18%.

Key words: Tibetan, Tibetan medicine, NER, BERT, BiLSTM