计算机与现代化 ›› 2022, Vol. 0 ›› Issue (04): 1-6.

• 人工智能 •    下一篇

面向中文命名实体识别任务的数据增强

  

  1. (战略支援部队信息工程大学,河南洛阳471003)
  • 出版日期:2022-05-07 发布日期:2022-05-07
  • 作者简介:李健(1985—),男,山东成武人,副教授,博士研究生,研究方向:语言信息处理,E-mail: maomaotfntfn@163.com; 张克亮(1964—),男,河南洛阳人,教授,博士,研究方向:计算语言学,人工智能,E-mail: kliang99@sina.com; 唐亮(1976—),男,河南洛阳人,副教授,博士,研究方向:知识图谱,智能检索,E-mail: tl_wy@163.com; 夏榕璟(1998—),女,云南昆明人,硕士研究生,研究方向:语言信息处理,E-mail: cczxxrj@163.com; 任静静(1985—),女,山东武城人,讲师,硕士,研究方向:自然语言处理,E-mail: renjingjing1029@163.com。
  • 基金资助:
    国家自然科学基金资助项目(11590771)

Data Augmentation for Chinese Named Entity Recognition Task

  1. (PLA Strategic Support Force Information Engineering University, Luoyang 471003, China)
  • Online:2022-05-07 Published:2022-05-07

摘要: 在低资源自然语言处理(NLP)任务中,现有的数据不足以训练一个理想的深度学习模型,文本数据增强是提高此类任务训练效果的有效方法。针对中文命名实体识别任务,提出一组基于实例替换的数据增强方法。将训练样本中的命名实体替换为另一个同类实体而保持标签不变,具体算法包括:1)实体之间交叉互换;2)实体内部同义替换;3)中文人名自动生成。分别在PeopleDailyNER和CLUENER2020数据集上应用上述方法,并对BERT+CRF模型进行增强训练。实验结果表明,仅添加与原始数据等量的增强数据,在小样本条件下能使模型F1值在2个数据集上分别提升约10%和7%,随着样本数据的增加,训练效果仍有明显提升。


关键词: 深度学习, 文本数据增强, 命名实体识别, 实例交叉, 中文人名生成器

Abstract: In low-resource natural language processing (NLP) tasks, the existing data is not enough to train an ideal deep learning model. Text data augmentation is an effective method to improve the training effect of such tasks. This paper proposes a group of data augmentation methods based on instance substitution for the task of Chinese named entity recognition. A named entity in the training sample can be replaced by another entity of the same kind without changing the label. The specific algorithms include: 1) crossover substitution between existing entities; 2) synonymous replacement of entity components; 3) automatic generation of Chinese names. These methods are applied to PeopleDailyNER and CLUENER2020 datasets respectively, and the augmentation data is used to train the BERT+CRF model. The experimental results show that the F1 value of the model can be improved by about 10% and 7% respectively on the two datasets with only adding the same amount of augmentation data as the original data under the condition of small samples, and it also has a significant improvement when the training samples increase.

Key words: deep learning, text data augmentation, named entity recognition, instance crossover, Chinese name creater