Computer and Modernization ›› 2022, Vol. 0 ›› Issue (04): 1-6.

    Next Articles

Data Augmentation for Chinese Named Entity Recognition Task

  

  1. (PLA Strategic Support Force Information Engineering University, Luoyang 471003, China)
  • Online:2022-05-07 Published:2022-05-07

Abstract: In low-resource natural language processing (NLP) tasks, the existing data is not enough to train an ideal deep learning model. Text data augmentation is an effective method to improve the training effect of such tasks. This paper proposes a group of data augmentation methods based on instance substitution for the task of Chinese named entity recognition. A named entity in the training sample can be replaced by another entity of the same kind without changing the label. The specific algorithms include: 1) crossover substitution between existing entities; 2) synonymous replacement of entity components; 3) automatic generation of Chinese names. These methods are applied to PeopleDailyNER and CLUENER2020 datasets respectively, and the augmentation data is used to train the BERT+CRF model. The experimental results show that the F1 value of the model can be improved by about 10% and 7% respectively on the two datasets with only adding the same amount of augmentation data as the original data under the condition of small samples, and it also has a significant improvement when the training samples increase.

Key words: deep learning, text data augmentation, named entity recognition, instance crossover, Chinese name creater