Data Augmentation for Chinese Named Entity Recognition Task

Abstract

Abstract: In low-resource natural language processing (NLP) tasks, the existing data is not enough to train an ideal deep learning model. Text data augmentation is an effective method to improve the training effect of such tasks. This paper proposes a group of data augmentation methods based on instance substitution for the task of Chinese named entity recognition. A named entity in the training sample can be replaced by another entity of the same kind without changing the label. The specific algorithms include: 1) crossover substitution between existing entities; 2) synonymous replacement of entity components; 3) automatic generation of Chinese names. These methods are applied to PeopleDailyNER and CLUENER2020 datasets respectively, and the augmentation data is used to train the BERT+CRF model. The experimental results show that the F1 value of the model can be improved by about 10% and 7% respectively on the two datasets with only adding the same amount of augmentation data as the original data under the condition of small samples, and it also has a significant improvement when the training samples increase.

Key words: deep learning, text data augmentation, named entity recognition, instance crossover, Chinese name creater

LI Jian, ZHANG Ke-liang, TANG Liang, XIA Rong-jing, REN Jing-jing. Data Augmentation for Chinese Named Entity Recognition Task[J]. Computer and Modernization, 2022, 0(04): 1-6.

References

［1］ MITCHELL T M. Machine Learning［M］. New York: McGraw-Hill, 1997.
［2］ GOODFELLOW I, BENGIO Y, COURVILLE A. Deep Learning［M］. Cambridge: MIT Press, 2016.
［3］ MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space［J］. arXiv preprint arXiv:1301.3781, 2013.
［4］ DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:4171-4186.
［5］ COULOMBE C. Text data augmentation made simple by leveraging NLP cloud APIs［J］. arXiv preprint arXiv:1812.04718, 2018.
［6］ REGINA M, MEYER M, GOUTAL S. Text data augmentation: Towards better detection of spear-phishing emails［J］. arXiv preprint arXiv:2007.02033, 2020.
［7］ NISHIKAWA S, RI R, TSURUOKA Y. Data augmentation for learning bilingual word embeddings with unsupervised machine translation［J］. arXiv preprint arXiv:2006.00262, 2020.
［8］ SHORTEN C, KHOSHGOFTAAR T. A survey on image data augmentation for deep learning［J］. Journal of Big Data, 2019,6. DOI: 10.1186/s40537-019-0197-0.
［9］ KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition［C］// Proceedings of the Annual Conference of the 16th International Speech Communication Association. 2015:3586-3589.
［10］LIU P, WANG X M, XIANG C, et al. A survey of text data augmentation［C］// Proceedings of the 2020 International Conference on Computer Communication and Network Security. 2020:191-195.
［11］LIU Y, ZHANG M. Neural network methods for natural language processing［J］. Computational Linguistics, 2018,44(1):193-195.
［12］ZHANG X, ZHAO J B, LECUN Y. Character-level convolutional networks for text classification［C］// Proceedings of the 29th Conference and Workshop on Neural Information Processing Systems. 2015:649-657.
［13］GARG S, RAMAKRISHNAN G. BAE: BERT-based adversarial examples for text classification［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020:6174-6181.
［14］MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity［C］// Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016:2786-2792.
［15］PENG B L, ZHU C G, ZENG M, et al. Data augmentation for spoken language understanding via pretrained language models［C］// Proceedings of Interspeech 2021. 2021:1219-1223.
［16］XIE Q Z, DAI Z H, HOVY E, et al. Unsupervised data augmentation for consistency training［C］// Proceedings of the 2020 Conference on Neural Information Processing Systems. 2020.
［17］NIEDERHUT D. Niacin: A Python package for text data enrichment［J］. Journal of Open Source Software, 2020,5(50). DOI: 10.21105/joss.02136.
［18］QUTEINEH H, SAMOTHRAKIS S, SUTCLIFFE R. Textual data augmentation for efficient active learning on tiny datasets［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020:7400-7410.

［19］XIE Z A, WANG S D I, LI J W, et al. Data noising as smoothing in neural network language models［C］// Proceedings of the 2017 International Conference on Learning Representations. 2017.

［20］GUO H Y, MAO Y Y, ZHANG R C. Augmenting data with mixup for sentence classification: An empirical study［J］. arXiv preprint arXiv:1905.08941, 2019.
［21］RAILLE G, DJAMBAZOVSKA S, MUSAT C. Fast cross-domain data augmentation through neural sentence editing［J］. arXiv preprint arXiv:2003.10254, 2020.
［22］赵小兵,鲍薇,董建,等. 基于数据增强的藏文改写检测研究［J］. 中文信息学报, 2019,33(12):83-90.
［23］何家劲. 基于文本数据增强与混合模型的情感分析研究［D］. 合肥:合肥工业大学, 2018.
［24］赵鹏飞. 基于生成对抗网络的中文语言模型数据增强技术［D］. 哈尔滨:哈尔滨工业大学, 2018.
［25］宋希良,韩先培,孙乐. 面向新类型人名识别的数据增强方法［J］. 中文信息学报, 2019,33(6):72-79.
［26］公安部户政管理研究中心. 二〇一九年全国姓名报告［EB/OL］. (2020-01-20)［2021-02-15］. https://www.mps.gov.cn/n2254314/n6409334/c6874817/content.html.
［27］XU L, TONG Y, DONG Q Q, et al. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese［J］. arXiv preprint arXiv:2001.04351, 2020.

[1]	QI Xian, LIU Daming, CHANG Jiaxin. Multi-view 3D Reconstruction Based on Improved Self-attention Mechanism [J]. Computer and Modernization, 2024, 0(11): 106-112.
[2]	CHEN Kai1, LI Yiting1, 2, QUAN Huafeng1. A River Discarded Bottles Detection Method Based on Improved YOLOv8 [J]. Computer and Modernization, 2024, 0(11): 113-120.
[3]	YANG Jun1, HU Wei1, ZHU Wenfu2. Visual SLAM Loop Closure Detection Algorithm Based on Improved MobileNetV3 [J]. Computer and Modernization, 2024, 0(10): 21-26.
[4]	WANG Yingying, HAO Xiao. Fine-grained Image Classification Based on Res2Net and Recursive Gated Convolution [J]. Computer and Modernization, 2024, 0(10): 74-79.
[5]	SHI Xingyu1, LI Qiang2, ZHUANG Li3, LIANG Yi3, WANG Qiulin3, CHEN Kai3, WU Chenzhou3, CHANG Sheng1. Object Detection Models Distillation Technique for Industrial Deployment [J]. Computer and Modernization, 2024, 0(10): 93-99.
[6]	ZHANG Ze1, ZHANG Jianquan2, 3, ZHOU Guopeng2, 3. Camera Module Defect Detection Based on Improved YOLOv8s [J]. Computer and Modernization, 2024, 0(09): 107-113.
[7]	CHENG Yazi1, LEI Liang1, 2, CHEN Han1, ZHAO Yiran1. Multi-scale Depth Fusion Monocular Depth Estimation Based on Transposed Attention [J]. Computer and Modernization, 2024, 0(09): 121-126.
[8]	CHENG Meng, LI Hao. Improved Deciduous Tree Nest Detection Method Based on YOLOv5s [J]. Computer and Modernization, 2024, 0(08): 24-29.
[9]	WANG Mengxi, LI Jun. Review of Fall Detection Technologies for Elderly [J]. Computer and Modernization, 2024, 0(08): 30-36.
[10]	SHI Xianwei1, FAN Xin2. Semantic Segmentation of Video Frame Scene Based on Lightweight [J]. Computer and Modernization, 2024, 0(08): 49-53.
[11]	XU Xin’ai, LI Gang. An Image Generation Method of Classroom Expression Images [J]. Computer and Modernization, 2024, 0(08): 88-91.
[12]	GAO Shuaipeng, WANG Yifan. Survey on Group-level Emotion Recognition in Images [J]. Computer and Modernization, 2024, 0(08): 98-107.
[13]	HUANG Wendong, WANG Yifan. Survey on Multimodal Information Processing and Fusion Based on Modal Categories [J]. Computer and Modernization, 2024, 0(07): 47-62.
[14]	WU Li1, ZHANG Zhenghao2, GE Caicheng2, YU Jun2. Lane Line Detection Algorithm Based on Improved SCNN Network [J]. Computer and Modernization, 2024, 0(07): 87-92.
[15]	ZHANG Ke1, AI Zhongliang2, LIU Zhonglin3, GU Pingli1, LIU Xuelin4. Judicial Argumentation Understanding Method Based on Multiplet Loss [J]. Computer and Modernization, 2024, 0(06): 115-120.