面向中文命名实体识别任务的数据增强

计算机与现代化 ›› 2022, Vol. 0 ›› Issue (04): 1-6.

• 人工智能 • 下一篇

面向中文命名实体识别任务的数据增强

(战略支援部队信息工程大学，河南洛阳471003)

出版日期:2022-05-07 发布日期:2022-05-07
作者简介:李健（1985—），男，山东成武人，副教授，博士研究生，研究方向：语言信息处理，E-mail: maomaotfntfn@163.com; 张克亮（1964—），男，河南洛阳人，教授，博士，研究方向：计算语言学，人工智能，E-mail: kliang99@sina.com; 唐亮（1976—），男，河南洛阳人，副教授，博士，研究方向：知识图谱，智能检索，E-mail: tl_wy@163.com; 夏榕璟（1998—），女，云南昆明人，硕士研究生，研究方向：语言信息处理，E-mail: cczxxrj@163.com; 任静静（1985—），女，山东武城人，讲师，硕士，研究方向：自然语言处理，E-mail: renjingjing1029@163.com。
基金资助:
国家自然科学基金资助项目(11590771)

Data Augmentation for Chinese Named Entity Recognition Task

(PLA Strategic Support Force Information Engineering University, Luoyang 471003, China)

Online:2022-05-07 Published:2022-05-07

摘要/Abstract

摘要： 在低资源自然语言处理(NLP)任务中，现有的数据不足以训练一个理想的深度学习模型，文本数据增强是提高此类任务训练效果的有效方法。针对中文命名实体识别任务，提出一组基于实例替换的数据增强方法。将训练样本中的命名实体替换为另一个同类实体而保持标签不变，具体算法包括：1）实体之间交叉互换；2）实体内部同义替换；3）中文人名自动生成。分别在PeopleDailyNER和CLUENER2020数据集上应用上述方法，并对BERT+CRF模型进行增强训练。实验结果表明，仅添加与原始数据等量的增强数据，在小样本条件下能使模型F1值在2个数据集上分别提升约10%和7%，随着样本数据的增加，训练效果仍有明显提升。

关键词: 深度学习, 文本数据增强, 命名实体识别, 实例交叉, 中文人名生成器

Abstract: In low-resource natural language processing (NLP) tasks, the existing data is not enough to train an ideal deep learning model. Text data augmentation is an effective method to improve the training effect of such tasks. This paper proposes a group of data augmentation methods based on instance substitution for the task of Chinese named entity recognition. A named entity in the training sample can be replaced by another entity of the same kind without changing the label. The specific algorithms include: 1) crossover substitution between existing entities; 2) synonymous replacement of entity components; 3) automatic generation of Chinese names. These methods are applied to PeopleDailyNER and CLUENER2020 datasets respectively, and the augmentation data is used to train the BERT+CRF model. The experimental results show that the F1 value of the model can be improved by about 10% and 7% respectively on the two datasets with only adding the same amount of augmentation data as the original data under the condition of small samples, and it also has a significant improvement when the training samples increase.

Key words: deep learning, text data augmentation, named entity recognition, instance crossover, Chinese name creater

李健, 张克亮, 唐亮, 夏榕璟, 任静静. 面向中文命名实体识别任务的数据增强[J]. 计算机与现代化, 2022, 0(04): 1-6.

LI Jian, ZHANG Ke-liang, TANG Liang, XIA Rong-jing, REN Jing-jing. Data Augmentation for Chinese Named Entity Recognition Task[J]. Computer and Modernization, 2022, 0(04): 1-6.

参考文献

［1］ MITCHELL T M. Machine Learning［M］. New York: McGraw-Hill, 1997.
［2］ GOODFELLOW I, BENGIO Y, COURVILLE A. Deep Learning［M］. Cambridge: MIT Press, 2016.
［3］ MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space［J］. arXiv preprint arXiv:1301.3781, 2013.
［4］ DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:4171-4186.
［5］ COULOMBE C. Text data augmentation made simple by leveraging NLP cloud APIs［J］. arXiv preprint arXiv:1812.04718, 2018.
［6］ REGINA M, MEYER M, GOUTAL S. Text data augmentation: Towards better detection of spear-phishing emails［J］. arXiv preprint arXiv:2007.02033, 2020.
［7］ NISHIKAWA S, RI R, TSURUOKA Y. Data augmentation for learning bilingual word embeddings with unsupervised machine translation［J］. arXiv preprint arXiv:2006.00262, 2020.
［8］ SHORTEN C, KHOSHGOFTAAR T. A survey on image data augmentation for deep learning［J］. Journal of Big Data, 2019,6. DOI: 10.1186/s40537-019-0197-0.
［9］ KO T, PEDDINTI V, POVEY D, et al. Audio augmentation for speech recognition［C］// Proceedings of the Annual Conference of the 16th International Speech Communication Association. 2015:3586-3589.
［10］LIU P, WANG X M, XIANG C, et al. A survey of text data augmentation［C］// Proceedings of the 2020 International Conference on Computer Communication and Network Security. 2020:191-195.
［11］LIU Y, ZHANG M. Neural network methods for natural language processing［J］. Computational Linguistics, 2018,44(1):193-195.
［12］ZHANG X, ZHAO J B, LECUN Y. Character-level convolutional networks for text classification［C］// Proceedings of the 29th Conference and Workshop on Neural Information Processing Systems. 2015:649-657.
［13］GARG S, RAMAKRISHNAN G. BAE: BERT-based adversarial examples for text classification［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020:6174-6181.
［14］MUELLER J, THYAGARAJAN A. Siamese recurrent architectures for learning sentence similarity［C］// Proceedings of the 30th AAAI Conference on Artificial Intelligence. 2016:2786-2792.
［15］PENG B L, ZHU C G, ZENG M, et al. Data augmentation for spoken language understanding via pretrained language models［C］// Proceedings of Interspeech 2021. 2021:1219-1223.
［16］XIE Q Z, DAI Z H, HOVY E, et al. Unsupervised data augmentation for consistency training［C］// Proceedings of the 2020 Conference on Neural Information Processing Systems. 2020.
［17］NIEDERHUT D. Niacin: A Python package for text data enrichment［J］. Journal of Open Source Software, 2020,5(50). DOI: 10.21105/joss.02136.
［18］QUTEINEH H, SAMOTHRAKIS S, SUTCLIFFE R. Textual data augmentation for efficient active learning on tiny datasets［C］// Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. 2020:7400-7410.

［19］XIE Z A, WANG S D I, LI J W, et al. Data noising as smoothing in neural network language models［C］// Proceedings of the 2017 International Conference on Learning Representations. 2017.

［20］GUO H Y, MAO Y Y, ZHANG R C. Augmenting data with mixup for sentence classification: An empirical study［J］. arXiv preprint arXiv:1905.08941, 2019.
［21］RAILLE G, DJAMBAZOVSKA S, MUSAT C. Fast cross-domain data augmentation through neural sentence editing［J］. arXiv preprint arXiv:2003.10254, 2020.
［22］赵小兵,鲍薇,董建,等. 基于数据增强的藏文改写检测研究［J］. 中文信息学报, 2019,33(12):83-90.
［23］何家劲. 基于文本数据增强与混合模型的情感分析研究［D］. 合肥:合肥工业大学, 2018.
［24］赵鹏飞. 基于生成对抗网络的中文语言模型数据增强技术［D］. 哈尔滨:哈尔滨工业大学, 2018.
［25］宋希良,韩先培,孙乐. 面向新类型人名识别的数据增强方法［J］. 中文信息学报, 2019,33(6):72-79.
［26］公安部户政管理研究中心. 二〇一九年全国姓名报告［EB/OL］. (2020-01-20)［2021-02-15］. https://www.mps.gov.cn/n2254314/n6409334/c6874817/content.html.
［27］XU L, TONG Y, DONG Q Q, et al. CLUENER2020: Fine-grained named entity recognition dataset and benchmark for Chinese［J］. arXiv preprint arXiv:2001.04351, 2020.

[1]	祁贤, 刘大铭, 常佳鑫. 基于改进自注意力机制的多视图三维重建[J]. 计算机与现代化, 2024, 0(11): 106-112.
[2]	陈凯1, 李宜汀1, 2, 全华凤1 . 基于改进YOLOv8的河道废弃瓶检测方法[J]. 计算机与现代化, 2024, 0(11): 113-120.
[3]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[4]	王莹莹, 郝潇. 基于Res2Net和递归门控卷积的细粒度图像分类[J]. 计算机与现代化, 2024, 0(10): 74-79.
[5]	史星宇1, 李强2, 庄莉3, 梁懿3, 王秋琳3, 陈锴3, 伍臣周3, 常胜1. 一种面向工业部署的目标检测模型蒸馏技术[J]. 计算机与现代化, 2024, 0(10): 93-99.
[6]	赵盾1, 佘学兵2, 邬昌兴3. 基于BERT-BiLSTM-CRF党建领域命名实体识别[J]. 计算机与现代化, 2024, 0(09): 91-94.
[7]	张泽1, 张建权2, 3, 周国鹏2, 3. 基于改进YOLOv8s的摄像头模组缺陷检测[J]. 计算机与现代化, 2024, 0(09): 107-113.
[8]	程亚子1, 雷亮1, 2, 陈瀚1, 赵毅然1. 基于转置注意力的多尺度深度融合单目深度估计[J]. 计算机与现代化, 2024, 0(09): 121-126.
[9]	程萌, 李浩. 改进YOLOv5s的落叶树鸟巢检测方法[J]. 计算机与现代化, 2024, 0(08): 24-29.
[10]	王梦溪, 李峻. 老年人跌倒检测技术研究综述[J]. 计算机与现代化, 2024, 0(08): 30-36.
[11]	时现伟1, 范鑫2. 基于轻量化的视频帧场景语义分割方法[J]. 计算机与现代化, 2024, 0(08): 49-53.
[12]	徐新爱, 李钢. 基于DCGAN的课堂表情图像生成方法[J]. 计算机与现代化, 2024, 0(08): 88-91.
[13]	高帅鹏, 王怡凡. 基于图像的群体情绪识别综述[J]. 计算机与现代化, 2024, 0(08): 98-107.
[14]	黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
[15]	武丽1, 张征浩2, 葛彩成2, 俞俊2. 基于改进SCNN网络的车道线检测算法[J]. 计算机与现代化, 2024, 0(07): 87-92.

面向中文命名实体识别任务的数据增强

Data Augmentation for Chinese Named Entity Recognition Task

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价