基于层次化表示的电力文本命名实体识别和匹配算法

计算机与现代化 ›› 2022, Vol. 0 ›› Issue (05): 75-81.

基于层次化表示的电力文本命名实体识别和匹配算法

（1.云南电网有限责任公司电力科学研究院，云南昆明650217；2.西安交通大学数学与统计学院，陕西西安710049）

出版日期:2022-06-08 发布日期:2022-06-08
作者简介:杨政（1987—），男，云南保山人，高级工程师，硕士，研究方向：电力文本分析与应用，电网数字化以及网络安全，E-mail： 1351876072@163.com；蔡迪（1998—），男，陕西商洛人，硕士研究生，研究方向：命名实体识别，新词发现，文本查重，智能问答，E-mail： cd1998@xjtu.edu.cn；通信作者：李慧斌（1984—），男，陕西西安人，副教授，博士，研究方向：自然语言处理，深度学习，图像处理与模式识别，E-mail： huibinli@mail.xjtu.edu.cn。
基金资助:
国家自然科学基金面上项目（61976173）；教育部——中国移动人工智能建设项目（MCM20190701）

Hierarchical Representation of Power Text Named Entity Recognition and Project-expert Matching

(1. Electric Power Research Institute of Yunnan Power Grid Co., Ltd., Kunming650217, China;
2. School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an710049, China)

Online:2022-06-08 Published:2022-06-08

摘要/Abstract

摘要： 针对电力领域科技项目申请书评审工作中存在的项目与专家精准匹配难的问题，提出一种基于层次化语义表示的电力文本命名实体识别模型（Attention-RoBerta-BiLSTM-CRF， ARBC）以及基于语义-象形双特征空间映射的电力项目与电力专家的匹配策略。ARBC模型包括词嵌入模块、双向长短时记忆网络BiLSTM模块以及条件随机场（CRF）模块。其中，词嵌入模块同时使用了电力文本词语、句子和文档3个层面的信息。具体地，首先提取基于RoBerta预训练模型的词嵌入向量，进而通过引入文档层面基于词频-逆文档频率值的注意力机制增强句子的上下文表征能力，最终将词嵌入与句子嵌入进行线性加权融合，形成词语的层次化表征向量。在ARBC模型输出电力文本命名实体基础之上，进一步提出基于语义-象形双特征空间映射的项目文本与领域专家的实体匹配策略，最终实现项目与专家的有效精准匹配任务。实验结果表明，ARBC模型在2000篇电力项目摘要文本命名实体识别测试集上获得83%的F1值，显著高于基于Bert和RoBerta的文本表示方法。此外，基于双特征空间映射的实体匹配策略在电力文本与电力专家匹配任务中准确率达85%。

关键词: 层次化表示, 命名实体识别, 专家匹配, 电力文本

Abstract: To address the project-expert matching problem existing inthe evaluation work of the application for science and technology projects in the power field, this paper proposes a novel hierarchical word representation model (Attention-RoBerta-BiLSTM-CRF, ARBC) for power text named entity recognition. Moreover, a project-expert matching algorithm is also presented based on semantic and pictorial double feature space mapping strategy. ARBC model consists of a word embedding module, a Bi-directional Long Short-Term Memory (BiLSTM) module and a Conditional Random Field (CRF) module. The hierarchical word embedding module utilizes the information of word, sentence and document of the power text. Specifically the word embedding vector based on the pre-trained RoBerta model is extracted firstly. Then, the contextual representation of any sentence is enhanced by introducing an attention mechanism based on word frequency-inverse document frequency values at the document level. Finally, the word embedding and sentence embedding are linearly weighted and fused to form a hierarchical representation vector of a given word. Once the named entities of power texts are recognized by ARBC model, the task of entity effetive accurate matching between power projects and experts is achieved by the semantic and pictorial double feature space mapping strategy. Experimental results demonstrated on a set of 2000 power project abstract texts for the task of named entities recognition, and a F1 value of 83% is achieved based on the ARBC model, which is significantly higher than the widely used pre-trained models such as Bert and RoBerta. In addition, the entity matching strategy based on double feature space mapping achieves 85% accuracy for the power text-expert matching task.

Key words: hierarchical representation, named entity recognition, expertmatching, power text

杨政, 蔡迪, 李慧斌. 基于层次化表示的电力文本命名实体识别和匹配算法[J]. 计算机与现代化, 2022, 0(05): 75-81.

YANG Zheng, CAI Di, LI Hui-bin. Hierarchical Representation of Power Text Named Entity Recognition and Project-expert Matching[J]. Computer and Modernization, 2022, 0(05): 75-81.

参考文献

［1］MILO T, ZOHAR S. Using schema matching to simplify heterogeneous data translation［C］// Proceedings of the 24th VLDB Conference. 1998:24-27.
［2］殷章志,李欣子,黄德根,等. 融合字词模型的中文命名实体识别研究［J］. 中文信息学报, 2019,33(11):95-100.
［3］RAU L F. Extracting company names from text［C］// Proceedings of the 7th IEEE Conference on Artificial Intelligence Application. 1991:29-32.
［4］COLLINS M, SINGER Y. Unsupervised models for named entity classification［C］// 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. 1999.
［5］刘浏,王东波. 命名实体识别研究综述［J］. 情报学报, 2018,37(3):329-340.
［6］LI J, SUN A X, MA Y K. Neural named entity boundary detection［J］. IEEE Transactions on Knowledge and Data Engineering, 2020,33(4):1790-1795.
［7］YULITA I N, FANANY M I, ARYMUTHY A M. Bi-directional long short-term memory using quantized data of deep belief networks for sleep stage classification［J］. Procedia Computer Science, 2017,116:530-538.
［8］JIA C, LIANG X B, ZHANG Y. Cross-domain NER using cross-domain language modeling［C］// Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 2019:2464-2474.
［9］LI Y, LONG G D, SHEN T, et al. Self-attention enhanced selective gate with entity-aware embedding for distantly supervised relation extraction［C］// Proceedings of the AAAI Conference on Artificial Intelligence. 2020,34(5):8269-8276.
［10］CHEN T, XU R F, HE Y L, et al. Improving sentiment analysis via sentence type classification using BiLSTM-CRF and CNN［J］. Expert Systems with Applications, 2017,72:221-230.
［11］SOUZA F, NOGUEIRA R, LOTUFO R. Portuguese named entity recognition using BERT-CRF［J］. arXiv preprint arXiv:1909.10649, 2019.
［12］LI C, LIU Y. Improving named entity recognition in tweets via detecting non-standard words［C］// Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing. 2015:929-938.
［13］AKBIK A, BERGMANN T, VOLLGRAFR. Pooled contextualized embeddings for named entity recognition［C］// Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2019:724-728.
［14］LIU Y J, MENG F D, ZHANG J C, et al. GCDT: A global context enhanced deep transition architecture for sequence labeling［J］. arXiv preprint arXiv:1906.02437, 2019.
［15］LIU Y H, OTT M, GOYAL N, et al. Roberta: A robustly optimized bert pretraining approach［J］. arXiv preprint arXiv:1907.11692, 2019.
［16］VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need［C］// Advances in Neural Information Processing Systems. 2017:5998-6008.
［17］DEVLIN J, CHANG M W, LEE K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding［J］. arXiv preprint arXiv:1810.04805, 2018.
［18］ELMAN J L. Finding structure in time［J］. Cognitive Science, 1990,14(2):179-211.
［19］HUANG Z H, XU W, YU K. Bidirectional LSTM-CRF models for sequence tagging［J］. arXiv preprint arXiv:1508.01991, 2015.
［20］KALBFLEISCH J D, LAWLESS J F. The analysis of panel count data under a Markov assumption［J］. Journal of the American Statistical Association, 1985,80(392):863-871.
［21］VITERBI A J, WOLF J K, ZEHAVI E, et al. A pragmatic approach to trellis-coded modulation［J］. IEEE Communications Magazine, 1989,27(7):11-19.
［22］CAO S S, LU W, ZHOU J, et al. Cw2vec: Learning Chinese word embeddings with stroke n-gram information［C］// The 32nd AAAI Conference on Artificial Intelligence. 2018.
［23］MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space［J］. arXiv preprint arXiv:1301.3781, 2013.
［24］RATINOV L, ROTH D. Design challenges and misconceptions in named entity recognition［C］// Proceedings of the 13th Conference on Computational Natural Language Learning (CoNLL-2009). 2009:147-155.
［25］LOSHCHILOV I, HUTTER F. Fixing weight decay regularization in adam［J］.arXiv preprint arXiv:1711.05101, 2019.

[1]	赵盾1, 佘学兵2, 邬昌兴3. 基于BERT-BiLSTM-CRF党建领域命名实体识别[J]. 计算机与现代化, 2024, 0(09): 91-94.
[2]	乔璐, 孙有朝, 吴红兰. 面向飞机故障文本的信息抽取[J]. 计算机与现代化, 2024, 0(03): 61-66.
[3]	王谭, 陈金广, 马丽丽. 融合词典信息和句子语义的中文命名实体识别[J]. 计算机与现代化, 2024, 0(03): 24-28.
[4]	崔少国, 胡光平. 基于语义分割的嵌套命名实体识别方法[J]. 计算机与现代化, 2024, 0(02): 69-74.
[5]	郑立瑞, 肖晓霞, 邹北骥, 刘彬, 周展. 基于BERT的电子病历命名实体识别[J]. 计算机与现代化, 2024, 0(01): 87-91.
[6]	于韬, 张英, 拥措, . 基于小样本学习的藏文命名实体识别[J]. 计算机与现代化, 2023, 0(05): 13-19.
[7]	朱亚军, 拥措, 尼玛扎西, . 基于藏文BERT的藏医药医学实体识别[J]. 计算机与现代化, 2023, 0(01): 43-48.
[8]	于鹏, 陈钰枫, 徐金安, 张玉洁. 基于多任务学习的电子病历实体识别方法[J]. 计算机与现代化, 2022, 0(09): 40-50.
[9]	李健, 张克亮, 唐亮, 夏榕璟, 任静静. 面向中文命名实体识别任务的数据增强[J]. 计算机与现代化, 2022, 0(04): 1-6.
[10]	王运乾, 王以松, 陈攀峰, 邹龙. 融合注意力机制的药用植物文本命名实体识别[J]. 计算机与现代化, 2021, 0(11): 100-105.
[11]	樊志强, 凌冬怡, 牛婵. 一种基于NLP的技术体制符合性审查方法[J]. 计算机与现代化, 2021, 0(09): 63-67.
[12]	李欣晔, 张承强, 周雄图, 郭太良, 张永爱. 多场景融合的细粒度图像描述生成算法[J]. 计算机与现代化, 2021, 0(09): 1-6.
[13]	张岑芳. 基于主动学习的命名实体识别算法[J]. 计算机与现代化, 2021, 0(07): 18-22.
[14]	朱岩, 张利, 王煜. 基于RoBERTa-WWM的中文电子病历命名实体识别[J]. 计算机与现代化, 2021, 0(02): 51-55.
[15]	许梦笛, 王金华. 基于深度学习和语法规约的需求文档命名实体识别[J]. 计算机与现代化, 2021, 0(01): 105-110.