计算机与现代化 ›› 2022, Vol. 0 ›› Issue (05): 75-81.

• 数据库与数据挖掘 • 上一篇    下一篇

基于层次化表示的电力文本命名实体识别和匹配算法

  

  1. (1.云南电网有限责任公司电力科学研究院,云南昆明650217;2.西安交通大学数学与统计学院,陕西西安710049)
  • 出版日期:2022-06-08 发布日期:2022-06-08
  • 作者简介:杨政(1987—),男,云南保山人,高级工程师,硕士,研究方向:电力文本分析与应用,电网数字化以及网络安全,E-mail: 1351876072@163.com; 蔡迪(1998—),男,陕西商洛人,硕士研究生,研究方向:命名实体识别,新词发现,文本查重,智能问答,E-mail: cd1998@xjtu.edu.cn; 通信作者:李慧斌(1984—),男,陕西西安人,副教授,博士,研究方向:自然语言处理,深度学习,图像处理与模式识别,E-mail: huibinli@mail.xjtu.edu.cn。
  • 基金资助:
    国家自然科学基金面上项目(61976173); 教育部——中国移动人工智能建设项目(MCM20190701)

Hierarchical Representation of Power Text Named Entity Recognition and Project-expert Matching

  1. (1. Electric Power Research Institute of Yunnan Power Grid Co., Ltd., Kunming650217, China;
    2. School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an710049, China)
  • Online:2022-06-08 Published:2022-06-08

摘要: 针对电力领域科技项目申请书评审工作中存在的项目与专家精准匹配难的问题,提出一种基于层次化语义表示的电力文本命名实体识别模型(Attention-RoBerta-BiLSTM-CRF, ARBC)以及基于语义-象形双特征空间映射的电力项目与电力专家的匹配策略。ARBC模型包括词嵌入模块、双向长短时记忆网络BiLSTM模块以及条件随机场(CRF)模块。其中,词嵌入模块同时使用了电力文本词语、句子和文档3个层面的信息。具体地,首先提取基于RoBerta预训练模型的词嵌入向量,进而通过引入文档层面基于词频-逆文档频率值的注意力机制增强句子的上下文表征能力,最终将词嵌入与句子嵌入进行线性加权融合,形成词语的层次化表征向量。在ARBC模型输出电力文本命名实体基础之上,进一步提出基于语义-象形双特征空间映射的项目文本与领域专家的实体匹配策略,最终实现项目与专家的有效精准匹配任务。实验结果表明,ARBC模型在2000篇电力项目摘要文本命名实体识别测试集上获得83%的F1值,显著高于基于Bert和RoBerta的文本表示方法。此外,基于双特征空间映射的实体匹配策略在电力文本与电力专家匹配任务中准确率达85%。

关键词: 层次化表示, 命名实体识别, 专家匹配, 电力文本

Abstract: To address the project-expert matching problem existing inthe evaluation work of the application for science and technology projects in the power field, this paper proposes a novel hierarchical word representation model (Attention-RoBerta-BiLSTM-CRF, ARBC) for power text named entity recognition. Moreover, a project-expert matching algorithm is also presented based on semantic and pictorial double feature space mapping strategy. ARBC model consists of a word embedding module, a Bi-directional Long Short-Term Memory (BiLSTM) module and a Conditional Random Field (CRF) module. The hierarchical word embedding module utilizes the information of word, sentence and document of the power text. Specifically the word embedding vector based on the pre-trained RoBerta model is extracted firstly. Then, the contextual representation of any sentence is enhanced by introducing an attention mechanism based on word frequency-inverse document frequency values at the document level. Finally, the word embedding and sentence embedding are linearly weighted and fused to form a hierarchical representation vector of a given word. Once the named entities of power texts are recognized by ARBC model, the task of entity effetive accurate matching between power projects and experts is achieved by the semantic and pictorial double feature space mapping strategy. Experimental results demonstrated on a set of 2000 power project abstract texts for the task of named entities recognition, and a F1 value of 83% is achieved based on the ARBC model, which is significantly higher than the widely used pre-trained models such as Bert and RoBerta. In addition, the entity matching strategy based on double feature space mapping achieves 85% accuracy for the power text-expert matching task.

Key words: hierarchical representation, named entity recognition, expertmatching, power text