基于Do-Bi-LSTM模型的电子政务文本相似度评估模型

doi:10.3969/j.issn.1006-2475.2020.07.014

计算机与现代化 ›› 2020, Vol. 0 ›› Issue (07): 71-75.doi: 10.3969/j.issn.1006-2475.2020.07.014

基于Do-Bi-LSTM模型的电子政务文本相似度评估模型

(太原科技大学计算机科学与技术学院,山西太原030024)

出版日期:2020-07-06 发布日期:2020-07-15
作者简介:李凡（1994-），女，山西吕梁人，硕士研究生，研究方向：自然语言处理，E-mail: 2659238942 @qq.com；白尚旺（1964-），男，教授，博士，研究方向：数据库与软件工程技术，信息管理与决策支持；党伟超（1974-），男，副教授，博士，研究方向：数据库与软件工程，信息管理与信息系统，分布式计算；潘理虎（1974-），男，副教授，博士，研究方向：人工智能。
基金资助:
“十二五”山西科技重大专项项目(20121101001)；山西省中科院科技合作项目(20141101001)；山西省科技攻关项目(20141039)

E-government Text Similarity Evaluation Model Based on Do-Bi-LSTM Model

(School of Computer Science and Technology, Taiyuan University of Science and Technology, Taiyuan 030024, China)

Online:2020-07-06 Published:2020-07-15

摘要/Abstract

摘要： 针对当前政务系统中人工审批文本效率低下的问题，本文将文本相似度引入电子政务中。当前基于文本相似度的网络模型中，存在着生成的词向量矩阵巨大，需要大量的时间去训练，而且仅利用上下文的环境来生成词向量，忽略了文档的语序和语义的关系。为了提高效率并降低训练成本，本文提出基于Do-Bi-LSTM文本相似度计算方法，该模型首先通过Doc2vec语言模型把训练数据集中的文本转换成向量，该方法在词向量的基础上增加了文本向量，从而获取了句子之间以及段落之间的相互关系。然后把得到的向量作为Bi-LSTM网络模型的输入进行模型训练，最后与LSTM网络模型、传统的深度网络模型相比，实验表明本文方法的准确率有很大的提高，具有可行性。

关键词: 文本相似度, Doc2vec, 双向长短期记忆网络

Abstract: In view of the inefficiency of manual approval texts in current government systems, this paper introduces text similarity into e-government. In the current network model based on text similarity, there is a huge matrix of generated word vectors, which requires a lot of time to train, and only uses the context of the context to generate word vectors, ignoring the relationship between the word order and semantics of the document. In order to improve efficiency and reduce training cost, this paper proposes a Do-Bi-LSTM text similarity calculation method, which first converts the text in the training data set into a vector through the Doc2vec language model. This method adds a text vector on the basis of the word vector, so can capture the interrelationship between sentences and between paragraphs. Then the obtained vector is trained as the input of the Bi-LSTM network model. Finally, compared with the LSTM network model and the traditional deep network model, the experiment shows that the accuracy of the method is greatly improved and feasible.

Key words: text similarity, Doc2vec, bi-directional long short-term memory

中图分类号:

TP311

李凡, 白尚旺, 党伟超, 潘理虎. 基于Do-Bi-LSTM模型的电子政务文本相似度评估模型[J]. 计算机与现代化, 2020, 0(07): 71-75.

LI Fan, BAI Shang-wang, DANG Wei-chao, PAN Li-hu. E-government Text Similarity Evaluation Model Based on Do-Bi-LSTM Model[J]. Computer and Modernization, 2020, 0(07): 71-75.

参考文献

［1］ OSMAN C C, ZALHAN P G. From natural language text to visual models: A survey of issues and approaches［J］. Informatica Economica, 2016,20(4):44-61.
［2］姜华,韩安琪,王美佳,等. 基于改进编辑距离的字符串相似度求解算法［J］. 计算机工程, 2014,40(1):222-227.
［3］ ZHENG T, GAO Y M, WANG F, et al. Detection of medical text semantic similarity based on convolutional neural network［J］. BMC Medical Informatics and Decision Making, 2019,156: DOI： 10.1186/s12911-019-0880-2.
［4］ PUTRA M E W, SUWARDI I S. Structural off-line handwriting character recognition using approximate subgraph matching and levenshtein distance［J］. Procedia Computer Science, 2015,59:340-349.
［5］ GRABOWSKI S. A note on the longest common substring with k-mismatches problem［J］. Information Processing Letters, 2015,115(6-8):640-642.
［6］张波. 基于维基百科链接特征的词语语义相似度计算［J］. 软件工程, 2019,22(10):36-43.
［7］刘文,马慧芳,脱婷,等. 融合共现距离和区分度的短文本相似度计算方法［J］. 计算机工程与科学, 2017,29(3):52-53.
［8］徐鑫鑫,刘彦隆,宋明. 利用加权词句向量的文本相似度计算方法［J］. 小型微型计算机系统, 2019,40(10):2072-2076.
［9］ ZHAO C G, WANG Z. GOGO: An improved algorithm to measure the semantic similarity between gene ontology terms［J］. Scientific Reports, 2018,8:15107:DOI:10.1038/s41598-018-33219-y.
［10］ABEYSINGHE R, QU X F, CUI L C. Identifying similar non-lattice subgraphs in gene ontology based on structural isomorphism and semantic similarity of concept labels［C］// AMIA Symposium. 2018:1186-1195.
［11］HASSANZADEH H, NGUYEN A, VERSPOOR K. Quantifying semantic similarity of clinical evidence in the biomedical literature to facilitate related evidence synthesis［J］. Journal of Biomedical Informatics, 2019:103321:DOI: 10.1016/j.jbi.2019.103321.
［12］张河苇,金剑,董绍华,等. 语义相似度计算在内检测数据参数匹配中的应用［J］. 石油科学通报, 2018,3(4):446-451.
［13］乔晶晶,段利国,李爱萍. 融合多种特征的实体对齐算法［J］. 计算机工程与设计, 2018,39(11):3395-3400.
［14］游彬,严岳松,孙英阁,等. 基于HowNet的信息量计算语义相似度算法［J］. 计算机系统应用, 2013,22(1):129-133.
［15］庄严,李国良,冯建华. 知识库实体对齐技术综述［J］. 计算机研究与发展, 2016,53(1):165-192.
［16］马慧芳,刘文,李志欣,等. 融合耦合距离区分度和强类别特征的短文本相似度计算方法［J］. 电子学报, 2019,47(6):1331-1336.
［17］翟社平,李兆兆,段宏宇,等. 多特征融合的句子语义相似度计算方法［J］. 计算机工程与设计, 2019,40(10):2867-2873.
［18］杨波,杨文忠,殷亚博,等. 基于词向量和增量聚类的短文本聚类算法［J］. 计算机工程与设计, 2019,40(10):2985-2990.
［19］刘一丁,陈晓琳,尹晓阳,等. 资源贫乏型语言间文本相似度计算方法［J］. 指挥信息系统与技术, 2019,10(4):27-32.
［20］SONG M, HEO G E, DING Y. SemPathFinder: Semantic path analysis for discovering publicly unknown knowledge［J］. Journal of Informetrics, 2015,9(4):686-703.
［21］O’SHEA K, CROCKETT K, BANDAR Z, et al. Natural language scripting within conversational agent design［J］. Applied Intelligence, 2014,198:DOI:10.1007/s10489-012-0408-2.
［22］CHUA C C, LIM T Y, SOON L K, et al. Meaning preservation in example-based machine translation with structural semantics［J］. Expert Systems with Applications, 2017,78:242-258.
［23］BEEKHUIZEN B, STEVENSON S. More than the eye can see: A computational model of color term acquisition and color discrimination［J］. Cognitive Science, 2018,42(8):2699-2734.

[1]	张晓东1, 白广芝1, 李敏1, 李昊洋2. 基于经验小波变换的油气井产量预测模型 [J]. 计算机与现代化, 2024, 0(12): 53-58.
[2]	李钧超1, 尤菲1, 张超2, 苏乐乐2, 龚龑2. 基于新型多目标浣熊优化算法的BiLSTM-Attention#br# 预测模型及误差分析[J]. 计算机与现代化, 2024, 0(11): 70-76.
[3]	郑立瑞, 肖晓霞, 邹北骥, 刘彬, 周展. 基于BERT的电子病历命名实体识别[J]. 计算机与现代化, 2024, 0(01): 87-91.
[4]	邵孟巧, 吉顺慧, 张鹏程. AC-Rec:基于多维特征的科研合作者推荐方法[J]. 计算机与现代化, 2021, 0(03): 94-100.
[5]	许梦笛, 王金华. 基于深度学习和语法规约的需求文档命名实体识别[J]. 计算机与现代化, 2021, 0(01): 105-110.
[6]	夏志明1，2，刘新1，2. 一种基于语义的中文文本相似度算法[J]. 计算机与现代化, 2015, 0(4): 6-9.
[7]	马昌威;邵莉. 一种融合本体与粗糙集的文档相似度计算方法[J]. 计算机与现代化, 2012, 1(10): 17-20,2.

基于Do-Bi-LSTM模型的电子政务文本相似度评估模型

E-government Text Similarity Evaluation Model Based on Do-Bi-LSTM Model

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价