一种基于自适应关联熵的关键字提取算法

doi:10.3969/j.issn.1006-2475.2020.04.012

计算机与现代化 ›› 2020, Vol. 0 ›› Issue (04): 67-.doi: 10.3969/j.issn.1006-2475.2020.04.012

一种基于自适应关联熵的关键字提取算法

(1.国网湖南供电服务中心（计量中心），湖南长沙410004；2.国网湖南省电力有限公司常德供电分公司，湖南常德415000）

收稿日期:2019-06-16 出版日期:2020-04-22 发布日期:2020-04-24
作者简介:罗有志(1989-),男,湖南娄底人,硕士研究生,研究方向:数据挖掘,人工智能,大数据,E-mail: hn_100861@aliyun.com; 陈征明(1971-),男,湖南常德人,工程师,本科；陈明(1972-),湖南常德人,本科。

A Keyword Extraction Algorithm Based on Adaptive Related Entropy

(1.State Grid Hunan Power Supply Service Center(Metrology Center), Changsha 410004, China;
2.Changde Power Supply Branch Company, State Grid Hunan Power Co. Ltd., Changde 415000, China)

Received:2019-06-16 Online:2020-04-22 Published:2020-04-24

摘要/Abstract

摘要： 相较于传统的依托词汇频率大小进行关键字提取技术，TextRank算法能够考虑词汇节点之间的相似性信息，却仍忽略了词汇上下文信息以及文章的语义结构等。在节点图权重迭代的基础上，本文利用文本上下文词汇的关联规则信息，引入关联熵的概念，自适应调整阻尼系数和滑动窗口大小，更贴近文本词汇的实际语义情况，能够较好地处理低词频和新词汇信息。实验表明，相较于TFIDF和TR等算法，本文算法在处理关键字提取时能够达到更准确的效果。

关键词: 文本挖掘, 关键字提取, 关联规则, TextRank, 节点

Abstract: Compared with the traditional technique of keyword extraction based on vocabulary frequency size, the TexRank algorithm can consider the similarity information between vocabulary nodes, but ignores vocabulary context information and the semantic structure of the article. On the basis of the weighted iteration of node diagram, this paper uses the association rule information of text context vocabulary, introduces the concept of association entropy, adaptively adjusts damping coefficient and sliding window size. It is closer to the actual semantic situation of text vocabulary, and can better deal with low word frequency and new vocabulary information. Experimental result shows that compared with TFIDF and TR algorithm, this method can achieve more accurate results when processing keyword extraction.

Key words: text mining, keyword extraction, association rules, TextRank, node

中图分类号:

TP301

罗有志1,2，陈征明2，陈明2，梅文涛2. 一种基于自适应关联熵的关键字提取算法[J]. 计算机与现代化, 2020, 0(04): 67-.

LUO You-zhi 1,2, CHEN Zheng-ming2, CHEN Ming2, MEI Wen-tao2. A Keyword Extraction Algorithm Based on Adaptive Related Entropy[J]. Computer and Modernization, 2020, 0(04): 67-.

参考文献

［1］李惠富,陆光,景维鹏. 文本分类中基于K-Sprinkling的特征提取方法［J］. 计算机工程, 2017,43(12):141-146.
［2］ XHEMA J, METIN H, GROUMPOS P. Switching-costs, corporate image and product quality effect on customer loyalty: Kosovo retail market［J］. IFAC Papers OnLine, 2018,51(30):287-292.
［3］许卓斌,郑海山,潘竹虹. 基于改进自编码器的文本分类算法［J］. 计算机科学, 2018,45(6):208-210.
［4］袁芳,王瑞春,管明祥,等. 基于文本挖掘与功能相似性的疾病基因预测［J］. 计算机工程, 2011,37(4):27-28.
［5］夏志明,刘新. 一种基于语义的中文文本相似度算法［J］. 计算机与现代化, 2015(4):6-9.
［6］ LEE H, SHIMOTAKAHARA R, OGATA S. Advantage of text mining as the knowledge creation support tool［J］. The Journal of Information Science and Technology Association, 2017,67(12):643-649.
［7］ RAMESH T, PANDA A K, BONALA S. Artificial intelligence based high performance direct torque and flux control of induction motor drive［C］// 2013 International Conference on Circuits, Power and Computing Technologies. 2013：225-230.
［8］ GUPTA A, DIXIT A, SHARMA A K. A novel statistical and linguistic features based technique for keyword extraction［C］// 2014 International Conference on Information Systems and Computer Networks. 2014:55-59.
［9］晏文坛. 半结构化中文简历的信息抽取［D］. 广州:华南理工大学, 2018.
［10］王琦,霍纬纲. 利用博客链接平台选取联合关键字的博客聚类方法［J］. 计算机应用研究, 2017,34(12):3560-3563.
［11］殷聪,张李义. 基于TF-IDF的情境后过滤推荐算法研究——以餐饮业O2O为例［J］. 数据分析与知识发现, 2018,2(11):28-36.
［12］王景中,邱铜相. 基于TF-IDF改进算法的聚焦主题网络爬虫［J］. 计算机应用, 2015,35(10):2901-2904.
［13］陈列蕾,方晖. 基于Scopus检索和TFIDF的论文关键词自动提取方法［J］. 南京大学学报(自然科学), 2018,54(3):604-611.
［14］叶雪梅,毛雪岷,夏锦春,等. 文本分类TF-IDF算法的改进研究［J］. 计算机工程与应用, 2019,55(2):104-109.
［15］TU S Z, HUANG M L. Mining microblog user interests based on TextRank with TF-IDF factor［J］. The Journal of China Universities of Posts and Telecommunications, 2016,23(5):40-46.
［16］谢毓彬. 面向微博用户的标签自动生成技术研究［D］. 哈尔滨:哈尔滨工业大学, 2012.
［17］李鹏,王斌,石志伟,等. Tag-TextRank:一种基于Tag的网页关键词抽取方法［J］. 计算机研究与发展, 2012,49(11):2344-2351.
［18］谢玮,沈一,马永征. 基于图计算的论文审稿自动推荐系统［J］. 计算机应用研究, 2016,33(03):798-801.
［19］张春,周静. 动车组故障关联规则挖掘优化算法研究与应用［J］. 计算机与现代化, 2017(9):74-78.
［20］DORESWAMY H, SALMA M U. Reducing the feature space using constraint-governed association rule mining［J］. Journal of Intelligent Systems, 2017,26(1)：139-152.
［21］GOLAN A, ULLAH A. Interval estimation: An information theoretic approach［J］. Econometric Reviews, 2017,36(6-9):781-795.
［22］吕靖,舒礼莲. 基于AdaBoost的不完整数据的信息熵分类算法［J］. 计算机与现代化, 2013(9):31-34.
［23］张龙,王劲松. SDN中基于信息熵与DNN的DDoS攻击检测模型［J］. 计算机研究与发展, 2019,56(5):909-918.

[1]	钟松影. 基于关联规则Apriori算法的纺织原料成本预警[J]. 计算机与现代化, 2023, 0(07): 43-43.
[2]	王劭华, 欧阳会丹, 孙丹, 王康, 吴鸿萍, 钟询, 褚兴平, 杨松涛. 基于Apriori算法的大学生体测项目关联规则挖掘[J]. 计算机与现代化, 2023, 0(03): 66-70.
[3]	包春晖, 庄毅, 郭黎烨. 面向服务传输的SDN移动网络脆弱性评估模型[J]. 计算机与现代化, 2022, 0(11): 43-51.
[4]	纪元, 郑卫波, 王梓, . 基于容器的安全接入虚拟化[J]. 计算机与现代化, 2022, 0(09): 106-110.
[5]	陈广福, 连雁平, . 融合节点中心性和度相关聚类的有向网络链路预测[J]. 计算机与现代化, 2022, 0(08): 36-42.
[6]	赵继业. 基于云安全模型的簇状树形无线传感器网络路由协议 [J]. 计算机与现代化, 2021, 0(08): 6-10.
[7]	罗兰花, 袁淑丹, 何巧萍. 一种基于节点中心性近似算法的ICN协作缓存策略[J]. 计算机与现代化, 2021, 0(04): 85-90.
[8]	徐家冰, 朱浩辰, 杨丽. 一种基于纠删码的多节点失效修复算法[J]. 计算机与现代化, 2021, 0(03): 18-23.
[9]	郭欣, 陈瑛, 章鸣嬛, 张璇, 潘曙明, 汤璐佳. 利用机器学习方法对灾难生命支持课程NDLS培训效果进行分析预测#br#[J]. 计算机与现代化, 2020, 0(12): 61-66.
[10]	蔡文鹏, 鞠时光. 三维曲面多移动节点的传感器网络部署算法[J]. 计算机与现代化, 2020, 0(11): 70-76.
[11]	李雯, 文勇军, 唐立军, . 多特征融合的教育资源标签生成算法[J]. 计算机与现代化, 2020, 0(09): 19-24.
[12]	冯云霞, 韩正亮, 薛蓉蓉, 宋波. 心血管疾病并发症与虚弱症关联模式研究[J]. 计算机与现代化, 2020, 0(07): 85-89.
[13]	于海洋. 快速挖掘最大频繁项集算法在图书馆管理中的应用[J]. 计算机与现代化, 2020, 0(06): 68-.
[14]	王云，李丛. 基于改进关联规则算法的警情数据分析[J]. 计算机与现代化, 2019, 0(12): 1-.
[15]	石慧1，陈恩2. Spark平台的分布式阶段自适应关联规则挖掘算法[J]. 计算机与现代化, 2019, 0(12): 31-.

一种基于自适应关联熵的关键字提取算法

A Keyword Extraction Algorithm Based on Adaptive Related Entropy

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价