计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

一种基于自适应关联熵的关键字提取算法

  

  1. (1.国网湖南供电服务中心(计量中心),湖南长沙410004;2.国网湖南省电力有限公司常德供电分公司,湖南常德415000)
  • 收稿日期:2019-06-16 出版日期:2020-04-22 发布日期:2020-04-24
  • 作者简介:罗有志(1989-),男,湖南娄底人,硕士研究生,研究方向:数据挖掘,人工智能,大数据,E-mail: hn_100861@aliyun.com; 陈征明(1971-),男,湖南常德人,工程师,本科; 陈明(1972-),湖南常德人,本科。

A Keyword Extraction Algorithm Based on Adaptive Related Entropy

  1. (1.State Grid Hunan Power Supply Service Center(Metrology Center), Changsha 410004, China;
    2.Changde Power Supply Branch Company, State Grid Hunan Power Co. Ltd., Changde 415000, China)
  • Received:2019-06-16 Online:2020-04-22 Published:2020-04-24

摘要: 相较于传统的依托词汇频率大小进行关键字提取技术,TextRank算法能够考虑词汇节点之间的相似性信息,却仍忽略了词汇上下文信息以及文章的语义结构等。在节点图权重迭代的基础上,本文利用文本上下文词汇的关联规则信息,引入关联熵的概念,自适应调整阻尼系数和滑动窗口大小,更贴近文本词汇的实际语义情况,能够较好地处理低词频和新词汇信息。实验表明,相较于TFIDF和TR等算法,本文算法在处理关键字提取时能够达到更准确的效果。

关键词: 文本挖掘, 关键字提取, 关联规则, TextRank, 节点

Abstract: Compared with the traditional technique of keyword extraction based on vocabulary frequency size, the TexRank algorithm can consider the similarity information between vocabulary nodes, but ignores vocabulary context information and the semantic structure of the article. On the basis of the weighted iteration of node diagram, this paper uses the association rule information of text context vocabulary, introduces the concept of association entropy, adaptively adjusts damping coefficient and sliding window size. It is closer to the actual semantic situation of text vocabulary, and can better deal with low word frequency and new vocabulary information. Experimental result shows that compared with TFIDF and TR algorithm, this method can achieve more accurate results when processing keyword extraction.

Key words: text mining, keyword extraction, association rules, TextRank, node

中图分类号: