计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

基于LDA模型和AP聚类的主题事件抽取技术

  

  1. (华北计算技术研究所,北京100083)
  • 收稿日期:2017-05-17 出版日期:2017-12-25 发布日期:2017-12-26
  • 作者简介:张建恒(1991-),男,北京人,华北计算技术研究所硕士研究生,研究方向:大数据处理,自然语言处理; 黄蔚(1972-),女,研究员,硕士,研究方向:大数据处理整合与挖掘分析; 胡国超(1982-),男,工程师,硕士,研究方向:大数据处理整合与挖掘分析。

Topic Event Extraction Technology Based on LDA Model and AP Clustering Method

  1. (North China Institute of Computing Technology, Beijing 100083, China)
  • Received:2017-05-17 Online:2017-12-25 Published:2017-12-26

摘要: 目前,事件抽取技术一般是对文本的事件信息进行直接抽取,忽略了文本的信息结构,并且抽取结果易受文本词语分布的影响。本文对文本的概念层级结构进行分析,并提出一种基于二次聚类并再划分的主题事件抽取方法。该方法可以提取文本的主题-事件层次化信息,并通过信息词的二次提取减小相关事件信息的干扰,优化抽取结果,并利用事件时间轴展示出事件的发展状态信息。实验结果表明本方法可以有效地提取文本的主题事件信息。

关键词: 主题事件抽取, LDA主题模型, AP聚类, 层次化信息, 二次提取

Abstract: At present, the event extraction technology is usually the direct extraction of the event information of the text, ignoring the information structure of text, and the result is susceptible to the distribution of the words in texts. This paper analyzes the hierarchical concept structure of the text, and proposes a method of extracting the topic event information of news based on two-stage clustering and subdividing. This method can extract the hierarchical topic-event information, and reduce the influence of the information of the relevant events by the two-stage extraction of information words. This way optimizes the performance of the extraction. And experiment shows that this method can extract the topic event information of the text effectively.

Key words: topic event extraction, LDA topic model, AP clustering method, hierarchical information, two-stage extraction

中图分类号: