计算机与现代化

• 人工智能 • 上一篇    下一篇

一种基于噪音过滤包装器的Web新闻抽取方法

  

  1. (北京交通大学计算机与信息技术学院,北京 100044)
  • 收稿日期:2016-06-13 出版日期:2017-01-12 发布日期:2017-01-11
  • 作者简介:孙梦(1992-),女,山东青岛人,北京交通大学计算机与信息技术学院硕士研究生,研究方向:信息处理; 瞿有利(1974-),男,高级工程师,博士,研究方向:信息检索。
  • 基金资助:
    中央高校基本科研业务费专项资金资助项目(2015JBM035)

A Web News Extraction Method Based on Filtering Noise Wrapper

  1. (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China)
  • Received:2016-06-13 Online:2017-01-12 Published:2017-01-11

摘要: 从海量Web新闻网页中抽取高纯度新闻,并以结构化的形式存储,是舆情监测、话题更新等研究的基础。本文提出一种基于噪音过滤包装器的方法抽取Web新闻,归纳包装器时,若2页面字符串对比失配,计算其字符串标签路径比,根据阈值α,标记不同符号区分新闻内容和噪音。同时提出了朴素贝叶斯Web新闻标题分类器和时间分类器抽取新闻题目和时间。实验结果表明,本文方法与其他抽取技术相比准确率和鲁棒性有显著的提高,具有重要的实际应用价值。

关键词: 信息抽取, 包装器, 新闻网页, ACME算法, 字符串标签路径比, 朴素贝叶斯分类器

Abstract: Extracting high-purity news from large Web pages, and stored in a structured form is the research foundation of the public opinion monitoring and topical updating. This paper proposes a Web news extraction method based on filtering noise wrapper. When inducing the wrapper, if the two strings don’t match each other, according to the threshold, we calculate the string tag path ratio of the strings to distinguish purity news from noise. At the same time, we propose two naive Bayes classifiers to extract the title and time of the news. Experimental results show that compared with other extraction technologies, the method in this paper has significant improvement in terms of accuracy and robustness. So it has greatly utility value.

Key words: information extraction, wrapper, Web news pages, algorithm of ACME;string tag path ratio, naive Bayes classifier

中图分类号: