一种基于噪音过滤包装器的Web新闻抽取方法

doi:10.3969/j.issn.1006-2475.2017.01.002

计算机与现代化 ›› 2017, Vol. 0 ›› Issue (1): 5-12.doi: 10.3969/j.issn.1006-2475.2017.01.002

一种基于噪音过滤包装器的Web新闻抽取方法

(北京交通大学计算机与信息技术学院，北京 100044)

收稿日期:2016-06-13 出版日期:2017-01-12 发布日期:2017-01-11
作者简介:孙梦(1992-)，女，山东青岛人，北京交通大学计算机与信息技术学院硕士研究生，研究方向：信息处理; 瞿有利(1974-)，男，高级工程师，博士，研究方向：信息检索。
基金资助:
中央高校基本科研业务费专项资金资助项目(2015JBM035)

A Web News Extraction Method Based on Filtering Noise Wrapper

(School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China)

Received:2016-06-13 Online:2017-01-12 Published:2017-01-11

摘要/Abstract

摘要： 从海量Web新闻网页中抽取高纯度新闻，并以结构化的形式存储，是舆情监测、话题更新等研究的基础。本文提出一种基于噪音过滤包装器的方法抽取Web新闻，归纳包装器时，若2页面字符串对比失配，计算其字符串标签路径比，根据阈值α，标记不同符号区分新闻内容和噪音。同时提出了朴素贝叶斯Web新闻标题分类器和时间分类器抽取新闻题目和时间。实验结果表明，本文方法与其他抽取技术相比准确率和鲁棒性有显著的提高，具有重要的实际应用价值。

关键词: 信息抽取, 包装器, 新闻网页, ACME算法, 字符串标签路径比, 朴素贝叶斯分类器

Abstract: Extracting high-purity news from large Web pages, and stored in a structured form is the research foundation of the public opinion monitoring and topical updating. This paper proposes a Web news extraction method based on filtering noise wrapper. When inducing the wrapper, if the two strings don’t match each other, according to the threshold, we calculate the string tag path ratio of the strings to distinguish purity news from noise. At the same time, we propose two naive Bayes classifiers to extract the title and time of the news. Experimental results show that compared with other extraction technologies, the method in this paper has significant improvement in terms of accuracy and robustness. So it has greatly utility value.

Key words: information extraction, wrapper, Web news pages, algorithm of ACME;string tag path ratio, naive Bayes classifier

中图分类号:

TP391.1

孙梦，瞿有利. 一种基于噪音过滤包装器的Web新闻抽取方法[J]. 计算机与现代化, 2017, 0(1): 5-12.

SUN Meng, QU You-li. A Web News Extraction Method Based on Filtering Noise Wrapper[J]. Computer and Modernization, 2017, 0(1): 5-12.

参考文献

[1] 罗超然,王纯,廖建新. 新闻网页内容抽取模块的设计与实现[J]. 电信技术, 2014(5):85-87.

[2] Chang C H, Kayed M, Girgis M R, et al. A survey of Web information extraction systems[J]. IEEE Transactions on Knowledge and Data Engineering, 2006,18(10):1411-1428.

[3] Grishman R. Information Extraction: Capabilities and Challenges[Z]. Notes Prepared for the 2012 International Winter School in Language and Speech Technologies, Rovira i Virgili University, Tarragona, Spain, 2012.

[4] Etzioni O, Banko M, Soderland S, et al. Open information extraction from the Web[J]. Communications of the ACM, 2008,51(12):68-74.

[5] Pol K, Patil N, Patankar S, et al. A survey on Web content mining and extraction of structured and semistructured data[C]// Proceedings of the 1st IEEE International Conference on Emerging Trends in Engineering and Technology. 2008:543-546.

[6] Crescenzi V, Mecca G, Merialdo P. RoadRunner: Towards automatic data extraction from large Web sites[C]// Proceedings of the 27th International Conference on Very Large Data Bases. 2001:109-118.

[7] Crescenzi V, Mecca G, Merialdo P. Automatic Web information extraction in the RoadRunner system[M]// Conceptual Modeling for New Information Systems Technologies. Springer Berlin Heidelberg, 2001:264-277.

[8] Crescenzi V, Mecca G, Merialdo P. Wrapping-oriented classification of Web pages[C]// Proceedings of the 2002 ACM Symposium on Applied Computing. 2002:1108-1112.

[9] Baroni M, Chantree F, Kilgarriff A, et al. Cleaneval: A competition for cleaning Web pages[C]// Proceedings of the 2008 International Conference on Language Resources and Evaluation. 2008:638-643.

[10] 李文奇,张忠能. 页面包装器自动生成的改进算法[J]. 计算机工程与应用, 2004,40(22):113-115.

[11] Wu Gongqing, Li Li, Hu Xuegang, et al. Web news extraction via path ratios[C]// Proceedings of the 22nd ACM International Conference on Information & Knowledge Management. 2013:2059-2068.

[12] 吴共庆. 基于标签路径特征的Web新闻内容抽取研究[D]. 合肥:合肥工业大学, 2012.

[13] 王中锋,王志海. 基于条件对数似然函数导数的贝叶斯网络分类器优化算法[J]. 计算机学报, 2012,35(2):364-374.

[14] 宫秀军. 贝叶斯学习理论及其应用研究[D]. 北京:中国科学院研究生院, 2002.

[1]	李璐, 朱焱. 基于知识提示微调的事件抽取方法[J]. 计算机与现代化, 2024, 0(07): 36-40.
[2]	乔璐, 孙有朝, 吴红兰. 面向飞机故障文本的信息抽取[J]. 计算机与现代化, 2024, 0(03): 61-66.
[3]	朱岩, 张利, 王煜. 基于RoBERTa-WWM的中文电子病历命名实体识别[J]. 计算机与现代化, 2021, 0(02): 51-55.
[4]	程树东，胡鹰. 基于BI-LSTM-CRF模型的限定领域知识库问答系统[J]. 计算机与现代化, 2018, 0(07): 53-.
[5]	王娜娜，李向前. 文本情感倾向性分析[J]. 计算机与现代化, 2017, 0(7): 10-15+52.
[6]	张建华1,肖中正2. 结合词性规则和依存句法分析的评价对象抽取方法[J]. 计算机与现代化, 2016, 0(4): 16-20.
[7]	罗永莲，赵昌垣，贾玉芳，芦彩林. 基于朴素贝叶斯Web新闻内容的抽取方法[J]. 计算机与现代化, 2016, 0(1): 59-63,68.
[8]	刘丽娟，张胤，杨一. 基于本体思想的网页信息抽取方法[J]. 计算机与现代化, 2015, 0(9): 90-94.
[9]	张兵1,汤进1,2,罗斌1,2. 基于超链接和DOM结构树的网页标题实时抽取方法[J]. 计算机与现代化, 2015, 0(8): 84-88.
[10]	贾遂民，张玉,张腾飞. 一种基于介词用法的灾难事件信息抽取方法[J]. 计算机与现代化, 2015, 0(7): 116-.
[11]	何源. 基于本体的Web生物信息抽取方法研究[J]. 计算机与现代化, 2013, 1(5): 172-175.
[12]	陈曦;薛广涛. 一种基于朴素贝叶斯分类的3G用户流量预测技术[J]. 计算机与现代化, 2013, 1(4): 153-157,.
[13]	刘敬培;李江;季文平;潘鹏辉. 面向文本的事件信息抽取方法的研究[J]. 计算机与现代化, 2012, 203(7): 198-201.
[14]	钱程;阳小兰. HTML到XML转换研究[J]. 计算机与现代化, 2011, 1(8): 39-41.
[15]	李金华;梁永全;吕芳芳. 一种加权朴素贝叶斯分类增量学习模型[J]. 计算机与现代化, 2010, 1(5): 30-32.

一种基于噪音过滤包装器的Web新闻抽取方法

A Web News Extraction Method Based on Filtering Noise Wrapper

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价