基于DOM结构树的网页正文信息分段方法

doi:10.3969/j.issn.1006-2475.2013.10.056

计算机与现代化 ›› 2013, Vol. 218 ›› Issue (10): 229-232.doi: 10.3969/j.issn.1006-2475.2013.10.056

基于DOM结构树的网页正文信息分段方法

周建¹,汤进^1,2,罗斌^1,2

1.安徽大学计算机科学与技术学院，安徽合肥230601；2.安徽省工业图像处理与分析重点实验室，安徽合肥230039

收稿日期:2013-05-23 修回日期:1900-01-01 出版日期:2013-10-26 发布日期:2013-10-26

Web Information Segmentation Method Based on DOM Structure Tree

ZHOU Jian¹, TANG Jin^1,2, LUO Bin^1,2

1. School of Computer Science and Technology, Anhui University, Hefei 230601, China; 2. Key Lab of Industrial Image Processing & Analysis of Anhui Province, Hefei 230039, China

Received:2013-05-23 Revised:1900-01-01 Online:2013-10-26 Published:2013-10-26

摘要/Abstract

摘要： 网页正文信息的正确提取与分段，对文本信息挖掘等具有重要的意义。本文提出并实现一种从Web页面获取正文信息并能够正确分段的方法。该方法首先利用网页布局标签

和

构建一个DOM结构树，然后利用DOM结构树所反映出的布局标签的嵌套关系，对内容块进行取舍，提取出正确的正文信息，最后利用对一些特殊标签的处理，实现正文信息的分段。实验表明，该方法易实现、效率高，能自动准确地提取正文信息并分段。

关键词: 语义标记, 布局标签, 分段, 噪音

Abstract:

Correct extraction and segmentation of Web information is significant to text information mining. The paper proposes and achieves a method which can get informative information from Web page and be able to follow the correct segmentation of the original text. The method first uses page layout tag <table> and <div> to build a DOM structure tree, and then uses the nested relations of the layout label, that the DOM structure tree reflects to choose the content blocks, extract text information correctly, and finally achieves information segment of the body through the manipulation of some special tags. The experimental results prove that this method is easy to realize and high efficiency and it can automatically extract informative message and section accurately.

Key words: semantic markup, layout label, segmentation, noise

中图分类号:

TP393

周建;汤进;罗斌;. 基于DOM结构树的网页正文信息分段方法[J]. 计算机与现代化, 2013, 218(10): 229-232.

ZHOU Jian;TANG Jin;LUO Bin;. Web Information Segmentation Method Based on DOM Structure Tree[J]. Computer and Modernization, 2013, 218(10): 229-232.

[1]	欧基发, 蔡茂国, 洪广杰, 詹楷杰. 基于PWLCM和秃鹰俯冲机制改进的野狗优化算法[J]. 计算机与现代化, 2024, 0(01): 109-116.
[2]	郑伟宁, 庄毅, 顾浩为. 一种检测控制流错误的多层分段标签方法[J]. 计算机与现代化, 2020, 0(08): 41-50.
[3]	朱春省,蔡倩,徐贵力,李振华,王正兵. 基于Hessian矩阵的线形结构搜索路面裂缝提取方法[J]. 计算机与现代化, 2017, 0(9): 61-66.
[4]	刘劼,张曦煌. 基于分段搜索策略的自适应差分进化人工蜂群算法[J]. 计算机与现代化, 2016, 0(9): 15-20.
[5]	李沛谕，张治学. 基于可扩展加密的传感器网络数据隐私保护研究[J]. 计算机与现代化, 2015, 0(7): 34-.
[6]	吴震菊. 基于S变换融合Canny算子的汽车轮廓提取[J]. 计算机与现代化, 2014, 0(4): 29-32.
[7]	谭爱平;成亚玲. 一种无线传感器网络中基于时间分段拟合的高效查询处理算法[J]. 计算机与现代化, 2013, 1(5): 1-6.

基于DOM结构树的网页正文信息分段方法

Web Information Segmentation Method Based on DOM Structure Tree

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 7

编辑推荐

Metrics

本文评价