计算机与现代化 ›› 2013, Vol. 218 ›› Issue (10): 229-232.doi: 10.3969/j.issn.1006-2475.2013.10.056

• 网络与通信 • 上一篇    下一篇

基于DOM结构树的网页正文信息分段方法

周建1,汤进1,2,罗斌1,2   

  1. 1.安徽大学计算机科学与技术学院,安徽合肥230601;2.安徽省工业图像处理与分析重点实验室,安徽合肥230039
  • 收稿日期:2013-05-23 修回日期:1900-01-01 出版日期:2013-10-26 发布日期:2013-10-26

Web Information Segmentation Method Based on DOM Structure Tree

ZHOU Jian1, TANG Jin1,2, LUO Bin1,2   

  1. 1. School of Computer Science and Technology, Anhui University, Hefei 230601, China; 2. Key Lab of Industrial Image Processing & Analysis of Anhui Province, Hefei 230039, China
  • Received:2013-05-23 Revised:1900-01-01 Online:2013-10-26 Published:2013-10-26

摘要: 网页正文信息的正确提取与分段,对文本信息挖掘等具有重要的意义。本文提出并实现一种从Web页面获取正文信息并能够正确分段的方法。该方法首先利用网页布局标签

构建一个DOM结构树,然后利用DOM结构树所反映出的布局标签的嵌套关系,对内容块进行取舍,提取出正确的正文信息,最后利用对一些特殊标签的处理,实现正文信息的分段。实验表明,该方法易实现、效率高,能自动准确地提取正文信息并分段。

关键词: 语义标记, 布局标签, 分段, 噪音

Abstract:

Correct extraction and segmentation of Web information is significant to text information mining. The paper proposes and achieves a method which can get informative information from Web page and be able to follow the correct segmentation of the original text. The method first uses page layout tag <table> and <div> to build a DOM structure tree, and then uses the nested relations of the layout label, that the DOM structure tree reflects to choose the content blocks, extract text information correctly, and finally achieves information segment of the body through the manipulation of some special tags. The experimental results prove that this method is easy to realize and high efficiency and it can automatically extract informative message and section accurately.

Key words: semantic markup, layout label, segmentation, noise

中图分类号: