Computer and Modernization

Previous Articles     Next Articles

Web Page Title Real-time Extraction Method Based on Hyperlink and DOM Tree

  

  1. 1. School of Computer Science and Technology, Anhui University, Hefei 230601, China;
    2. Key Lab of Industrial Image Processing & Analysis of Anhui Province, Hefei 230039, China
  • Received:2015-03-16 Online:2015-08-08 Published:2015-08-19

Abstract: Correct extraction of Web title is significant to Web text information mining. This paper proposed a method which can get a real-time Web page title extraction. This method first used a real-time analysis model though the catalog page, and then used the hyperlink-based travelsal approach, and used the correspondence between the title and the release time to get the URL of the page and the corresponding anchor text. If the anchor text we have was not the title of the text page, we should get the Web page HTML source code and build a DOM tree for the corresponding theme-based Web page. Based on the visual characteristics of the Web page title, we traversed the DOM tree in depth-first order. The experimental results demonstrate that this method is of high accuracy and can be simply implemented and so on.

Key words: hyperlink, Web page title, publication time, real-time, information extraction

CLC Number: