Computer and Modernization

Previous Articles     Next Articles

Information Extraction of Web Pages Based on Support Vector Machine

  

  1. LIANG Dong, YANG Yong-quan, WEI Zhi-qiang
  • Received:2018-02-03 Online:2018-09-29 Published:2018-09-30

Abstract:  Aiming at the text information extraction of Web pages, this paper presents a method of extracting text information based on support vector machines. This method adopts “come in easily, out strictly” policy. The first step is to traverse the Web DOM tree according to the rules of the Web page structure, and locate an HTML tag that contains both useful and noise information. The second step is to select five important features of the HTML tag with noise information and use SVM to train the sample data. The model can effectively remove the navigation, promotion, copyright and other noise information, and preserve the useful information of Web pages. The method is applied to several commonly used websites. The experimental results show that this method has good effect of extracting texts and noise reduction, and can preserve short texts, such as hyperlinks related to texts that often mistakenly deleted by traditional methods.

Key words: support vector machine, information extraction, HTML label, noise reduction, machine learning

CLC Number: