Computer and Modernization

Previous Articles     Next Articles

Method of Web News Text Extraction Based on Nave Bayes

  

  1. School of Information Technology & Engineering, Jinzhong University, Jinzhong 030619, China
  • Received:2015-04-08 Online:2016-01-22 Published:2016-01-26

Abstract:  Concerning the problems of information automatic extraction from Web news, a method of extracting Web news text from webpage tag blocks based on Nave Bayes was proposed. Tag information, text similarity and text length of tag blocks were taken as the attributes in machine learning. To improve the representation of tag attributes and reduce interference between related tags, the algorithm reduced the number of attributes in the way of examining the correlation between tag attributes and categories between tag attributes based on χ2 test. In order to improve the extraction accuracy, both the probability of news text and non news text were considered. The experimental results show that the accuracy rate of extraction news text reached 85% with appropriate parameter values.

Key words: Nave Bayes, news of webpage, webpage tag block, text extraction, correlation test

CLC Number: