Computer and Modernization

Previous Articles     Next Articles

A Web News Extraction Method Based on Filtering Noise Wrapper

  

  1. (School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China)
  • Received:2016-06-13 Online:2017-01-12 Published:2017-01-11

Abstract: Extracting high-purity news from large Web pages, and stored in a structured form is the research foundation of the public opinion monitoring and topical updating. This paper proposes a Web news extraction method based on filtering noise wrapper. When inducing the wrapper, if the two strings don’t match each other, according to the threshold, we calculate the string tag path ratio of the strings to distinguish purity news from noise. At the same time, we propose two naive Bayes classifiers to extract the title and time of the news. Experimental results show that compared with other extraction technologies, the method in this paper has significant improvement in terms of accuracy and robustness. So it has greatly utility value.

Key words: information extraction, wrapper, Web news pages, algorithm of ACME;string tag path ratio, naive Bayes classifier

CLC Number: