计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

基于朴素贝叶斯Web新闻内容的抽取方法

  

  1. 晋中学院信息技术与工程学院,山西晋中030619
  • 收稿日期:2015-04-08 出版日期:2016-01-22 发布日期:2016-01-26
  • 基金资助:
    山西省高等学校教学改革项目(J2014108); 山西省教育科学“十一五”规划课题(GH-08072)

Method of Web News Text Extraction Based on Nave Bayes

  1. School of Information Technology & Engineering, Jinzhong University, Jinzhong 030619, China
  • Received:2015-04-08 Online:2016-01-22 Published:2016-01-26

摘要: 针对网页信息自动抽取问题,提出一种将网页按标记分块并根据朴素贝叶斯理论从中识别新闻正文的方法。该方法将各分块的标记信息、文本相似度以及字长特征作为机器学习的特征属性。为提高标记属性的表征作用,减少相关标记之间的干扰,算法采用χ2检验法来检验标记属性之间以及标记属性与类别之间的相关性并实现属性约减。新闻正文抽取过程中同时考虑正文与非正文分块的后验概率,以提高抽取准确率。实验结果表明,选取适当的参数值,抽取新闻正文的准确率达到85%。

关键词: 朴素贝叶斯, 新闻网页, 网页分块, 正文抽取, 相关性检验

Abstract:  Concerning the problems of information automatic extraction from Web news, a method of extracting Web news text from webpage tag blocks based on Nave Bayes was proposed. Tag information, text similarity and text length of tag blocks were taken as the attributes in machine learning. To improve the representation of tag attributes and reduce interference between related tags, the algorithm reduced the number of attributes in the way of examining the correlation between tag attributes and categories between tag attributes based on χ2 test. In order to improve the extraction accuracy, both the probability of news text and non news text were considered. The experimental results show that the accuracy rate of extraction news text reached 85% with appropriate parameter values.

Key words: Nave Bayes, news of webpage, webpage tag block, text extraction, correlation test

中图分类号: