计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

基于内容与链接分析的主题爬虫研究与设计

  

  1. 安徽理工大学计算机科学与工程学院,安徽淮南232001
  • 收稿日期:2013-12-10 出版日期:2014-04-17 发布日期:2014-04-23
  • 作者简介:作者简介:舒奔(1988),男,湖北武汉人,安徽理工大学计算机科学与工程学院硕士研究生,研究方向:软件工程,信息管理技术; 尹珂(1990),男,安徽宿州人,硕士研究生,研究方向:嵌入式系统开发,信息安全工程。

 Research and Design on Topical Crawler Based on Analysis of Content and Link

  1. School of Computer Science and Technology, Anhui University of Science and Technology, Huainan 232001, China
  • Received:2013-12-10 Online:2014-04-17 Published:2014-04-23

摘要:  

摘要: 现存主题爬虫算法在抓取主题网页方面,其准确性不是很高。本文提出一种基于文本内容评价与网页链接评价的主题网页抓取方法。首先计算当前网页与主题的相关度,然后将相关度值与给定阈值进行比较决定当前网页是丢弃还是存储,同时相关度值的大小也决定了待爬链接队列中URL的优先权,此模型考虑了主题网页的准确率与覆盖率之间的平衡。新设计的主题爬虫算法在抓取主题网页方面,其准确性有一定程度的提高。

关键词:  , 主题爬虫, 主题相关度, 主题网页

Abstract:  

 Abstract:  In the aspect of grasping the topical webpage to the existing topical crawler algorithm, its accuracy is not high. This paper presents a topical webpage grasping method which based on evaluation of text content and webpage link. First it calculates the correlation of current webpage and theme, and then compares the correlation values with a given threshold to determine the current webpage is discarded or stored. At the same time the size of the correlation value also determines the priority of URL in the climbing link queue, this model takes into account the balance of topical webpage between accuracy and coverage. In the aspect of grasping topical webpage to design the new topical crawler algorithm, its accuracy has been improved to some extent.

Key words:

中图分类号: