计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

一种基于链接和内容分析的自适应主题爬虫算法

  

  1. 重庆大学计算机学院软件理论与技术重庆市重点实验室,重庆400044
  • 收稿日期:2015-03-30 出版日期:2015-09-21 发布日期:2015-09-24
  • 作者简介: 朱庆生(1956-),男,安徽当涂人,重庆大学计算机学院软件理论与技术重庆市重点实验室博士生导师,博士,研究方向:数据挖掘,软件工程; 徐宁(1990-),男,江西高安人,硕士 研究生,研究方向:数据挖掘,网络爬虫。
  • 基金资助:
     国家自然科学基金资助项目(61272914)

 An Adaptive Focused Crawling Algorithm Based on Link and Content Analysis

  1. College of Computer Science, Chongqing University, Chongqing Key Laboratory of 
     Software Theory & Technology, Chongqing 400044, China
  • Received:2015-03-30 Online:2015-09-21 Published:2015-09-24

摘要:

主题网络爬虫是主题搜索引擎中的一种关键技术,针对OTIE算法参数考虑不全的问题,提出结合链接和网页内容分析的自适应算法,通过结合网页链接重要性和内容相关性得分得到主题网
页下载优先级的综合评分,同时考虑在爬取主题网页中的隧道穿越问题。从ODP中选择主题和种子网页,将本算法与Best-First算法、Shark-Search算法和OTIE算法进行比较。实验结果表明,本算法不仅
具有明显较好的查全率,而且具有很好的查准率。

关键词:  , 主题爬虫, OTIE算法, Shark-Search算法, 隧道穿越

Abstract:

The focused crawling is a key technique of focus search engine. To solve the problem of incomplete parameters considering in the On-line Topical Importance
Estimation (OTIE) algorithm, this paper proposes an adaptive algorithm that combines link with content analysis to estimate the priority of unvisited URL in the frontier.
Moreover, we consider the tunneling problem in the process of topical crawling. We select topics and seed pages from the Open Directory Project (ODP) and conduct the comparative
experiments with four crawling algorithms: Best-First, Shark-Search, OTIE and our algorithm. The results of experiment indicate that the proposed method improves the performance
of focused crawler that significantly outperforms the other three algorithms on the average target recall while maintaining an acceptable harvest rate.

Key words:  focused crawler, OTIE algorithm, Shark-Search algorithm, tunneling