计算机与现代化 ›› 2011, Vol. 193 ›› Issue (9): 1-4.doi: 10.3969/j.issn.1006-2475.2011.09.001

• 算法设计与分析 •    下一篇

基于网页内容相似度改进算法的主题网络爬虫

魏晶晶1,杨定达2,廖祥文2   

  1. 1.福建江夏学院电子信息科学系,福建 福州 350108; 2.福州大学数学与计算机科学学院,福建 福州 350108
  • 收稿日期:2011-05-06 修回日期:1900-01-01 出版日期:2011-09-22 发布日期:2011-09-22

Focused Crawler Based on Improved Algorithm of Web Content Similarity

WEI Jing-jing1, YANG Ding-da2, LIAO Xiang-wen2   

  1. 1.Department of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, China; 2.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China
  • Received:2011-05-06 Revised:1900-01-01 Online:2011-09-22 Published:2011-09-22

摘要: 主题网络爬虫是垂直搜索引擎的重要组成部分,传统主题爬虫的网页内容相似度算法只考虑词频,忽略了关键词的位置信息。本文在分析基于网页内容相似度的主题爬虫的基础之上,提出利用网页HTML标签的特点改进相似度的计算方法。实验结果表明,改进算法抓取的平均准确率为64.99%,相比原始方法提高了15.37%。

关键词: 搜索引擎, 主题网络爬虫, 相似度, 向量空间模型, HTML标签

Abstract: Focused crawler is an important part of the vertical search engine. The Web content relevance algorithm of traditional focused crawler only considers term frequency, ignores the location information of key terms. After the analysis of the focused crawler based on the Web content relevance, this paper proposes an improved method of calculating relevance using the features of HTML tags. Experimental results show that the average accuracy of improved algorithm is 64.99% and increases 15.37% compared to the original method.

Key words: search engine, focused crawler, similarity, vector space model, HTML tags

中图分类号: