基于网页内容相似度改进算法的主题网络爬虫

doi:10.3969/j.issn.1006-2475.2011.09.001

计算机与现代化 ›› 2011, Vol. 193 ›› Issue (9): 1-4.doi: 10.3969/j.issn.1006-2475.2011.09.001

• 算法设计与分析 • 下一篇

基于网页内容相似度改进算法的主题网络爬虫

魏晶晶¹，杨定达²，廖祥文²

1.福建江夏学院电子信息科学系,福建福州 350108; 2.福州大学数学与计算机科学学院,福建福州 350108

收稿日期:2011-05-06 修回日期:1900-01-01 出版日期:2011-09-22 发布日期:2011-09-22

Focused Crawler Based on Improved Algorithm of Web Content Similarity

WEI Jing-jing¹, YANG Ding-da², LIAO Xiang-wen²

1.Department of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, China; 2.College of Mathematics and Computer Science, Fuzhou University, Fuzhou 350108, China

Received:2011-05-06 Revised:1900-01-01 Online:2011-09-22 Published:2011-09-22

摘要/Abstract

摘要： 主题网络爬虫是垂直搜索引擎的重要组成部分，传统主题爬虫的网页内容相似度算法只考虑词频，忽略了关键词的位置信息。本文在分析基于网页内容相似度的主题爬虫的基础之上，提出利用网页HTML标签的特点改进相似度的计算方法。实验结果表明，改进算法抓取的平均准确率为64.99%，相比原始方法提高了15.37%。

关键词: 搜索引擎, 主题网络爬虫, 相似度, 向量空间模型, HTML标签

Abstract: Focused crawler is an important part of the vertical search engine. The Web content relevance algorithm of traditional focused crawler only considers term frequency, ignores the location information of key terms. After the analysis of the focused crawler based on the Web content relevance, this paper proposes an improved method of calculating relevance using the features of HTML tags. Experimental results show that the average accuracy of improved algorithm is 64.99% and increases 15.37% compared to the original method.

Key words: search engine, focused crawler, similarity, vector space model, HTML tags

中图分类号:

TP301.6

魏晶晶;杨定达;廖祥文. 基于网页内容相似度改进算法的主题网络爬虫[J]. 计算机与现代化, 2011, 193(9): 1-4.

WEI Jing-jing;YANG Ding-da;LIAO Xiang-wen. Focused Crawler Based on Improved Algorithm of Web Content Similarity[J]. Computer and Modernization, 2011, 193(9): 1-4.

[1]	杨骏1, 胡为1, 朱文福2. 基于改进MobileNetV3的视觉SLAM回环检测算法[J]. 计算机与现代化, 2024, 0(10): 21-26.
[2]	仁青卓玛1, 2, 3, 拥措1, 2, 3, 唐超超1, 2, 3. 面向藏汉神经机器翻译的数据筛选方法[J]. 计算机与现代化, 2024, 0(06): 19-24.
[3]	王子琛, 瞿有利. 基于CRF的分区倒排索引压缩算法[J]. 计算机与现代化, 2024, 0(02): 36-42.
[4]	王宏杰, 徐胜超. 基于希尔伯特相似度的云平台异常传输数据聚类方法[J]. 计算机与现代化, 2023, 0(09): 27-31.
[5]	王鸿, 葛红. 基于注意力机制和语义相似度的跨模态哈希检索[J]. 计算机与现代化, 2023, 0(08): 44-53.
[6]	刘国丽, 徐洪楠, 谭有倩. 结合专家信任的协同过滤推荐算法研究[J]. 计算机与现代化, 2022, 0(11): 60-68.
[7]	田丰, 邓晓平, 张桂青, 王保义. 基于改进kNN算法与暂稳态特征的非侵入式负荷监测方法[J]. 计算机与现代化, 2022, 0(10): 29-35.
[8]	邱金水, 庄会富, 金涛. 面向海量植物图像的智能检索系统设计[J]. 计算机与现代化, 2022, 0(10): 62-67.
[9]	邹梦苑, 樊志强, 徐珞, 刘洁, 梁万路. Inf-ProA信息活动过程模型相似性度量方法[J]. 计算机与现代化, 2022, 0(02): 26-32.
[10]	许贤慧, 王淑营, 曾文驱. 面向工程数据检索的ElasticSearch索引优化策略[J]. 计算机与现代化, 2022, 0(02): 79-84.
[11]	耿化聪, 梁宏涛, 刘国柱. 基于知识图谱与协同过滤的饮食推荐算法[J]. 计算机与现代化, 2021, 0(08): 24-29.
[12]	黄皓, 陈荔. 基于信息熵和改进相似度协同过滤算法[J]. 计算机与现代化, 2021, 0(06): 29-34.
[13]	舒鹏, 杜庆伟. 一种融合多维信息的移动社区发现方法[J]. 计算机与现代化, 2021, 0(05): 88-92.
[14]	邵孟巧, 吉顺慧, 张鹏程. AC-Rec:基于多维特征的科研合作者推荐方法[J]. 计算机与现代化, 2021, 0(03): 94-100.
[15]	杨泉. 基于Logistic函数的《同义词词林》语义相似度计算[J]. 计算机与现代化, 2021, 0(01): 111-119.

基于网页内容相似度改进算法的主题网络爬虫

Focused Crawler Based on Improved Algorithm of Web Content Similarity

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价