基于内容与链接分析的主题爬虫研究与设计

计算机与现代化 ›› 2014, Vol. 0 ›› Issue (4): 77-80.

基于内容与链接分析的主题爬虫研究与设计

安徽理工大学计算机科学与工程学院,安徽淮南232001

收稿日期:2013-12-10 出版日期:2014-04-17 发布日期:2014-04-23
作者简介:作者简介：舒奔(1988)，男，湖北武汉人，安徽理工大学计算机科学与工程学院硕士研究生，研究方向：软件工程，信息管理技术；尹珂(1990)，男，安徽宿州人，硕士研究生，研究方向：嵌入式系统开发，信息安全工程。

Research and Design on Topical Crawler Based on Analysis of Content and Link

School of Computer Science and Technology, Anhui University of Science and Technology, Huainan 232001, China

Received:2013-12-10 Online:2014-04-17 Published:2014-04-23

摘要/Abstract

摘要：

摘要：现存主题爬虫算法在抓取主题网页方面，其准确性不是很高。本文提出一种基于文本内容评价与网页链接评价的主题网页抓取方法。首先计算当前网页与主题的相关度，然后将相关度值与给定阈值进行比较决定当前网页是丢弃还是存储，同时相关度值的大小也决定了待爬链接队列中URL的优先权，此模型考虑了主题网页的准确率与覆盖率之间的平衡。新设计的主题爬虫算法在抓取主题网页方面，其准确性有一定程度的提高。

关键词: , 主题爬虫, 主题相关度, 主题网页

Abstract:

Abstract: In the aspect of grasping the topical webpage to the existing topical crawler algorithm, its accuracy is not high. This paper presents a topical webpage grasping method which based on evaluation of text content and webpage link. First it calculates the correlation of current webpage and theme, and then compares the correlation values with a given threshold to determine the current webpage is discarded or stored. At the same time the size of the correlation value also determines the priority of URL in the climbing link queue, this model takes into account the balance of topical webpage between accuracy and coverage. In the aspect of grasping topical webpage to design the new topical crawler algorithm, its accuracy has been improved to some extent.

Key words:

Key words: topical crawler, topical correlation, topical webpage

中图分类号:

TP393

舒奔，尹珂. 基于内容与链接分析的主题爬虫研究与设计[J]. 计算机与现代化, 2014, 0(4): 77-80.

SHU Ben, YIN Ke. Research and Design on Topical Crawler Based on Analysis of Content and Link[J]. Computer and Modernization, 2014, 0(4): 77-80.

参考文献

［1］

刘运强. 垂直搜索引擎的研究与设计［J］. 计算机应用与软件， 2010,27(7):130132.

［2］付强. 基于Lucene的高效图书垂直搜索引擎的研究与实现［J］．太原师范学院学报， 2011，10(4)：103107．

［3］袁浩. 主题爬虫搜索Web页面策略的研究［D］. 长沙：中南大学， 2009.

［4］ Menczer F， Pant G， Ruiz M E， et al． Evaluating topic—driven Web crawlers［C］//Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval． New York，NY，USA， 2001:241249.

［5］ Bra D P， Post R. Searching for arbitrary information in the WWW: The fishsearch for mosaic［C］// The Second WWW Conference. Chicago： ACM Press， 1994:4551.

［6］ Herseovid M， Jaeov M， SMaarek Y. The sharksearch algorithm：An aplication:Tailored Web site mapping［J］. Computer Networks and ISDN System, 1998,30(6):317326.

［7］ Page L, Brin S, Motwani R. The PageRank Citation Ranking: Bring Order to the Web［R］. Technical Report SIDLWP19990120, Standford University, 1999.

［8］ Kleinberg J. Autboritative sources in a hyperlinked environment［J］. Journal of the ACM, 1999,46(5):604632.

［9］罗林波，陈绮，吴清秀. 基于SharkSearch和Hits算法的主题爬虫研究［J］. 计算机技术与发展， 2010,20(17):7679.

［10］宋宇，孟祥增. 基于改进FishSearch算法的多媒体检索［J］. 计算机工程， 2008,34(11):189190,193.

［11］谢治军. 垂直搜索引擎的主题网页抓取策略研究［D］. 重庆：重庆理工大学， 2012.

［12］金明珠，丁岳伟. 基于统计模型的主题爬虫的研究与实现［J］. 计算机工程与设计， 2010,31(16)：37003704.

［13］胡伟. 基于PageRank算法的主题爬虫研究与设计［D］. 武汉：武汉理工大学， 2012.

[1]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[2]	万兵1, 2, 3, 赵文涛4, 潘多涛1, 赵峥韬2, 3, 孙朝阳2, 3, 俞建成2, 3. 无人帆船半物理仿真测试系统设计[J]. 计算机与现代化, 2024, 0(12): 91-99.
[3]	陈宇航1, 杨勇1, 帕力旦·吐尔逊1, 樊小超1, 任鸽1, 刁宇峰2. 融合句法特征与语义特征的作文自动评分方法[J]. 计算机与现代化, 2024, 0(11): 64-69.
[4]	黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
[5]	李静元, 张珂, 杨东裕. 基于雾计算的工业互联网安全数据访问方法[J]. 计算机与现代化, 2022, 0(12): 118-122.
[6]	陈春燕, 刘梦赤. 基于粒子群遗传算法的智能组卷策略[J]. 计算机与现代化, 2021, 0(08): 16-23.
[7]	宋鑫, 樊志强, . 基于Laguerre 前向神经网络的信息服务性能建模方法 [J]. 计算机与现代化, 2021, 0(03): 1-6.
[8]	魏健, 赵红涛, 刘敦楠, 加鹤萍 . 基于集成模型的超短时负荷预测方法[J]. 计算机与现代化, 2021, 0(03): 12-17.
[9]	段桂芹1，邹臣嵩2，刘锋2. 基于优化初始聚类中心的K中心点算法[J]. 计算机与现代化, 2019, 0(04): 1-.
[10]	李富星，蒙祖强 . 一种改进的类别区分词特征选择算法[J]. 计算机与现代化, 2019, 0(03): 73-.
[11]	白晓波1，邵景峰1，和征1，田建刚2. 基于学习的核偏最小二乘法优化扩展卡尔曼滤波[J]. 计算机与现代化, 2018, 0(09): 110-.
[12]	郑亚鹏，樊璐. 基于LSTM的临床血液需求预测方法[J]. 计算机与现代化, 2018, 0(05): 41-.
[13]	刘德春1，张秀国2，姜微2. 基于马尔科夫链的大学生自主学习能力预测方法[J]. 计算机与现代化, 2018, 0(05): 106-.
[14]	夏琨1，丁波1，刘俊1，刘子豪1，林亮成2. 基于内容分析的网络协议指纹识别[J]. 计算机与现代化, 2018, 0(05): 121-.
[15]	李华，江峰，于旭，杜军威，刘国柱. 基于粒度决策熵的属性约简[J]. 计算机与现代化, 2018, 0(04): 7-.