基于本体思想的网页信息抽取方法

计算机与现代化 ›› 2015, Vol. 0 ›› Issue (9): 90-94.

基于本体思想的网页信息抽取方法

国家计算机网络应急技术处理协调中心上海分中心，上海201315

收稿日期:2015-03-02 出版日期:2015-09-21 发布日期:2015-09-24
作者简介:刘丽娟(1989-),女,江西南昌人,国家计算机网络应急技术处理协调中心上海分中心工程师，硕士,研究方向：信息处理，Web技术；张胤(1990-),男,江西吉安人,助理工程师,学士,研究方向：信息处理，语义分析；杨一(1986-),男,河北保定人,硕士,研究方向：信息安全，网络工程。

Method of Web Information Extraction Based on Ontology Theory

Shanghai Branch, Computer Network Emergency Response Technical Team/Coordination

Center of China, Shanghai 201315, China

Received:2015-03-02 Online:2015-09-21 Published:2015-09-24

摘要/Abstract

摘要：

为了抽取特定主题的网页信息，本文提出一种基于本体思想衡量主题相关度，以提高网页信息抽取质量的方法。该方法采用向量空间模型，结合特征词权值，利用本体思想分析并计算主题

相关度，从而达到提高特定主题的网页信息抽取质量的目的。该方法不但简化了向量空间模型的维度计算，而且扩展了语义范围。最后采用一个分层体系结构的实际应用系统来演示该方法的实现过程。

实际应用结果表明，该方法较准确抽取了特定主题的网页信息，同时降低了系统运算复杂度，提高了网页信息抽取的查全率和查准率，减少了网页信息的遗漏，提高了网页信息抽取的质量。

关键词: , 本体, 网页信息抽取, 相关度, 爬虫, Web

Abstract:

To get the Web information for a specific topic, it used an ontology method to measure the topic correlation, in order to improve the quality of Web information

extraction. According to Vector Space Model (VSM), by calculating weights of feature words, the ontology method to calculate topic correlation is used. In this way, it improved

the Web information extraction quality in specific topic. In this paper, the method not only simplified dimensional computing in VSM, but also extended the semantic range. A

practical application system with layered architecture was used to demonstrate the implementation process of this method. Practical application result shows that the proposed

method is more accurate in extract Web information on specific topic, at the same time it reduces the computational complexity of the system, while improving the web information

extraction recall and precision, thereby it reduces the missing pages of information, and improves the quality of Web information extraction.

Key words: ontology, Web information extraction, correlation, crawler, Web

刘丽娟，张胤，杨一. 基于本体思想的网页信息抽取方法[J]. 计算机与现代化, 2015, 0(9): 90-94.

LIU Li-juan, ZHANG Yin, YANG Yi. Method of Web Information Extraction Based on Ontology Theory[J]. Computer and Modernization, 2015, 0(9): 90-94.

［1］ Lonsdale D W, Embley D W, Ding Yihong, et al. Reusing ontologies and language components for ontology generation［J］. Data & Knowledge Engineering, 2010,69(4):318-330.
［2］王芳，陈海建．深入解析Web主题爬虫的关键性原理［J］．微型电脑应用, 2011(7):32-34.
［3］ Boldi P, Codenotti B, Satini M. UbiCrawler: A scalable fully distributed Web crawler［J］. Software: Practice and Experience, 2004,34(8):711-726.
［4］张晓孪，王西锋. 基于本体和相似图的概念语义相似度计算［J］. 计算机技术与发展， 2011,21(8):101-104,108. 
［5］ Yadav D, Sharma A K, Gupta J P. Topical Web crawling using weighted anchor text and Web page change detection techniques［J］. WSEAS Transactions on Information Science

and Applications, 2009,6(9):263-275.
［6］ Jiang Hua, Han Bing, Lin Ying, et al. Design and implementation of university focused crawler based on BP network classifier［C］// International Workshop on Knowledge

Discovery and Data Mining. 2009:44-47.
［7］ Batzios A, Dimou C, Andreas L, et al. BioCrawler: An intelligent crawler for the semantic Web［J］. Expert Systems with Applications, 2008,35(1-2):524-530.
［8］ Lin Shan, Hu Yanzhong. An approach of extracting Web information based on HTMLParser［C］// International Conference on Information Technology and Computer Science.

2010:284-287.
［9］李卫，刘建毅，何华灿，等. 基于主题的智能Web信息采集系统的研究与实现［J］. 计算机应用研究， 2006,23（2）：163-166.
［10］Basakis S, Euripides G M, Milios E. Imporving the performance of focused Web crawlers［J］. Data & Knowledge Engineering, 2009,68(10):1001-1013.
［11］Chandramouli A, Gauch S, Eno J. A cooperative approach to Web crawler URL ordering［J］. Advances in Intelligent and Soft Computing, 2012,98:343-351.
［12］Pant G, Srinivasan P. Link contexts in classifier-guided topical crawlers［J］. IEEE Transactions on Knowledge and Data Engineering, 2006,18(1):107-122.
［13］Srinivasan P, Mencezer F, Pant G. A general evaluation framework for topical crawlers［J］. Information Retrieval, 2005,8(3):417-447.
［14］Zhou Yan， Li Wei， Yuan Xingfu， et al. Ontology modeling of semantics in social media: Public issue knowledge base(PIKB) of the Weibo［J］. Chinese Journal of Library

and information Science, 2014,7(1):16-17.
［15］Wei Xiaomei， Huang Sixing， Chen Bo， et al. BioTSA: Annotating token semantic association to support biomedical text mining［J］. Wuhan University Journal of Natural

Sciences, 2015,20(2):134-140.
［16］Wang Junhua, Zuo Wanli, Peng Tao. Hyponymy graph model for word semantic similarity measurement［J］. Chinese Journal of Electronics, 2015,24(1):31-33.
［17］Liang Xiaohui, Gu Siqi, Deng Jiayuan, et al. Investigation of college students’ mental health status via semantic analysis of sina microblog［J］. Wuhan University

Journal of Natural Sciences, 2015,20(2):159-164.

[1]	赵晨阳, 薛涛, 刘俊华. 基于改进Stable Diffusion的时尚服饰图案生成[J]. 计算机与现代化, 2024, 0(12): 15-23.
[2]	万兵1, 2, 3, 赵文涛4, 潘多涛1, 赵峥韬2, 3, 孙朝阳2, 3, 俞建成2, 3. 无人帆船半物理仿真测试系统设计[J]. 计算机与现代化, 2024, 0(12): 91-99.
[3]	陈宇航1, 杨勇1, 帕力旦·吐尔逊1, 樊小超1, 任鸽1, 刁宇峰2. 融合句法特征与语义特征的作文自动评分方法[J]. 计算机与现代化, 2024, 0(11): 64-69.
[4]	黄文栋, 王怡凡. 基于模态类别的多模态信息处理与融合综述[J]. 计算机与现代化, 2024, 0(07): 47-62.
[5]	刘语珵, 贺奇, 董延华, 王晓宇. 结合时间相关度与课程搭配度的课程推荐方法[J]. 计算机与现代化, 2023, 0(12): 53-58.
[6]	刘甫, 余劲松弟, 魏丹丹, . 基于北斗网格的影像数据REST Web服务系统[J]. 计算机与现代化, 2023, 0(11): 108-112.
[7]	杨柳青, 王冲. 基于极大熵的Web服务资源个性化推荐方法[J]. 计算机与现代化, 2023, 0(09): 32-37.
[8]	李静元, 张珂, 杨东裕. 基于雾计算的工业互联网安全数据访问方法[J]. 计算机与现代化, 2022, 0(12): 118-122.
[9]	邱金水, 庄会富, 金涛. 面向海量植物图像的智能检索系统设计[J]. 计算机与现代化, 2022, 0(10): 62-67.
[10]	王坭, 王淑营, 史海欧, 袁泉. 基于三角剖分算法的BIM模型高精度显示方法[J]. 计算机与现代化, 2021, 0(09): 57-62.
[11]	陈春燕, 刘梦赤. 基于粒子群遗传算法的智能组卷策略[J]. 计算机与现代化, 2021, 0(08): 16-23.
[12]	宋鑫, 樊志强, . 基于Laguerre 前向神经网络的信息服务性能建模方法 [J]. 计算机与现代化, 2021, 0(03): 1-6.
[13]	魏健, 赵红涛, 刘敦楠, 加鹤萍 . 基于集成模型的超短时负荷预测方法[J]. 计算机与现代化, 2021, 0(03): 12-17.
[14]	刘梦超, 王玉玫, 吴亚非, 臧义华, 梁佳. 基于本体的军事装备知识建模及分析[J]. 计算机与现代化, 2021, 0(01): 76-80.
[15]	张艳, 杨芳, 杨蕾, 韩奎国, 李辉. 基于知识图谱的区块链技术及电力行业应用分析[J]. 计算机与现代化, 2020, 0(12): 55-60.