计算机与现代化

• 网络与通信 • 上一篇    下一篇

基于本体思想的网页信息抽取方法

  

  1. 国家计算机网络应急技术处理协调中心上海分中心,上海201315
  • 收稿日期:2015-03-02 出版日期:2015-09-21 发布日期:2015-09-24
  • 作者简介:刘丽娟(1989-),女,江西南昌人,国家计算机网络应急技术处理协调中心上海分中心工程师,硕士,研究方向:信息处理,Web技术; 张胤(1990-),男,江西吉安人,助理工程师,学士,研究方向:信息处理, 语义分析; 杨一(1986-),男,河北保定人,硕士,研究方向:信息安全,网络工程。

 Method of Web Information Extraction Based on Ontology Theory

  1. Shanghai Branch, Computer Network Emergency Response Technical Team/Coordination

     Center of China, Shanghai 201315, China
  • Received:2015-03-02 Online:2015-09-21 Published:2015-09-24

摘要:

为了抽取特定主题的网页信息,本文提出一种基于本体思想衡量主题相关度,以提高网页信息抽取质量的方法。该方法采用向量空间模型,结合特征词权值,利用本体思想分析并计算主题
相关度,从而达到提高特定主题的网页信息抽取质量的目的。该方法不但简化了向量空间模型的维度计算,而且扩展了语义范围。最后采用一个分层体系结构的实际应用系统来演示该方法的实现过程。
实际应用结果表明,该方法较准确抽取了特定主题的网页信息,同时降低了系统运算复杂度,提高了网页信息抽取的查全率和查准率,减少了网页信息的遗漏,提高了网页信息抽取的质量。

关键词:  , 本体, 网页信息抽取, 相关度, 爬虫, Web

Abstract:

To get the Web information for a specific topic, it used an ontology method to measure the topic correlation, in order to improve the quality of Web information
extraction. According to Vector Space Model (VSM), by calculating weights of feature words, the ontology method to calculate topic correlation is used. In this way, it improved
the Web information extraction quality in specific topic. In this paper, the method not only simplified dimensional computing in VSM, but also extended the semantic range. A
practical application system with layered architecture was used to demonstrate the implementation process of this method. Practical application result shows that the proposed
method is more accurate in extract Web information on specific topic, at the same time it reduces the computational complexity of the system, while improving the web information
extraction recall and precision, thereby it reduces the missing pages of information, and improves the quality of Web information extraction.

Key words:  ontology, Web information extraction, correlation, crawler, Web