计算机与现代化

• 算法设计与分析 •    下一篇

基于Hadoop的水利元数据语义搜索方法

  

  1. 河海大学计算机与信息学院,江苏南京211100
  • 收稿日期:2015-08-11 出版日期:2015-12-23 发布日期:2015-12-30
  • 作者简介:冯钧(1969-),女,江苏常州人,河海大学计算机与信息学院教授,博士,CCF会员,研究方向:时空数据管理,智能数据处理与数据挖掘,水利信息化; 李宗祥(1989-),男,硕士研究生,研究方向:信息检索; 唐志贤(1983-),男,博士研究生,研究方向:时空数据管理; 姜康(1990),男,硕士研究生,研究方向:信息检索。
  • 基金资助:
    国家自然科学基金资助项目(61370091,61170200)

Semantic Retrieval Method for Water Conservancy Metadata Based on Hadoop

  1. College of Computer and Information, Hohai University, Nanjing 211100, China
  • Received:2015-08-11 Online:2015-12-23 Published:2015-12-30

摘要: 为了解决水利领域中元数据搜索引擎缺乏语义理解,并且在集中式环境下索引水利元数据效率低下的问题,本文提出一种基于Hadoop的水利元数据语义搜索方法。首先结合本体与查询扩展技术的语义搜索方法,设计水利领域的本体推理规则、语义相似度计算方法、扩展词选择方法和语义相关度排序方法,从而有效地提高搜索结果的查全率与查准率;其次,针对XML形式的水利元数据建立索引的效率低下问题,引入Hadoop平台中的MapReduce并行处理模型,并行化处理解析提取元数据信息与索引建立工作,并修改SequenceFile的文件结构,以应对水利元数据的小文件问题,解决集中式环境下对水利元数据建立索引的性能瓶颈;最后利用Hadoop强大的并行计算能力,设计分布式环境下的语义扩展查询方法,从而提高水利元数据的查询效率。

关键词: 关键词:领域本体, 相似性计算, 语义搜索, Hadoop, SequenceFile

Abstract: In order to provide a solution for the absence of semantic comprehension of metadata search engine in water conservancy domain together with the problem of low efficiency when indexing water conservancy metadata, a semantic retrieval method for water conservancy metadata based on Hadoop is brought forward in this paper. First, the semantic searching method with the combination of ontology and query expansion technology is used to design ontology reasoning rules, semantic similarity calculation method, expansion words selecting method and semantic relevance ordering method so as to effectively improve the recall ratio and precision ratio of search results. Second, as for the problem of low efficiency when building an index of water conservancy metadata in XML form, MapReduce parallel processing model in Hadoop platform is introduced to make parallel processing, analysis and extraction of metadata information and index building, and to modify the file structure of SequenceFile in response to the small files of water conservancy metadata and performance bottleneck of water conservancy metadata index building under centralized environment. Finally, semantic extension query method is designed by using of the powerful parallel computing capability of Hadoop so as to improve the query efficiency of water conservancy metadata.

Key words: Key words: domain ontology, calculation of similarity, semantic query, Hadoop, SequenceFile

中图分类号: