计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

面向海量水利数据的索引方法研究

  

  1. 河海大学计算机与信息学院,江苏南京211100
  • 收稿日期:2017-02-28 出版日期:2017-10-30 发布日期:2017-10-31
  • 作者简介:冯钧(1969-),女,江苏常州人,河海大学计算机与信息学院教授,博士生导师,博士,CCF会员,研究方向:数据管理与知识工程,空间数据库理论与技术,水利信息化技术; 徐维纲(1993-),男,浙江金华人,硕士研究生,研究方向:时空数据索引,分布式存储; 冯读庆(1989-),男,江苏赣榆人,硕士研究生,研究方向:云存储与云计算; 陆佳民(1983-),男,讲师,研究方向:移动对象数据管理,分布式数据处理。
  • 基金资助:
    国家自然科学基金资助项目(61370091,61602151)

Research on Index Method of Massive Hydrology Data

  1. College of Computer and Information, Hohai University, Nanjing 211100, China
  • Received:2017-02-28 Online:2017-10-30 Published:2017-10-31

摘要: 水利数据的存储形式多样、数据量庞大以及水利实体种类丰富,针对每一类水利实体对象,不仅存在基础描述信息,也存在一系列测量业务信息,这2类数据存储和更新频率不相同。水利业务检索不仅要求能实现对象基础信息的快速检索,还要求根据基础描述信息和业务信息之间的依赖进行组合查询,而目前云环境中,尚缺能满足此类兼顾多类型数据之间依赖关系的高效索引方法。此外,水利数据量的急剧增长,给系统检索性能带来了巨大的挑战。为此,本文提出基于Hadoop的分布式双层索引结构HRB,针对不同的数据类型建立不同的索引。经实验验证,HRB索引与常规分布式索引相比,索引创建效率更优,且在数据量达到千万级别时,HRB检索速度更快,表明HRB具有一定的使用价值。

关键词: 水利实体, 双层索引结构, 分布式索引, Hadoop

Abstract: A large amount of hydrology data are stored in different forms and there are rich varieties of hydrology entity classes. For every type of hydrology entities, some basic description information and series of measuring business data involved in these entities are stored in different way with different update frequency. Hydrology business retrieve requests the index to provide basic descriptive information searching and a kind of combined query based on the relation between basic descriptive information and the business information. However, there is not an efficient index method which can consider several kinds of data and their dependencies. Furthermore, the rapid increasing of hydrology data also brings big challenges to retrieval performance. So, this paper proposes a distributed two-level index HRB based on Hadoop, which creates different index to satisfy different data types and retrieve requirements. The Experiments show that HRB is better at creating index than traditional distributed index, and when the amount of data reaches 10 million levels, HRB index retrieve data is faster. So, HRB has definitive value.

Key words: hydrology entities, two-level index, distributed index, Hadoop