计算机与现代化

• 算法设计与分析 • 上一篇    下一篇

一种支持通配符查询的XML模式匹配算法

  

  1.   (南京航空航天大学计算机科学与技术学院,江苏南京211106)
     
  • 收稿日期:2015-11-20 出版日期:2016-04-14 发布日期:2018-09-30
  • 作者简介:陈冲(1990-),男,山东菏泽人,南京航空航天大学计算机科学与技术学院硕士研究生,研究方向:XML数据库; 蒋夏军(1973-),男,江苏南京人,讲师,博士,研究方向:时空数据库,计算机仿真。
  • 基金资助:

    江苏省自然科学基金资助项目(BK20140826)

An Efficient XML Pattern Matching Algorithm for Supporting Wildcard Query

  1. (College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China)

  • Received:2015-11-20 Online:2016-04-14 Published:2018-09-30

摘要:

XML查询语言当中,包含通配符*的查询能够方便有效地满足一些特殊查询要求,但在大数据时代下XML文件容量与结构复杂性不断增加,现有支持通配符查询的算法需消耗巨量内存来解析XML,并且在对嵌套通配符处理时需要大量的单路径匹配操作和局部结果的缓存。针对此现状,结合现有经典算法,提出一种新的、能够高效解决小枝模式当中含有通配符*的查询算法—WTwigList。该算法首先对查询模式进行通配符的层次关系处理,减少不必要的通配符匹配,以数据流形式解析XML文件并执行局部的扩展Dewey编码,经过滤操作后得到有序的叶子节点编码列表,在列表中执行匹配操作得到结果;其次在真实和合成数据集上做大量实验,结果表明WTwigList算法与现有算法相比,能够有效提高查询效率,在空间效率上具有一定优势,且能够快速准确地处理查询模式中PC关系。

 

关键词: 通配符查询, 流数据处理, 扩展Dewey编码, XML模式匹配

Abstract:

In XML query language, the wildcard query which includes “*” can effectively meet some special query requirements. But in the big data era, with the increasing of the XML file size and structural complexity, the existing algorithms which support wildcard query need huge amounts of memory to parse XML file and also need many single path matching operations and local result caching. Aiming at this situation, we propose a new XML pattern matching algorithm named WTwigList to solve the twig pattern containing the wildcard effectively. First, the hierarchical relationship of wildcard in the query pattern is processed to reduce unnecessary wildcard matching. Then the XML file is parsed as data stream pattern and the local Extended Dewey encoding is executed. After filtering operation, the ordered list of leaf node encoding is gotten, and the matching results can get from the list matching operations. A set of experimental result on both reallife and synthetic dataset demonstrates that WTwigList improves query efficiency andis of advantages in space efficiency, and it can deal with the P-C relationship quickly and accurately.

Key words: wildcard query, stream data processing, Extended Dewey Encoding, XML pattern matching

中图分类号: