计算机与现代化 ›› 2020, Vol. 0 ›› Issue (11): 47-55.

• 数据库与数据挖掘 • 上一篇    下一篇

基于Apache Flink的RDF流数据查询

  

  1. (华南师范大学计算机学院,广东广州510631)
  • 出版日期:2020-12-03 发布日期:2020-12-03
  • 作者简介:郑滔(1996—),男,广东河源人,硕士研究生,研究方向:流数据处理,数据库理论与系统,E-mail: zhengtao649557301@qq.com; 刘梦赤(1962—),男,湖北武汉人,教授,博士,研究方向:数据库理论与系统,大数据智能教育系统,E-mail: mengchiliu1@gmail.com; 冯嘉美(1992—),女,博士研究生,研究方向:大数据应用,数据库理论与系统,E-mail: 739719252@qq.com。
  • 基金资助:
    国家自然科学基金资助项目(61672389); 广州市大数据智能教育重点实验室(201905010009)

RDF Stream Data Query Based on Apache Flink

  1. (School of Computer Science, South China Normal University, Guangzhou 510631, China)
  • Online:2020-12-03 Published:2020-12-03

摘要: 目前成熟的RDF流处理(RDF Stream Processing, RSP)系统由于集中式的设计而缺乏并行处理特性,因此在查询处理大量传入的RDF流数据时,均无法实现高吞吐和低延迟。为提高查询性能,本文对RSP查询过程和Flink流计算结构进行研究,设计数据源、滤器、多路分区连接和投影4个逻辑操作符,并设计一种多流连接(Multi-Stream Join, MSJ)算法用于生成具有并行性的有向无环图的逻辑查询计划,最后以大数据流处理平台Apache Flink为底层实现逻辑操作符和逻辑查询计划。使用真实数据集SRBench和模拟数据集LUBMs进行实验验证。结果表明,与最成熟的系统C-SPARQL、CQELS相比,单机吞吐量增长高达10倍,5台机器集群的吞吐量增长高达28倍,同时在延时方面达到了毫秒级;在查询性能方面实现了处理大量RDF流数据时吞吐量的提高和延时的降低。

关键词: RDF流, 并行处理, 逻辑操作符, 多流连接, Apache Flink

Abstract: At present, mature RDF Stream Processing (RSP) systems lack parallel processing characteristics due to the centralized design. Therefore, when querying and processing a large amount of incoming RDF stream data, high throughput and low latency cannot be achieved. In order to improve the query performance, this paper researches the RSP query process and Flink stream calculation structure, designs four logical operators: source, filter, multi-way partition join and project, and designs a Multi-Stream Join (MSJ) algorithm that is used to generate a logical query plan of a directed acyclic graph with parallelism. Finally, a big data stream processing platform called Apache Flink is used to implement the logical operator and logical query plan. The real data set SRBench and simulated data set LUBMs are used for experimental verification. The results show that compared with the most mature systems C-SPARQL and CQELS, the throughput of a single machine increases by 10 times, the throughput of a cluster of 5 machines increases by 28 times, and the latency reaches the millisecond level; in terms of query performance, the throughput is improved and the latency is reduced when processing a large amount of RDF stream data.

Key words:  RDF stream, parallel processing, logical operators, multi-stream join, Apache Flink