Computer and Modernization ›› 2020, Vol. 0 ›› Issue (11): 47-55.

Previous Articles     Next Articles

RDF Stream Data Query Based on Apache Flink

  

  1. (School of Computer Science, South China Normal University, Guangzhou 510631, China)
  • Online:2020-12-03 Published:2020-12-03

Abstract: At present, mature RDF Stream Processing (RSP) systems lack parallel processing characteristics due to the centralized design. Therefore, when querying and processing a large amount of incoming RDF stream data, high throughput and low latency cannot be achieved. In order to improve the query performance, this paper researches the RSP query process and Flink stream calculation structure, designs four logical operators: source, filter, multi-way partition join and project, and designs a Multi-Stream Join (MSJ) algorithm that is used to generate a logical query plan of a directed acyclic graph with parallelism. Finally, a big data stream processing platform called Apache Flink is used to implement the logical operator and logical query plan. The real data set SRBench and simulated data set LUBMs are used for experimental verification. The results show that compared with the most mature systems C-SPARQL and CQELS, the throughput of a single machine increases by 10 times, the throughput of a cluster of 5 machines increases by 28 times, and the latency reaches the millisecond level; in terms of query performance, the throughput is improved and the latency is reduced when processing a large amount of RDF stream data.

Key words:  RDF stream, parallel processing, logical operators, multi-stream join, Apache Flink