基于Apache Flink的RDF流数据查询

摘要/Abstract

摘要： 目前成熟的RDF流处理（RDF Stream Processing, RSP）系统由于集中式的设计而缺乏并行处理特性，因此在查询处理大量传入的RDF流数据时，均无法实现高吞吐和低延迟。为提高查询性能，本文对RSP查询过程和Flink流计算结构进行研究，设计数据源、滤器、多路分区连接和投影4个逻辑操作符，并设计一种多流连接（Multi-Stream Join， MSJ）算法用于生成具有并行性的有向无环图的逻辑查询计划，最后以大数据流处理平台Apache Flink为底层实现逻辑操作符和逻辑查询计划。使用真实数据集SRBench和模拟数据集LUBMs进行实验验证。结果表明，与最成熟的系统C-SPARQL、CQELS相比，单机吞吐量增长高达10倍，5台机器集群的吞吐量增长高达28倍，同时在延时方面达到了毫秒级；在查询性能方面实现了处理大量RDF流数据时吞吐量的提高和延时的降低。

关键词: RDF流, 并行处理, 逻辑操作符, 多流连接, Apache Flink

Abstract: At present, mature RDF Stream Processing (RSP) systems lack parallel processing characteristics due to the centralized design. Therefore, when querying and processing a large amount of incoming RDF stream data, high throughput and low latency cannot be achieved. In order to improve the query performance, this paper researches the RSP query process and Flink stream calculation structure, designs four logical operators: source, filter, multi-way partition join and project, and designs a Multi-Stream Join (MSJ) algorithm that is used to generate a logical query plan of a directed acyclic graph with parallelism. Finally, a big data stream processing platform called Apache Flink is used to implement the logical operator and logical query plan. The real data set SRBench and simulated data set LUBMs are used for experimental verification. The results show that compared with the most mature systems C-SPARQL and CQELS, the throughput of a single machine increases by 10 times, the throughput of a cluster of 5 machines increases by 28 times, and the latency reaches the millisecond level; in terms of query performance, the throughput is improved and the latency is reduced when processing a large amount of RDF stream data.

Key words: RDF stream, parallel processing, logical operators, multi-stream join, Apache Flink

郑滔, 刘梦赤, 冯嘉美. 基于Apache Flink的RDF流数据查询[J]. 计算机与现代化, 2020, 0(11): 47-55.

ZHENG Tao, LIU Meng-chi, FENG Jia-mei. RDF Stream Data Query Based on Apache Flink[J]. Computer and Modernization, 2020, 0(11): 47-55.

参考文献

［1］李琼. RDF流分布式处理框架研究［D］. 天津:天津大学， 2017.
［2］ TOMMASINI R, DELLA VALLE E, MAURI A, et al. RSPLab: RDF stream processing benchmarking made easy〖JP4〗［C］// International Semantic Web Conference. 2017:202-209.
［3］ DELL’AGLIO D, DELLA VALLE E, CALBIMONTE J P, et al. RSP-QL semantics: A unifying query model to explain heterogeneity of RDF stream processing systems［J］. International Journal on Semantic Web and Information Systems, 2014,10(4):17-44.
［4］ MAURI A, CALBIMONTE J P, DELL’AGLIO D, et al. Triplewave: Spreading RDF streams on the web［C］// International Semantic Web Conference. 2016:140-149.
［5］ BOLLES A, GRAWUNDER M, JACOBI J. Streaming SPARQL-extending SPARQL to process data streams［C］//European Semantic Web Conference. 2008:448-462.
［6］ BARBIERI D F, BRAGA D, CERI S, et al. Querying RDF streams with C-SPARQL［J］. ACM SIGMOD Record, 2010,39(1):20-26.
［7］ BARBIERI D F, BRAGA D, CERI S, et al. C-SPARQL:SPARQL for continuous querying［C］// The 18th International Conference on World Wide Web. 2009:1061-1062.
［8］ ANICIC D, FODOR P, RUDOLPH S, et al. EP-SPARQL: A unified language for event processing and stream reasoning［C］// Proceedings of the 20th ACM International Conference on World Wide Web. 2011:635-644.
［9］ CALBIMONTE J P, CORCHO O, GRAY A J G. Enabling ontology-based access to streaming data sources［C］// International Semantic Web Conference. 2010:96-111.
［10］LE-PHUOC D, DAO-TRAN M, PARREIRA J X, et al. A native and adaptive approach for unified processing of linked streams and linked data［C］// International Semantic Web Conference. 2011:370-388.
［11］REN X, KHROUF H, KAZI-AOUL Z, et al. On measuring performances of C-SPARQL and CQELS［J］. arXiv:1611.08269, 2016.
［12］李国鼎,冯志勇,饶国政,等. 基于BSP的SPARQL基本图模式查询算法［J］. 计算机工程, 2014,40(9):37-41.
［13］LE-PHUOC D, QUOC H N M, LE VAN C, et al. Elastic and scalable processing of linked stream data in the cloud〖JP4〗［C］// International Semantic Web Conference. 2013:280-297.
［14］REN X, CUR O. Strider: A hybrid adaptive distributed RDF stream processing engine［C］// International Semantic Web Conference. 2017:559-576.
［15］SUN D, ZHANG G, YANG S, et al. Re-Stream: Real-time and energy-efficient resource scheduling in big data stream computing environments［J］. Information Sciences, 2015,319:92-112.
［16］李名扬. 基于DAG的数据流处理与分析引擎的研究与实现［D］. 北京：北京理工大学， 2016.
［17］KATSIFODIMOS A, SCHELTER S. Apache Flink: Stream analytics at scale［C］// 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW). 2016:193.
［18］DELL’AGLIO D, DAO-TRAN M, CALBIMONTE J P, et al. A query model to capture event pattern matching in RDF stream processing query languages［C］// European Knowledge Acquisition Workshop. 2016:145-162.
［19］ZHANG Y, DUC P M, CORCHO O, et al. SRBench: A streaming RDF/SPARQL benchmark［C］// International Semantic Web Conference. 2012:641-657.
［20］KARUNARATNE P, KARUNASEKERA S, HARWOOD A. Distributed stream clustering using micro-clusters on Apache Storm［J］. Journal of Parallel and Distributed Computing, 2017,108:74-84.
［21］QUOC D L, CHEN R, BHATOTIA P, et al. Approximate stream analytics in apache flink and apache spark streaming［J］.arXiv:1709.02946,2017.
［22］VORA M N. Hadoop-HBase for large-scale data［C］// Proceedings of 2011 IEEE International Conference on Computer Science and Network Technology. 2011,1:601-605.
［23］LE-PHUOC D, QUOC H N M, LE VAN C, et al. Elastic and scalable processing of linked stream data in the cloud［C］// International Semantic Web Conference. 2013:280-297.
［24］宋纪成. 海量RDF数据存储与查询技术的研究与发现［D］. 北京：北京工业大学， 2013.
［25］MYUNG J, YEON J, LEE S. SPARQL basic graph pattern processing with iterative MapReduce［C］// Proceedings of 2010 ACM Workshop on Massive Data Analytics on the Cloud. 2010:6.
［26］GUO Y, PAN Z, HEFLIN J. LUBM: A benchmark for OWL knowledge base systems［J］. Web Semantics: Science, Services and Agents on the World Wide Web, 2005,3(2-3):158-182.
［27］WANG Z, DAI W, WANG F, et al. Kafka and its using in high-throughput and reliable message distribution［C］//2015 8th IEEE International Conference on Intelligent Networks and Intelligent Systems. 2015:117-120.
［28］KARIMOV J, RABL T, KATSIFODIMOS A, et al. Benchmarking distributed stream processing engines［J］. arXiv:1802.08496, 2018.

[1]	邓斌涛, 徐胜超. 基于动态双子种群的差分进化K中心点聚类算法[J]. 计算机与现代化, 2021, 0(07): 54-59.
[2]	李瀚1,胡明晓2,龚志红3,樊竝君3. 面向矢量瓦片的海量空间数据并行处理技术[J]. 计算机与现代化, 2018, 0(09): 32-.
[3]	陈康,李耀华,游峰,陈润丰. 基于串并行处理的多特征交通视频烟雾检测算法[J]. 计算机与现代化, 2017, 0(4): 1-6,22.
[4]	黄益彬,金倩倩,纪元. 网络数据包高性能并行处理技术研究[J]. 计算机与现代化, 2016, 0(12): 57-61.
[5]	张建荣，毛宇光. 不确定数据流上的并行反Skyline查询[J]. 计算机与现代化, 2015, 0(1): 46-52.
[6]	唐青松. Web页面展示系统处理进度的研究与实现[J]. 计算机与现代化, 2014, 0(5): 136-140.
[7]	方昕. 一种带数据整数规划的新型并行自平衡PSO算法[J]. 计算机与现代化, 2012, 203(7): 17-20.
[8]	方昕. 一种并行的正序变异的混合PSO算法研究与应用[J]. 计算机与现代化, 2012, 8(08): 13-16.
[9]	丛婷婷;祝永志. 并行差分进化算法[J]. 计算机与现代化, 2011, 1(1): 49-3.
[10]	徐精华;邹雄;王旭成. 基于蠕虫算法的DSP网络结构探测[J]. 计算机与现代化, 2010, 1(01): 16-18,2.