RDF Stream Data Query Based on Apache Flink

Abstract

Abstract: At present, mature RDF Stream Processing (RSP) systems lack parallel processing characteristics due to the centralized design. Therefore, when querying and processing a large amount of incoming RDF stream data, high throughput and low latency cannot be achieved. In order to improve the query performance, this paper researches the RSP query process and Flink stream calculation structure, designs four logical operators: source, filter, multi-way partition join and project, and designs a Multi-Stream Join (MSJ) algorithm that is used to generate a logical query plan of a directed acyclic graph with parallelism. Finally, a big data stream processing platform called Apache Flink is used to implement the logical operator and logical query plan. The real data set SRBench and simulated data set LUBMs are used for experimental verification. The results show that compared with the most mature systems C-SPARQL and CQELS, the throughput of a single machine increases by 10 times, the throughput of a cluster of 5 machines increases by 28 times, and the latency reaches the millisecond level; in terms of query performance, the throughput is improved and the latency is reduced when processing a large amount of RDF stream data.

Key words: RDF stream, parallel processing, logical operators, multi-stream join, Apache Flink

ZHENG Tao, LIU Meng-chi, FENG Jia-mei. RDF Stream Data Query Based on Apache Flink[J]. Computer and Modernization, 2020, 0(11): 47-55.

References

［1］李琼. RDF流分布式处理框架研究［D］. 天津:天津大学， 2017.
［2］ TOMMASINI R, DELLA VALLE E, MAURI A, et al. RSPLab: RDF stream processing benchmarking made easy〖JP4〗［C］// International Semantic Web Conference. 2017:202-209.
［3］ DELL’AGLIO D, DELLA VALLE E, CALBIMONTE J P, et al. RSP-QL semantics: A unifying query model to explain heterogeneity of RDF stream processing systems［J］. International Journal on Semantic Web and Information Systems, 2014,10(4):17-44.
［4］ MAURI A, CALBIMONTE J P, DELL’AGLIO D, et al. Triplewave: Spreading RDF streams on the web［C］// International Semantic Web Conference. 2016:140-149.
［5］ BOLLES A, GRAWUNDER M, JACOBI J. Streaming SPARQL-extending SPARQL to process data streams［C］//European Semantic Web Conference. 2008:448-462.
［6］ BARBIERI D F, BRAGA D, CERI S, et al. Querying RDF streams with C-SPARQL［J］. ACM SIGMOD Record, 2010,39(1):20-26.
［7］ BARBIERI D F, BRAGA D, CERI S, et al. C-SPARQL:SPARQL for continuous querying［C］// The 18th International Conference on World Wide Web. 2009:1061-1062.
［8］ ANICIC D, FODOR P, RUDOLPH S, et al. EP-SPARQL: A unified language for event processing and stream reasoning［C］// Proceedings of the 20th ACM International Conference on World Wide Web. 2011:635-644.
［9］ CALBIMONTE J P, CORCHO O, GRAY A J G. Enabling ontology-based access to streaming data sources［C］// International Semantic Web Conference. 2010:96-111.
［10］LE-PHUOC D, DAO-TRAN M, PARREIRA J X, et al. A native and adaptive approach for unified processing of linked streams and linked data［C］// International Semantic Web Conference. 2011:370-388.
［11］REN X, KHROUF H, KAZI-AOUL Z, et al. On measuring performances of C-SPARQL and CQELS［J］. arXiv:1611.08269, 2016.
［12］李国鼎,冯志勇,饶国政,等. 基于BSP的SPARQL基本图模式查询算法［J］. 计算机工程, 2014,40(9):37-41.
［13］LE-PHUOC D, QUOC H N M, LE VAN C, et al. Elastic and scalable processing of linked stream data in the cloud〖JP4〗［C］// International Semantic Web Conference. 2013:280-297.
［14］REN X, CUR O. Strider: A hybrid adaptive distributed RDF stream processing engine［C］// International Semantic Web Conference. 2017:559-576.
［15］SUN D, ZHANG G, YANG S, et al. Re-Stream: Real-time and energy-efficient resource scheduling in big data stream computing environments［J］. Information Sciences, 2015,319:92-112.
［16］李名扬. 基于DAG的数据流处理与分析引擎的研究与实现［D］. 北京：北京理工大学， 2016.
［17］KATSIFODIMOS A, SCHELTER S. Apache Flink: Stream analytics at scale［C］// 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW). 2016:193.
［18］DELL’AGLIO D, DAO-TRAN M, CALBIMONTE J P, et al. A query model to capture event pattern matching in RDF stream processing query languages［C］// European Knowledge Acquisition Workshop. 2016:145-162.
［19］ZHANG Y, DUC P M, CORCHO O, et al. SRBench: A streaming RDF/SPARQL benchmark［C］// International Semantic Web Conference. 2012:641-657.
［20］KARUNARATNE P, KARUNASEKERA S, HARWOOD A. Distributed stream clustering using micro-clusters on Apache Storm［J］. Journal of Parallel and Distributed Computing, 2017,108:74-84.
［21］QUOC D L, CHEN R, BHATOTIA P, et al. Approximate stream analytics in apache flink and apache spark streaming［J］.arXiv:1709.02946,2017.
［22］VORA M N. Hadoop-HBase for large-scale data［C］// Proceedings of 2011 IEEE International Conference on Computer Science and Network Technology. 2011,1:601-605.
［23］LE-PHUOC D, QUOC H N M, LE VAN C, et al. Elastic and scalable processing of linked stream data in the cloud［C］// International Semantic Web Conference. 2013:280-297.
［24］宋纪成. 海量RDF数据存储与查询技术的研究与发现［D］. 北京：北京工业大学， 2013.
［25］MYUNG J, YEON J, LEE S. SPARQL basic graph pattern processing with iterative MapReduce［C］// Proceedings of 2010 ACM Workshop on Massive Data Analytics on the Cloud. 2010:6.
［26］GUO Y, PAN Z, HEFLIN J. LUBM: A benchmark for OWL knowledge base systems［J］. Web Semantics: Science, Services and Agents on the World Wide Web, 2005,3(2-3):158-182.
［27］WANG Z, DAI W, WANG F, et al. Kafka and its using in high-throughput and reliable message distribution［C］//2015 8th IEEE International Conference on Intelligent Networks and Intelligent Systems. 2015:117-120.
［28］KARIMOV J, RABL T, KATSIFODIMOS A, et al. Benchmarking distributed stream processing engines［J］. arXiv:1802.08496, 2018.

[1]	LI Han1, HU Ming-xiao2, GONG Zhi-hong3, FAN Bing-jun3. Massive Spatial Data Parallel Processing Technology for Vector Tiles [J]. Computer and Modernization, 2018, 0(09): 32-.
[2]	CHEN Kang, LI Yao-hua, YOU Feng, CHEN Run-feng. Smoke Detection Algorithm About Video Image with Multiple Features Based on Serial and Parallel Processing Model [J]. Computer and Modernization, 2017, 0(4): 1-6,22.
[3]	HUANG Yi-bin, JIN Qian-qian, JI Yuan. Research on High Performances and Parallel Processing for Network Packet [J]. Computer and Modernization, 2016, 0(12): 57-61.
[4]	ZHANG Jian-rong, MAO Yu-guang. Parallel Reverse Skyline Query over Uncertain Data Streams [J]. Computer and Modernization, 2015, 0(1): 46-52.
[5]	TANG Qing-song. Research and Implementation of Processing Progress for Webpage Display System [J]. Computer and Modernization, 2014, 0(5): 136-140.
[6]	FANG Xin. A New Parallel Selfbalancing PSO Algorithm with Data Integer Programming [J]. Computer and Modernization, 2012, 203(7): 17-20.
[7]	FANG Xin. Research and Application of Parallel Hybrid PSO Algorithm of Positive Sequence Variation [J]. Computer and Modernization, 2012, 8(08): 13-16.
[8]	ONG Ting-ting;ZHU Yong-zhi. Parallel Differential Evolution Algorithm [J]. Computer and Modernization, 2011, 1(1): 49-3.
[9]	XU Jing-hua;ZOU Xiong;WANG Xu-cheng. Framework Detecting in DSP Network Based on Wormhole Algorithm [J]. Computer and Modernization, 2010, 1(01): 16-18,2.