计算机与现代化 ›› 2022, Vol. 0 ›› Issue (10): 100-105.

• 网络与通信 • 上一篇    下一篇

面向信息网模型的动态数据划分算法

  

  1. (华南师范大学计算机学院,广东广州510631)
  • 出版日期:2022-10-20 发布日期:2022-10-24
  • 作者简介:袁嘉立(1996—),男,浙江宁波人,硕士研究生,研究方向:数据库技术,大数据,E-mail: 1052019421@qq.com; 刘梦赤(1962—),男,湖北武汉人,教授,博士生导师,博士,研究方向:大数据系统,数据库系统,智能信息系统,E-mail: liumengchi@scnu.edu.cn。
  • 基金资助:
    国家自然科学基金资助项目(61672389); 广州市大数据智能教育重点实验室项目(201905010009)

Dynamic Data Partition Algorithm for Information Network Model

  1. (School of Computer Science, South China Normal University, Guangzhou 510631, China)
  • Online:2022-10-20 Published:2022-10-24

摘要: 针对分布式信息网数据库管理系统中因跨节点的复杂查询带来的昂贵通信开销,提出一种基于信息网模型和查询的数据动态划分算法。该算法根据信息网模型的关系特性和历史关系信息得到数据之间的初始关联,并结合历史查询信息挖掘数据之间的潜在关联,将关联性较强的数据动态调整到同一个处理节点上,使复杂查询跨节点的数量减少。最后,在标准合成数据集WatDiv上进行大量的实验评估。实验结果表明:在保证节点之间的对象个数和关系对占比负载均衡的情况下,该算法在周期内的查询时间与一致性哈希算法相比缩短了35%~55%,并将多个周期相同查询的时间波动控制在5%~10%,保证了复杂查询的稳定性。

关键词: 信息网模型, 动态数据划分, 关联性, 负载均衡, 分布式系统

Abstract: Due to the high communication overhead caused by the complex query across nodes in the distributed information network model (INM) database management system, a dynamic data partition and query processing algorithm is proposed. Based on the relationship characteristics of INM model and the historical relationship information, it obtains the initial relevance between data, then mines the potential relevance between data based on the historical query information and dynamically adjusts the data with strong correlation to the same processing node, so as to reduce the number of cross-nodes traversals in complex query. The extensive experiments on synthetic dataset WatDiv are carried out. The experimental results show that the query time of this algorithm is reduced by 35%~55% compared with the consistent hash algorithm in the period by ensuring the load balance of the number of objects and the proportion of relationship pairs between nodes, and the time fluctuation of the same query in multiple periods is controlled within 5%~10%, which ensures the stability of complex queries.

Key words: information network model, dynamic data partition, relevance, load balancing, distributed system