计算机与现代化 ›› 2025, Vol. 0 ›› Issue (12): 38-45.doi: 10.3969/j.issn.1006-2475.2025.12.006

• 人工智能 • 上一篇    下一篇

基于MPI的异构算力资源融合调度平台

  


  1. (江西省科技基础条件平台中心,江西 南昌 330003)
  • 出版日期:2025-12-18 发布日期:2025-12-18
  • 作者简介: 作者简介:叶宁(1991—),男,江西吉安人,工程师,硕士,研究方向:高性能计算,网络优化,E-mail: yyycsu@163.com; 付康(1974—),男,江西进贤人,教授级高级工程师,硕士,研究方向:高性能计算,信息安全,E-mail: 42713269@qq.com; 胡少文(1987—),男,江西南昌人,硕士,研究方向:高性能计算,信息安全,E-mail: 806814726@163.com; 龚一峰(1998—),男,江西抚州人,硕士,研究方向:高性能计算,人工智能,E-mail:1099266115@qq.com; 王康(1989—),男,江西南昌人,硕士,研究方向:信息系统,高性能计算,E-mail: kkfish8@163.com; 杨宇仙(1973—),女,江西兴国人,高级工程师,硕士,研究方向:计算机应用,E-mail: 56638293@qq.com。
  • 基金资助:
    基金项目:江西省重点研发计划项目(20224BBC31002)
       

MPI-based Heterogeneous Computing Resource Integration and Scheduling Platform


  1. (Jiangxi Science and Technology Infrastructure Center, Nanchang 330003, China) 
  • Online:2025-12-18 Published:2025-12-18

摘要:
摘要:针对高性能计算中心尤其是中小规模计算中心因异构算力资源分散化导致无法承担大规模计算作业问题,设计并实现一种异构算力资源融合调度平台,实现对X86、ARM等异构算力资源的统一纳管与协同计算。平台采用分层融合调度架构,利用集群管理服务(CMS)与作业管理器(JMC)动态监控资源状态,借助作业调度器(JS)实现计算任务在异构计算节点间的协同并行计算。通过主从式JMC进程协同以及消息传递接口MPI(Message Passing Interface)规约机制,实现物理机层面的跨架构数据同步,首次实现物理机层面单一作业在异构计算节点并行计算。针对异构集群性能不均衡引发的长尾延迟效应及产生的同步开销问题,本文提出时限约束最小资源配置算法(DCMR),在保证作业完成时限的前提下,最小化计算资源投入。测试结果表明,平台在异构环境下计算性能几乎无损失。DCMR算法有效提升了异构计算资源的利用效率,为应对异构计算环境提供了可靠的系统解决方案。

关键词: 关键词:异构算力, 资源调度, MPI, 物理机, 中小规模计算中心

Abstract:
Abstract: Aiming to the problem that high-performance computing centers, especially small and medium-sized computing centers, are unable to undertake large-scale computing jobs due to the decentralization of heterogeneous computing resources, this paper designs and implements a heterogeneous computing resource integration and scheduling platform to realize the unified management of heterogeneous computing resources such as X86, ARM and so on, as well as collaborative computing. The platform adopts a layered fusion scheduling architecture, utilizes cluster manager server (CMS) and job manager client (JMC) to dynamically monitor the resource status, and realizes collaborative parallel computing among heterogeneous computing nodes with the help of job scheduler (JS). Through the master-slave JMC process collaboration and MPI reduction mechanism, cross-architecture data synchronization at the physical machine level is achieved, and parallel execution of a single job on heterogeneous computing nodes at the physical machine level is realized for the first time. To address the long-tail delay effects and synchronization overhead caused by performance imbalances in heterogeneous clusters, this paper proposes a deadline-constrained minimal resource algorithm (DCMR), which minimizes computational resource consumption while ensuring job completion deadlines are met. Test results show that the platform has almost no loss of computing performance in heterogeneous environments, and the DCMR algorithm effectively improves the utilization efficiency of heterogeneous computing resources, providing a reliable system solution to deal with heterogeneous computing environments.

Key words: Key words: heterogeneous computing resource, resource scheduling, MPI, physical machine, small and medium-sized computing centers

中图分类号: