计算机与现代化

• 数据库与数据挖掘 • 上一篇    下一篇

利用网格计算加速基因大数据分析的方法实现

  

  1. (1.北京大学信息科学技术学院,北京100871;2.北京智因东方转化医学研究中心有限公司,北京100176)
  • 收稿日期:2019-04-03 出版日期:2019-08-15 发布日期:2019-08-16
  • 作者简介:杨双浩(1987-),男,河南许昌人,硕士研究生,研究方向:数据挖掘,网格计算,生物信息学,E-mail: yshbioinfo@126.com。
  • 基金资助:
    国家重点研发计划资助项目(2018YFB1004100)

Realization of Accelerating Gene Big Data Analysis by Grid Computing

  1. (1. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China;
      2. Beijing Chigene Translational Medical Research Center Co. Ltd., Beijing 100176, China)
  • Received:2019-04-03 Online:2019-08-15 Published:2019-08-16

摘要: 为解决基因测序数据量大、数据分析时间长,搭建FPGA、GPU计算平台成本高昂且计算软件兼容不足的问题,运用分布式计算思想设计高通量测序数据分析架构Sequence Grid(SeqGrid)。该架构安装centos开源操作系统,利用网格引擎Sun Grid Engine(SGE)、普通CPU、机械硬盘和SSD硬盘,通过并发调度生物信息软件bwa、GATK等实现数据分析。结果表明:单人全外显子组30 GB数据分析时间从15 h缩短至1 h,计算速度达到串行流程的15倍,有效提高了数据分析效率。

关键词: 网格计算, 高性能计算集群, 生物信息, 高通量测序

Abstract: In order to solve the problems of large amount of gene sequencing data, long time data analysis, high cost of building FPGA and GPU computing platform, and insufficient compatibility of computing software, the paper designs a high-throughput sequencing data analysis architecture called Sequence Grid(SeqGrid) by distributed computing ideas. The architecture installs the centos open source operating system, uses the grid engine Sun Grid Engine (SGE), an ordinary CPU, a mechanical hard disk, and a SSD hard disk, and concurrently dispatches bioinformatics software bwa, GATK, etc. to realize data analysis. The results show the 30 GB data analysis time of single whole exome sequence is shortened from 15 hours to 1 hour, and the computing speed is 15 times faster than that of the serial process, which effectively improves the efficiency of data analysis.

Key words: grid computing, high performance computing cluster, bioinformatics, high-throughput sequencing

中图分类号: