利用网格计算加速基因大数据分析的方法实现

doi:10.3969/j.issn.1006-2475.2019.08.016

计算机与现代化 ›› 2019, Vol. 0 ›› Issue (08): 85-.doi: 10.3969/j.issn.1006-2475.2019.08.016

利用网格计算加速基因大数据分析的方法实现

（1.北京大学信息科学技术学院，北京100871；2.北京智因东方转化医学研究中心有限公司，北京100176）

收稿日期:2019-04-03 出版日期:2019-08-15 发布日期:2019-08-16
作者简介:杨双浩（1987-），男，河南许昌人，硕士研究生，研究方向：数据挖掘，网格计算，生物信息学，E-mail： yshbioinfo@126.com。
基金资助:
国家重点研发计划资助项目（2018YFB1004100）

Realization of Accelerating Gene Big Data Analysis by Grid Computing

（1. School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China;
2. Beijing Chigene Translational Medical Research Center Co. Ltd., Beijing 100176, China）

Received:2019-04-03 Online:2019-08-15 Published:2019-08-16

摘要/Abstract

摘要： 为解决基因测序数据量大、数据分析时间长，搭建FPGA、GPU计算平台成本高昂且计算软件兼容不足的问题，运用分布式计算思想设计高通量测序数据分析架构Sequence Grid(SeqGrid)。该架构安装centos开源操作系统，利用网格引擎Sun Grid Engine(SGE)、普通CPU、机械硬盘和SSD硬盘，通过并发调度生物信息软件bwa、GATK等实现数据分析。结果表明：单人全外显子组30 GB数据分析时间从15 h缩短至1 h，计算速度达到串行流程的15倍，有效提高了数据分析效率。

关键词: 网格计算, 高性能计算集群, 生物信息, 高通量测序

Abstract: In order to solve the problems of large amount of gene sequencing data, long time data analysis, high cost of building FPGA and GPU computing platform, and insufficient compatibility of computing software, the paper designs a high-throughput sequencing data analysis architecture called Sequence Grid(SeqGrid) by distributed computing ideas. The architecture installs the centos open source operating system, uses the grid engine Sun Grid Engine (SGE), an ordinary CPU, a mechanical hard disk, and a SSD hard disk, and concurrently dispatches bioinformatics software bwa, GATK, etc. to realize data analysis. The results show the 30 GB data analysis time of single whole exome sequence is shortened from 15 hours to 1 hour, and the computing speed is 15 times faster than that of the serial process, which effectively improves the efficiency of data analysis.

Key words: grid computing, high performance computing cluster, bioinformatics, high-throughput sequencing

中图分类号:

TP338.8

杨双浩1，2 . 利用网格计算加速基因大数据分析的方法实现[J]. 计算机与现代化, 2019, 0(08): 85-.

YANG Shuang-hao1，2 . Realization of Accelerating Gene Big Data Analysis by Grid Computing[J]. Computer and Modernization, 2019, 0(08): 85-.

参考文献

［1］周琳颖,苏燕,许丽,等. 分子诊断研发与市场发展概况［J］. 生物产业技术, 2018(2):6-13.
［2］ VENTER J C, ADAMS M D, MYERS E W, et al. The sequence of the human genome［J］. Science, 2001,291(5507):1304-1351.
［3］黄留玉. PCR最新技术原理、方法及应用［M］. 北京：化学工业出版社， 2005.
［4］ WANG R, YANG S, XU M, et al. Exome sequencing confirms molecular diagnoses in 38 Chinese families with hereditary spherocytosis［J］. Science China Life Sciences, 2018，61（8）:947-953.
［5］ BHUVANESHWAR K, SULAKHE D, GAUBA R, et al. A case study for cloud based high throughput analysis of NGS data using the globus genomics system［J］. Computational and Structural Biotechnology Journal, 2015,13:64-74.
［6］ GAO X, XU J, STARMER J. Fastq2vcf: A concise and transparent pipeline for whole-exome sequencing data analyses［J］. BMC Research Notes, 2015,8(1):72.
［7］ SINGER J, RUSCHEWEYH H J, HOFMANN A L, et al. NGS-pipe: A flexible, easily extendable and highly configurable framework for NGS analysis［J］. Bioinformatics, 2017,34(1):107-108.
［8］ BOLGER A M, LOHSE M, USADEL B. Trimmomatic: A flexible trimmer for Illumina sequence data［J］. Bioinformatics, 2014,30(15):2114-2120.
［9］ LI H, DURBIN R. Fast and accurate short read alignment with Burrows-Wheeler transform［J］. Bioinformatics, 2009,25(14):1754-1760.
［10］LI H, HANDSAKER B, WYSOKER A, et al. The sequence alignment/map format and SAMtools［J］. Bioinformatics, 2009,25(16):2078-2079.
［11］VAN DER-AUWERA G A, CARNEIRO M O, HARTL C, et al. From FastQ data to high-confidence variant calls: The genome analysis toolkit best practices pipeline［J］. Current Protocols in Bioinformatics, 2013,43(1110):11.10.1-11.10.33.
［12］DANECEK P, AUTON A, ABECASIS G, et al. The variant call format and VCFtools［J］. Bioinformatics, 2011,27(15):2156-2158.
［13］周超,孙海龙,胡春明,等. 面向生物信息的网格工作流开发与运行环境［J］. 计算机科学与探索, 2010,4(3):275-282.
［14］蒋勇，赵作鹏. 基于MapReduce模型的排序算法优化研究［J］. 计算机科学与探索, 2015,9(4):410-417.
［15］吴沛霖,何枫,仲宇. 一种基于分布式计算平台的试验数据包络线分析算法［J］. 网络空间安全, 2017(6):67-70.
［16］曾绍华,魏延. 共享存储器多处理机并行计算编译及调度机制［J］. 重庆师范大学学报（自然科学版）, 2006，23（1）：27-30.
［17］王翼，徐六通，杨胜琦. DMGrid：基于网格计算的数据挖掘系统［J］. 计算机科学与探索, 2010,4(2)：180-190.
［18］杨晓东,郜鲁涛,杨林楠,等. 基于Hadoop平台的农产品价格数据爬取和存储系统的研究［J］. 计算机应用与软件, 2017,34(3):76-80.
［19］PATTERSON D A， HENNESSY J L. 计算机组成与设计：硬件/软件接口［M］. 5版. 王党辉,康继昌，安建峰，等译. 北京:机械工业出版社， 2015:320.
［20］彭觅. 固态硬盘SSD的性能分析和组建方案设计［J］. 硅谷, 2008(20):25.
［21］李媛,张建林,张锐,等. 集群作业管理系统SGE及其应用［J］. 计算机工程与设计, 2009,30(12):2911-2914.
［22］张传富,刘云生,张童,等. 基于SGE的仿真网格及其作业调度研究［J］. 计算机仿真， 2006，23（6）：274-278.
［23］邓力,谢仲生,黄正丰,等. MC程序并行设计及提高加速比措施［J］. 计算物理, 2001,18(2):177-180.
［24］张阳,窦勇,夏飞. 生物信息学双序列比对算法加速器设计与实现［J］. 计算机科学与探索, 2008,2(5)：519-528.
［25］DO T N, NGUYEN V H, POULET F. GPU的并行支持向量机算法［J］. 计算机科学与探索, 2009,3(4):368-377.
［26］陈荟竹,王梦甜,宁刚. 卵巢恶性肿瘤的影像学诊断进展［J］. 中华妇幼临床医学杂志， 2015，11（1）：86-90.

[1]	陈奎;刘波. 一种基于模糊逻辑的网格节点可靠性评估模型[J]. 计算机与现代化, 2013, 1(9): 27-30,3.
[2]	唐晓芬. 生物信息学中的NP-完全问题研究综述[J]. 计算机与现代化, 2013, 1(8): 43-45.
[3]	江代有. 云计算技术综述[J]. 计算机与现代化, 2012, 1(201): 71-73.
[4]	孙萧寒;赵维. 基于SVM的microRNA计算识别方法研究[J]. 计算机与现代化, 2012, 1(11): 30-32.
[5]	高冬. 基于混合蚁群遗传算法的RNA二级结构预测[J]. 计算机与现代化, 2011, 1(8): 1-4,8.
[6]	赵丹;程从从. 基于粗糙集的生物数据分类 [J]. 计算机与现代化, 2010, 1(7): 96-99.
[7]	吕爽;马渝勇;刘一谦;马艳军. 基于Gold的气象中心网格平台资源管理系统构建及应用[J]. 计算机与现代化, 2010, 1(5): 79-82.
[8]	林剑柠. 基于网格节点服务质量约束的信任模型研究[J]. 计算机与现代化, 2010, 1(4): 0-90.

利用网格计算加速基因大数据分析的方法实现

Realization of Accelerating Gene Big Data Analysis by Grid Computing

PDF (PC)

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

本文评价