计算机与现代化 ›› 2025, Vol. 0 ›› Issue (03): 71-77.doi: 10.3969/j.issn.1006-2475.2025.03.011

• 算法设计与分析 • 上一篇    下一篇

sORF-BERT:基于预训练模型预测编码sORFs的方法



  

  1. (1.山东理工大学计算机科学与技术学院,山东 淄博 255049;
    2.德州学院生物物理研究院山东省生物物理重点实验室,山东 德州 253023)
  • 出版日期:2025-03-28 发布日期:2025-03-28
  • 基金资助:
    山东省高等学校青年创新团队发展计划项目(2019KJN048)

sORF-BERT: A Method on Identifying Coding sORFs Based on Pre-trained Models

  1. (1. School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China;
    2. Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China)
  • Online:2025-03-28 Published:2025-03-28

摘要: 小开放阅读框(small Open Reading Frames, sORFs)是指基因组中长度不超过300个碱基的开放阅读框,对于维持细胞代谢平衡及生命体的基础生理功能有重要作用。为深入挖掘sORFs序列的深层特征以及进一步提升跨物种预测编码与非编码sORFs的精度,提出一种融合DNABERT预训练和数据混合编码策略的sORF-BERT神经网络模型,并引入CAL模块以学习sORFs的多尺度特征。对原核基因组、人类、老鼠、拟南芥以及大肠杆菌数据集进行分析研究,sORF-BERT模型通过预训练与微调之后,能有效地捕获sORFs序列丰富的生物学特征,同时利用CAL更好地学习不同尺度的sORFs特征。将sORF-BERT与已发表的CPPred、DeepCPP、CNCI、CPPred-sORF、MiPiped、PsORFs这6种先进方法进行跨物种实验对比,结果显示sORF-BERT在5个独立测试集上的性能均得到提升,与排名第二的PsORFs相比ACC提升了0.42~18.72个百分点、MCC提升了1.08~11.75个百分点,充分表明了该方法在预测编码sORFs的优越性,有助于推动基础生物学的研究。

关键词: 小开放阅读框, 预训练, 混合编码, 特征融合, 跨物种预测

Abstract:  Small open reading frames (sORFs), which are open reading frames in the genome that do not exceed 300 bases in length, are identified as crucial for maintaining cellular metabolic balance and fundamental physiological functions of organisms. To excavate the deep characteristics of sORF sequences and to enhance the accuracy of cross-species prediction of coding and non-coding sORFs, a sORF-BERT neural network model is proposed. This model integrates DNABERT pre-training with a data blending encoding strategy and introduces the CAL module to learn multi-scale features of sORFs. Analyses are conducted on prokaryotic genomes, as well as human, mouse, arabidopsis, and escherichia coli datasets. After pre-training and fine-tuning, the sORF-BERT model can effectively capture the rich biological features of sORF sequences and utilize the CAL to better learn sORF features across different scales. Cross-species experimental comparisons of sORF-BERT with six published advanced methods, including CPPred, DeepCPP, CNCI, CPPred-sORF, MiPiped, and PsORFs, reveal that sORF-BERT improves performance across five independent test datasets. Compared to the second-ranked PsORFs, sORF-BERT shows increases of 0.42 ~ 18.72 percentage points in ACC and 1.08 ~ 11.75 percentage points in MCC, thereby demonstrating the superiority of this method in predicting coding sORFs and its potential to advance basic biological research.

Key words: small open reading frame, pre-training, hybrid coding, feature fusion, cross-species prediction

中图分类号: