Computer and Modernization ›› 2025, Vol. 0 ›› Issue (03): 71-77.doi: 10.3969/j.issn.1006-2475.2025.03.011

Previous Articles     Next Articles

sORF-BERT: A Method on Identifying Coding sORFs Based on Pre-trained Models

  

  1. (1. School of Computer Science and Technology, Shandong University of Technology, Zibo 255049, China;
    2. Shandong Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou 253023, China)
  • Online:2025-03-28 Published:2025-03-28

Abstract:  Small open reading frames (sORFs), which are open reading frames in the genome that do not exceed 300 bases in length, are identified as crucial for maintaining cellular metabolic balance and fundamental physiological functions of organisms. To excavate the deep characteristics of sORF sequences and to enhance the accuracy of cross-species prediction of coding and non-coding sORFs, a sORF-BERT neural network model is proposed. This model integrates DNABERT pre-training with a data blending encoding strategy and introduces the CAL module to learn multi-scale features of sORFs. Analyses are conducted on prokaryotic genomes, as well as human, mouse, arabidopsis, and escherichia coli datasets. After pre-training and fine-tuning, the sORF-BERT model can effectively capture the rich biological features of sORF sequences and utilize the CAL to better learn sORF features across different scales. Cross-species experimental comparisons of sORF-BERT with six published advanced methods, including CPPred, DeepCPP, CNCI, CPPred-sORF, MiPiped, and PsORFs, reveal that sORF-BERT improves performance across five independent test datasets. Compared to the second-ranked PsORFs, sORF-BERT shows increases of 0.42 ~ 18.72 percentage points in ACC and 1.08 ~ 11.75 percentage points in MCC, thereby demonstrating the superiority of this method in predicting coding sORFs and its potential to advance basic biological research.

Key words: small open reading frame, pre-training, hybrid coding, feature fusion, cross-species prediction

CLC Number: