生物学杂志 ›› 2025, Vol. 42 ›› Issue (5): 67-.doi: 10.3969/j.issn.2095-1736.2025.05.067

• 研究报告 • 上一篇    下一篇

基于机器学习的枯草芽孢杆菌信号肽分泌效率预测

孟祥波, 李 岑, 苑成武, 刘夫锋, 路福平, 彭 冲   

  1. 天津科技大学 生物工程学院, 天津 300457
  • 出版日期:2025-10-18 发布日期:2025-10-14
  • 通讯作者: 彭冲,博士,讲师,研究方向为生物信息学与微生物基因组学,E-mail:cpeng@tust.edu.cn
  • 作者简介:孟祥波,硕士,研究方向为生物信息学,E-mail:22815007@mail.tust.edu.cn
  • 基金资助:
    国家自然科学基金项目(32001657); 国家重点研发计划项目(2021YFC2100400)

Machine learning-based prediction of secretory efficiency of signal peptides in Bacillus subtilis 

MENG Xiangbo, LI Cen, YUAN Chengwu, LIU Fufeng, LU Fuping, PENG Chong   

  1. College of Biotechnology, Tianjin University of Science and Technology, Tianjin 300457, China
  • Online:2025-10-18 Published:2025-10-14

摘要: 针对信号肽引导异源蛋白分泌效率规律性差的问题,使用枯草芽孢杆菌来源信号肽引导异源蛋白分泌的相关数据建立8个数据集,基于支持向量机和随机森林算法构建信号肽分泌效率预测模型。通过数据集、序列特征以及算法的不同组合,共建立458个分类模型和228个回归模型。其中,使用随机森林算法在α-淀粉酶数据集上获得最佳分类效果,准确度可以达到83.21%;随机森林回归算法在α-淀粉酶数据集中获得效果最好的回归模型,该模型的决定系数为0.43。此外,还分析了高分泌效率和低分泌效率信号肽的氨基酸组成和GC3含量(G和C出现在密码子第3个位置的频率)的差异,发现高分泌效率的信号肽具有较多的不折叠氨基酸且具有较高的GC3含量。研究实现了对信号肽分泌效率的预测,并且探究了影响信号肽分泌效率的因素。

关键词: 信号肽, 分泌效率, 支持向量机, 随机森林, 枯草芽孢杆菌

Abstract: This study aimed at the problem of poor regularity in the secretion efficiency of heterologous proteins guided by signal peptides. Eight datasets were constructed from the relevant data of the secretion of heterologous proteins guided by signal peptides fromBacillus subtilis, and prediction models of signal peptide secretion efficiency were developed using support vector machine (SVM) and Random Forest (RF) algorithms. Through various permutations of datasets, sequence features, and computational algorithms, a total of 458 classification models and 228 regression models were devised. The RF algorithm demonstrated superior classification performance, achieving 83.21% accuracy with the α-amylase dataset. In regression analysis, RF also outperformed other methods for the α-amylase dataset, yielding a model with a determination coefficient of 0.43. Additionally, the work revealed compositional differences in amino acids and GC3 content (the frequency of G and C nucleotides at the third position of codons) between high- and low-efficiency signal peptides, highlighting that good-performing signal peptides tended to have a higher proportion of unfolded amino acids and elevated GC3 content. In this study, the prediction of signal peptide secretion efficiency was realized, and the factors affecting the secretion efficiency of signal peptide were explored.

Key words: signal peptides, secretion efficiency, support vector machine, Random Forest;Bacillus subtilis

中图分类号: