A New Feature Extraction Approach in Classification for Improving the Accuracy of Proteins

- Damayanti - Universitas Teknokrat Indonesia, Lampung, 35132, Indonesia
Favorisen Rosyking Lumbanraja - University of Lampung, Lampung, 35141, Indonesia
Akmal Junaidi - University of Lampung, Lampung, 35141, Indonesia
- Sutyarso - University of Lampung, Lampung, 35141, Indonesia
Gregorius Nugroho Susanto - University of Lampung, Lampung, 35141, Indonesia
Dyah Ayu Megawaty - Universitas Teknokrat Indonesia, Lampung, 35132, Indonesia


Citation Format:



DOI: http://dx.doi.org/10.62527/joiv.9.1.2589

Abstract


Proteins play a vital role in life as essential macromolecules, consisting of linear heteromeric biopolymers formed by amino acids covalently bonded through peptide bonds. They contribute to cell development and bolster the body's defense mechanisms. Post-translational modification processes, such as glycosylation, are necessary for proteins to function optimally. Glycosylation involves adding sugar groups to proteins, playing a critical role in various protein folding processes. Dysregulation of protein glycosylation can lead to diseases like Alzheimer's and cancer. Manual classification of glycosylated proteins is time-consuming, necessitating a faster approach. This study aims to expedite glycosylated protein classification using novel methods like AAindex, CTD, SABLE, hydrophobicity, and PseAAC for increased accuracy, comparing them with existing approaches. The dataset comprises protein sequences sourced from the openly accessible UniProt database. Results demonstrate that glycosylated protein prediction achieved 100% accuracy, surpassing previous approaches. Several features contributed to this improvement, with Hydrophobicity making a significant contribution at 24%, and PseAAC making the most significant contribution at 40% among the five extraction methods developed.

Keywords


proteins; classification; Xgboost; new approach; glycosylation; machine learning; PseAAC; AAindex.

Full Text:

PDF

References


L. Guruprasad, “Protein Structure,” Resonance, 2019, doi:10.1007/s12045-019-0783-7.

N. Fujii, T. Takata, N. Fujii, K. Aki, and H. Sakaue, “D-Amino acids in protein: The mirror of life as a molecular index of aging,” Biochim. Biophys. Acta - Proteins Proteomics, vol. 1866, no. 7, pp. 840–847, 2018, doi: 10.1016/j.bbapap.2018.03.001.

S. Kadakeri, M. R. Arul, R. Bordett, N. Duraisamy, H. Naik, and S. Rudraiah, Protein synthesis and characterization. Elsevier Ltd., 2020.

Q. Zhong et al., “Protein posttranslational modifications in health and diseases: Functions, regulatory mechanisms, and therapeutic implications,” MedComm, vol. 4, no. 3, pp. 1–112, 2023, doi:10.1002/mco2.261.

F. Li et al., “Positive-unlabelled learning of glycosylation sites in the human proteome,” BMC Bioinformatics, vol. 20, no. 1, pp. 1–17, 2019, doi: 10.1186/s12859-019-2700-1.

T. Pitti, C. T. Chen, H. N. Lin, W. K. Choong, W. L. Hsu, and T. Y. Sung, “N-GlyDE: a two-stage N-linked glycosylation site prediction incorporating gapped dipeptides and pattern-based encoding,” Sci. Rep., 2019, doi: 10.1038/s41598-019-52341-z.

D. Wang et al., “MusiteDeep: A deep-learning based webserver for protein post-translational modification site prediction and visualization,” Nucleic Acids Res., vol. 48, no. W1, pp. W140–W146, 2021, doi: 10.1093/nar/gkaa275.

Y. Zhang and L. Sun, “Sweetening the Deal: Glycosylation and its Clinical Applications,” J. Biomed. Sci., vol. 9, no. 3, pp. 1–7, 2020, doi: 10.36648/2254-609x.9.3.9.

Y. Mazola, G. Chinea, and A. Musacchio, “Integrating bioinformatics tools to handle glycosylation,” PLoS Comput. Biol., vol. 7, no. 12, pp. 1–8, 2011, doi: 10.1371/journal.pcbi.1002285.

G. Taherzadeh, A. Dehzangi, M. Golchin, Y. Zhou, and M. P. Campbell, “SPRINT-Gly: Predicting N- and O-linked glycosylation sites of human and mouse proteins by using sequence and predicted structural properties,” Bioinformatics, vol. 35, no. 20, pp. 4140–4146, 2019, doi: 10.1093/bioinformatics/btz215.

A. V. Everest-Dass, E. S. X. Moh, C. Ashwood, A. M. M. Shathili, and N. H. Packer, “Human disease glycomics: technology advances enabling protein glycosylation analysis–part 2,” Expert Review of Proteomics. 2018, doi: 10.1080/14789450.2018.1448710.

H. Bashir, B. A. Wani, B. A. Ganai, and S. A. Mir, “Protein Glycosylation: An Important Tool for Diagnosis or Early Detection of Diseases,” Protein Modificomics, pp. 339–359, 2019, doi:10.1016/b978-0-12-811913-6.00013-8.

F. R. Lumbanraja, B. Mahesworo, T. W. Cenggoro, A. Budiarto, and B. Pardamean, “An evaluation of deep neural network performance on limited protein phosphorylation site prediction data,” Procedia Comput. Sci., vol. 157, pp. 25–30, 2019, doi:10.1016/j.procs.2019.08.137.

S. Cramer, D. Buschmann, and R. H. Schmitt, “Comparison of Feature Extraction Algorithms for Prediction of Quality Characteristics,” Procedia CIRP, vol. 112, pp. 579–584, 2022, doi:10.1016/j.procir.2022.09.061.

F. Li et al., “GlycoMine: A machine learning-based approach for predicting N-, C-and O-linked glycosylation in the human proteome,” Bioinformatics, vol. 31, no. 9, pp. 1411–1419, 2015, doi:10.1093/bioinformatics/btu852.

C.-H. Chien, C.-C. Chang, S.-H. Lin, C.-W. Chen, Z.-H. Chang, and Y.-W. Chu, “N-GlycoGo: Predicting Protein N-Glycosylation Sites on Imbalanced Data Sets by Using Heterogeneous and Comprehensive Strategy,” IEEE Access, 2020, doi: 10.1109/access.2020.3022629.

A. Alkuhlani, W. Gad, M. Roushdy, and A. B. M. Salem, “PUStackNGly: Positive-Unlabeled and Stacking Learning for N-Linked Glycosylation Site Prediction,” IEEE Access, vol. 10, pp. 12702–12713, 2022, doi: 10.1109/access.2022.3146395.

A. Alkuhlani, W. Gad, and M. Roushdy, “International Journal of Intelligent Prediction of O-Glycosylation Site Using Pre-Trained,” vol. 23, no. 1, pp. 41–52, 2023, doi:10.21608/ijicis.2023.160986.1218.

A. Bateman et al., “UniProt: The universal protein knowledgebase,” Nucleic Acids Res., vol. 45, no. D1, pp. D158–D169, 2017, doi:10.1093/nar/gkw1099.

P. Regan, P. L. McClean, T. Smyth, and M. Doherty, “Early Stage Glycosylation Biomarkers in Alzheimer’s Disease,” Medicines, vol. 6, no. 3, p. 92, 2019, doi: 10.3390/medicines6030092.

U. M. Khaire and R. Dhanalakshmi, “Stability of feature selection algorithm: A review,” J. King Saud Univ. - Comput. Inf. Sci., vol. 34, no. 4, pp. 1060–1073, 2022, doi: 10.1016/j.jksuci.2019.06.012.

N. De Jay, S. Papillon-Cavanagh, C. Olsen, N. El-Hachem, G. Bontempi, and B. Haibe-Kains, “MRMRe: An R package for parallelized mRMR ensemble feature selection,” Bioinformatics, vol. 29, no. 18, pp. 2365–2368, 2013, doi: 10.1093/bioinformatics/btt383.

M. Radovic, M. Ghalwash, N. Filipovic, and Z. Obradovic, “Minimum redundancy maximum relevance feature selection approach for temporal gene expression data,” BMC Bioinformatics, vol. 18, no. 1, pp. 1–14, 2017, doi: 10.1186/s12859-016-1423-9.

T. Chen, T. He, M. Benesty, V. Khotilovich, and Y. Tang, “xgboost: Customized Extreme Gradient Boosting,” pp. 1–4, 2018, [Online]. Available: https://cran.r-project.org/package=xgboost.

T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” Proc. ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., vol. 13-17-Augu, pp. 785–794, 2016, doi: 10.1145/2939672.2939785.

L. Zhang and C. Zhan, “Machine Learning in Rock Facies Classification: An Application of XGBoost,” pp. 1371–1374, 2017, doi: 10.1190/igc2017-351.

T. Chen and T. He, “xgboost: Extreme Gradient Boosting,” R Lect., no. 2016, pp. 1–84, 2014.

D. Berrar, “Cross-validation,” Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics, 2018. .

M. Ohsaki, P. Wang, K. Matsuda, S. Katagiri, H. Watanabe, and A. Ralescu, “Confusion-matrix-based kernel logistic regression for imbalanced data classification,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 9, pp. 1806–1819, 2017, doi: 10.1109/TKDE.2017.2682249.

B. Ma, F. Meng, G. Yan, H. Yan, B. Chai, and F. Song, “Diagnostic classification of cancers using extreme gradient boosting algorithm and multi-omics data,” Comput. Biol. Med., 2020, doi:10.1016/j.compbiomed.2020.103761.