Comparative study of machine learning algorithms for predicting drug induced autoimmunity using molecular descriptors
DOI:
https://doi.org/10.35335/mandiri.v14i1.436Keywords:
Drug Induced Autoimmunity, Machine Learning, Model Comparison, Molecular Descriptors, SHAP AnalysisAbstract
Drug induced autoimmunity (DIA) poses significant challenges in pharmaceutical development due to its complex immunological mechanisms and delayed clinical manifestations. This study proposes a comparative evaluation of three ensemble machine learning models CatBoost, XGBoost, and Gradient Boosting for predicting DIA using molecular descriptors. A curated dataset of drug compounds with known autoimmune outcomes was analyzed through a systematic workflow incorporating preprocessing, stratified sampling, and model evaluation using accuracy, F1 score, and ROC AUC. Results indicate that CatBoost achieved the highest ROC AUC, while XGBoost demonstrated superior balance between precision and recall, as reflected by its F1 score. Feature importance analysis using SHAP highlighted key molecular properties such as SlogP_VSA10 and fr_NH2 as major contributors to prediction outcomes. The study provides a reproducible and interpretable framework for early toxicity screening, offering valuable insights for data driven decision making in drug safety assessment.
References
Bahad, P., & Saxena, P. (2020). Study of adaboost and gradient boosting algorithms for predictive analytics. International Conference on Intelligent Computing and Smart Communication 2019: Proceedings of ICSC 2019, 235–244.
Bhasuran, B., Jin, Q., Xie, Y., Yang, C., Hanna, K., Costa, J., Shavor, C., Han, W., Lu, Z., & He, Z. (2025). Preliminary analysis of the impact of lab results on large language model generated differential diagnoses. Npj Digital Medicine, 8(1), 166.
Chand, S., Dikkatwar, M. S., Pant, R. D., Misra, V., Pradhan, N., Ansari, U., & Debnath, G. (2023). Drug Induced Hematological Disorders: An Undiscussed Stigma. In Drug Metabolism and Pharmacokinetics. IntechOpen.
Demir, S., & Sahin, E. K. (2023). An investigation of feature selection methods for soil liquefaction prediction based on tree-based ensemble algorithms using AdaBoost, gradient boosting, and XGBoost. Neural Computing and Applications, 35(4), 3173–3190.
Ekanayake, I. U., Meddage, D. P. P., & Rathnayake, U. (2022). A novel approach to explain the black-box nature of machine learning in compressive strength predictions of concrete using Shapley additive explanations (SHAP). Case Studies in Construction Materials, 16, e01059.
Hancock, J. T., & Khoshgoftaar, T. M. (2020). CatBoost for big data: an interdisciplinary review. Journal of Big Data, 7(1), 94.
Husnain, A., Rasool, S., Saeed, A., & Hussain, H. K. (2023). Revolutionizing pharmaceutical research: Harnessing machine learning for a paradigm shift in drug discovery. International Journal of Multidisciplinary Sciences and Arts, 2(4), 149–157.
Javadzadeh, S. M., Barati, M., Ghahremani, A., & Ahmadabad, H. N. (2025). Challenges and Strategies for Developing Effective Neoantigen-Based Dendritic Cell Vaccines for Cancer Immunotherapy: A Literature Review. Reviews in Clinical Medicine, 12(2).
Jiang, J., Zhang, C., Ke, L., Hayes, N., Zhu, Y., Qiu, H., Zhang, B., Zhou, T., & Wei, G.-W. (2025). A review of machine learning methods for imbalanced data challenges in chemistry. Chemical Science.
Johnson, R. (2025). XGBoost in Practice: Definitive Reference for Developers and Engineers. HiTeX Press.
Kabir, M., Rana, M. R. H., & Debnath, A. (2024). The Role of Quality Assurance in Accelerating Pharmaceutical Research and Development: Strategies for Ensuring Regulatory Compliance and Product Integrity. Journal of Angiotherapy, 8(12), 1–11.
Larner, A. J. (2024). The 2x2 matrix: contingency, confusion and the metrics of binary classification. Springer Nature.
Liu, Y., Fu, Y., Peng, Y., & Ming, J. (2024). Clinical decision support tool for breast cancer recurrence prediction using SHAP value in cooperative game theory. Heliyon, 10(2).
Luque, A., Mazzoleni, M., Carrasco, A., & Ferramosca, A. (2021). Visualizing classification results: Confusion star and confusion gear. IEEE Access, 10, 1659–1677.
Maulana, R., Narasati, R., Herdiana, R., Hamonangan, R., & Anwar, S. (2024). Komparasi Algoritma Decision Tree Dan Naive Bayes Dalam Klasifikasi Penyakit Diabetes. JATI (Jurnal Mahasiswa Teknik Informatika), 7(6), 3865–3870. https://doi.org/10.36040/jati.v7i6.8265
Niazi, S. K., & Mariam, Z. (2023). Recent advances in machine-learning-based chemoinformatics: a comprehensive review. International Journal of Molecular Sciences, 24(14), 11488.
PANDEY, R. (2024). CHEMOINFORMATICS: A COMPUTATIONAL GATEWAY. Omics Applications and Avenues, 241.
Pisetsky, D. S. (2023). Pathogenesis of autoimmune disease. Nature Reviews Nephrology, 19(8), 509–524.
Sayyad, J., Attarde, K., & Saadouli, N. (2024). Optimizing e-commerce Supply Chains with Categorical Boosting: A Predictive Modeling Framework. IEEE Access.
Smith, G. L., Walker, I. G., Aubareda, A., & Chapman, M. A. (2023). A machine learning approach to predict drug-induced autoimmunity using transcriptional data. BioRxiv, 2023.04.04.533417. https://www.biorxiv.org/content/10.1101/2023.04.04.533417v1%0Ahttps://www.biorxiv.org/content/10.1101/2023.04.04.533417v1.abstract
Veeramsetty, V. (2021). Shapley value cooperative game theory-based locational marginal price computation for loss and emission reduction. Protection and Control of Modern Power Systems, 6(4), 1–11.
Wu, Y., Zhu, J., Fu, P., Tong, W., Hong, H., & Chen, M. (2021). Machine learning for predicting risk of drug-induced autoimmune diseases by structural alerts and daily dose. International Journal of Environmental Research and Public Health, 18(13). https://doi.org/10.3390/ijerph18137139
Xu, L. L., Young, A., Zhou, A., & Röst, H. L. (2020). Machine learning in mass spectrometric analysis of DIA data. Proteomics, 20(21–22), 1900352.
Yang, Y., Liu, Y., Chen, Y., Luo, D., Xu, K., & Zhang, L. (2024). Artificial intelligence for predicting treatment responses in autoimmune rheumatic diseases: advancements, challenges, and future perspectives. Frontiers in Immunology, 15, 1477130.
Zheng, P., Yu, Z., Li, L., Liu, S., Lou, Y., Hao, X., Yu, P., Lei, M., Qi, Q., Wang, Z., Gao, F., Zhang, Y., & Li, Y. (2021). Predicting Blood Concentration of Tacrolimus in Patients With Autoimmune Diseases Using Machine Learning Techniques Based on Real-World Evidence. Frontiers in Pharmacology, 12(September), 1–10. https://doi.org/10.3389/fphar.2021.727245
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2025 Yusuf Rio Delfiero, Ajeng Hidayati, Bagus Hendra Saputra

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.




