open access publication

Preprint, 2024

Novel machine learning approach toward classification model of HIV-1 integrase inhibitors

ChemRxiv, ISSN 2573-2293, 10.26434/chemrxiv-2024-hgrzx

Contributors

Phan, Tieu-Long 0000-0002-3532-2064 [1] [2] Le, Hoang-Son Lai [3] Truong, Gia-Bao 0009-0001-5379-1593 [3] Trinh, The-Chuong 0000-0002-8693-030X [4] To, Van-Thinh 0000-0002-7640-0807 [3] Nguyen, Phuoc-Chung Van [3] Pham, Thanh-An 0000-0003-0271-2696 [3] Truong, Tuyen Ngoc 0000-0002-0952-1633 [3]

Affiliations

  1. [1] Leipzig University
  2. [NORA names: Germany; Europe, EU; OECD];
  3. [2] University of Southern Denmark
  4. [NORA names: SDU University of Southern Denmark; University; Denmark; Europe, EU; Nordic; OECD];
  5. [3] Ho Chi Minh City Medicine and Pharmacy University
  6. [NORA names: Vietnam; Asia, South];
  7. [4] Grenoble Alpes University
  8. [NORA names: France; Europe, EU; OECD]

Abstract

HIV-1 (Human immunodeficiency virus-1) has been causing severe pandemics by attacking the immune system of its host. Left untreated, it can lead to AIDS (acquired immunodeficiency syndrome), where death is inevitable due to opportunistic diseases. Therefore, discovering new antiviral drugs against HIV-1 is crucial. This study aimed to explore a novel machine learning approach to classify compounds that inhibit HIV-1 integrase and screen the dataset of repurposing compounds. The present study had two main stages: selecting the best type of fingerprint or molecular descriptor using the Wilcoxon signed-rank test and building a computational model based on machine learning. In the first stage, we calculated 16 different types of fingerprint or molecular descriptors from the dataset and used each of them as input features for 10 machine-learning models, which were evaluated through cross-validation. Then, a meta-analysis was performed with the Wilcoxon signed-rank test to select the optimal fingerprint or molecular descriptor types. In the second stage, we constructed a model based on the optimal fingerprint or molecular descriptor type. This data followed the machine learning procedure, including data preprocessing, outlier handling, normalization, feature selection, model selection, external validation, and model optimization. In the end, an XGBoost model and RDK7 fingerprint were identified to be the most suitable. The model achieved promising results, with an average precision of 0.928 ± 0.027 and an F1-score of 0.848 ± 0.041 in cross-validation. The model achieved an average precision of 0.921 and an F1-score of 0.889 in external validation. Molecular docking was performed and validated by redocking for docking power and retrospective control for screening power, with the AUC metrics being 0.876 and the threshold being identified at –9.71 kcal/mol. Finally, 44 compounds from DrugBank repurposing data were selected from the QSAR model, then three candidates were identified as potential compounds from molecular docking, and PSI-697 was detected as the most promising molecule, with in vitro experiment being not performed (docking score: -17.14 kcal/mol, HIV integrase inhibitory probability: 69.81%)

Keywords

AIDS, AUC, AUC metric, DrugBank, HIV-1, HIV-1 integrase, HIV-1 integrase inhibitors, PSI-697, QSAR, QSAR models, Wilcoxon, Wilcoxon signed-rank test, XGBoost, XGBoost model, antiviral drugs, approach, average precision, classification, classification model, compounds, computational model, control, cross-validation, data, data preprocessing, dataset, death, descriptor types, descriptors, disease, docking, docking power, drug, experiments, external validation, feature selection, features, fingerprint, handling, host, immune system, in vitro experiments, inhibit HIV-1 integrase, inhibitors, integrase, integrase inhibitors, learning, learning approach, learning procedure, machine, machine learning, machine learning approach, machine learning procedures, machine-learning models, meta-analysis, metrics, model, model optimization, model selection, molecular descriptors, molecular docking, molecules, normalization, novel machine learning approaches, opportunistic diseases, optimal fingerprinting, optimization, pandemic, power, precision, preprocessing, procedure, repurposed data, results, retrospective controls, screening power, selection, severe pandemic, signed-rank test, stage, study, system, test, threshold, type, validity

Data Provider: Digital Science