Drug Discovery on entamoeba histolytica
🦥
Data Preprocessing
The data was required from ChEMBL Database. A chemical database of bioactive molecule with drug like properties.
| assay_chembl_id | assay_type | target_organism | type | units | value |
|---|---|---|---|---|---|
| CHEMBL676675 | F | Entamoeba histolytica | IC50 | uM | 0.069 |
| CHEMBL676675 | F | Entamoeba histolytica | IC50 | uM | 0.022 |
| CHEMBL676675 | F | Entamoeba histolytica | IC50 | uM | 0.35 |
| CHEMBL676675 | F | Entamoeba histolytica | IC50 | uM | 0.046 |
after requiring the necessary data, we could explore the data using Lipinski Descriptor. Christopher Lipinski, a scientist at Pfizer, came up with a set of rule-of-thumb for evaluating the druglikeness of compounds. Such druglikeness is based on the Absorption, Distribution, Metabolism and Excretion (ADME) that is also known as the pharmacokinetic profile. Lipinski analyzed all orally active FDA-approved drugs in the formulation of what is to be known as the Rule-of-Five or Lipinski’s Rule.
The Lipinski’s Rule stated the following:
- Molecular weight < 500 Dalton
- Octanol-water partition coefficient (LogP) < 5
- Hydrogen bond donors < 5
- Hydrogen bond acceptors < 10
we also need to convert the $IC^{50}$ to $pIC{50}$, to allow $IC^{50}$ data to be more uniformly distributed, we will convert $IC^{50}$ to the negative logarithmic scale which is essentially $-log10(IC^{50})$.
The conversion process is the following:
- Take the $IC^{50}$ values from the standard_value column and converts it from nM to M by multiplying the value by $10^{-9}$
- Take the molar value and apply $-log10$
- Delete the standard_value column and create a new $pIC{50}$ column
Lastly, we need to clean the salt from the Canonical SMILES.
Canonical SMILES represents chemical information that partain to the chemical structure
The Canonical SMILES were pre-processed by applying sequential filters to remove stereochemistry, salts, and molecules with undesirable atoms or groups. SMILES strings $>100$ symbols in length were removed, as $∼97%$ of the dataset consists of SMILES strings with $<100$ symbols. The RDKit library in Python was used for this pre-processing.
From the preprocessing, we get this data
| molecule_chembl_id | canonical_smiles | bioactivity_class | MW | LogP | NumHDonors | NumHAcceptors | pIC50 |
|---|---|---|---|---|---|---|---|
CHEMBL55641 |
FC(F)(F)c1nc2ccccc2[nH]1 |
active | 186.136 | 2.58170 | 1.0 | 1.0 | 7.161151 |
CHEMBL53788 |
FC(F)(F)c1nc2cc(Cl)ccc2[nH]1 |
active | 220.581 | 3.23510 | 1.0 | 1.0 | 7.657577 |
CHEMBL137 |
Cc1ncc([N+](=O)[O-])n1CCO |
active | 171.156 | 0.09202 | 1.0 | 5.0 | 6.455932 |
CHEMBL56473 |
Cn1c(C(F)(F)F)nc2cc(Cl)ccc21 |
active | 234.608 | 3.24550 | 0.0 | 2.0 | 7.337242 |
CHEMBL293520 |
Cn1c(C(F)(F)F)nc2ccccc21 |
active | 200.163 | 2.59210 | 0.0 | 2.0 | 7.397940 |
Model
The pre-processed Canonical SMILES are processed using PaDEL Descriptor to get the fingerprints which are later used for the modeling. The fingerprints acquired has 882 columns so to maximize the result, the low variance are removed using the Variance Threshold from the sklearn feature selection.
| 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | … | 165 | 166 | 167 | 168 | 169 | 170 | 171 | 172 | 173 | 174 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 1 | … | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 1 | 0 |
| 1 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | … | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 1 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | … | 1 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 3 | 1 | 1 | 1 | 0 | 1 | 0 | 1 | 1 | 1 | 0 | … | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
| 4 | 1 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 1 | … | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
After applying 3 different algorithm (CatBoost, Forest Regressor, Gradient Boost) and comparing the results, it is concluded that Gradient Boost with Grid CV is the best algorithm to use in this case with the $R^2 = 8.38$.
Leave a comment