In Silico Methods for ADMET and Solubility Prediction

Slides:



Advertisements
Similar presentations
Analysis of High-Throughput Screening Data C371 Fall 2004.
Advertisements

Random Forest Predrag Radenković 3237/10
Predictive Analysis of Gene Expression Data from Human SAGE Libraries Alexessander Alves* Nikolay Zagoruiko + Oleg Okun § Olga Kutnenko + Irina Borisova.
PhysChem Forum, 29 Nov 2006, Newhouse 1 Memories and the future: From experimental to in silico physical chemistry Han van de Waterbeemd AstraZeneca, DMPK.
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
D ISCOVERING REGULATORY AND SIGNALLING CIRCUITS IN MOLECULAR INTERACTION NETWORK Ideker Bioinformatics 2002 Presented by: Omrit Zemach April Seminar.
ABCD Flexsim-R: A new 3D descriptor for combinatorial library design and in-silico screening 2 nd Joint Sheffield Conference on Chemoinformatics: Computational.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
In silico prediction of solubility: Solid progress but no solution? Dr John Mitchell University of St Andrews.
Lipinski’s rule of five
Novel Drug Design Modified Megestrol by Group II.
Computers in Chemistry Dr John Mitchell University of St Andrews.
Paper presentation for CSI5388 PENGCHENG XI Mar. 23, 2005
Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews.
Cheminformatics II Apr 2010 Postgrad course on Comp Chem Noel M. O’Boyle.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Predictive Automatic Relevance Determination by Expectation Propagation Yuan (Alan) Qi Thomas P. Minka Rosalind W. Picard Zoubin Ghahramani.
Pharmacotherapy in the Elderly Judy Wong
Super fast identification and optimization of high quality drug candidates.
 MicroRNAs (miRNAs) are a class of small RNA molecules, about ~21 nucleotide (nt) long.  MicroRNA are small non coding RNAs (ncRNAs) that regulate.
Statistical Learning: Pattern Classification, Prediction, and Control Peter Bartlett August 2002, UC Berkeley CIS.
Computers in Chemistry Dr John Mitchell University of St Andrews.
Application and Efficacy of Random Forest Method for QSAR Analysis
Bioinformatics Ayesha M. Khan Spring Phylogenetic software PHYLIP l 2.
Principal Component Analysis (PCA) for Clustering Gene Expression Data K. Y. Yeung and W. L. Ruzzo.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Predicting Highly Connected Proteins in PIN using QSAR Art Cherkasov Apr 14, 2011 UBC / VGH THE UNIVERSITY OF BRITISH COLUMBIA.
Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting.
1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.
Combinatorial Chemistry and Library Design
Asia’s Largest Global Software & Services Company Genomes to Drugs: A Bioinformatics Perspective Sharmila Mande Bioinformatics Division Advanced Technology.
Rational Drug Design Soma Mandal, Mee'nal Moudgil, Sanat K. Mandal.
Introduction to Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Predicting Phospholipidosis Using Machine Learning 1 Lowe et al., Molec. Pharmaceutics, 7, 1708 (2010) Robert Lowe (Cambridge) John Mitchell (St Andrews)
Combined Experimental and Computational Modeling Studies at the Example of ErbB Family Birgit Schoeberl.
Use of Machine Learning in Chemoinformatics Irene Kouskoumvekaki Associate Professor December 12th, 2012 Biological Sequence Analysis course.
Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews.
Skewing: An Efficient Alternative to Lookahead for Decision Tree Induction David PageSoumya Ray Department of Biostatistics and Medical Informatics Department.
Computational biology of cancer cell pathways Modelling of cancer cell function and response to therapy.
1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson.
Construction of cancer pathways for personalized medicine | Presented By Date Construction of cancer pathways for personalized medicine Predictive, Preventive.
Ensemble Learning Spring 2009 Ben-Gurion University of the Negev.
SimBioSys Inc.© Slide #1 Enrichment and cross-validation studies of the eHiTS high throughput screening software package.
Virtual Screening C371 Fall INTRODUCTION Virtual screening – Computational or in silico analog of biological screening –Score, rank, and/or filter.
Real Gas Relationships
Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
Chapter 5.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
In silico calculation of aqueous solubility Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge,
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013.
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
T HE S CREENERS WERE CREATED BY MAN. T HEY EVOLVED.
Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring T.R. Golub et al., Science 286, 531 (1999)
Use of Machine Learning in Chemoinformatics
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
TIDEA Target (and Lead) Independent Drug Enhancement Algorithm.
Lipinski’s rule of five
Classification with Gene Expression Data
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
ADME/Tox PredictionTox Prediction. The characterization of Absorption, Distribution, Metabolism, and Excretion (also known as ADME) and Toxicity are essential.
1 Department of Engineering, 2 Department of Mathematics,
Virtual Screening.
1 Department of Engineering, 2 Department of Mathematics,
1 Department of Engineering, 2 Department of Mathematics,
Reporter: Yu Lun Kuo (D )
Anastasia Baryshnikova  Cell Systems 
Altered Caspase-8 Expression
Humanity v The Machines
Advisor: Dr.vahidipour Zahra salimian Shaghayegh jalali Dec 2017
Presentation transcript:

In Silico Methods for ADMET and Solubility Prediction Dr John Mitchell University of St Andrews

Outline Part 1: Computational Toxicology Part 2: Aqueous Solubility

1. Toxicological Relationships Between Proteins Obtained From a Molecular Spam Filter Florian Nigsch & John Mitchell Now at Novartis Institutes, Boston

Spam Unsolicited (commercial) email Approx. 90% of all email traffic is spam Where are the legitimate messages? Filtering

Analogy to Drug Discovery Huge number of possible candidates Virtual screening to help in selection process

Properties of Drugs High affinity to protein target Soluble Permeable Absorbable High bioavailability Specific rate of metabolism Renal/hepatic clearance? Volume of distribution? Low toxicity Plasma protein binding? Blood-Brain-Barrier penetration? Dosage (once/twice daily?) Synthetic accessibility Formulation (important in development)

Multiobjective Optimisation Synthetic accessibility Bioactivity Solubility Toxicity Permeability Metabolism Huge number of candidates …

Multiobjective Optimisation Synthetic accessibility Bioactivity Drug Solubility Toxicity U S E L E S S Permeability Metabolism Huge number of candidates … most of which are useless!

Feature Space - Chemical Space m = (f1,f2,…,fn) f3 f3 f2 COX2 CDK2 f1 Feature spaces of high dimensionality CDK1 f2 DHFR f1

Based on circular fingerprints Features of Molecules Based on circular fingerprints

Combinations of Features Combinations of molecular features to account for synergies.

Winnow Algorithm

Protein Target Prediction Which protein does a given molecule bind to? Virtual Screening Multiple endpoint drugs - polypharmacology New targets for existing drugs Prediction of adverse drug reactions (ADR) Computational toxicology

Predicted Protein Targets Selection of 233 classes from the MDL Drug Data Report ~90,000 molecules 15 independent 50%/50% splits into training/test set

Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

Computational Toxicology Model for target prediction Annotated library of toxic molecules MDL Toxicity database ~150,000 molecules For each molecule we predict the likely target Correlations between predicted protein targets and known toxicity codes Canonical (23) Full (490)

Toxicological Relationships Outline (1) Protein target prediction allows us to link (predictively) 150,000 toxic organic molecules to 233 specific protein targets Each target is treated as a single protein, although may be sets of related proteins Toxicological databases link (experimentally) these 150,000 molecules to 23 toxicity classes Combining these two sources of data matches the 233 proteins with the 23 toxicity classes

Toxicity Annotations FULL TOXICITY CODES (490) Y41 : Glycolytic < Metabolism (intermediary) < Biochemical CANONICAL TOXICITY CODES (23)

Toxicological Relationships Outline (2) For each protein target, we have a profile of association with the 23 toxicity classes Proteins with similar profiles are clustered together We demonstrate that these clusters of proteins can be physiologically meaningful.

( ) Predictions Obtained Highest ranking one IS predicted protein target Protein code j Target Prediction L70 - Changes in liver weight<Liver Y07 - Hepatic microsomal oxidase<Enzyme inhibition M30 - Other changes<Kidney, Urether, and Bladder L30 - Other changes<Liver Toxicity codes i Result matrix R = (rij) rij incremented for each prediction. Protein targets Toxcodes ( ) r11 r12 … r21

Proteins by Toxicity Kainic acid receptor Angiotensin II AT2 Cardiac - G Kainic acid receptor Adrenergic alpha2 Phosphodiesterase III cAMP Phosphodiesterase O6-Alkylguanine-DNA alkyltransferase Vascular - H Angiotensin II AT2 Dopamine (D2) Bombesin Adrenergic alpha2 5-HT antagonist

Top 5 Proteins by Toxicity 68 distinct proteins for 23 toxicity classes, i.e., 3 proteins per canonical toxicity code. Lanosterol 14alpha-Methyl Demethylase 5 Glucose-6-phosphate Translocase 4 IL-6 4 Benzodiazepine Antagonist 3 Kainic Acid Receptor 3 Proteins and their connectivities

Correlation Between Proteins Correlations between proteins: 233 by 233 correlation matrix Cluster 1 (proteins 6-11)

Carbonic Anhydrase Inhibitor Estrogen Receptor Modulator LHRH Agonist Cluster 1 Carbonic Anhydrase Inhibitor Estrogen Receptor Modulator LHRH Agonist Aromatase Inhibitor Cysteine Protease Inhibitor DHFR Inhibitor Cluster 1 Within-cluster correlation (without auto-correlation) r = 0.95 Proteins involved in breast cancer

Cluster 1 Proteins involved in breast cancer

Literature-based links between these proteins Tissue-specific transcripts of human steroid sulfatase are under control of estrogen signaling pathways in breast carcinoma, Zaichuk 2007 “aim of this study was to characterize carbonic anhydrase II (CA2), as novel estrogen responsive gene” Caldarelli 2005 CA This led to premature expression of CAII, a possible explanation for the toxic effects of overexpressed ER. The Transactivation Domain AF-2 but not the DNA-Binding Domain of the Estrogen Receptor Is Required to Inhibit Differentiation of Avian Erythroid Progenitors, Marieke von Lindern 1998 ER LHRH Controversies of adjuvant endocrine treatment for breast cancer and recommendations of the 2007 St Gallen conference, Rabaglio 2007 Cathepsin L Gene Expression and Promoter Activation in Rodent Granulosa Cells, Sriraman 2004 showed that cathepsin L expression in granulosa cells of small, growing follicles in- creased in periovulatory follicles after human chorionic gonadotropin stimulation. Merchenthaler 2005 Summary of aromatase inhibitor trials: The past and future, Goss 2007 Regulation of collagenolytic cysteine protease synthesis by estrogen in osteoclasts, Furuyama 2000 Aromatase Cysteine Prot. Induction by estrogens of methotrexate resistance in MCF-7 breast cancer cells, Thibodeau 1998 DHFR Antimalarials?

Breast Cancer Proteins

Cluster 4

This cluster links treatment of stomach ulcers to loss of bone mass!

Proton Pump Inhibitors etc. Correlation above 0.98

Proton Pump Inhibitors etc. Correlation above 0.98 Correlation above 0.99

Proton Pump Inhibitors etc. PTH = Parathyroid hormone (84 aa mini-protein) Proton pump inhibitors used to limit production of gastric acid PTH is important in the developent/regulation of osteoclasts (cells for bone resorption) PTH controls levels of Ca2+ in the blood; increased PTH levels are associated with age-related decrease of bone mass Recent clinical studies showed increased risk of hip fractures resulting from long-term use of proton pump inhibitors. Hence link between PTH and proton pump inhibitors.

Conclusions from Part 1 Successful adaptation of algorithm formerly not used in chemoinformatics Can find correct protein targets for molecules Hence link proteins together via ligand-binding properties and associations of ligands with toxicities Identify clinically relevant toxicological relationships between proteins

2. In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews

Our Methods … (a) Random Forest (informatics)

References

Our Random Forest Model … We want to construct a model that will predict solubility for druglike molecules … We don’t expect our model either to use real physics and chemistry or to be easily interpretable … We do expect it to be fast and reasonably accurate …

Random Forest Machine Learning Method

Random Forest for Predicting Solubility A Forest of Regression Trees Dataset is partitioned into consecutively smaller subsets (of similar solubility) Each partition is based upon the value of one descriptor The descriptor used at each split is selected so as to minimise the MSE High predictive accuracy Includes descriptor selection No training problems – largely immune from overfitting “Out-of-bag” validation – using those molecules not in the bootstrap samples. Leo Breiman, "Random Forests“, Machine Learning 45, 5-32 (2001).

Dataset Literature Data Compiled from Huuskonen dataset and AquaSol database – pharmaceutically relevant molecules All molecules solid at room temperature n = 988 molecules Training = 658 molecules Test = 330 molecules MOE descriptors 2D/3D ● Intrinsic aqueous solubility – the thermodynamic solubility of the neutral form in unbuffered water at 25oC Datasets compiled from diverse literature data may have significant random and systematic errors.

Random Forest: Solubility Results These results are competitive with any other informatics or QSPR solubility prediction method Random Forest: Solubility Results RMSE(tr)=0.27 r2(tr)=0.98 Bias(tr)=0.005 RMSE(oob)=0.68 r2(oob)=0.90 Bias(oob)=0.01 RMSE(te)=0.69 r2(te)=0.89 Bias(te)=-0.04 DS Palmer et al., J. Chem. Inf. Model., 47, 150-158 (2007)

Part 2a, Solubility by Random Forest: Conclusions ● Random Forest gives an RMS error of 0.69 logS units. ● These results are competitive with any other informatics or QSPR solubility prediction method. ● The nature of the model is predictive, without offering much insight.

Our Methods … (b) Thermodynamic Cycle (A hybrid of theoretical chemistry & informatics)

Reference

Our Thermodynamic Cycle method … We want to construct a theoretical model that will predict solubility for druglike molecules … We expect our model to use real physics and chemistry and to give some insight … We may need to include some empirical parameters… We don’t expect it to be fast by informatics or QSPR standards, but it should be reasonably accurate …

For this study Toni Llinàs measured 30 solubilities using the CheqSol method and took another 30 from other high quality studies (Bergstrom & Rytting). We use a Sirius glpKa instrument

Can we use theoretical chemistry to calculate solubility via a thermodynamic cycle?

Gsub comes mostly from lattice energy minimisation based on the experimental crystal structure.

Gsolv comes from a semi-empirical solvation model (SCRF B3LYP/6-31G Gsolv comes from a semi-empirical solvation model (SCRF B3LYP/6-31G* in Jaguar) This is likely to be the least accurate term in our equation. We also tried SM5.4 with AM1 & PM3 in Spartan, with similar results.

Gtr comes from ClogP ClogP is a fragment-based (informatics) method of estimating the octanol-water partition coefficient.

What Error is Acceptable? For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. An RMSE > 1.0 logS unit is probably unacceptable. This corresponds to an error range of 4.0 to 5.7 kJ/mol in Gsol.

What Error is Acceptable? A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

Results from Theoretical Calculations ● Direct calculation was a nice idea, but didn’t quite work – errors larger than QSPR ● “Why not add a correction factor to account for the difference between the theoretical methods?” ● This was originally intended to calibrate the different theoretical approaches, but …

… ● Within a week this had become a hybrid method, essentially a QSPR with the theoretical energies as descriptors

Results from Hybrid Model

This regression equation gives r2=0.77 and RMSE=0.71

How Well Did We Do? For a training-test split of 34:26, we obtain an RMSE of 0.71 logS units for the test set. This is comparable with the performance of “pure” QSPR models. This corresponds to an error of about 4.0 kJ/mol in Gsol.

Drug Disc.Today, 10 (4), 289 (2005) Gsolv & ClogP Ssub & b_rotR Ulatt

Part 2b, Solubility by TD Cycle: Conclusions ● We have a hybrid part-theoretical, part-empirical method. ● An interesting idea, but relatively low throughput - and an experimental (or possibly predicted?) crystal structure is needed. ● Similarly accurate to pure QSPR for a druglike set. ● Instructive to compare with literature of theoretical solubility studies.

Thanks Unilever Dr Florian Nigsch Pfizer & PIPMS Dr Dave Palmer Pfizer (Dr Iñaki Morao, Dr Nick Terrett & Dr Hua Gao)