Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews.

Slides:



Advertisements
Similar presentations
Classifying the WADA 2005 Prohibited List Using the CDK Fingerprint Ed Cannon & John Mitchell www-mitchell.ch.cam.ac.uk/
Advertisements

Analysis of High-Throughput Screening Data C371 Fall 2004.
Computers in Chemistry Dr John Mitchell & Rosanna Alderson University of St Andrews.
Mustafa Cayci INFS 795 An Evaluation on Feature Selection for Text Clustering.
LEARNING INFLUENCE PROBABILITIES IN SOCIAL NETWORKS Amit Goyal Francesco Bonchi Laks V. S. Lakshmanan University of British Columbia Yahoo! Research University.
Modern Tools for Drug Discovery NIMBUS Biotechnology Modern Tools for Drug Discovery
Feature selection and transduction for prediction of molecular bioactivity for drug design Reporter: Yu Lun Kuo (D )
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
Improving enrichment rates A practical solution to an impractical problem Noel O’Boyle Cambridge Crystallographic Data Centre
In silico prediction of solubility: Solid progress but no solution? Dr John Mitchell University of St Andrews.
Bioinformatics Vol. 21 no (Pages ) Reporter: Yu Lun Kuo (D )
Computers in Chemistry Dr John Mitchell University of St Andrews.
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Two Examples of Docking Algorithms With thanks to Maria Teresa Gil Lucientes.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Summary Protein design seeks to find amino acid sequences which stably fold into specific 3-D structures. Modeling the inherent flexibility of the protein.
Active Learning Strategies for Compound Screening Megon Walker 1 and Simon Kasif 1,2 1 Bioinformatics Program, Boston University 2 Department of Biomedical.
Computers in Chemistry Dr John Mitchell University of St Andrews.
Application and Efficacy of Random Forest Method for QSAR Analysis
Bioinformatics Ayesha M. Khan Spring Phylogenetic software PHYLIP l 2.
Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting.
Protein Tertiary Structure Prediction
Molecular Modeling Fundamentals: Modus in Silico C372 Introduction to Cheminformatics II Kelsey Forsythe.
ClusPro: an automated docking and discrimination method for the prediction of protein complexes Stephen R. Comeau, David W.Gatchell, Sandor Vajda, and.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
by B. Zadrozny and C. Elkan
A genetic algorithm for structure based de-novo design Scott C.-H. Pegg, Jose J. Haresco & Irwin D. Kuntz February 21, 2006.
David Kim Allergan Inc. SoCalBSI California State University, Los Angeles.
Predicting Phospholipidosis Using Machine Learning 1 Lowe et al., Molec. Pharmaceutics, 7, 1708 (2010) Robert Lowe (Cambridge) John Mitchell (St Andrews)
John Mitchell Bioinformatics Chemoinformatics Computational Chemistry Theoretical Chemistry.
Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews.
Function first: a powerful approach to post-genomic drug discovery Stephen F. Betz, Susan M. Baxter and Jacquelyn S. Fetrow GeneFormatics Presented by.
ChEMBL– Open Access Database For Drug Discovery By – Udghosh Singh M.S.(Pharm), 3 rd Sem Pharmacoinformatics.
1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson.
1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
In silico discovery of inhibitors using structure-based approaches Jasmita Gill Structural and Computational Biology Group, ICGEB, New Delhi Nov 2005.
Prediction of Molecular Bioactivity for Drug Design Experiences from the KDD Cup 2001 competition Sunita Sarawagi, IITB
Empirical Validation of the Effectiveness of Chemical Descriptors in Data Mining Kirk Simmons DuPont Crop Protection Stine-Haskell Research Center 1090.
SimBioSys Inc.© Slide #1 Enrichment and cross-validation studies of the eHiTS high throughput screening software package.
Page 1 Inferring Relevant Social Networks from Interpersonal Communication Munmun De Choudhury, Winter Mason, Jake Hofman and Duncan Watts WWW ’10 Summarized.
Lecture 4: Statistics Review II Date: 9/5/02  Hypothesis tests: power  Estimation: likelihood, moment estimation, least square  Statistical properties.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
Bioinformatics MEDC601 Lecture by Brad Windle Ph# Office: Massey Cancer Center, Goodwin Labs Room 319 Web site for lecture:
Hierarchical Database Screenings for HIV-1 Reverse Transcriptase Using a Pharmacophore Model, Rigid Docking, Solvation Docking, and MM-PB/SA Junmei Wang,
In silico calculation of aqueous solubility Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge,
LOGO iDNA-Prot|dis: Identifying DNA-Binding Proteins by Incorporating Amino Acid Distance- Pairs and Reduced Alphabet Profile into the General Pseudo Amino.
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Cluster validation Integration ICES Bioinformatics.
Ensemble Methods in Machine Learning
Feature Extraction Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and.
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
Surflex: Fully Automatic Flexible Molecular Docking Using a Molecular Similarity-Based Search Engine Ajay N. Jain UCSF Cancer Research Institute and Comprehensive.
Molecular mechanics Classical physics, treats atoms as spheres Calculations are rapid, even for large molecules Useful for studying conformations Cannot.
Chapter 1: The Nature of Analytical Chemistry
Improving compound–protein interaction prediction by building up highly credible negative samples Toward more realistic drug-target interaction predictions.
Docking and Virtual Screening Using the BMI cluster
Molecular Modeling in Drug Discovery: an Overview
TIDEA Target (and Lead) Independent Drug Enhancement Algorithm.
Receptor Theory & Toxicant-Receptor Interactions
APPLICATIONS OF BIOINFORMATICS IN DRUG DISCOVERY
CS548 Fall 2017 Decision Trees / Random Forest Showcase by Yimin Lin, Youqiao Ma, Ran Lin, Shaoju Wu, Bhon Bunnag Showcasing work by Cano,
Alexey Sulimov, Ekaterina Katkova, Vladimir Sulimov,
Facultad de Ingeniería, Centro de Cálculo
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Mr.Halavath Ramesh 16-MCH-001 Dept. of Chemistry Loyola College University of Madras-Chennai.
Humanity v The Machines
Presentation transcript:

Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews

1. Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting the solubility of druglike molecules would be very valuable.

How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.

Theoretical Chemistry Calculations and simulations based on real physics. Calculations are either quantum mechanical or use parameters derived from quantum mechanics. Attempt to model or simulate reality. Usually Low Throughput.

Dataset The thermodynamically most stable polymorph was selected where possible. All have experimental crystal structures. All have experimental logS. 10 have experimental ΔG sub and ΔG hydr (circled in red).

CheqSol Method In Solution Powder ● We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating…. SOLUTION IS IN EQUILIBRIUM Repeatability better than 0.05 log units Supersaturated Solution Subsaturated Solution 8 Intrinsic solubility values * A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), ● First precipitation – Kinetic Solubility (Not in Equilibrium) ● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium) Supersaturation Factor SSF = S kin – S 0 “CheqSol”

Thermodynamic Cycle

Crystal Gas Solution

Sublimation Free Energy Crystal Gas

Sublimation Free Energy Crystal Gas (rigid molecule approximation)

Sublimation Free Energy Crystal Gas Calculating Δ G sub is a standard procedure in crystal structure prediction

 G sub from lattice energy & a phonon entropy term; DMACRYS using B3LYP/6-31G** multipoles and FIT repulsion-dispersion potential. Theoretical method for crystal

Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles. Results for ΔG sub

A 46 compound set has a larger error, mostly due to some large outliers. Error statistics vary with dataset.

Thermodynamic Cycle Crystal Gas Solution

Hydration Free Energy We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.

 G hydr from Reference Interaction Site Model with Universal Correction (3DRISM-KH/UC). Theoretical method for aqueous solution

Reference Interaction Site Model (RISM) Combines features of explicit and implicit solvent models. Solvent density is modelled, but no explicit molecular coordinates or dynamics. ~45 CPU mins per compound

Reference Interaction Site Model (RISM) Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the Reference Interaction Site Model. Journal of Chemical Physics, (4): p

Perhaps surprisingly, error in  G hyd is smaller than in  G sub. Results for ΔG hyd

logS from Thermodynamic Cycle Crystal Gas Solution Add the two terms to get ΔG sol and hence logS.

Results for ΔG sol

Conclusions: Solubility from Theory Must calculate  G sub &  G hyd separately; RISM is efficient & fairly accurate for  G hyd ; Experimental data for  G sub &  G hyd sparse and errors may be large; Dataset size and composition make comparisons of methods hard; Not yet matched accuracy of informatics.

Informatics and Empirical Models In general, informatics methods represent phenomena mathematically, but not in a physics-based way. Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. Do not attempt to simulate reality. Usually High Throughput.

What Error is Acceptable? For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. A RMSE > 1.0 logS unit is probably unacceptable. This corresponds to an error range of 4.0 to 5.7 kJ/mol in  G sol.

What Error is Acceptable? A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

Machine Learning Method Random Forest

Random Forest: Solubility Results RMSE(te)=0.69 r 2 (te)=0.89 Bias(te)=-0.04 RMSE(oob)=0.68 r 2 (oob)=0.90 Bias(oob)=0.01 DS Palmer et al., J. Chem. Inf. Model., 47, (2007) N train = 658; N test = 300

Support Vector Machine

SVM: Solubility Results et al., N train = ; N test = 87 RMSE(te)=0.94 r 2 (te)=0.79

100 Compound Cross-Validation Theoretical energies don’t seem to improve descriptor models.

Replicating Solubility Challenge (post hoc) RMSE(te)=1.09; 1.00; 0.89; 1.08 r 2 (te)=0.39; 0.49; 0.58; ; 12; 12; 13/28 correct within 0.5 logS units N train  94; N test  28 CDK descriptors: RF, RF, PLS, SVM

Replicating Solubility Challenge (post hoc) N train  94; N test  28 CDK descriptors: RF, RF, PLS, SVM Although the test dataset is small, it is a standard set.

Conclusions: Solubility from Informatics Experimental data: errors unknown, but limit possible accuracy of models; CheqSol - step in right direction; Dataset size and composition hinder comparisons of methods; Solubility Challenge – step in right direction.

2. Protein Target Prediction Which protein does a given molecule bind to? Virtual Screening Multiple endpoint drugs - polypharmacology New targets for existing drugs Prediction of adverse drug reactions (ADR) –Computational toxicology

Predicted Protein Targets Selection of 233 classes from the MDL Drug Data Report ~90,000 molecules 15 independent 50%/50% splits into training/test set Actually we are predicting closely target-related MDDR classes

Predicted Protein Targets Cumulative probability of correct prediction within the three top-ranking predictions: 82.1% (±0.5%)

Protein Target Prediction Given a specific compound, is it possible to predict computationally its biological interactions with protein targets? Very important for In silico screening (time and money efficient) off-target prediction (side effects) Can be used for identifying substances with performance- enhancing potential DrugDrug discovery: Predicting promiscuity, Andrew L. Hopkins, Nature 462, (12 November 2009),doi: /462167adiscovery:

Substances Prohibited in Sports WADA publishes and maintains a prohibited list of banned compounds, updated every 6 months Substances are split into three main categories: Substances prohibited at all times (in and out of competition) S0. Non-Approved substances S1. Anabolic Agents S2. Peptide hormones, Growth Factors and Related Substances S3. Beta-2 Agonists S4. Hormone Antagonists and Modulators S5. Diuretics and Other Masking Agents Substances prohibited in competition S6. Stimulants S7. Narcotics S8. Cannabinoids S9. Glucocorticosteroids Substances prohibited in particular sports P1. Alcohol with a violation threshold of 0.10 g/L. (Archery, Karate etc) P2. Beta-Blockers prohibited In- Competition only (Bridge, Curling, Darts, Wrestling, Archery etc.)

Database Rank Class Anabolic Agents VitaminD Glucocorticoids Methodology Stanozolol

ChEMBL-Activities Each compound has experimental data for a number of targets Activity data based on IC50, EC50, K i, K d etc. Some activities just labelled “inactive” or “active” Each compound can have more than one record for a given target

Each of the 8,845 targets has a number of compounds assigned Not all compounds have actual data on the target or are active We filtered each of the families according to rules defining “active” and “inactive” The rules were decided by visual inspection of distributions Rules IC50 ≤50000nM active & >50000nM inactive K i <20000nM active & ≥20000nM inactive K d ≤ 10000nM active & >10000nM inactive EC50 ≤ 40000nM active & >40000nM inactive ED50 ≤ 10000nM active & >10000nM inactive Potency ≤ 10000nM active & >10000nM inactive Activity ≥40% active & <40% inactive Inhibition ≥45% active & <45% inactive Filtering the CheMBLFamilies

Index Example: Distributions of K i

??? Refined Families Filtered families consist of compounds with significant experimental activities against the relevant targets. Many targets have distinct groups of ligands with different scaffolds. May be because there is more than one binding site, or because different scaffolds can fit the same site. Splitting such a family into smaller groups based on ligand structure will allow us to identify the different sets of ligands.

Refined Families - PFClust We selected the PFClust algorithm because it is a parameter free clustering algorithm and does not require any kind of parameter tuning. PFClust : A novel parameter free clustering algorithm. Mavridis L, Nath N, Mitchell JBO. BMC Bioinformatics 2013, 14:213.

Rule Filtering Clustering Database Refined Families Compounds 5443 Families Compounds Predicting the protein targets for athletic performance-enhancing substances. Mavridis L, Mitchell JBO. J Cheminformatics 2013, 5:31. Database Refinement Original Families

Database Refinement - Validation Monte Carlo Cross-Validation The three versions of the database were examined (Original, Filtered and Refined) 10% of each family were randomly removed and used as queries If the top prediction was the family that the query was a member of, a TP would be counted; if not, a FP Average Matthews Correlation Coefficient (MCC) Original : 0.02 Filtered : 0.03 Refined : % (6.61%) 66.98% (87.25%) 3.18% (7.21%) Top Hit (Top four )

P2–BetaBlockers 20 explicitly prohibited compounds Every compound, except timolol and levobunolol, gave a strong prediction (PR-Score) for at least one family Good experimental validation We see that the majority of the families are Beta-1,2 & 3 adrenergic receptor ligands, as expected. Other families also generate some interesting results, such as the serotonin 1a receptor, indicated to make off-target interactions with pindolol CompoundTargetPR-ScoreE-Value P2-Beta Blockers Alprenolol (266195)Cavia Porceullus (369)0.039 LogB/F = −0.158 Carvedilol (723) β-1 adrenergic receptor (3252) Ki = 0.81 nM β-2 adrenergic receptor (210) Ki = nM Pindolol (500) β-2 adrenergic receptor (3754) β-3 adrenergic receptor (4031) β-1 adrenergic receptor (3252) Prediction Ki = 1 nM β-2 adrenergic receptor (210) Ki = 0.4 nM β-2 adrenergic receptor (3754) β-3 adrenergic receptor (4031) Inhibition = 84% Ki = 1 nM Serotonin 1a (5-HT1a (214)0.026 Ki = 24 nM Propranolol (27) Sotalol (471) β-2 adrenergic receptor (210) β-3 adrenergic receptor (246) IC50 = 12 nM IC50 = 7200 nM

Carvedilol WADA– P2 Beta Blockers Metoprolol

CompoundTargetPR-Score E-Value S8-Cannabinoids Cannabidivarin (−) Cannabinoid CB1 receptor (218) 0.037Prediction Cannabigerol (497318) Cannabinoid CB2 receptor (253) HL-60 (383) Prediction HU-210 (70625) JWH-018 (561013) Cannabinoid CB1 receptor (3571) Cannabinoid CB2 receptor (5373) Cannabinoid CB1 receptor (218) Cannabinoid CB1 receptor (3571) Cannabinoid CB2 receptor (253) Ki = 0.82 nM a Prediction pKi = 8.7 pKi = pKi = 8.2 JWH-073 (−) Isoprenylcysteine carboxyl methyltransferase (4699) MDA-MB-231 (400) Cannabinoid CB1 receptor (218) Prediction Cannabinoid CB1 receptor (3571) 0.025Prediction Tetrahydrocannabinol (465) Cannabinoid CB1 receptor (218) Ki = 2.9 nM Cannabinoid CB1 receptor (3571) Cannabinoid CB2 receptor (2470) Cannabinoid CB2 receptor (253) Cannabinoid CB2 receptor (5373) Ki = 37 nM Ki = 20 nM Ki = 3.3 nM Ki = 9.2 nM S8-Cannabinoids 10 explicitly prohibited compounds 17 refined families of which 13 are cannabinoid CB1/2 receptors All compounds show strong predicted affinity to at least one cannabinoid receptor, except cannabivarol Excellent agreement between PR-scores and experimental results

Tetrahydrocannabinol HU-210 WADA– S8 Cannabinoids JWH-018/073

Discussion As for any method, the success of our approach depends on the quality of the underlying data. Our methodology addresses the problem that only a small fraction of possible activities of different molecules against different targets have been assayed. For ChEMBL families that are not well populated, or for protein targets which too few compounds are assayed against, we cannot make predictions.

Conclusions – Target Prediction Automated data curation of the ChEMBL families greatly increases the precision of our protein target predictions. Our validations show good correspondence with experiment, 87% having the correct refined family among the top four hits. Across the seven WADA classes considered, we find a combination of expected and unexpected protein targets. Many of the non-obvious predicted targets have biochemically or clinically validated connections with the expected bioactivities.

Thanks SULSA, WADA, BBSRC, SFC Dr Lazaros Mavridis, James McDonagh, Dr Tanja van Mourik, Dr Luna De Ferrari, Neetika Nath (St Andrews) Prof. Maxim Fedorov, Dr David Palmer (Strathclyde) Laura Hughes, Dr Toni Llinas (ex-Cambridge) James Taylor, Simon Hogan, Gregor McInnes, Callum Kirk, William Walton (U/G project)