Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews.

Slides:



Advertisements
Similar presentations
Computers in Chemistry Dr John Mitchell & Rosanna Alderson University of St Andrews.
Advertisements

Evaluating Free Energies of Binding using Amber: The MM-PBSA Approach.
Crystallography, Birkbeck MOLECULAR SIMULATIONS ALL YOU (N)EVER WANTED TO KNOW Julia M. Goodfellow Dynamic Processes: Lecture 1 Lecture Notes.
Quantum One: Lecture 1a Entitled So what is quantum mechanics, anyway?
Supervised Normalized Cut for Detecting, Classifying and Identifying Special Nuclear Materials Yan T. Yang Barak Fishbain Dorit S. Hochbaum Eric B. Norman.
X-Ray Crystallography
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
2. Solubility and Molecular Weights Polymer Solubility1.
CHAPTER 14 THE CLASSICAL STATISTICAL TREATMENT OF AN IDEAL GAS.
Solvation Models. Many reactions take place in solution Short-range effects Typically concentrated in the first solvation sphere Examples: H-bonds,
In silico prediction of solubility: Solid progress but no solution? Dr John Mitchell University of St Andrews.
Introduction to Molecular Orbitals
Computational Chemistry
Ion Solvation Thermodynamics from Simulation with a Polarizable Force Field Gaurav Chopra 07 February 2005 CS 379 A Alan GrossfeildPengyu Ren Jay W. Ponder.
Computers in Chemistry Dr John Mitchell University of St Andrews.
Gaussian Processes I have known
Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews.
Lecture 3 – 4. October 2010 Molecular force field 1.
Multiple Instance Learning
Polymers PART.2 Soft Condensed Matter Physics Dept. Phys., Tunghai Univ. C. T. Shih.
Potential Energy Surfaces
Computers in Chemistry Dr John Mitchell University of St Andrews.
Application and Efficacy of Random Forest Method for QSAR Analysis
Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting.
1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.
Computational Chemistry
Ch 8.1 Numerical Methods: The Euler or Tangent Line Method
Efficient Model Selection for Support Vector Machines
1 Validation & Verification Chapter VALIDATION & VERIFICATION Very Difficult Very Important Conceptually distinct, but performed simultaneously.
Entropy and the Second Law Lecture 2. Getting to know Entropy Imagine a box containing two different gases (for example, He and Ne) on either side of.
CS 8751 ML & KDDSupport Vector Machines1 Support Vector Machines (SVMs) Learning mechanism based on linear programming Chooses a separating plane based.
Future role of DMR in Cyber Infrastructure D. Ceperley NCSA, University of Illinois Urbana-Champaign N.B. All views expressed are my own.
Statistics and Quantitative Analysis Chemistry 321, Summer 2014.
David Kim Allergan Inc. SoCalBSI California State University, Los Angeles.
Rosa Ramirez ( Université d’Evry ) Shuangliang Zhao ( ENS Paris) Classical Density Functional Theory of Solvation in Molecular Solvents Daniel Borgis Département.
1.Solvation Models and 2. Combined QM / MM Methods See review article on Solvation by Cramer and Truhlar: Chem. Rev. 99, (1999)
Operated by Los Alamos National Security, LLC for the U.S. Department of Energy’s NNSA U N C L A S S I F I E D Slide 1 Practical Considerations for Analysis.
1 John Mitchell; James McDonagh; Neetika Nath Rob Lowe; Richard Marchese Robinson.
SVM Support Vector Machines Presented by: Anas Assiri Supervisor Prof. Dr. Mohamed Batouche.
1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University.
The Ideal Monatomic Gas. Canonical ensemble: N, V, T 2.
Conformational Entropy Entropy is an essential component in ΔG and must be considered in order to model many chemical processes, including protein folding,
Understanding Molecular Simulations Introduction
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
In silico calculation of aqueous solubility Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge,
Multiple Instance Learning for Sparse Positive Bags Razvan C. Bunescu Machine Learning Group Department of Computer Sciences University of Texas at Austin.
Javier Junquera Introduction to atomistic simulation methods in condensed matter Alberto García Pablo Ordejón.
12 Thermodynamics 12.1 Types of Enthalpy Change 12.2 Born-Haber Cycles 12.3 Enthalpy Changes – Enthalpy of Solution 12.4 Mean Bond Enthalpy 12.5 Entropy.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
CZ5225: Modeling and Simulation in Biology Lecture 7, Microarray Class Classification by Machine learning Methods Prof. Chen Yu Zong Tel:
Lecture 9: Theory of Non-Covalent Binding Equilibria Dr. Ronald M. Levy Statistical Thermodynamics.
Role of Theory Model and understand catalytic processes at the electronic/atomistic level. This involves proposing atomic structures, suggesting reaction.
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
Generalized van der Waals Partition Function
Monatomic Crystals.
Machine Learning 5. Parametric Methods.
A) I. I. Mechnikov National University, Chemistry Department, Dvorianskaya 2, Odessa 65026, Ukraine, b) Department of Molecular.
Theory of dilute electrolyte solutions and ionized gases
42C.1 Non-Ideal Solutions This development is patterned after that found in Molecular Themodynamics by D. A. McQuarrie and John D. Simon. Consider a molecular.
Advanced methods of molecular dynamics 1.Monte Carlo methods 2.Free energy calculations 3.Ab initio molecular dynamics 4.Quantum molecular dynamics 5.Trajectory.
Equilibrium Defects Real crystals are never perfect, they always contain a considerable density of defects and imperfections that affect their physical,
The inference and accuracy We learned how to estimate the probability that the percentage of some subjects in the sample would be in a given interval by.
The accuracy of averages We learned how to make inference from the sample to the population: Counting the percentages. Here we begin to learn how to make.
Physical and chemical properties of solvent mixtures are important for understanding their thermodynamic behavior. One of the most important considerations.
Chapter 1: The Nature of Analytical Chemistry
Thermodynamics Definitions Forming Ionic Compounds
Approximation of Percolation Thresholds
Dr John Mitchell (Chemistry, St Andrews, 2019)
Humanity v The Machines
Presentation transcript:

Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews

How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.

The Two Faces of Computational Chemistry Theoretical Chemistry Informatics

Theoretical Chemistry “The problem is difficult, but by making suitable approximations we can solve it at reasonable cost based on our understanding of physics and chemistry”

Theoretical Chemistry Calculations and simulations based on real physics. Calculations are either quantum mechanical or use parameters derived from quantum mechanics. Attempt to model or simulate reality. Usually Low Throughput.

Drug Disc.Today, 10 (4), 289 (2005)

Existing Theoretical Approaches Thus far, although theoretical methods have shown promise, they have not matched the accuracy of QSPR. There is no theoretical method that deals directly with solubility, so the problem has to be broken down into parts. There are several different ways of doing this.

Our First Principles Method We present one such approach and believe this to be the world’s most cost- effective first principles solubility method.

Thermodynamic Cycle

 G sub from lattice energy minimisation  G hydr from Reference Interaction Site Model (RISM) Different kinds of theoretical method are used for each part

 G sub from lattice energy & a phonon entropy term; DMACRYS using B3LYP/6-31G(d,p) multipoles and FIT repulsion-dispersion potential.  G hydr from Reference Interaction Site Model with Universal Correction (RISM/UC). Different kinds of theoretical method are used for each part

OUR DATASET (25 molecules)

We have experimental logS for all 25 molecules, but can only subdivide into ΔG sub and ΔG hydr for 10 of them.

Thermodynamic Cycle Crystal Gas Solution

Sublimation Free Energy Crystal Gas

Sublimation Free Energy Crystal Gas

Sublimation Free Energy Crystal Gas Calculating Δ G sub is a standard procedure in crystal structure prediction

Crystal Structure Prediction Given the structural diagram of an organic molecule, predict the 3D crystal structure. Slide after SL Price, Int. Sch. Crystallography, Erice, 2004

CSP Methodology Based around minimising lattice energy of trial structures. Enthalpy comes from lattice energy and intramolecular energy (DFT), which need to be well calibrated against each other: trade-off of lattice vs conformational energy. Entropy comes from phonon modes (crystal vibrations); can get Free Energy.

These methods can get relative lattice energies of different structures correct, probably to within a few kJ/mol. Absolute energies are harder.

Additional possible benefit for solubility: if we don’t know the crystal structure, we could reasonably use best structure from crystal structure prediction.

Lattice energies from DMACRYS with FIT atom-atom model potential and B3LYP/6-31G(d,p) distributed multipoles. Results for ΔG sub

Reasonable prediction of ΔG sub, but small number of molecules. Results for ΔG sub

To see the trends in errors, we need to look at more molecules. RMSE = 20.4 kJ/mol (46 molecules)

The 46 compound set shown here has a larger error, mostly due to some large outliers. Error statistics vary with dataset. RMSE = 20.4 kJ/mol (46 molecules)

RMSE = 22.4 kJ/mol (46 molecules)

The predicted ΔH sub is much better correlated with experiment than is TΔS sub. However, ΔH sub has a much larger range of values and contributes more to the RMS error.

Thermodynamic Cycle Crystal Gas Solution

Hydration Free Energy We expected that hydration would be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.

Reference Interaction Site Model (RISM) Combines features of explicit and implicit solvent models. Solvent density is modelled, but no explicit molecular coordinates or dynamics. ~45 CPU mins per compound

RISM

Reference Interaction Site Model (RISM) Palmer, D.S., et al., Accurate calculations of the hydration free energies of druglike molecules using the reference interaction site model. The Journal of Chemical Physics, (4): p

Perhaps surprisingly, error in  G hyd is smaller than in  G sub. Results for ΔG hyd

Other Hydration Energy Approaches An alternative methodology here is just to try the various different continuum solvent models available in Gaussian.

logS from Thermodynamic Cycle Crystal Gas Solution Add the two terms to get ΔG sol and hence logS.

Results for ΔG sol

Conclusions: Theory Must calculate  G sub &  G hyd separately; Expt data sparse and errors may be large; RISM is efficient & fairly accurate for  G hyd ; Dataset size and composition make comparisons of methods hard; Not yet matched accuracy of informatics.

Informatics Approaches “The problem is too difficult to solve using physics and chemistry, so we will design a black box to link structure and solubility”

Informatics and Empirical Models In general, informatics methods represent phenomena mathematically, but not in a physics-based way. Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. Do not attempt to simulate reality. Usually High Throughput.

What Error is Acceptable? For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. A RMSE > 1.0 logS unit is probably unacceptable. This corresponds to an error range of 4.0 to 5.7 kJ/mol in  G sol.

What Error is Acceptable? A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

Machine Learning Method Random Forest

Random Forest: Solubility Results RMSE(te)=0.69 r 2 (te)=0.89 Bias(te)=-0.04 RMSE(oob)=0.68 r 2 (oob)=0.90 Bias(oob)=0.01 DS Palmer et al., J. Chem. Inf. Model., 47, (2007) N train = 658; N test = 300

Support Vector Machine

SVM: Solubility Results et al., N train = ; N test = 87 RMSE(te)=0.94 r 2 (te)=0.79

100 Compound Cross-Validation Theoretical energies don’t seem to improve descriptor models.

100 Compound Cross-Validation McDonagh et al., J Chem Inf Model, 54, 844 (2014)

Replicating Solubility Challenge (post hoc) McDonagh et al., J Chem Inf Model, 54, 844 (2014)

Replicating Solubility Challenge (post hoc) RMSE(te)=1.00; 0.89; 1.08 r 2 (te)= 0.49; 0.58; ; 12; 13/28 correct within 0.5 logS units N train = 94; N test = 28 CDK descriptors: RF, PLS, SVM

Replicating Solubility Challenge (post hoc) N train = 94; N test = 28 CDK descriptors: RF, PLS, SVM Although the test dataset is small, it is a standard set.

Conclusions: Informatics Expt data: errors unknown, but limit possible accuracy of models; CheqSol - step in right direction; Dataset size and composition hinder comparisons of methods; Solubility Challenge – step in right direction.

Thanks SULSA James McDonagh, Dr Tanja van Mourik, Neetika Nath (St Andrews) Prof. Maxim Fedorov, Dr Dave Palmer (Strathclyde) Laura Hughes, Dr Toni Llinas James Taylor, Simon Hogan, Gregor McInnes, Callum Kirk, William Walton (U/G project)