In silico prediction of solubility: Solid progress but no solution? Dr John Mitchell University of St Andrews.

Slides:



Advertisements
Similar presentations
Simulazione di Biomolecole: metodi e applicazioni giorgio colombo
Advertisements

Computers in Chemistry Dr John Mitchell & Rosanna Alderson University of St Andrews.
Evaluating Free Energies of Binding using Amber: The MM-PBSA Approach.
Monte Carlo Methods and Statistical Physics
Spontaneous Processes
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
Introduction to Molecular Orbitals
Ion Solvation Thermodynamics from Simulation with a Polarizable Force Field Gaurav Chopra 07 February 2005 CS 379 A Alan GrossfeildPengyu Ren Jay W. Ponder.
© 2014 Carl Lund, all rights reserved A First Course on Kinetics and Reaction Engineering Class 3.
1 Reinforcement Learning Introduction & Passive Learning Alan Fern * Based in part on slides by Daniel Weld.
Applications and integration with experimental data Checking your results Validating your results Structure determination from powder data calculations.
Computers in Chemistry Dr John Mitchell University of St Andrews.
Quantum Chemical and Machine Learning Approaches to Property Prediction for Druglike Molecules Dr John Mitchell University of St Andrews.
The loss function, the normal equation,
Experimental Evaluation
Computers in Chemistry Dr John Mitchell University of St Andrews.
The Third E …. Lecture 3. Enthalpy But first… Uniting First & Second Laws First Law:dU = dQ + dW With substitution: dU ≤ TdS – PdV For a reversible change:
Calibration & Curve Fitting
Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the pharma industry A good model for predicting.
1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry.
Molecular Modeling Fundamentals: Modus in Silico C372 Introduction to Cheminformatics II Kelsey Forsythe.
CHEMISTRY 2000 Topic #3: Thermochemistry and Electrochemistry – What Makes Reactions Go? Spring 2008 Dr. Susan Lait.
In Silico Methods for ADMET and Solubility Prediction
Calibration Guidelines 1. Start simple, add complexity carefully 2. Use a broad range of information 3. Be well-posed & be comprehensive 4. Include diverse.
Solutions The Solution Process.
Chapter 12 Preview Objectives
Genetic Regulatory Network Inference Russell Schwartz Department of Biological Sciences Carnegie Mellon University.
Surveillance monitoring Operational and investigative monitoring Chemical fate fugacity model QSAR Select substance Are physical data and toxicity information.
CHEMISTRY 2000 Topic #3: Thermochemistry and Electrochemistry – What Makes Reactions Go? Spring 2010 Dr. Susan Lait.
Chapter 20: Thermodynamics
Spontaneous Processes Spontaneous processes are those that can proceed without any outside intervention. The gas in vessel B will spontaneously effuse.
David Kim Allergan Inc. SoCalBSI California State University, Los Angeles.
CHEMISTRY 2000 Topic #3: Thermochemistry and Electrochemistry – What Makes Reactions Go? Spring 2012 Dr. Susan Lait.
Quantum Chemical and Machine Learning Calculations of the Intrinsic Aqueous Solubility of Druglike Molecules Dr John Mitchell University of St Andrews.
Phase diagram calculation based on cluster expansion and Monte Carlo methods Wei LI 05/07/2007.
The Ising Model Mathematical Biology Lecture 5 James A. Glazier (Partially Based on Koonin and Meredith, Computational Physics, Chapter 8)
Statistics Introduction 2. The word Probability derives from the Latin probabilitas, which can also mean probity, a measure of the authority of a witness.
Understanding Molecular Simulations Introduction
Properties of Pure Substances Chapter 3. Why do we need physical properties?  As we analyze thermodynamic systems we describe them using physical properties.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
Real Gas Relationships
Informed by Informatics? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
Physical Property Modeling from Equations of State David Schaich Hope College REU 2003 Evaluation of Series Coefficients for the Peng-Robinson Equation.
ISCG8025 Machine Learning for Intelligent Data and Information Processing Week 3 Practical Notes Application Advice *Courtesy of Associate Professor Andrew.
In silico calculation of aqueous solubility Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge,
Understanding Randomness.  Many phenomena in the world are random: ◦ Nobody can guess the outcome before it happens. ◦ When we want things to be fair,
12 Thermodynamics 12.1 Types of Enthalpy Change 12.2 Born-Haber Cycles 12.3 Enthalpy Changes – Enthalpy of Solution 12.4 Mean Bond Enthalpy 12.5 Entropy.
Statistics What is the probability that 7 heads will be observed in 10 tosses of a fair coin? This is a ________ problem. Have probabilities on a fundamental.
1 1 Slide Simulation Professor Ahmadi. 2 2 Slide Simulation Chapter Outline n Computer Simulation n Simulation Modeling n Random Variables and Pseudo-Random.
In silico calculation of aqueous solubility Dr John Mitchell University of St Andrews.
Generalized van der Waals Partition Function
Monatomic Crystals.
Entropy Change (at Constant Volume) For an ideal gas, C V (and C P ) are constant with T. But in the general case, C V (and C P ) are functions of T. Then.
Review Vocabulary Solvent Solute Solution Sublimation Diatomic Molecules Breaking bonds: energy change Creating bonds: energy change Periodic Trends for.
PHYS 172: Modern Mechanics Lecture 23 – Heat Capacity Read 12.6 Summer 2012.
Tests of Significance We use test to determine whether a “prediction” is “true” or “false”. More precisely, a test of significance gets at the question.
Physiochemical properties of drugs Using the Sirius T3 to make measurements.
Notes 13-1 obj 13.1, A.) The solution process Solutions are homogeneous mixtures of two or more pure substances. In a solution, the solute is dispersed.
Enthalpy of formation Using enthalpies of formation, calculate the standard change in enthalpy for the thermite reaction: This reaction occurs when a mixture.
Dynamical Systems Modeling
Chapter 7. Classification and Prediction
Applications of the Canonical Ensemble: Simple Models of Paramagnetism
S as an energy relationship
Applications of the Canonical Ensemble:
The loss function, the normal equation,
Mathematical Foundations of BME Reza Shadmehr
Dr John Mitchell (Chemistry, St Andrews, 2019)
Humanity v The Machines
Presentation transcript:

In silico prediction of solubility: Solid progress but no solution? Dr John Mitchell University of St Andrews

Given accurately measured solubilities of 100 molecules, can you predict the solubilities of 32 similar ones?

For this study Toni Llinàs measured 132 solubilities using the CheqSol method. He used a Sirius glpKa instrument

K0K0 KaKa Intrinsic solubility- Of an ionisable compound is the thermodynamic solubility of the free acid or base form (Horter, D, Dressman, J. B., Adv. Drug Deliv. Rev., 1997, 25, 3-14) A NaA - ………. Na + AH S 0 is essentially the solubility of the neutral form only.

Diclofenac In Solution Powder ● We continue “Chasing equilibrium” until a specified number of crossing points have been reached ● A crossing point represents the moment when the solution switches from a saturated solution to a subsaturated solution; no change in pH, gradient zero, no re-dissolving nor precipitating…. SOLUTION IS IN EQUILIBRIUM Random error less than 0.05 log units !!!! Supersaturated Solution Subsaturated Solution 8 Intrinsic solubility values * A. Llinàs, J. C. Burley, K. J. Box, R. C. Glen and J. M. Goodman. Diclofenac solubility: independent determination of the intrinsic solubility of three crystal forms. J. Med. Chem. 2007, 50(5), ● First precipitation – Kinetic Solubility (Not in Equilibrium) ● Thermodynamic Solubility through “Chasing Equilibrium”- Intrinsic Solubility (In Equilibrium) Supersaturation Factor SSF = S kin – S 0 “CheqSol”

Caveat: the official results are used in the following slides, but most of the interpretation is my own.

A prediction was considered correct if it was within 0.5 log units

Not a very generous margin of error!

A “null prediction” based on predicting everything to have the mean training set solubility would have got 9/32 correct

Using an R 2 threshold of 0.500, only 18/99 entries were good

GOOD BAD

3 “WINNERS” 3 Pareto optimal entries which I think of as “winners”. These combine best R 2 with most correct predictions.

Some molecules proved much harder to predict than others – the most insoluble were amongst the most difficult.

My opinion is that the overall standard was rather poor; It’s obvious that some entries were much better than others; But entries were anonymous; So we can’t judge between either specific researchers or between their methods; We can only rely on the “official” summary … Conclusions from Solubility Challenge

We can only rely on the “official” summary … … “a variety of methods and combinations of methods all perform about equally well.” Conclusions from Solubility Challenge

How should we approach the prediction/estimation/calculation of the aqueous solubility of druglike molecules? Two (apparently) fundamentally different approaches: theoretical chemistry & informatics.

What Error is Acceptable? For typically diverse sets of druglike molecules, a “good” QSPR will have an RMSE ≈ 0.7 logS units. A RMSE > 1.0 logS unit is probably unacceptable. This corresponds to an error range of 4.0 to 5.7 kJ/mol in  G sol.

What Error is Acceptable? A useless model would have an RMSE close to the SD of the test set logS values: ~ 1.4 logS units; The best possible model would have an RMSE close to the SD resulting from the experimental error in the underlying data: ~ 0.5 logS units?

Theoretical Approaches

Theoretical Chemistry “The problem is difficult, but by making suitable approximations we can solve it at reasonable cost based on our understanding of physics and chemistry”

Theoretical Chemistry Calculations and simulations based on real physics. Calculations are either quantum mechanical or use parameters derived from quantum mechanics. Attempt to model or simulate reality. Usually Low Throughput.

Drug Disc.Today, 10 (4), 289 (2005)

Thermodynamic Cycle

Crystal Gas Solution

Sublimation Free Energy Crystal Gas

Sublimation Free Energy Crystal Gas

Sublimation Free Energy Crystal Gas

Sublimation Free Energy Crystal Gas Calculating  G sub is a standard procedure in crystal structure prediction

Crystal Structure Prediction Given the structural diagram of an organic molecule, predict the 3D crystal structure. Slide after SL Price, Int. Sch. Crystallography, Erice, 2004

CSP Methodology Based around minimising lattice energy of trial structures. Enthalpy comes from lattice energy and intramolecular energy (DFT), which need to be well calibrated against each other: trade-off of lattice vs conformational energy. Entropy comes from phonon modes (crystal vibrations); can get Free Energy.

CSP Methodology DFT calculation on monomer to obtain DMA electrostatics. Generate many plausible crystal structures using different space groups. Minimise lattice energy using DMA + repulsion-dispersion potential. Many structures may have similar energies.

34 These methods can get relative lattice energies of different structures correct, probably to within a few kJ/mol. Absolute energies harder.

35 Additional possible benefit for solubility: if we don’t know the crystal structure, we could reasonably use best structure from CSP.

Other approaches to Lattice Energy Periodic DFT calculations on a lattice are an alternative to the model potential approach. Advantageous to optimise intra- and intermolecular energies simultaneously using the same method. Disadvantage: it’s hard to get accurate dispersion.

Empirical routes to  G sub Alternatively one could estimate sublimation energy from QSPR (no crystal structure needed, but no obvious benefit over direct informatics approach to solubility).

Thermodynamic Cycle Crystal Gas Solution

Hydration Free Energy

We expect that hydration will be harder to model than sublimation, because the solution has an inexactly known and dynamic structure, both solute and solvent are important etc.

Simulation: MD/FEP Parameterised continuum models

… and of course the parameterised RISM work of our hosts. Quoted RMS error ~5kJ/mol or 0.9 log units.

… and this one both calculates solubility directly and is simulation based: FEP or Monte Carlo.

Luder et al.’s results correspond to an RMS error of about 6kJ/mol, or 1 logS unit, but only when an empirical “correction” is applied ….

… their uncorrected results are less impressive.

Hydration Energy Our currrent methodology here is just to try the various different PCM continuum models available in Gaussian.

We observe than our TD cycle method based on lattice energy minimisation for sublimation and a PCM continuum model of hydration correlates reasonably with experiment, but is not quantitatively predictive (at least without arbitrary correction). Caveat: currently only a small sample of molecules.

An alternative route is via octanol, then using logP.

Theoretical  Approaches ish So really these are

Using a training-test set split to optimise parameters & validate: RMSE(te)=0.71 r 2 (te)=0.77 N train = 34; N test = 26

Informatics Approaches

Informatics “The problem is too difficult to solve using physics and chemistry, so we will design a black box to link structure and solubility”

Informatics and Empirical Models In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. Do not attempt to simulate reality. Usually High Throughput.

Machine Learning Method Random Forest

Random Forest: Solubility Results RMSE(te)=0.69 r 2 (te)=0.89 Bias(te)=-0.04 RMSE(tr)=0.27 r 2 (tr)=0.98 Bias(tr)=0.005 RMSE(oob)=0.68 r 2 (oob)=0.90 Bias(oob)=0.01 DS Palmer et al., J. Chem. Inf. Model., 47, (2007) N train = 658; N test = 300

Random Forest: Replicating Solubility Challenge (post hoc) RMSE(te)=1.09 r 2 (te)= /32 correct within 0.5 logS units N train  100; N test  32 CDK descriptors

Support Vector Machine

SVM: Solubility Results et al., N train = ; N test = 87 RMSE(te)=0.94 r 2 (te)=0.79

What can we Learn from Informatics?

What Descriptors Correlate with logS? …amongst the solubility challenge training set, once intercorrelated descriptors with R 2 > 0.64 are removed?

The first 21 are all negatively correlated with logS … … things that reduce solubility.

The first 21 are all negatively correlated with logS … … things that reduce solubility. Some of this is meaningful: aromatic groups reduce solubility. Some is accidental: logP happens to be defined as octanol:water, rather than water:octanol.

Future Work Explore different models of hydration: PCM, simulation (MD/FEP), RISM … Route: Direct to water or via octanol? Machine Learning (Random Forest, SVM etc.) for hybrid experimental/parameterised models. Consistent training and validation sets and methodologies to compare methods: e.g., solubility challenge {100+32}.

Conclusions thus far…

Solubility has proved a difficult property to calculate. It involves different phases (solid & solution) and different substances (solute and solvent), and both enthalpy & entropy are important. The theoretical approaches are generally based around thermodynamic cycles and involve some empirical element.

Thanks Pfizer & PIPMS Gates Cambridge Trust SULSA Dr Dave Palmer, Laura Hughes, Dr Toni Llinas James McDonagh, Dr Tanja van Mourik James Taylor, Simon Hogan, Gregor McInnes, Callum Kirk, William Walton (U/G project)