Download presentation
Presentation is loading. Please wait.
Published byBrendan Snow Modified over 9 years ago
1
1 Can we Predict Anything Useful from 2-D Molecular Structure? Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
2
2
3
3
4
4
5
5
6
6
7
7 We look at data, analyse data, use data to find correlations...... to develop models...... and to make (hopefully) useful predictions. Let’s look at some data...
8
8 New York Times, 4 th October 2005.
9
9 Happiness ≈ (GNP/$5000) -1 Poor fit to linear model
10
10 (GNP/$5000) -2 Outliers? Happiness
11
11 Fitting with a curve: reduce RMSE
12
12 Outliers? Different linear models for different regimes
13
13 Only one obvious (to me) conclusion This area is empty: no country is both rich and unhappy. All other combinations are observed. Happiness(GNP/$5000) -2
14
14... but this is nothing to do with 2-D molecular structure
15
15 QSPR Quantitative Structure Property Relationship Physical property related to more than one other variable First example from Hansch et al 1960’s General form (for non-linear relationships): y = f (descriptors)
16
16 QSPR Y = f (X 1, X 2,..., X N ) Optimisation of Y = f(X 1, X 2,..., X N ) is called regression. Model is optimised upon N “training molecules” and then tested upon M “test” molecules.
17
17 QSPR Quality of the model is judged by three parameters:
18
18 QSPR Different methods for carrying out regression: LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.
19
19 QSPR However, this does not guarantee a good predictive model….
20
20 QSPR Problems with experimental error. A QSPR equation is only as accurate as the data it is trained upon. Therefore, we are making experimental measurements of solubility (Dr Antonio Llinàs).
21
21 QSPR Problems with “chemical space”. “Sample” molecules must be representative of “Population”. Prediction results will be most accurate for molecules similar to training set. Global or Local models?
22
22 Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.
23
23 Drug Disc.Today, 10 (4), 289 (2005) Cohesive interactions in the lattice reduce solubility Predicting lattice (or almost equivalently sublimation) energy should help predict solubility
24
24 Relationship of Chemical Structure With Lattice Energy Can we predict lattice energy from molecular structure? Dr Carole Ouvrard & Dr John Mitchell Unilever Centre for Molecular Informatics University of Cambridge C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)
25
25 Why Do We Need a Predictive Model? A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials From 2-D molecular structure only Without knowing the crystal packing Without expensive theoretical calculations Should help predict solubility.
26
26 Why Do We Think it Will Work? Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule. Many molecules have a plurality of different experimentally observable polymorphs. We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.
27
27 x x x x O x x x x Density (g/cc) Lattice Energy (kJ/mol) x x 1.40 1.60 1.50 -92.0 -94.0 -96.0 -98.0 OO O O O + + + + x x P1- + P21/c O P212121 P21 Calculated Lowest Energy Structure Experimental Crystal Structure
28
28 Expression for the Lattice Energy U crystal = U molecule + U lattice Theoretical lattice energy –Crystal binding = Cohesive energy Experimental lattice energy is related to - H sublimation H sublimation = -U lattice – 2RT (Gavezzotti & Filippini)
29
29 Partitioning of the Lattice Energy U crystal = U molecule + U lattice H sublimation = -U lattice – 2RT Partitioning the lattice energy in terms of structural contributions Choice of the significant parameters –number of atoms of each type? –Number of rings, aromatics? –Number of bonds of each type? –Symmetry? –Hydrogen bond donors and acceptors? Intramolecular? We choose counts of atom type occurrences.
30
30 Analysis of the Sublimation Energy Data Experimental data: H sublimation Atom Types –SATIS codes : 10-digit connectivity code + bond types –Each 2 digit code = atomic number HN 01 07 99 99 99 HO 01 08 99 99 99 O=C 08 06 99 99 99 -O- 08 06 06 99 99 Statistical analysis Multi-Linear Regression Analysis H sub # atoms of each type Typically, several similar SATIS codes are grouped to define an atom type. NIST (National Institute of Standards and Technology, USA) Scientific literature
31
31 Training Dataset of Model Molecules 226 organic compounds 19 linear alkanes (19) 14 branched alkanes (33) 17 aromatics (50) 106 other non-H-bonders (156) 70 H-bond formers (226) Non-specific interacting –Hydrocarbons –Nitrogen compounds –Nitro-, C N, halogens, – S, Se substituents –Pyridine Potential hydrogen bonding interactions –Amides –Carboxylic acids –Amino acids…
32
32 Study of Non-specific Interactions: Linear Alkanes 19 compounds : CH 4 C 20 H 24 Limit for van der Waals interactions H sub 7.955C-2.714 r 2 = 0.977 s = 7.096 kJ/mol BPt HH sub Note odd-even variation in H sub for this series. Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.
33
33 Include Branched Alkanes Add 14 branched alkanes to dataset. The graph below highlights the reduction of sublimation enthalpy due to bulky substituents. 33 compounds : CH 4 C 20 H 24 H sub = 7.724C nonbranched + 3.703 r 2 = 0.959 s = 8.117 kJ/mol If we also include the parameters for branched carbons, C 3 & C 4, the model doesn’t improve.
34
34 All Hydrocarbons: Include Aromatics Add 17 aromatics to the dataset (note: we have no alkenes or alkynes). 50 compounds H sub = 7.680C nonbranched + 6.185C aromatic + 4.162 r 2 = 0.958 s = 7.478 kJ/mol As before, i f we also include the parameters for branched carbons, C 3 & C 4, the model doesn’t improve. aliphatic
35
35 All Non-Hydrogen-Bonded Molecules: Add 106 non-hydrocarbons to the dataset. Include elements H, C, N, O, F, S, Cl, Br & I. 156 compounds H sub predicted by 16 parameter model r 2 = 0.896 s = 9.976 kJ/mol Parameters in model are counts of atom type occurrences.
36
36 General Predictive Model Add 70 hydrogen bond forming molecules to the dataset. 226 compounds H sub predicted by 19 parameter model r 2 = 0.925 s = 9.579 kJ/mol Parameters in model are counts of atom type occurrences.
37
37 H sublimation (kJ mol -1 ) = 6.942 + 20.141 HN + 30.172 HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I + 3.297 C 3 – 3.305 C 4 + 5.970 C aromatic + 7.631 C nonbranched + 7.341 CO + 19.676 CS + 11.415 N nitrile + 8.953 N nonnitrile + 8.466 NO + 18.249 O ether + 20.585 SO + 12.840 S thioether Predictive Model Determined by MLRA aliphatic All these parameters are significantly larger than their standard errors
38
38 Distribution of Residuals The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.
39
39 35 diverse compounds r 2 = 0.928 s = 7.420 kJ/mol Validation on an Independent Test Set Nitro-compounds are often outliers Very encouraging result: accurate prediction possible.
40
40 Conclusions We have determined a general equation allowing us to estimate the sublimation enthalpy for a large range of organic compounds with an estimated error of 9 kJ/mol. A very simple model (counts of atom types) gives a good prediction of lattice & sublimation energies. Lattice energy can be predicted from 2D structure, without knowing the details of the crystal packing. Avoids need for expensive calculations. May help predict solubility. Model gives good chemical insight.
41
41 A Chemoinformatics Approach To Predicting the Aqueous Solubility of Pharmaceutical Molecules David Palmer & Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
42
42 Pfizer Project: P13 Novel Methods for Predicting Solubility David Palmer Dr Antonio Llinàs Pfizer Institute for Pharmaceutical Materials Science http://www.msm.cam.ac.uk/pfizer
43
43 Datasets Compiled from Huuskonen dataset and AquaSol database All molecules solid at R.T. n = 1000 molecules Aqueous solubility – the thermodynamic solubility in unbuffered water (at 25 o C)
44
44 Diversity-Conserving Partitioning MACCS Structural Key fingerprints Tanimoto coefficient MaxMin Algorithm Full dataset n = 1000 molecules Training n = 670 molecules Test n = 330 molecules
45
45 Structures & Descriptors 3D structures from Concord Minimised with MMFF94 MOE descriptors 2D/ 3D Separate analysis of 2D and 3D descriptors QuaSAR Contingency Module (MOE) 52 descriptors selected
46
46 Multi-Linear Regression Log.S = 0.07nHDon (+/-0.018) - 0.21TPSA (+/-0.033) + 0.11MAXDP (+/-0.022) - 0.22n.Ct (+/-0.019) - 0.29KierFlex (+/-0.032) - 0.59SLOGP (+/0.036) - 0.26ATS2m (+/-0.026) + 0.25RBN (+/-0.033) We can do better than this with other methods...
47
47 Two More Methods of Prediction (1) Random Forest handles both selection and regression. (2a) Ant Colony Optimisation algorithm selection was used for Support Vector Machine regression. (2b) Support Vector Machine regression was repeated with “Intelligent trial and error” selection.
48
48 Random Forest: Introduction Introduced by Briemann and Cutler (2001) Development of Decision Trees (Recursive Partitioning): Dataset is partitioned into consecutively smaller subsets (of similar solubility) Each partition is based upon the value of one descriptor The descriptor used at each split is selected so as to minimise the MSE
49
49 Random Forest: Method Random Forest is a collection of Decision Trees grown with the CART algorithm. Standard Parameters: 500 decision trees No pruning back: Minimum node size > 5 “mtry” descriptors tried at each split Important features: Incorporates descriptor selection Incorporates “Out-of-bag” validation
50
50 Random Forest: Results RMSE(te)=0.69 r 2 (te)=0.89 Bias(te)=-0.04 RMSE(tr)=0.27 r 2 (tr)=0.98 Bias(tr)=0.005 RMSE(oob)=0.68 r 2 (oob)=0.90 Bias(oob)=0.01
51
51 Support Vector Machines [1] V.Vapnik, Estimation of Dependences Based on Empirical Data, Nauka, 1979 [in Russian] [2] V.Vapnik, The Nature of Statistical Learning Theory, Springer-Verlag, 1995. "In SVM regression, the input is first mapped onto a m-dimensional feature space using a fixed (non-linear) mapping, and then a linear model is constructed in this feature space. The linear model (in the feature space) is given by: Kernel Function ε - "Over-fitting" "Support Vectors" C - cost - "Outliers" γ - Kernel parameter
52
52 SVM: Descriptor Selection Stepwise selection of descriptors: “intelligent trial & error” Ant colony descriptor selection algorithm gives 20 descriptors and RMSE (test set) = 0.70 Gives five descriptor model with RMSE (test set) = 0.71
53
53 Support Vector Machines: Results RMSE(CV) = 0.71 r 2 (CV) =0.88 Bias(CV) = -0.001 RMSE(test) = 0.71 r 2 (test) = 0.88 Bias(test) = 0.02
54
54 2D or 3D Molecular Descriptors? R=0.88 R=1.00 (2.d.p.) R=0.95 No improvement from models containing 3D descriptors R=0.88
55
55 Conclusions Two methods so far have produced good models: a. Random Forest b. Support Vector Machines Accurate experimental data necessary to improve models Random Forest valuable for QSPR modelling
56
56 Other work Linking Enthalpy of Sublimation (Carole) and Solubility (David) studies. Prediction of Melting Point. Chemoinformatics of prohibited substances in sport. Scoring functions for virtual screening. Repertoire of enzyme-catalysed reactions (MACiE).
57
57
58
58 People Pfizer Dr Hua Gao Dr Tony Auffret University of Cambridge Prof. Robert Glen Dr Jonathan Goodman Dr Antonio Llinàs Dr Noel O’Boyle Acknowledgements Funding Centre: Unilever David Palmer: Pfizer Carole Ouvrard: University of Nantes, France.
59
59 Ant Colony Optimisation Algorithm Variable selection based on probability : Level of Inhibitory Pheromone Updating rules: where is the increment of pheromone left on each descriptor in given cycle. Level of Activator Pheromone Extra slide 1
60
60 Ant Colony Optimisation Algorithm if k th ant selected variable i both in current iteration and global best solution if k th ant selected variable i only in current iteration if variable i was not selected in either current iteration or global best solution if k th ant did not select variable i in either the current iteration or its global best solution if k th ant did not select variable i in the current iteration if k th ant did not select variable i in its global best solution Extra slide 2
61
61 Correlation diagram Extra slide 3
62
62 Distributions in dataset Extra slide 4
63
63 MLR Extra slide 5
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.