Download presentation
Presentation is loading. Please wait.
Published byCecil Gordon Modified over 9 years ago
1
1 Modelling in Chemistry: High and Low-Throughput Regimes Dr John Mitchell Unilever Centre for Molecular Science Informatics Department of Chemistry University of Cambridge, U.K.
2
2
3
3
4
4
5
5
6
6
7
7 We look at data, analyse data, use data to find correlations...... to develop models...... and to make (hopefully) useful predictions. Let’s look at some data...
8
8 New York Times, 4 th October 2005.
9
9 Happiness ≈ (GNP/$5000) -1 Poor fit to linear model
10
10 (GNP/$5000) -2 Outliers? Happiness
11
11 Fitting with a curve: reduce RMSE
12
12 Outliers? Different linear models for different regimes
13
13 Only one obvious (to me) conclusion This area is empty: no country is both rich and unhappy. All other combinations are observed. Happiness(GNP/$5000) -2
14
14... but what is the connection with chemistry?
15
15 Modelling in Chemistry Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking PHYSICS-BASED EMPIRICAL ATOMISTIC Car-Parrinello NON-ATOMISTIC DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. Fluid Dynamics
16
16 Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking Car-Parrinello DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. HIGH THROUGHPUT LOW THROUGHPUT Fluid Dynamics
17
17 Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking Car-Parrinello DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. INFORMATICS THEORETICAL CHEMISTRY NO FIRM BOUNDARIES! Fluid Dynamics
18
18 Density Functional Theory ab initio Molecular Dynamics Monte Carlo Docking Car-Parrinello DPD CoMFA 2-D QSAR/QSPR Machine Learning AM1, PM3 etc. Fluid Dynamics
19
19 Theoretical Chemistry Calculations and simulations based on real physics. Calculations are either quantum mechanical or use parameters derived from quantum mechanics. Attempt to model or simulate reality. Usually Low Throughput.
20
20 Informatics and Empirical Models In general, Informatics methods represent phenomena mathematically, but not in a physics-based way. Inputs and output model are based on an empirically parameterised equation or more elaborate mathematical model. Do not attempt to simulate reality. Usually High Throughput.
21
21 QSPR Quantitative Structure Property Relationship Physical property related to more than one other variable Hansch et al developed QSPR in 1960’s, building on Hammett (1930’s). Property-property relationships from 1860’s General form (for non-linear relationships): y = f (descriptors)
22
22 QSPR Y = f (X 1, X 2,..., X N ) Optimisation of Y = f(X 1, X 2,..., X N ) is called regression. Model is optimised upon N “training molecules” and then tested upon M “test” molecules.
23
23 QSPR Quality of the model is judged by three parameters:
24
24 QSPR Different methods for carrying out regression: LINEAR - Multi-linear Regression (MLR), Partial Least Squares (PLS), Principal Component Regression (PCR), etc. NON-LINEAR - Random Forest, Support Vector Machines (SVM), Artificial Neural Networks (ANN), etc.
25
25 QSPR However, this does not guarantee a good predictive model….
26
26 QSPR Problems with experimental error. QSPR only as accurate as data it is trained upon. Therefore, we are need accurate experimental data.
27
27 QSPR Problems with “chemical space”. “Sample” molecules must be representative of “Population”. Prediction results will be most accurate for molecules similar to training set. Global or Local models?
28
28 Relationship of Chemical Structure With Lattice Energy Can we predict lattice energy from 2D molecular structure? Dr Carole Ouvrard & Dr John Mitchell Unilever Centre for Molecular Informatics University of Cambridge C Ouvrard & JBO Mitchell, Acta Cryst. B 59, 676-685 (2003)
29
29 Why Do We Need a Predictive Model? Existing techniques from Theoretical Chemistry can give us accurate sublimation and lattice energies...... but only in very low throughput.
30
30 Why Do We Need a Predictive Model? A predictive model for sublimation energies will allow us to estimate accurately the cohesive energies of crystalline materials From 2-D molecular structure only Without knowing the crystal packing Without expensive theoretical calculations Should help predict solubility.
31
31 Why Do We Think it Will Work? Accurately calculated lattice energies are usually very similar for many different possible crystal packings of a molecule. Many molecules have a plurality of different experimentally observable polymorphs. We hypothesise that, to a good approximation, cohesive energy depends only on 2-D structure.
32
32 x x x x O x x x x Density (g/cc) Lattice Energy (kJ/mol) x x 1.40 1.60 1.50 -92.0 -94.0 -96.0 -98.0 OO O O O + + + + x x P1- + P21/c O P212121 P21 Calculated Lowest Energy Structure Experimental Crystal Structure
33
33 Expression for the Lattice Energy U crystal = U molecule + U lattice Theoretical lattice energy –Crystal binding = Cohesive energy Experimental lattice energy is related to - H sublimation H sublimation = -U lattice – 2RT (Gavezzotti & Filippini)
34
34 Partitioning of the Lattice Energy U crystal = U molecule + U lattice H sublimation = -U lattice – 2RT Partitioning the lattice energy in terms of structural contributions Choice of the significant parameters –number of atoms of each type? –Number of rings, aromatics? –Number of bonds of each type? –Symmetry? –Hydrogen bond donors and acceptors? Intramolecular? We choose counts of atom type occurrences.
35
35 Analysis of the Sublimation Energy Data Experimental data: H sublimation Atom Types –SATIS codes : 10-digit connectivity code + bond types –Each 2 digit code = atomic number HN 01 07 99 99 99 HO 01 08 99 99 99 O=C 08 06 99 99 99 -O- 08 06 06 99 99 Statistical analysis Multi-Linear Regression Analysis H sub # atoms of each type Typically, several similar SATIS codes are grouped to define an atom type. NIST (National Institute of Standards and Technology, USA) Scientific literature
36
36 Training Dataset of Model Molecules 226 organic compounds 19 linear alkanes (19) 14 branched alkanes (33) 17 aromatics (50) 106 other non-H-bonders (156) 70 H-bond formers (226) Non-specific interacting –Hydrocarbons –Nitrogen compounds –Nitro-, C N, halogens, – S, Se substituents –Pyridine Potential hydrogen bonding interactions –Amides –Carboxylic acids –Amino acids…
37
37 Study of Non-specific Interactions: Linear Alkanes 19 compounds : CH 4 C 20 H 24 Limit for van der Waals interactions H sub 7.955C-2.714 r 2 = 0.977 s = 7.096 kJ/mol BPt HH sub Note odd-even variation in H sub for this series. Enthalpy of sublimation correlates with molecular size. Since linear alkanes interact non-specifically and without significant steric effects, this establishes a baseline for the analysis of more complex systems.
38
38 Include Branched Alkanes Add 14 branched alkanes to dataset. The graph below highlights the reduction of sublimation enthalpy due to bulky substituents. 33 compounds : CH 4 C 20 H 24 H sub = 7.724C nonbranched + 3.703 r 2 = 0.959 s = 8.117 kJ/mol If we also include the parameters for branched carbons, C 3 & C 4, the model doesn’t improve.
39
39 All Hydrocarbons: Include Aromatics Add 17 aromatics to the dataset (note: we have no alkenes or alkynes). 50 compounds H sub = 7.680C nonbranched + 6.185C aromatic + 4.162 r 2 = 0.958 s = 7.478 kJ/mol As before, i f we also include the parameters for branched carbons, C 3 & C 4, the model doesn’t improve. aliphatic
40
40 All Non-Hydrogen-Bonded Molecules: Add 106 non-hydrocarbons to the dataset. Include elements H, C, N, O, F, S, Cl, Br & I. 156 compounds H sub predicted by 16 parameter model r 2 = 0.896 s = 9.976 kJ/mol Parameters in model are counts of atom type occurrences.
41
41 General Predictive Model Add 70 hydrogen bond forming molecules to the dataset. 226 compounds H sub predicted by 19 parameter model r 2 = 0.925 s = 9.579 kJ/mol Parameters in model are counts of atom type occurrences.
42
42 H sublimation (kJ mol -1 ) = 6.942 + 20.141 HN + 30.172 HO + 3.127 F + 10.456 Cl + 12.926 Br + 19.763 I + 3.297 C 3 – 3.305 C 4 + 5.970 C aromatic + 7.631 C nonbranched + 7.341 CO + 19.676 CS + 11.415 N nitrile + 8.953 N nonnitrile + 8.466 NO + 18.249 O ether + 20.585 SO + 12.840 S thioether Predictive Model Determined by MLRA aliphatic All these parameters are significantly larger than their standard errors
43
43 Distribution of Residuals The distribution of the residuals between calculated and experimental data follows an approximately normal distribution, as expected.
44
44 35 diverse compounds r 2 = 0.928 s = 7.420 kJ/mol Validation on an Independent Test Set Nitro-compounds are often outliers Very encouraging result: accurate prediction possible.
45
45 Major Conclusion Lattice energy can be predicted from 2D structure, without knowing the details of the crystal packing!
46
46 Conclusions We have determined a general equation allowing us to estimate the sublimation enthalpy for a large range of organic compounds with an estimated error of 9 kJ/mol. A very simple model (counts of atom types) gives a good prediction of lattice & sublimation energies. Lattice energy can be predicted from 2D structure, without knowing the details of the crystal packing. Avoids need for expensive calculations. May help predict solubility. Model gives good chemical insight.
47
47 Solubility is an important issue in drug discovery and a major source of attrition This is expensive for the industry A good model for predicting the solubility of druglike molecules would be very valuable.
48
48 Drug Disc.Today, 10 (4), 289 (2005) Cohesive interactions in the lattice reduce solubility Predicting lattice (or almost equivalently sublimation) energy should help predict solubility
49
49 Classifying the WADA 2005 Prohibited List Using CDK & Unity Fingerprints www-mitchell.ch.cam.ac.uk/ jbom1@cam.ac.uk Ed Cannon, Andreas Bender, David Palmer & John Mitchell, J. Chem. Inf. and Model., 46, 2369-2380 (2006)
50
50 Classifying the WADA Prohibited List Aims & Background. Methods. Data. Results. Conclusions.
51
51 Aims & Background
52
52 Aims & Background Much drug abuse in sport involves novel compounds such as the “designer steroid” THG. tetrahydrogestrinone (THG)
53
53 Aims & Background Hence the World Anti-Doping Agency (WADA) prohibits classes of bioactivity as well as specific molecules. Analogues are prohibited using the “similar chemical structure or similar biological effect(s)” criterion.
54
54 WADA Prohibited Classes Anabolic Agents (S1) Hormones and Related Substances (S2) Beta-2-agonists (S3) Anti-estrogenic Agents (S4) Diuretics and Masking Agents (S5) Stimulants (S6) Narcotics (S7) Cannabinoids (S8) Glucocorticoids (S9) Alcohol (P1) Beta Blockers (P2)
55
55 Predicting Bioactivities We seek to predict whether a molecule exhibits one of these bioactivities. Such a classifier would be powerful as an in silico pre-filter for experimental methods such as assays.
56
56 Methods
57
57 Chemical Space Use descriptor-based fingerprints to locate molecules in chemical space. Similar Property Principle suggests molecules close together in chemical space often share common bioactivity.
58
58 Machine Learning Use Machine Learning classification algorithms to predict bioactivity from location of molecules in chemical space. Random Forest. k-Nearest Neighbours.
59
59 Fingerprints CDK (Chemistry Development Kit) fingerprint. Unity 2D. MACCS key. MOE 2D (2004). Typed Atom Distance. Typed Graph Distance.
60
60 CDK Fingerprint CDK fingerprint resembles Daylight. All bond paths up to a length of 6 are generated. A hashing function is used to map these paths onto a fingerprint of 1024 bits.
61
61 Unity 2D Fingerprint Unity is similar to CDK, but based on sub-structures rather than just paths. Substructures present in the molecule are enumerated. A hashing function is used to map these paths onto a fingerprint of 992 bits.
62
62 Classification Algorithms Random Forest (RF). k-Nearest Neighbours (k-NN).
63
63 Random Forest Decision based learner. Based on bootstrap sample of data. Number of trees in forest (ntree). Number of descriptors tried at each node (mtry). Each tree predicts label of molecule. Majority vote = class label of molecule.
64
64 Random Forest Node A > x 1 A < x 1 B > x 2 B < x 2 C > x 3 C < x 3 Decision: YesNo Yes A Random Forest contains many such trees.
65
65 Random Forest Decision based learner. Based on bootstrap sample of data. Number of trees in forest (ntree). Number of descriptors tried at each node (mtry). Each tree predicts label of molecule. Majority vote = class label of molecule.
66
66 k-Nearest Neighbours Instance based learner. Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space. k is a variable describing the number of neighbours to be considered. Class of x determined by majority vote of class labels of k neighbours. Ties broken randomly (only occurs for even k).
67
67 k-Nearest Neighbours
68
68 k-Nearest Neighbours Instance based learner. Take a query point x and find the closest k points from the training set to it using Euclidean distance in descriptor space. k is a variable describing the number of neighbours to be considered. Class of x determined by majority vote of class labels of k neighbours. Ties broken randomly (only occurs for even k).
69
69 k-Nearest Neighbours Local method. Uses only a very small number of near neighbours to make its prediction. Suitable for predicting activity classes with multiple clusters in chemical space. Therefore good for WADA classes with multiple receptors.
70
70 Performance Measure Matthews Correlation Coefficient: Range: -1 < MCC < 1; Balance between predicting positives & negatives.
71
71 Data
72
72 The Dataset 5245 molecules (5235 for CDK). Molecules taken from WADA banned list and from corresponding activity classes in MDDR. 367 explicitly allowed substances.
73
73 Data by Class WADA ClassNumber of Molecules S147 S2272 S3367 S4928 S51000 S6804 S7195 S81000 S926 P2239 Allowed367
74
74 Fivefold Cross-validation We test for membership of each prohibited class separately. All calculations use 5-fold cv. This uses {80% molecules training set; 20% test set} repeated 5 times so that each molecule is in exactly 1 test set.
75
75 False Positives False Positives arise in two ways: (1) A molecule predicted positive on an incorrect activity class; (2) An explicitly allowed molecule predicted positive.
76
76 Results
77
77 Results: Random Forest Aggregated over 10 classes
78
78 Unity CDK > MACCS > others.
79
79 100 trees sufficient; little improvement with more.
80
80 Results: k-Nearest Neighbours Aggregated over 10 classes
81
81
82
82 Unity CDK > MACCS > others.
83
83 k = 1 best; poor performance at k = 2 due to ties. MCC falls off with increasing k.
84
84 k = 1 best; poor performance at k = 2 due to ties. MCC falls off with increasing k. Unity ≈ CDK.
85
85 Results: Comparison Recall v Precision Aggregated over 10 classes RecallPrecision
86
86 RF gives higher precision, k-NN higher recall.
87
87 Results: Comparison Analysed by class
88
88 Classes vary in difficulty of prediction; independent of classification algorithm.
89
89 Conclusions
90
90 Major Conclusion Can use Informatics to predict whether or not a molecule exhibits a prohibited bioactivity.
91
91 Conclusions Can successfully predict active molecules (MCC ≈ 0.83). Unity ≈ CDK > MACCS > others. RF & k-NN give similar MCC. k-NN higher recall. RF higher precision; RF less likely to find false positives.
92
92 Conclusions RF results vary little with ntree. k-NN results best for k = 1. Performance decreases at higher k. Odd k avoids problems with ties (k = 2 is worse than k = 3). Activity classes show consistent prediction difficulty pattern.
93
93 www-mitchell.ch.cam.ac.uk/ jbom1@cam.ac.uk
94
94 Acknowledgements: People Carole Ouvrard, Ed Cannon, David Palmer, Florian Nigsch, Chrysi Kirtay, Laura Hughes, Jo Bailey, Noel O’Boyle, Daniel Almonacid, Gemma Holliday, Jen Ryder, Dushy Puvanendrampillai, Andreas Bender.
95
95 A¢know£€dg€m€nt$: Funding Unilever
96
96 No significant correlation overall; though smallest class S9 is hardest to predict.
97
97
98
98 tetrahydrogestrinone (THG) gestrinone trenbolone
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.