Download presentation
Presentation is loading. Please wait.
Published byShanna Hutchinson Modified over 9 years ago
1
Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE QSAR/QSPR modeling
2
Development Validation Application QSAR/QSPR models
3
Development QSAR models Selection and curation of experimental data Preparation of training and test sets (optionaly) Selection of an initial set of descriptors and their normalisation Variables selection Selection of a machine-learning method Validation of models Training/test set Cross-validation -internal, -external Application of the Models Models Applicability Domain
4
Development the QSAR models Experimental Data Descriptors Mathematical techniques Statistical criteria
5
Preparation of training and test sets Building of structure - property models Selection of the best models according to statistical criteria Splitting of an initial data set into training and test sets Training set Test Initial data set “Prediction” calculations using the best structure - property models 10 – 15 %
6
Recommendations to prepare a test set (i) experimental methods for determination of activities in the training and test sets should be similar; (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215
7
Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good “spread”, 5-6 structure points per descriptor. 5-6 structure points per descriptor. Objective selection (independent variable only) Statistical criteria of correlations Statistical criteria of correlations Pairwise selection (Forward or Backward Stepwise selection) Pairwise selection (Forward or Backward Stepwise selection) Principal Component Analysis Principal Component Analysis Partial Least Square analysis Partial Least Square analysis Genetic Algorithm Genetic Algorithm………………. Subjective selection Subjective selection Descriptors selection based on mechanistic studies Descriptors selection based on mechanistic studies
8
1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40, 1160 - 1168 Preprocessing strategy for the derivation of models for use in structure - activity relationships (QSARs)
9
Machine-Learning Methods
10
Fitting models’ parameters The goal is to minimize Residual Sum of Squared (RSS) Y = F(a i, X i ) X i - descriptors (independent variables) a i - fitted parameters
11
Multiple Linear Regression ActivityDescriptor Y1Y1 X1X1 Y2Y2 Y2Y2 …… YnYn XnXn Y X Y i = a 0 + a 1 X i1
12
Multiple Linear Regression y=ax+b a b Residual Sum of Squared (RSS)
13
Multiple Linear Regression ActivityDescr 1Descr 2…Descr m Y1Y1 X 11 X 12 …X 1m Y2Y2 X 21 X 22 …X 2m …………… YnYn X n1 X n2 …X nm Y i = a 0 + a 1 X i1 + a 2 X i2 +…+ a m X im
14
kNN (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Y i of its k nearest neighbors in the chemical space A.Tropsha, A.Golbraikh, 2003 Descriptor 1 Descriptor 2 TRAINING SET
15
Biological and Artificial Neuron
16
Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables
17
Development Validation Application QSAR/QSPR models
18
Validating the QSAR Equation actual predicted r 2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. How well does the model predicts the activity of known compounds? For a perfect model: All data points would reside on the diagonal. All variance existing in the original data is explained by the model.
19
Calculating r 2 Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line) Original varianceVariance around regression line
20
Calculating r 2 Original variance: Explained variance: Improvement in predicting y from just using the mean of y Variance around regression line:
21
F-test Tests the assumption that a significant portion of the original variance has been explained by the model. In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn’t explain any of the variance.
22
F-distribution As N and k decrease, the probability of getting large r 2 values purely by chance increases. Thus, as N and k decrease, a larger F-value is required for the test to be significant. Nk
23
Calculating F Values Calculate F according to the above equation. Select a significance level (e.g., 0.05). Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. Example: r 2 = 0.89N = 7k = 1F = 40.46 For an F-distribution with N=7, k=1, a value of 40.46 corresponds to a significance level of 0.9997. Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03%
24
5-fold external cross-validation procedure Validation of Models
25
Cross Validation A measure of the predictive ability of the model (as opposed to the measure of fit produced by r 2 ). r 2 always increases as more descriptors are added. Q 2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q 2 is a better indicator of the model quality.
26
Other Model Validation Parameters 1.s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. N is the number of observations and k is the number of variables. 2.Scrambling of y.
27
Scrambling: to mix randomly: Y values (Y-scrambling), or X values (X-scrambling), or simulteneously Y and X values (X,Y-scrambling) Statistical tests for « chance correlations » Randomization: to generat random number s: from Y min to Y max (Y – randomization), from X min to X max (X – randomization), or do this simulteneously for Y and X (X, Y – randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model
28
Scrambling Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n
29
Development Validation Application QSAR/QSPR models
30
Is a test compound similar to the training set compounds? - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Prediction Performance QSPR Models Test compound Robustness of QSPR models Applicability domain of models
31
Applicability domain of QSAR models = TEST COMPOUND Descriptor 1 Descriptor 2 TRAINING SET OUTSIDE THE DOMAIN Will not be predicted DiDi ≤ + Z × s k with Z, an empirical parameter (0.5 by default) The new compound will be predicted by the model, only if : INSIDE THE DOMAIN Will be predicted
32
Range –based methods Bounding Box (BB) Applicability domain of QSAR models
33
ensemble modeling Should one use only one individual model or many models ?
34
Hunting season … Single hunter
35
Hunting season … Many hunters
36
Ensemble modelling Model 1 Model 2 Model 3 Model 4
37
Property (Y) predictions using best fit models Compound model 1 model 2 … mean ± s Compound 1 Y 11 Y 12 … ± Y 1 Compound 2 Y 21 Y 22 … ± Y 2 … … Compound m Y m1 Y m2 … ± Y m Grubbs statistics is used to exclude les outliers
38
Etc. DataSet C-C-C-C-C-C C-C-C-N-C-C C=O C-C-C-N C-N-C-C*C ISIDA FRAGMENTOR 0 10 1 5 0 0 8 1 4 0 0 4 1 2 4 the Pattern matrix Calculation of Descriptors
39
+ PATTERN MATRIX PROPERTY VALUES -0.222 0.973 -0.066 LEARNING STAGE Building of models QSAR models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones
40
Example : linear QSPR model Property PROPERTY calc = -0.36 * N C-C-C-N-C-C + 0.27 * N C=O + 0.12 * N C-N-C*C + …
41
Virtual screening with QSAR/QSPR models
42
Virtual Sreening Database Experimental Tests Hits Screening and hits selection QSPR model Useless compounds
43
Combinatorial Library Design
44
Generation of Virtual Combinatorial Libraries if R1, R2, R3 = andthen Markush structure
45
1.Substituent variation (R 1 ) 2.Position variation (R 2 ) 3.Frequency variation 4.Homology variation (R 3 ) ( only for patent search) n = 1 – 3 R 2 =NH 2 R 3 = alkyl or heterocycle R 1 = Me, Et, Pr The types of variation in Markush structures:
46
IN SILICO design of new compounds
47
- Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge « In silico » design of new compounds
48
Markush structure The combinatorial module generates virtual libraries based on the Markush structures. ISIDA combinatorial module Database Filtering 2 Hits selection 6 Synthesis and experimental tests 7 QSAR models 1 Similarity Search QSAR models 5 Assessment of properties Applicability domains 4 3 ISIDA 1000 molecules/second
49
COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO 2 2+ by monoamides R = H, alkyl A. Varnek, D. Fourches, V. Solov’ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N°4 D = [ U ] organic phase [ U ] aqueous phase
50
SOLVENT EXTRACTION OF METALS M1+M1+ M2+M2+ An - L
51
COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Extraction of UO 2 2+ by monoamides Usine de La HAGUE, France Reprocessing of the spent nuclear fuel PUREX process TBP : tributyl phosphate
52
1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960) 2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999) Goal: theoretical design of new uranyl binders more efficient than previously studied molecules
53
Selected Hits: 21 cmpds EXPERT SYSTEM DATABASE DATA TREATMENT PREDICTOR PREDICTOR VIRTUAL SCREENING Hits selection ISIDA Virtual library: 11.000 cmpds
54
“In silico” design of uranyl binders with ISIDA
55
New amides (ID) logD Experimental vs Predicted logD
56
logD Number of compounds Previously studied amides Newly synthesized amides 4 compounds (previously studied) 9 compounds (newly synthesized) Enrichment of the initial data set by new efficient extractants: logD > 0.9 :
57
Classification Models
58
Confusion Matrix For N instances, K classes and a classifier N ij, the number of instances of class i classified as j Class1Class2…ClassK Class1N 11 N 12 …N 1K Class2N 21 N 22 …N 2K …………… ClassKN K1 N K2 …N KK
59
Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class
60
The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.