Presentation is loading. Please wait.

Presentation is loading. Please wait.

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE.

Similar presentations


Presentation on theme: "Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE."— Presentation transcript:

1 Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE QSAR/QSPR modeling

2 Development Validation Application QSAR/QSPR models

3 Development QSAR models Selection and curation of experimental data Preparation of training and test sets (optionaly) Selection of an initial set of descriptors and their normalisation Variables selection Selection of a machine-learning method Validation of models Training/test set Cross-validation -internal, -external Application of the Models Models Applicability Domain

4 Development the QSAR models Experimental Data Descriptors Mathematical techniques Statistical criteria

5 Preparation of training and test sets Building of structure - property models Selection of the best models according to statistical criteria Splitting of an initial data set into training and test sets Training set Test Initial data set “Prediction” calculations using the best structure - property models 10 – 15 %

6 Recommendations to prepare a test set (i) experimental methods for determination of activities in the training and test sets should be similar; (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215

7 Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good “spread”, 5-6 structure points per descriptor. 5-6 structure points per descriptor. Objective selection (independent variable only) Statistical criteria of correlations Statistical criteria of correlations Pairwise selection (Forward or Backward Stepwise selection) Pairwise selection (Forward or Backward Stepwise selection) Principal Component Analysis Principal Component Analysis Partial Least Square analysis Partial Least Square analysis Genetic Algorithm Genetic Algorithm………………. Subjective selection Subjective selection Descriptors selection based on mechanistic studies Descriptors selection based on mechanistic studies

8 1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40, 1160 - 1168 Preprocessing strategy for the derivation of models for use in structure - activity relationships (QSARs)

9 Machine-Learning Methods

10 Fitting models’ parameters The goal is to minimize Residual Sum of Squared (RSS) Y = F(a i, X i ) X i - descriptors (independent variables) a i - fitted parameters

11 Multiple Linear Regression ActivityDescriptor Y1Y1 X1X1 Y2Y2 Y2Y2 …… YnYn XnXn Y X Y i = a 0 + a 1 X i1

12 Multiple Linear Regression y=ax+b a b Residual Sum of Squared (RSS)

13 Multiple Linear Regression ActivityDescr 1Descr 2…Descr m Y1Y1 X 11 X 12 …X 1m Y2Y2 X 21 X 22 …X 2m …………… YnYn X n1 X n2 …X nm Y i = a 0 + a 1 X i1 + a 2 X i2 +…+ a m X im

14 kNN (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Y i of its k nearest neighbors in the chemical space A.Tropsha, A.Golbraikh, 2003 Descriptor 1 Descriptor 2 TRAINING SET

15 Biological and Artificial Neuron

16 Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables

17 Development Validation Application QSAR/QSPR models

18 Validating the QSAR Equation actual predicted r 2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. How well does the model predicts the activity of known compounds? For a perfect model: All data points would reside on the diagonal. All variance existing in the original data is explained by the model.

19 Calculating r 2 Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line) Original varianceVariance around regression line

20 Calculating r 2 Original variance: Explained variance: Improvement in predicting y from just using the mean of y Variance around regression line:

21 F-test Tests the assumption that a significant portion of the original variance has been explained by the model. In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn’t explain any of the variance.

22 F-distribution As N and k decrease, the probability of getting large r 2 values purely by chance increases. Thus, as N and k decrease, a larger F-value is required for the test to be significant. Nk

23 Calculating F Values Calculate F according to the above equation. Select a significance level (e.g., 0.05). Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. Example: r 2 = 0.89N = 7k = 1F = 40.46 For an F-distribution with N=7, k=1, a value of 40.46 corresponds to a significance level of 0.9997. Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03%

24 5-fold external cross-validation procedure Validation of Models

25 Cross Validation A measure of the predictive ability of the model (as opposed to the measure of fit produced by r 2 ). r 2 always increases as more descriptors are added. Q 2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q 2 is a better indicator of the model quality.

26 Other Model Validation Parameters 1.s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. N is the number of observations and k is the number of variables. 2.Scrambling of y.

27 Scrambling: to mix randomly: Y values (Y-scrambling), or X values (X-scrambling), or simulteneously Y and X values (X,Y-scrambling) Statistical tests for « chance correlations » Randomization: to generat random number s: from Y min to Y max (Y – randomization), from X min to X max (X – randomization), or do this simulteneously for Y and X (X, Y – randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model

28 Scrambling Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n

29 Development Validation Application QSAR/QSPR models

30 Is a test compound similar to the training set compounds? - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Prediction Performance QSPR Models Test compound Robustness of QSPR models Applicability domain of models

31 Applicability domain of QSAR models = TEST COMPOUND Descriptor 1 Descriptor 2 TRAINING SET OUTSIDE THE DOMAIN Will not be predicted DiDi ≤ + Z × s k with Z, an empirical parameter (0.5 by default) The new compound will be predicted by the model, only if : INSIDE THE DOMAIN Will be predicted

32 Range –based methods  Bounding Box (BB) Applicability domain of QSAR models

33 ensemble modeling Should one use only one individual model or many models ?

34 Hunting season … Single hunter

35 Hunting season … Many hunters

36 Ensemble modelling Model 1 Model 2 Model 3 Model 4

37 Property (Y) predictions using best fit models Compound model 1 model 2 … mean ± s Compound 1 Y 11 Y 12 … ±  Y 1 Compound 2 Y 21 Y 22 … ±  Y 2 … … Compound m Y m1 Y m2 … ±  Y m Grubbs statistics is used to exclude les outliers

38 Etc. DataSet C-C-C-C-C-C C-C-C-N-C-C C=O C-C-C-N C-N-C-C*C ISIDA FRAGMENTOR 0 10 1 5 0 0 8 1 4 0 0 4 1 2 4 the Pattern matrix Calculation of Descriptors

39 + PATTERN MATRIX PROPERTY VALUES -0.222 0.973 -0.066 LEARNING STAGE Building of models QSAR models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones

40 Example : linear QSPR model Property PROPERTY calc = -0.36 * N C-C-C-N-C-C + 0.27 * N C=O + 0.12 * N C-N-C*C + …

41 Virtual screening with QSAR/QSPR models

42 Virtual Sreening Database Experimental Tests Hits Screening and hits selection QSPR model Useless compounds

43 Combinatorial Library Design

44 Generation of Virtual Combinatorial Libraries if R1, R2, R3 = andthen Markush structure

45 1.Substituent variation (R 1 ) 2.Position variation (R 2 ) 3.Frequency variation 4.Homology variation (R 3 ) ( only for patent search) n = 1 – 3 R 2 =NH 2 R 3 = alkyl or heterocycle R 1 = Me, Et, Pr The types of variation in Markush structures:

46 IN SILICO design of new compounds

47 - Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge « In silico » design of new compounds

48 Markush structure The combinatorial module generates virtual libraries based on the Markush structures. ISIDA combinatorial module Database Filtering 2 Hits selection 6 Synthesis and experimental tests 7 QSAR models 1 Similarity Search QSAR models 5 Assessment of properties Applicability domains 4 3 ISIDA 1000 molecules/second

49 COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO 2 2+ by monoamides R = H, alkyl A. Varnek, D. Fourches, V. Solov’ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N°4 D = [ U ] organic phase [ U ] aqueous phase

50 SOLVENT EXTRACTION OF METALS M1+M1+ M2+M2+ An - L

51 COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Extraction of UO 2 2+ by monoamides Usine de La HAGUE, France Reprocessing of the spent nuclear fuel PUREX process TBP : tributyl phosphate

52 1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960) 2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999) Goal: theoretical design of new uranyl binders more efficient than previously studied molecules

53 Selected Hits: 21 cmpds EXPERT SYSTEM DATABASE DATA TREATMENT PREDICTOR PREDICTOR VIRTUAL SCREENING Hits selection ISIDA Virtual library: 11.000 cmpds

54 “In silico” design of uranyl binders with ISIDA

55 New amides (ID) logD Experimental vs Predicted logD

56 logD Number of compounds Previously studied amides Newly synthesized amides 4 compounds (previously studied) 9 compounds (newly synthesized) Enrichment of the initial data set by new efficient extractants: logD > 0.9 :

57 Classification Models

58 Confusion Matrix For N instances, K classes and a classifier N ij, the number of instances of class i classified as j Class1Class2…ClassK Class1N 11 N 12 …N 1K Class2N 21 N 22 …N 2K …………… ClassKN K1 N K2 …N KK

59 Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class

60 The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968


Download ppt "Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE."

Similar presentations


Ads by Google