Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE.

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE QSAR/QSPR modeling

Development Validation Application QSAR/QSPR models

Development QSAR models Selection and curation of experimental data Preparation of training and test sets (optionaly) Selection of an initial set of descriptors and their normalisation Variables selection Selection of a machine-learning method Validation of models Training/test set Cross-validation -internal, -external Application of the Models Models Applicability Domain

Development the QSAR models Experimental Data Descriptors Mathematical techniques Statistical criteria

Preparation of training and test sets Building of structure - property models Selection of the best models according to statistical criteria Splitting of an initial data set into training and test sets Training set Test Initial data set “Prediction” calculations using the best structure - property models 10 – 15 %

Recommendations to prepare a test set (i) experimental methods for determination of activities in the training and test sets should be similar; (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37, 2206-2215

Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good “spread”, 5-6 structure points per descriptor. 5-6 structure points per descriptor. Objective selection (independent variable only) Statistical criteria of correlations Statistical criteria of correlations Pairwise selection (Forward or Backward Stepwise selection) Pairwise selection (Forward or Backward Stepwise selection) Principal Component Analysis Principal Component Analysis Partial Least Square analysis Partial Least Square analysis Genetic Algorithm Genetic Algorithm………………. Subjective selection Subjective selection Descriptors selection based on mechanistic studies Descriptors selection based on mechanistic studies

1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40, 1160 - 1168 Preprocessing strategy for the derivation of models for use in structure - activity relationships (QSARs)

Machine-Learning Methods

Fitting models’ parameters The goal is to minimize Residual Sum of Squared (RSS) Y = F(a i, X i ) X i - descriptors (independent variables) a i - fitted parameters

Multiple Linear Regression ActivityDescriptor Y1Y1 X1X1 Y2Y2 Y2Y2 …… YnYn XnXn Y X Y i = a 0 + a 1 X i1

Multiple Linear Regression y=ax+b a b Residual Sum of Squared (RSS)

Multiple Linear Regression ActivityDescr 1Descr 2…Descr m Y1Y1 X 11 X 12 …X 1m Y2Y2 X 21 X 22 …X 2m …………… YnYn X n1 X n2 …X nm Y i = a 0 + a 1 X i1 + a 2 X i2 +…+ a m X im

kNN (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Y i of its k nearest neighbors in the chemical space A.Tropsha, A.Golbraikh, 2003 Descriptor 1 Descriptor 2 TRAINING SET

Biological and Artificial Neuron

Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables

Validating the QSAR Equation actual predicted r 2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. How well does the model predicts the activity of known compounds? For a perfect model: All data points would reside on the diagonal. All variance existing in the original data is explained by the model.

Calculating r 2 Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line) Original varianceVariance around regression line

Calculating r 2 Original variance: Explained variance: Improvement in predicting y from just using the mean of y Variance around regression line:

F-test Tests the assumption that a significant portion of the original variance has been explained by the model. In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn’t explain any of the variance.

F-distribution As N and k decrease, the probability of getting large r 2 values purely by chance increases. Thus, as N and k decrease, a larger F-value is required for the test to be significant. Nk

Calculating F Values Calculate F according to the above equation. Select a significance level (e.g., 0.05). Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. Example: r 2 = 0.89N = 7k = 1F = 40.46 For an F-distribution with N=7, k=1, a value of 40.46 corresponds to a significance level of 0.9997. Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03%

5-fold external cross-validation procedure Validation of Models

Cross Validation A measure of the predictive ability of the model (as opposed to the measure of fit produced by r 2 ). r 2 always increases as more descriptors are added. Q 2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q 2 is a better indicator of the model quality.

Other Model Validation Parameters 1.s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. N is the number of observations and k is the number of variables. 2.Scrambling of y.

Scrambling: to mix randomly: Y values (Y-scrambling), or X values (X-scrambling), or simulteneously Y and X values (X,Y-scrambling) Statistical tests for « chance correlations » Randomization: to generat random number s: from Y min to Y max (Y – randomization), from X min to X max (X – randomization), or do this simulteneously for Y and X (X, Y – randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model

Scrambling Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n

Is a test compound similar to the training set compounds? - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Prediction Performance QSPR Models Test compound Robustness of QSPR models Applicability domain of models

Applicability domain of QSAR models = TEST COMPOUND Descriptor 1 Descriptor 2 TRAINING SET OUTSIDE THE DOMAIN Will not be predicted DiDi ≤ + Z × s k with Z, an empirical parameter (0.5 by default) The new compound will be predicted by the model, only if : INSIDE THE DOMAIN Will be predicted

Range –based methods  Bounding Box (BB) Applicability domain of QSAR models

ensemble modeling Should one use only one individual model or many models ?

Hunting season … Single hunter

Hunting season … Many hunters

Ensemble modelling Model 1 Model 2 Model 3 Model 4

Property (Y) predictions using best fit models Compound model 1 model 2 … mean ± s Compound 1 Y 11 Y 12 … ±  Y 1 Compound 2 Y 21 Y 22 … ±  Y 2 … … Compound m Y m1 Y m2 … ±  Y m Grubbs statistics is used to exclude les outliers

Etc. DataSet C-C-C-C-C-C C-C-C-N-C-C C=O C-C-C-N C-N-C-C*C ISIDA FRAGMENTOR 0 10 1 5 0 0 8 1 4 0 0 4 1 2 4 the Pattern matrix Calculation of Descriptors

+ PATTERN MATRIX PROPERTY VALUES -0.222 0.973 -0.066 LEARNING STAGE Building of models QSAR models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones

Example : linear QSPR model Property PROPERTY calc = -0.36 * N C-C-C-N-C-C + 0.27 * N C=O + 0.12 * N C-N-C*C + …

Virtual screening with QSAR/QSPR models

Virtual Sreening Database Experimental Tests Hits Screening and hits selection QSPR model Useless compounds

Combinatorial Library Design

Generation of Virtual Combinatorial Libraries if R1, R2, R3 = andthen Markush structure

1.Substituent variation (R 1 ) 2.Position variation (R 2 ) 3.Frequency variation 4.Homology variation (R 3 ) ( only for patent search) n = 1 – 3 R 2 =NH 2 R 3 = alkyl or heterocycle R 1 = Me, Et, Pr The types of variation in Markush structures:

IN SILICO design of new compounds

- Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge « In silico » design of new compounds

Markush structure The combinatorial module generates virtual libraries based on the Markush structures. ISIDA combinatorial module Database Filtering 2 Hits selection 6 Synthesis and experimental tests 7 QSAR models 1 Similarity Search QSAR models 5 Assessment of properties Applicability domains 4 3 ISIDA 1000 molecules/second

COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO 2 2+ by monoamides R = H, alkyl A. Varnek, D. Fourches, V. Solov’ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N°4 D = [ U ] organic phase [ U ] aqueous phase

SOLVENT EXTRACTION OF METALS M1+M1+ M2+M2+ An - L

COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Extraction of UO 2 2+ by monoamides Usine de La HAGUE, France Reprocessing of the spent nuclear fuel PUREX process TBP : tributyl phosphate

1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960) 2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999) Goal: theoretical design of new uranyl binders more efficient than previously studied molecules

Selected Hits: 21 cmpds EXPERT SYSTEM DATABASE DATA TREATMENT PREDICTOR PREDICTOR VIRTUAL SCREENING Hits selection ISIDA Virtual library: 11.000 cmpds

“In silico” design of uranyl binders with ISIDA

New amides (ID) logD Experimental vs Predicted logD

logD Number of compounds Previously studied amides Newly synthesized amides 4 compounds (previously studied) 9 compounds (newly synthesized) Enrichment of the initial data set by new efficient extractants: logD > 0.9 :

Classification Models

Confusion Matrix For N instances, K classes and a classifier N ij, the number of instances of class i classified as j Class1Class2…ClassK Class1N 11 N 12 …N 1K Class2N 21 N 22 …N 2K …………… ClassKN K1 N K2 …N KK

Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class

The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE.

Similar presentations

Presentation on theme: "Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE.

Similar presentations

Presentation on theme: "Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE."— Presentation transcript:

Similar presentations

About project

Feedback