Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE.

Slides:



Advertisements
Similar presentations
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
Advertisements

Probability & Statistical Inference Lecture 9
6-1 Introduction To Empirical Models 6-1 Introduction To Empirical Models.
Regression Analysis Module 3. Regression Regression is the attempt to explain the variation in a dependent variable using the variation in independent.
Regression Analysis Once a linear relationship is defined, the independent variable can be used to forecast the dependent variable. Y ^ = bo + bX bo is.
Chapter 10 Curve Fitting and Regression Analysis
Linear regression models
Ch11 Curve Fitting Dr. Deshi Ye
Regression Analysis Using Excel. Econometrics Econometrics is simply the statistical analysis of economic phenomena Here, we just summarize some of the.
Chapter 13 Multiple Regression
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
x – independent variable (input)
Regression Analysis. Unscheduled Maintenance Issue: l 36 flight squadrons l Each experiences unscheduled maintenance actions (UMAs) l UMAs costs $1000.
BA 555 Practical Business Analysis
Chapter 12 Multiple Regression
Curve-Fitting Regression
Statistics for Business and Economics
Lecture 19: Tues., Nov. 11th R-squared (8.6.1) Review
Chapter 11 Multiple Regression.
REGRESSION AND CORRELATION
Data mining and statistical learning - lecture 11 Neural networks - a model class providing a joint framework for prediction and classification  Relationship.
1 Chapter 17: Introduction to Regression. 2 Introduction to Linear Regression The Pearson correlation measures the degree to which a set of data points.
Correlation and Regression Analysis
Introduction to Regression Analysis, Chapter 13,
Linear Regression/Correlation
Classification and Prediction: Regression Analysis
Relationships Among Variables
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Correlation & Regression
Predicting Highly Connected Proteins in PIN using QSAR Art Cherkasov Apr 14, 2011 UBC / VGH THE UNIVERSITY OF BRITISH COLUMBIA.
Regression and Correlation Methods Judy Zhong Ph.D.
Marketing Research Aaker, Kumar, Day and Leone Tenth Edition
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Chapter 9 Neural Network.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Chapter 12 Examining Relationships in Quantitative Research Copyright © 2013 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin.
EQT 373 Chapter 3 Simple Linear Regression. EQT 373 Learning Objectives In this chapter, you learn: How to use regression analysis to predict the value.
Applied Quantitative Analysis and Practices LECTURE#23 By Dr. Osman Sadiq Paracha.
Développement "IN SILICO" de nouveaux extractants et complexants de métaux Alexandre Varnek Laboratoire d’Infochimie, Université Louis Pasteur, Strasbourg,
Développement "IN SILICO" de nouveaux extractants et complexants de métaux Alexandre Varnek Laboratoire d’Infochimie, Université Louis Pasteur, Strasbourg,
Multiple Regression and Model Building Chapter 15 Copyright © 2014 by The McGraw-Hill Companies, Inc. All rights reserved.McGraw-Hill/Irwin.
Y X 0 X and Y are not perfectly correlated. However, there is on average a positive relationship between Y and X X1X1 X2X2.
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.
Model Selection and Validation. Model-Building Process 1. Data collection and preparation 2. Reduction of explanatory or predictor variables (for exploratory.
EE3561_Unit 4(c)AL-DHAIFALLAH14351 EE 3561 : Computational Methods Unit 4 : Least Squares Curve Fitting Dr. Mujahed Al-Dhaifallah (Term 342) Reading Assignment.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre.
 Relationship between education level, income, and length of time out of school  Our new regression equation: is the predicted value of the dependent.
1 Experimental Statistics - week 12 Chapter 12: Multiple Regression Chapter 13: Variable Selection Model Checking.
Développement "IN SILICO" de nouveaux complexants de métaux Alexandre Varnek Laboratoire d’Infochimie, Université Louis Pasteur, Strasbourg, FRANCE.
Correlation & Regression Analysis
Math 4030 – 11b Method of Least Squares. Model: Dependent (response) Variable Independent (control) Variable Random Error Objectives: Find (estimated)
Assignments CS fall Assignment 1 due Generate the in silico data set of 2sin(1.5x)+ N (0,1) with 100 random values of x between.
Copyright © 2010 Pearson Education, Inc Chapter Seventeen Correlation and Regression.
Machine Learning 5. Parametric Methods.
Développement "IN SILICO" de nouveaux extractants et complexants de métaux Alexandre Varnek Laboratoire d’Infochimie, Université Louis Pasteur, Strasbourg,
A Kernel Approach for Learning From Almost Orthogonal Pattern * CIS 525 Class Presentation Professor: Slobodan Vucetic Presenter: Yilian Qin * B. Scholkopf.
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Simple and multiple regression analysis in matrix form Least square Beta estimation Beta Simple linear regression Multiple regression with two predictors.
Statistics 350 Lecture 2. Today Last Day: Section Today: Section 1.6 Homework #1: Chapter 1 Problems (page 33-38): 2, 5, 6, 7, 22, 26, 33, 34,
The “Big Picture” (from Heath 1995). Simple Linear Regression.
Data Mining: Concepts and Techniques1 Prediction Prediction vs. classification Classification predicts categorical class label Prediction predicts continuous-valued.
Stats Methods at IC Lecture 3: Regression.
Chapter 7. Classification and Prediction
Correlation, Bivariate Regression, and Multiple Regression
BPK 304W Correlation.
Product moment correlation
Presentation transcript:

Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE QSAR/QSPR modeling

Development Validation Application QSAR/QSPR models

Development QSAR models Selection and curation of experimental data Preparation of training and test sets (optionaly) Selection of an initial set of descriptors and their normalisation Variables selection Selection of a machine-learning method Validation of models Training/test set Cross-validation -internal, -external Application of the Models Models Applicability Domain

Development the QSAR models Experimental Data Descriptors Mathematical techniques Statistical criteria

Preparation of training and test sets Building of structure - property models Selection of the best models according to statistical criteria Splitting of an initial data set into training and test sets Training set Test Initial data set “Prediction” calculations using the best structure - property models 10 – 15 %

Recommendations to prepare a test set (i) experimental methods for determination of activities in the training and test sets should be similar; (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37,

Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good “spread”, 5-6 structure points per descriptor. 5-6 structure points per descriptor. Objective selection (independent variable only) Statistical criteria of correlations Statistical criteria of correlations Pairwise selection (Forward or Backward Stepwise selection) Pairwise selection (Forward or Backward Stepwise selection) Principal Component Analysis Principal Component Analysis Partial Least Square analysis Partial Least Square analysis Genetic Algorithm Genetic Algorithm………………. Subjective selection Subjective selection Descriptors selection based on mechanistic studies Descriptors selection based on mechanistic studies

1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40, Preprocessing strategy for the derivation of models for use in structure - activity relationships (QSARs)

Machine-Learning Methods

Fitting models’ parameters The goal is to minimize Residual Sum of Squared (RSS) Y = F(a i, X i ) X i - descriptors (independent variables) a i - fitted parameters

Multiple Linear Regression ActivityDescriptor Y1Y1 X1X1 Y2Y2 Y2Y2 …… YnYn XnXn Y X Y i = a 0 + a 1 X i1

Multiple Linear Regression y=ax+b a b Residual Sum of Squared (RSS)

Multiple Linear Regression ActivityDescr 1Descr 2…Descr m Y1Y1 X 11 X 12 …X 1m Y2Y2 X 21 X 22 …X 2m …………… YnYn X n1 X n2 …X nm Y i = a 0 + a 1 X i1 + a 2 X i2 +…+ a m X im

kNN (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Y i of its k nearest neighbors in the chemical space A.Tropsha, A.Golbraikh, 2003 Descriptor 1 Descriptor 2 TRAINING SET

Biological and Artificial Neuron

Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer – to properties being predicted, neurons in the hidden layer – to nonlinear latent variables

Development Validation Application QSAR/QSPR models

Validating the QSAR Equation actual predicted r 2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. How well does the model predicts the activity of known compounds? For a perfect model: All data points would reside on the diagonal. All variance existing in the original data is explained by the model.

Calculating r 2 Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line) Original varianceVariance around regression line

Calculating r 2 Original variance: Explained variance: Improvement in predicting y from just using the mean of y Variance around regression line:

F-test Tests the assumption that a significant portion of the original variance has been explained by the model. In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn’t explain any of the variance.

F-distribution As N and k decrease, the probability of getting large r 2 values purely by chance increases. Thus, as N and k decrease, a larger F-value is required for the test to be significant. Nk

Calculating F Values Calculate F according to the above equation. Select a significance level (e.g., 0.05). Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. Example: r 2 = 0.89N = 7k = 1F = For an F-distribution with N=7, k=1, a value of corresponds to a significance level of Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03%

5-fold external cross-validation procedure Validation of Models

Cross Validation A measure of the predictive ability of the model (as opposed to the measure of fit produced by r 2 ). r 2 always increases as more descriptors are added. Q 2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q 2 is a better indicator of the model quality.

Other Model Validation Parameters 1.s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. N is the number of observations and k is the number of variables. 2.Scrambling of y.

Scrambling: to mix randomly: Y values (Y-scrambling), or X values (X-scrambling), or simulteneously Y and X values (X,Y-scrambling) Statistical tests for « chance correlations » Randomization: to generat random number s: from Y min to Y max (Y – randomization), from X min to X max (X – randomization), or do this simulteneously for Y and X (X, Y – randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model

Scrambling Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n Struc.1 Struc.2 Struc.n.. Pro.1 Struc.3.. Pro.2 Pro.3 Pro.n

Development Validation Application QSAR/QSPR models

Is a test compound similar to the training set compounds? - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Prediction Performance QSPR Models Test compound Robustness of QSPR models Applicability domain of models

Applicability domain of QSAR models = TEST COMPOUND Descriptor 1 Descriptor 2 TRAINING SET OUTSIDE THE DOMAIN Will not be predicted DiDi ≤ + Z × s k with Z, an empirical parameter (0.5 by default) The new compound will be predicted by the model, only if : INSIDE THE DOMAIN Will be predicted

Range –based methods  Bounding Box (BB) Applicability domain of QSAR models

ensemble modeling Should one use only one individual model or many models ?

Hunting season … Single hunter

Hunting season … Many hunters

Ensemble modelling Model 1 Model 2 Model 3 Model 4

Property (Y) predictions using best fit models Compound model 1 model 2 … mean ± s Compound 1 Y 11 Y 12 … ±  Y 1 Compound 2 Y 21 Y 22 … ±  Y 2 … … Compound m Y m1 Y m2 … ±  Y m Grubbs statistics is used to exclude les outliers

Etc. DataSet C-C-C-C-C-C C-C-C-N-C-C C=O C-C-C-N C-N-C-C*C ISIDA FRAGMENTOR the Pattern matrix Calculation of Descriptors

+ PATTERN MATRIX PROPERTY VALUES LEARNING STAGE Building of models QSAR models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones

Example : linear QSPR model Property PROPERTY calc = * N C-C-C-N-C-C * N C=O * N C-N-C*C + …

Virtual screening with QSAR/QSPR models

Virtual Sreening Database Experimental Tests Hits Screening and hits selection QSPR model Useless compounds

Combinatorial Library Design

Generation of Virtual Combinatorial Libraries if R1, R2, R3 = andthen Markush structure

1.Substituent variation (R 1 ) 2.Position variation (R 2 ) 3.Frequency variation 4.Homology variation (R 3 ) ( only for patent search) n = 1 – 3 R 2 =NH 2 R 3 = alkyl or heterocycle R 1 = Me, Et, Pr The types of variation in Markush structures:

IN SILICO design of new compounds

- Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge « In silico » design of new compounds

Markush structure The combinatorial module generates virtual libraries based on the Markush structures. ISIDA combinatorial module Database Filtering 2 Hits selection 6 Synthesis and experimental tests 7 QSAR models 1 Similarity Search QSAR models 5 Assessment of properties Applicability domains 4 3 ISIDA 1000 molecules/second

COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO 2 2+ by monoamides R = H, alkyl A. Varnek, D. Fourches, V. Solov’ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N°4 D = [ U ] organic phase [ U ] aqueous phase

SOLVENT EXTRACTION OF METALS M1+M1+ M2+M2+ An - L

COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Extraction of UO 2 2+ by monoamides Usine de La HAGUE, France Reprocessing of the spent nuclear fuel PUREX process TBP : tributyl phosphate

1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960) 2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999) Goal: theoretical design of new uranyl binders more efficient than previously studied molecules

Selected Hits: 21 cmpds EXPERT SYSTEM DATABASE DATA TREATMENT PREDICTOR PREDICTOR VIRTUAL SCREENING Hits selection ISIDA Virtual library: cmpds

“In silico” design of uranyl binders with ISIDA

New amides (ID) logD Experimental vs Predicted logD

logD Number of compounds Previously studied amides Newly synthesized amides 4 compounds (previously studied) 9 compounds (newly synthesized) Enrichment of the initial data set by new efficient extractants: logD > 0.9 :

Classification Models

Confusion Matrix For N instances, K classes and a classifier N ij, the number of instances of class i classified as j Class1Class2…ClassK Class1N 11 N 12 …N 1K Class2N 21 N 22 …N 2K …………… ClassKN K1 N K2 …N KK

Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class

The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968