Review of methods to assess a QSAR Applicability Domain Joanna Jaworska Procter & Gamble European Technical Center Brussels, Belgium and Nina Nikolova.

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

Fast Algorithms For Hierarchical Range Histogram Constructions
Lecture XXIII.  In general there are two kinds of hypotheses: one concerns the form of the probability distribution (i.e. is the random variable normally.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Model generalization Test error Bias, variance and complexity
PROBABILISTIC ASSESSMENT OF THE QSAR APPLICATION DOMAIN Nina Jeliazkova 1, Joanna Jaworska 2 (1) IPP, Bulgarian Academy of Sciences, Sofia, Bulgaria (2)
Prénom Nom Document Analysis: Parameter Estimation for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Statistical Methods Chichang Jou Tamkang University.
Dimensional reduction, PCA
Analysis of Hop-Distance Relationship in Spatially Random Sensor Networks 1 Serdar Vural and Eylem Ekici Department of Electrical and Computer Engineering.
CSE 300: Software Reliability Engineering Topics covered: Software metrics and software reliability Software complexity and software quality.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Lecture II-2: Probability Review
Decision analysis and Risk Management course in Kuopio
Educational Research: Correlational Studies EDU 8603 Educational Research Richard M. Jacobs, OSA, Ph.D.
Determining How Costs Behave
1 D r a f t Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module k – Uncertainty in LCA.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Xitao Fan, Ph.D. Chair Professor & Dean Faculty of Education University of Macau Designing Monte Carlo Simulation Studies.
Data Mining Techniques
Statistical Methods For Engineers ChE 477 (UO Lab) Larry Baxter & Stan Harding Brigham Young University.
AM Recitation 2/10/11.
Statistics for Managers Using Microsoft Excel, 4e © 2004 Prentice-Hall, Inc. Chap 12-1 Chapter 12 Simple Linear Regression Statistics for Managers Using.
Gaussian process modelling
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Pattern Recognition: Baysian Decision Theory Charles Tappert Seidenberg School of CSIS, Pace University.
COMP3503 Intro to Inductive Modeling
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
STA Lecture 161 STA 291 Lecture 16 Normal distributions: ( mean and SD ) use table or web page. The sampling distribution of and are both (approximately)
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Outline 1-D regression Least-squares Regression Non-iterative Least-squares Regression Basis Functions Overfitting Validation 2.
Module 1: Statistical Issues in Micro simulation Paul Sousa.
Image Classification 영상분류
Computational Intelligence: Methods and Applications Lecture 12 Bayesian decisions: foundation of learning Włodzisław Duch Dept. of Informatics, UMK Google:
Identifying Applicability Domains for Quantitative Structure Property Relationships Mordechai Shacham a, Neima Brauner b Georgi St. Cholakov c and Roumiana.
IE241: Introduction to Hypothesis Testing. We said before that estimation of parameters was one of the two major areas of statistics. Now let’s turn to.
AN APPROACH TO DETERMINE THE APPLICATION DOMAIN OF GROUP CONTRIBUTION MODELS Nina Jeliazkova 1 Joanna Jaworska 2, (2) Central Product Safety, Procter &
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
1 E. Fatemizadeh Statistical Pattern Recognition.
17 May 2007RSS Kent Local Group1 Quantifying uncertainty in the UK carbon flux Tony O’Hagan CTCD, Sheffield.
SAR vs QSAR or “is QSAR different from SAR”
EMIS 7300 SYSTEMS ANALYSIS METHODS FALL 2005 Dr. John Lipp Copyright © Dr. John Lipp.
Image Modeling & Segmentation Aly Farag and Asem Ali Lecture #2.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Chapter1: Introduction Chapter2: Overview of Supervised Learning
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Introduction to emulators Tony O’Hagan University of Sheffield.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
1 Life Cycle Assessment A product-oriented method for sustainability analysis UNEP LCA Training Kit Module k – Uncertainty in LCA.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
On triangular norms, metric spaces and a general formulation of the discrete inverse problem or starting to think logically about uncertainty On triangular.
Global predictors of regression fidelity A single number to characterize the overall quality of the surrogate. Equivalence measures –Coefficient of multiple.
CS Statistical Machine learning Lecture 7 Yuan (Alan) Qi Purdue CS Sept Acknowledgement: Sargur Srihari’s slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Determining How Costs Behave
Pattern Classification, Chapter 3
Statistical Methods For Engineers
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Review of methods to assess a QSAR Applicability Domain Joanna Jaworska Procter & Gamble European Technical Center Brussels, Belgium and Nina Nikolova – Jeliazkova IPP Bulgarian Academy of Sciences Sofia, Bulgaria

Contents Why we need applicability domain ? What is an applicability domain ? –Training data set coverage vs predictive domain Methods for identification of training set coverage Methods for identification of predictive domain Practical use / software availability

Why we need applicability domain for a QSAR ? Use of QSAR models for decision making increases –Cost & time effective –Animal alternatives Concerns related to quality evaluation of model predictions and prevention of model’s potential misuse. –Acceptance of a result = prediction from applicability domain Elements of the quality prediction –define whether the model is suitable to predict activity of a queried chemical –Assess uncertainty of a model’s result

QSAR models as a high consequence computing – can we learn from others ? In the past QSAR research focused on analyses of experimental data & development of QSAR models The applicability domain QSAR definition has not been addressed in the past –Acceptance of a QSAR result was left to the discretion of an expert –It is no longer classic computational toxicology –currently the methods and software are not very well integrated with However, Computational physicists and engineers are working on the same topic –Reliability theory and Uncertainty analysis –increasingly dominated by Bayesian approaches

What is an applicability domain ? Setubal report (2002) provided a philosophical definition of the applicability domain but not possible to compute code. Training data set from which a QSAR model is derived provides basis for the estimation of its applicability domain. The training set data, when projected in the model’s multivariate parameter space, defines space regions populated with data and empty ones. Populated regions define applicability domain of a model i.e. space the model is suitable to predict. This stems from the fact that generally, interpolation is more reliable than extrapolation.

Experience using QSAR training set domain as application domain. Interpolative predictive accuracy defined as predictive accuracy within the training set is in general greater than Extrapolative predictive accuracy The average prediction error outside application domain defined by the training set ranges is twice larger than the prediction error inside the domain. Note that it is true only on average, i.e. there are many individual compounds with low error outside of the domain, as well as individual compounds with high error inside the domain. For more info see poster

What have we missed while defining applicability domain? The so far discussed approach to applicability domain addressed ONLY training data set coverage Is applicability domain for 2 different models developed on the same data set same or different ? Clearly we need to take into account model itself

Applicability domain – evolved view Assessing if the prediction is from interpolation region representing training set does not tell anything about model accuracy –The only link to the model is by using model variables ( descriptors) Model predictive error is eventually needed to make decision regarding acceptance of a result. –Model predictive error is related to experimental data variability, parameter uncertainty –Quantitative assessment of prediction error will allow for transparent decision making where different cutoff values of error acceptance can be used for different management applications

Applicability domain estimation – 2 step process Step 1 – Estimation of Application domain –Define training data set coverage by interpolation Step 2 – Model Uncertainty quantification –Calculate uncertainty of predictions, i.e. predictive error

Application domain of a QSAR Training set of chemicals Multivariate descriptor space Legend: Estrogen binding training set HVPC database Applicability domain = populated regions in multivariate descriptor space ?

Application domain estimation Most of current QSAR models are not LFERs They are statistical models with varying degree of mechanistic interpretation usualy developed a posteriori Statistical models application is confined to interpolation region of the data used to develop a model i.e. training set Mathematically, interpolation projection of the training set in the model’ descriptors space is equivalent to estimating a multivariate convex hull

Is classic definition of interpolation sufficient ? In reality often –data are sparse and nonhomogenous; Group contribution methods are especially vulnerable by the “Curse of dimensionality” –Data in the training set are not chosen to follow experimental design because we are doing retrospective evaluations Empty regions within the interpolation space may exist; The relationship within the empty regions can differ from the derived model and we can not verify this without additional data;

Interpolation vs. Extrapolation 1D : parameter range determines interpolation region >2D: is empty space within ranges interpolation ? Legend: BCF Training set Legend: SRC KOWWIN Training set

Interpolation vs. Extrapolation (Linear models) Linear model: Predicted results within interpolation range do not exceed training set endpoint values Linear model - 2D – can exceed training set endpoint values even within ranges

Approaches to determine interpolation regions Descriptor rangesDescriptor ranges DistancesDistances GeometricGeometric ProbabilisticProbabilistic

Ranges of descriptors Very simple Will work for high dimensional models –Only practical solution for group contribution method –KOWIN model contains over 500 descriptors Cannot pick holes in the interpolated space Assumes homogenous distribution of the data

Distance approach Euclidean distance –Gaussian distribution of data –No correlation between descriptors Mahalonobis distance –Gaussian distribution of data –Correlation between descriptors

Probabilistic approach Does not assume standard distribution. Solution for general multivariate case by nonparametric distribution estimation The probability density is a most accurate approach to identify regions containing data Can find internal empty regions and differentiate between differing density regions Accounts for correlations, skewness

Bayesian Classification Rule provides theoretically optimal decision boundaries with smallest classification error Bayesian Probabilistic Approach to Classification Estimate density of each data set Read off probability density value for the new point for each data set Classify the point to the data set with the highest probability value R.O.Duda and P.E.Hart. Pattern Classification and Scene Analisys, Wiley, 1973 Duda R., Hart P., Stork D., Pattern Classification, 2 nd ed., John Wiley & Sons, 2000 Devroye L., Gyorfi L., Lugosi G., A probabilistic Theory of Pattern Recognition, Springer, 1996

Probability Density Estimation multidimensional approximations Assume x i independent Estimate 1D density by fast algorithm. Estimate nD density by product of 1D densities Does not account for correlation between descriptors Extract principal components Estimate 1D density by fast algorithm on each principal component Estimate nD density by product of 1D densities Accounts for linear correlations via PCA (rotation of the coordinate system)

Various approximations of Application domain may lead to different results  (a) ranges  (b) distance based  (c) distribution based (a) (b) (c)

Is it correct to say : “prediction result is always reliable for a point within the application region” ? “prediction is always unreliable if the point is outside the application region” ? Interpolation regions and Applicability domain of a model

Assessment of predictive error Assessment of the predictive error is related to model uncertainty quantification given the uncertainty of model parameters –Need to calculate uncertainty of model coefficients –Propagate this uncertainty through the model to assess prediction uncertainty Analytical method of variances if the model is linear in parameters y=ax 1 + bx 2 Numerical Monte Carlo method

Methods to assess predictive error of the model –Training set error –test error –Predictive error External validation error Crossvalidation bootstrap

Conclusions Applicability domain is not a one step evaluation. It requires estimation of application domain - data set coverage Estimation of predictive error of the model Various methods exist for estimation of interpolated space, boundaries defined by different methods can be very different. Be honest and do not apply “easy” methods if the assumptions will be violated. It is important to differentiate between dense and empty regions in descriptor space, because Relationship within empty space can be different than the model and we can not verify this without additional data To avoid complexity of finding Application Domain after model development Use Experimental design before model development

Conclusions -2 Different methods of uncertainty quantification exist, choice depends on the type of the model ( linear, nonlinear)

Practical use/software availability For uncertainty propagation can we advertise Busy ?

COVERAGE Application

Thank you ! Acknowledgements to Tom Aldenberg ( RIVM)

Example: Two data sets, represented in two different 1D descriptors: green points red points Two models (over two different descriptors X1 and X2. Linear model (green) Nonlinear model (red) The magenta point is within coverage of both data sets. Coverage estimation should be used only as a warning, and not as a final decision of “model applicability” Interpolation regions and applicability domain of a model Is the prediction reliable ? Experimental activity

Possible reasons for the error : The true relationship The models Models missing important parameter Wrong type of model Non-unique nature of the descriptors Experimental activity

Two data sets, represented in two different 1D descriptors: green points red points Two models (over two different descriptors X1 and X2. Linear model (green) Nonlinear model (red) The magenta point is OUT of coverage of both data sets. Prediction could be correct, if the model is close to the TRUE RELATIONSHIP outside the training data set! Correct predictions outside of the data set coverage. Example: