1 LMO & Jackknife If a QSPR/QSAR model has a high average q 2 in LMO valiation, it can be reasonably concluded that obtained model is robust. Leave-many-out.

Slides:



Advertisements
Similar presentations
Design of Experiments Lecture I
Advertisements

Biointelligence Laboratory, Seoul National University
« هو اللطیف » By : Atefe Malek. khatabi Spring 90.
Chapter 10: Sampling and Sampling Distributions
The Central Limit Theorem
Evaluation.
Linear Methods for Regression Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
K nearest neighbor and Rocchio algorithm
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
4 Th Iranian chemometrics Workshop (ICW) Zanjan-2004.
Resampling techniques
Evaluation.
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Radford M. Neal and Jianguo Zhang the winners.
8 th Iranian workshop of Chemometrics 7-9 February 2009 Progress of Chemometrics in Iran Mehdi Jalali-Heravi February 2009 In the Name of God.
End of Chapter 8 Neil Weisenfeld March 28, 2005.
2008 Chingchun 1 Bootstrap Chingchun Huang ( 黃敬群 ) Vision Lab, NCTU.
Performance Evaluation in Computer Vision Kyungnam Kim Computer Vision Lab, University of Maryland, College Park.
Chapter 11 Multiple Regression.
Chapter 3 Hypothesis Testing. Curriculum Object Specified the problem based the form of hypothesis Student can arrange for hypothesis step Analyze a problem.
Bagging LING 572 Fei Xia 1/24/06. Ensemble methods So far, we have covered several learning methods: FSA, HMM, DT, DL, TBL. Question: how to improve results?
Experimental Evaluation
Ensemble Learning (2), Tree and Forest
Application and Efficacy of Random Forest Method for QSAR Analysis
Mixture Modeling Chongming Yang Research Support Center FHSS College.
Space-Filling DOEs Design of experiments (DOE) for noisy data tend to place points on the boundary of the domain. When the error in the surrogate is due.
Review of normal distribution. Exercise Solution.
Graph-based consensus clustering for class discovery from gene expression data Zhiwen Yum, Hau-San Wong and Hongqiang Wang Bioinformatics, 2007.
Predicting Highly Connected Proteins in PIN using QSAR Art Cherkasov Apr 14, 2011 UBC / VGH THE UNIVERSITY OF BRITISH COLUMBIA.
1 CSI5388 Error Estimation: Re-Sampling Approaches.
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Machine Learning1 Machine Learning: Summary Greg Grudic CSCI-4830.
Bootstrapping (And other statistical trickery). Reminder Of What We Do In Statistics Null Hypothesis Statistical Test Logic – Assume that the “no effect”
Predicting a Variety of Constant Pure Compound Properties by the Targeted QSPR Method Abstract The possibility of obtaining a reliable prediction a wide.
Chapter 11 – 1 Chapter 7: Sampling and Sampling Distributions Aims of Sampling Basic Principles of Probability Types of Random Samples Sampling Distributions.
Slides to accompany Weathington, Cunningham & Pittenger (2010), Chapter 7: Sampling 1.
Chapter 7: Sampling and Sampling Distributions
Institute for Advanced Studies in Basic Sciences – Zanjan Kohonen Artificial Neural Networks in Analytical Chemistry Mahdi Vasighi.
Inference for Regression Simple Linear Regression IPS Chapter 10.1 © 2009 W.H. Freeman and Company.
Today Ensemble Methods. Recap of the course. Classifier Fusion
Center for Radiative Shock Hydrodynamics Fall 2011 Review Assessment of predictive capability Derek Bingham 1.
Paola Gramatica, Elena Bonfanti, Manuela Pavan and Federica Consolaro QSAR Research Unit, Department of Structural and Functional Biology, University of.
Understanding Sampling
CROSS-VALIDATION AND MODEL SELECTION Many Slides are from: Dr. Thomas Jensen -Expedia.com and Prof. Olga Veksler - CS Learning and Computer Vision.
ABSTRACT The behavior and fate of chemicals in the environment is strongly influenced by the inherent properties of the compounds themselves, particularly.
CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.
Lecture 3: Statistics Review I Date: 9/3/02  Distributions  Likelihood  Hypothesis tests.
Unsupervised Forward Selection A data reduction algorithm for use with very large data sets David Whitley †, Martyn Ford † and David Livingstone †‡ † Centre.
Descriptive Statistics Used to describe a data set –Mean, minimum, maximum Usually include information on data variability (error) –Standard deviation.
Inferential Statistics Introduction. If both variables are categorical, build tables... Convention: Each value of the independent (causal) variable has.
1 Module One: Measurements and Uncertainties No measurement can perfectly determine the value of the quantity being measured. The uncertainty of a measurement.
Case Selection and Resampling Lucila Ohno-Machado HST951.
Designing Factorial Experiments with Binary Response Tel-Aviv University Faculty of Exact Sciences Department of Statistics and Operations Research Hovav.
MUTAGENICITY OF AROMATIC AMINES: MODELLING, PREDICTION AND CLASSIFICATION BY MOLECULAR DESCRIPTORS M.Pavan and P.Gramatica QSAR Research Unit, Dept. of.
Validation methods.
Chance Correlation in QSAR studies Ahmadreza Mehdipour Medicinal & Natural Product Chemistry Research Center.
Computational methods for inferring cellular networks II Stat 877 Apr 17 th, 2014 Sushmita Roy.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability Primer Bayesian Brain Probabilistic Approaches to Neural Coding 1.1 A Probability.
Tree and Forest Classification and Regression Tree Bagging of trees Boosting trees Random Forest.
Data Science Credibility: Evaluating What’s Been Learned
Chapter 7. Classification and Prediction
Table 1. Advantages and Disadvantages of Traditional DM/ML Methods
4 Sampling.
Ch13 Empirical Methods.
Chap. 7 Regularization for Deep Learning (7.8~7.12 )
Types of Control I. Measurement Control II. Statistical Control
Cross-validation for the selection of statistical models
Sampling.
Ch 3. Linear Models for Regression (2/2) Pattern Recognition and Machine Learning, C. M. Bishop, Previously summarized by Yung-Kyun Noh Updated.
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

1 LMO & Jackknife If a QSPR/QSAR model has a high average q 2 in LMO valiation, it can be reasonably concluded that obtained model is robust. Leave-many-out (LMO) validation: An internal validation procedure, like LOO. LMO employs smaller training set than LOO and can be repeated many more times due to possibility of large combinations in leave many compounds out from training set. n objects in data set, G cancellation groups of equal size, G =n/m j ) 2<G<10 A large number of groups : n-m j objects in training set m j objects in validation set => q 2 (from m j estim.s

2 Jackknife: training set → a # Subsamples, SubSampNo>G each SubSample → SubTrain and SubValid +SubSampNo times estimation of parameters. (instead of time consuming repetition of the experiment) Along with LMO cross validation (internal validation) LMO & Jackknife

3 SubTrain1 SubValid1 q2TOT # subsamples >> # molec.s in training set …. LMO n=6 m=2 G=3 SubTrain4 SubValid4

LMO & Jackknife 4 # subsamples >> # molec.s in training set …. Jackknife n=6 m=2 G=3 b1b1 b2b2 b3b3 b4b4 bsn SubTrain1 SubTrain4

5 >> for i=1:subSampNo PERMUT(i,:)=randperm(Dr); end for i=1:9 % 9 Subsamples PERMUT(i,:)=randperm(6); % 6 molecules in train end PERMUT = SubValid LMO & Jackknife SubTrain

→ b1 3 1 → q → b2 2 4 → q → b3 3 1 → q → b4 6 3 → q → b5 2 3 → q → b6 3 4 → q → b7 3 4 → q → b8 4 3 → q → b9 3 2 → q 2 9 q 2 TOT Db=y b=D + y SubTrain sets: SubValid sets: histogr LMO & Jackknife

7 dn…d6d5d4d3d2d1 Distribution of b for 3 rd Descriptor LMO & Jackknife → b → b → b → b → b → b → b → b → b9

8 Jackknife: on all 31 molec.s and all 53 desc.s 200 subsamples(using MLR) Desc No 25 Desc No 15 LMO & Jackknife

9 Jackknife: on all 31 samples and all 53 desc.s ( using MLR) Desc No 25 Desc No 15 >> histfit(bJACK(:,15),20); LMO & Jackknife

10 How much is the probability that 0.0 is different from the population by chance. To determine the probability: All data in population, and 0.0, should be standardized to z.

11 >> disttool z = -1.5 Probability that 1.5 is different from μ by chance

12 >> disttool x2 =0.134 =p 2 tailed Probability that difference between -1.5 and μ is because of random error is not 0.05)  -1.5 is not significantly different from population p signif. difference >>cdf Gives the area before z, from left

LMO & Jackknife 13 All descriptors, MLRq2TOT = # p<0.05 =0 # signif descrip.s =0

14 All descriptors, PLS, lv=14q2TOT = # p<0.05 =28 LMO & Jackknife # signif descrip.s =28

15 All descriptors, PLS, lv=14 q2TOT = # p<0.05 = e e e e e e Significant descriptors with p<0.05 can be sorted (according to p value), For doing a forward selection Desc No p LMO & Jackknife

16 q2TOT at different number of latent variables in PLS (applying all descriptors) 4 times running the program lv signif var signif var Overfitt Inform ↓ LMO & Jackknife

17 for lv=6:13 % Number of latent var.s in PLS for i=lv:18 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,SRTDpDESC(1:i,1)), y, 150, 27,2,lv); end lv No of descriptors q2TOT Max q2TOT at lv=7 and #desc=7 LMO & Jackknife

18 D=Dini(:,[ ]); [q2, bJACK]=jackknife(D, y, 500, 27) Three significant descriptors with q2 < 0.05, as example. LMO & Jackknife

19 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,[ ]), y, 150, 27,2,7); [ ]: Selected descriptors 150: #subset samples in Jackknife 27: #samples in training set of each subset 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS Jackknife is a method for determining the significant descriptors beside LMO CV, as internal validation …. and can be applied for descriptor selection... LMO & Jackknife function

20 Exercise: Applying Jackknife of selected set of descriptors, using MLR and determining the results and significance of descriptors…

21 Cross model validation (CMV) Anderssen, et al Reducing over-optimism in variable selection by cross model validation, Chemom Intell Lab Syst (2006) 84, Validation during variable selection, and not posterior to it. Gidskehaug, et al Cross model validation and optimization of bilinear regression models, Chemom Intell Lab Syst (2008) 93, CMV Data set → a # train, and Test sets. train → SubSample → SubTrain and SubValid

22 q2CMV1 q2CMV2 Jackknife -Selec Var.s - # latent var.s PLS model ( b1 ) predic CMV Test set: No contribution to var and lv sel process n=15 m=3 G=3 TrainTest

23 q2CMVm CMV Effective external validation

24 [q2TOT,q2CMV]=crossmv(trainD,trainy,testD,testy,selVAR,7) selVAR: set of selected descriptors (applied calibration method is PLS) 7: Number of latent variables in PLS CMV is an effective external validation method... function CMV

25 Bootstrapping Bootstrapping: Bootstrap re-sampling, another approach to internal validation Wehrens, et al The Bootstrap: a tutorial, Chemom Intell Lab Syst (2002) 54, There is only one data set. Data set should be representative of the population from which it was drawn. Bootstr. is simulation of random selection Generation of K groups of size n, by a repeated random selection of n objects from original data set.

26 Some of the objects can be included in the same random sample several times, while other objects will never be selected. The model obtained on the n randomly selected objects is used to predict the target properties for the excluded sample. + q 2 estimation, as in LMO. Bootstrapping

27 for i=1:10 %No of subSamples in bootstr for j=1:6 %Dr=6 number of molec.s in Train RND=randperm(6); bootSamp(i,j)=RND(1); end Bootstrapping bootSamp = → b1 2 4 → q → b10 5 → q 2 10 SubValid SubTrain Not present in SubTrainSame no of molec as Train

28 Bootstrapping

29 Distribution of b values are not normal Nonparam estimation Of confidence limits Sorted 200 subsamples, 200x0.025=5 => 5 th from left and 5 th from right are the 95% confidence limits. signif Not signif signif Bootstrapping

30 -12e e Bootstrapping  Small effect But signif Not signif

31 [bBOOT]=bootstrp(trainD, trainy,1000,2,7) 1000: #subset samples in bootstrapping (#molecules in SubTraining set = #molec.s in Train) 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS Bootstrap is a method for determining the confidence interval for descriptors... function Bootstrapping

32 Model validation Y-randomization: Random shuffling of dependent variable vector, and development of a new QSAR model using original dependent variable matrix. Repeating the process a number of times, chance correlation or structural redundancy of training set Sometimes: High q 2 values Expected: QSAR models with low R 2 and LOO q 2 values Acceptable model can not be obtained by this method.

33 Training and test External validation: Selecting training and test sets: a. Finding new experimental tested set: not a simple task b. Splitting data set into training and test set. For establishing QSAR model For external validation Both training and test sets should separately span the whole descriptor space occupied by the entire data set. Ideally, each member of test set should be close to one point in training set.

34 Approaches for creating training and test sets: 1. Straightforward random selection Yasri, et al Toward an optimal procedure for variable selection and QSAR model building, J Chem Inf Comput Sci (2001) 41, Activity sampling Kauffman, et al QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically based numerical descriptors, J Chem Inf Comput Sci (2001) 41, Mattioni, et al Development of Q. S.-A. R. and classification models for a set of carbonic anhydrase inhibitors, J Chem Inf Comput Sci (2002) 42, Training and test

35 3. Systematic clustering techniques Burden, et al Use of automatic relevance determination in QSAR studies using Bayesian neural networks, J Chem Inf Comput Sci (2000) 40, Snarey, et al Comparison of algorithms for dissimilarity-based compound selection, J Mol Graph Model (1997) 15, Self organizing maps (SOMs) Gramatica, et al QSAR study on the tropospheric degradation of organic compounds, Chemosphere (1999) 38, Better than random selection Training and test

Kohonen Map 53 × 31 Columns (molecules) as input for Kohonen map: Sampling from all region of columns (molecules) space 19, 18 4, 23,14 3,20 15,16 selwood data matrix TestTrain 19,18,3,20,4,23,14, 15,16 Other.. arrangem 1 27,12,3,7,30,23,11, 16 …. arrangem 2 Training and test

RMSEPRMSECV Sample selection (Kohonen) Descriptor selection (Kohonen) Sample selection (Kohonen) Descriptor selection (P-value) Descriptor selection using Kohonen correlation map Descriptor selection using P-Value 35,36,37, 40,44 43,51 51,37,35,38,39 36,15 15 Correlation With activity ! Training and test

38 5. Kennard Stone Kennard, et al Computer aided design of experiments, Technometrics (1969) 11, Bourguignon, et al Optimization in irregularly shaped regions- pH and solvent strength in reverse-phase HPLC separation, Anal Chem (1994) 66, Factorial and D-optimal design Eriksson, et al Multivariate design and modeling in QSAR. Tutorial, Chemometr Intell Lab Syst (1996) 34, Mitchell, et al Algorithm for the construction of “D-optimal” experimental designs, Technometrics (2000) 42, Training and test

39 Gramatica, et al QSAR modeling of bioconcentration factors by theoretical molecular descriptors, Quant Struct-Act Relat (2003) 22, D-optimal Selection of samples that maximize the |X’X| determinant. X: Variance-covariance (information) matrix of independent variables (desriptors) or independent plus dependent variables. These samples will be spanned across the whole area occupied by representative points and constitute the training set. The points not selected are used as test set. => well-balanced structural diversity and representativity of entire data space(descriptors and responses) Training and test

40 trianD1 = [D(1:3:end,:);[D(2:3:end,:)]]; Training and test trianD2 = D([1:2 5: :end],:); detCovDySelected descritors e-236 !!D=Dini; %all 2.18 e53D=Dini(:,[ ]); 5.90 e 08D=Dini(:,[ ]); 2.13 e-243 !!D=Dini; 2.66 e53D=Dini(:,[ ]); 4.45 e08D=Dini(:,[ ]); Optimum selection of descriptors and molecules in training set can be performed using detCovDy (D-optimal

41 leverage Model applicability domain No matter how robust, significant and validated a QSAR maybe, it cannot be expected to reliably predict the modeled property for the entire universe of chemicals.!! Leverage is a criterion for determining the applicability domain of the model for a query compound: x : vector of query compound X : Matrix of training set indep variables

42 Using all descriptors leverage for all test samples are very high. It means that test samples are not in the space of training samples and can not be predicted. leverage

43 leverage Using a number of descriptors ( ) leverage for test samples are similar to training samples. It means that test samples are in the space of training samples and can be predicted.