Download presentation
Presentation is loading. Please wait.
1
1 LMO & Jackknife If a QSPR/QSAR model has a high average q 2 in LMO valiation, it can be reasonably concluded that obtained model is robust. Leave-many-out (LMO) validation: An internal validation procedure, like LOO. LMO employs smaller training set than LOO and can be repeated many more times due to possibility of large combinations in leave many compounds out from training set. n objects in data set, G cancellation groups of equal size, G =n/m j ) 2<G<10 A large number of groups : n-m j objects in training set m j objects in validation set => q 2 (from m j estim.s
2
2 Jackknife: training set → a # Subsamples, SubSampNo>G each SubSample → SubTrain and SubValid +SubSampNo times estimation of parameters. (instead of time consuming repetition of the experiment) Along with LMO cross validation (internal validation) LMO & Jackknife
3
3 SubTrain1 SubValid1 q2TOT # subsamples >> # molec.s in training set …. LMO n=6 m=2 G=3 SubTrain4 SubValid4
4
LMO & Jackknife 4 # subsamples >> # molec.s in training set …. Jackknife n=6 m=2 G=3 b1b1 b2b2 b3b3 b4b4 bsn SubTrain1 SubTrain4
5
5 >> for i=1:subSampNo PERMUT(i,:)=randperm(Dr); end for i=1:9 % 9 Subsamples PERMUT(i,:)=randperm(6); % 6 molecules in train end PERMUT = 6 5 2 4 3 1 1 6 3 5 2 4 5 2 6 4 3 1 5 4 2 1 6 3 5 4 1 6 2 3 2 6 5 1 3 4 1 2 6 5 3 4 6 2 1 5 4 3 4 5 1 6 3 2 SubValid LMO & Jackknife SubTrain
6
6 6 5 2 4 → b1 3 1 → q 2 1 1 6 3 5 → b2 2 4 → q 2 2 5 2 6 4 → b3 3 1 → q 2 3 5 4 2 1 → b4 6 3 → q 2 4 5 4 1 6 → b5 2 3 → q 2 5 2 6 5 1 → b6 3 4 → q 2 6 1 2 6 5 → b7 3 4 → q 2 7 6 2 1 5 → b8 4 3 → q 2 8 4 5 1 6 → b9 3 2 → q 2 9 q 2 TOT Db=y b=D + y SubTrain sets: SubValid sets: histogr LMO & Jackknife
7
7 dn…d6d5d4d3d2d1 Distribution of b for 3 rd Descriptor LMO & Jackknife 6 5 2 4 → b1 1 6 3 5 → b2 5 2 6 4 → b3 5 4 2 1 → b4 5 4 1 6 → b5 2 6 5 1 → b6 1 2 6 5 → b7 6 2 1 5 → b8 4 5 1 6 → b9
8
8 Jackknife: on all 31 molec.s and all 53 desc.s 200 subsamples(using MLR) Desc No 25 Desc No 15 LMO & Jackknife
9
9 Jackknife: on all 31 samples and all 53 desc.s ( using MLR) Desc No 25 Desc No 15 >> histfit(bJACK(:,15),20); LMO & Jackknife
10
10 How much is the probability that 0.0 is different from the population by chance. To determine the probability: All data in population, and 0.0, should be standardized to z.
11
11 >> disttool z = -1.5 Probability that 1.5 is different from μ by chance
12
12 >> disttool x2 =0.134 =p 2 tailed Probability that difference between -1.5 and μ is because of random error is not 0.05) -1.5 is not significantly different from population p signif. difference >>cdf Gives the area before z, from left. 0.0668
13
LMO & Jackknife 13 All descriptors, MLRq2TOT = -407.46 # p<0.05 =0 # signif descrip.s =0
14
14 All descriptors, PLS, lv=14q2TOT = -0.0988 # p<0.05 =28 LMO & Jackknife # signif descrip.s =28
15
15 All descriptors, PLS, lv=14 q2TOT = -0.0988 # p<0.05 =28 511.4002e-022 371.383e-010 358.605e-009 389.1021e-009 391.8559e-008 368.7005e-008 150.00027689 10.00038808 20.00040547 450.00059674 320.00063731 Significant descriptors with p<0.05 can be sorted (according to p value), For doing a forward selection --------------------------------- Desc No p --------------------------------- LMO & Jackknife
16
16 q2TOT at different number of latent variables in PLS (applying all descriptors) 4 times running the program lv 8 -.0411.0776 -.0431.0270 9.2200.2340.3641.2576 10.1721.1147.2391.1434 37 signif var 11.2855.1948.0667.2372 12.1847.1275.2390.2184 13 -.0343 -.1439.0120.0049 14 -.2578 -.2460 -.3010 -.0989 28 signif var Overfitt Inform ↓ LMO & Jackknife
17
17 for lv=6:13 % Number of latent var.s in PLS for i=lv:18 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,SRTDpDESC(1:i,1)), y, 150, 27,2,lv); end lv No of descriptors q2TOT Max q2TOT at lv=7 and #desc=7 LMO & Jackknife
18
18 D=Dini(:,[38 50 3]); [q2, bJACK]=jackknife(D, y, 500, 27) Three significant descriptors with q2 < 0.05, as example. LMO & Jackknife
19
19 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,[34 38 45 51]), y, 150, 27,2,7); [34 38 45 51]: Selected descriptors 150: #subset samples in Jackknife 27: #samples in training set of each subset 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS Jackknife is a method for determining the significant descriptors beside LMO CV, as internal validation …. and can be applied for descriptor selection... LMO & Jackknife function
20
20 Exercise: Applying Jackknife of selected set of descriptors, using MLR and determining the results and significance of descriptors…
21
21 Cross model validation (CMV) Anderssen, et al Reducing over-optimism in variable selection by cross model validation, Chemom Intell Lab Syst (2006) 84, 69-74. Validation during variable selection, and not posterior to it. Gidskehaug, et al Cross model validation and optimization of bilinear regression models, Chemom Intell Lab Syst (2008) 93, 1-10. CMV Data set → a # train, and Test sets. train → SubSample → SubTrain and SubValid
22
22 q2CMV1 q2CMV2 Jackknife -Selec Var.s - # latent var.s PLS model ( b1 ) predic CMV Test set: No contribution to var and lv sel process n=15 m=3 G=3 TrainTest
23
23 q2CMVm............ CMV Effective external validation
24
24 [q2TOT,q2CMV]=crossmv(trainD,trainy,testD,testy,selVAR,7) selVAR: set of selected descriptors (applied calibration method is PLS) 7: Number of latent variables in PLS CMV is an effective external validation method... function CMV
25
25 Bootstrapping Bootstrapping: Bootstrap re-sampling, another approach to internal validation Wehrens, et al The Bootstrap: a tutorial, Chemom Intell Lab Syst (2002) 54, 35-52. There is only one data set. Data set should be representative of the population from which it was drawn. Bootstr. is simulation of random selection Generation of K groups of size n, by a repeated random selection of n objects from original data set.
26
26 Some of the objects can be included in the same random sample several times, while other objects will never be selected. The model obtained on the n randomly selected objects is used to predict the target properties for the excluded sample. + q 2 estimation, as in LMO. Bootstrapping
27
27 for i=1:10 %No of subSamples in bootstr for j=1:6 %Dr=6 number of molec.s in Train RND=randperm(6); bootSamp(i,j)=RND(1); end Bootstrapping bootSamp = 5 5 6 3 6 1 → b1 2 4 → q 2 1 4 2 6 3 2 6.. 1 5.. 2 5 3 1 2 4 6 2 3 1 4 4 1 5 6 3 3 2 6 3 3 1 4 5 5 5 6 4 4 3 1 2 4 3 6 1 1 2 5 2 2 5 4 5 1 3 6 3 3 2 3 3 5 1 4 6 2 3 1 6 4 6 → b10 5 → q 2 10 SubValid SubTrain Not present in SubTrainSame no of molec as Train
28
28 Bootstrapping 38 50 15
29
29 Distribution of b values are not normal Nonparam estimation Of confidence limits Sorted 200 subsamples, 200x0.025=5 => 5 th from left and 5 th from right are the 95% confidence limits. signif Not signif signif Bootstrapping
30
30 -12e-5 0.1113 -0.0181 -1.5e-5 0.5131 0.0250 Bootstrapping 38 50 15 Small effect But signif Not signif
31
31 [bBOOT]=bootstrp(trainD, trainy,1000,2,7) 1000: #subset samples in bootstrapping (#molecules in SubTraining set = #molec.s in Train) 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS Bootstrap is a method for determining the confidence interval for descriptors... function Bootstrapping
32
32 Model validation Y-randomization: Random shuffling of dependent variable vector, and development of a new QSAR model using original dependent variable matrix. Repeating the process a number of times, chance correlation or structural redundancy of training set Sometimes: High q 2 values Expected: QSAR models with low R 2 and LOO q 2 values Acceptable model can not be obtained by this method.
33
33 Training and test External validation: Selecting training and test sets: a. Finding new experimental tested set: not a simple task b. Splitting data set into training and test set. For establishing QSAR model For external validation Both training and test sets should separately span the whole descriptor space occupied by the entire data set. Ideally, each member of test set should be close to one point in training set.
34
34 Approaches for creating training and test sets: 1. Straightforward random selection Yasri, et al Toward an optimal procedure for variable selection and QSAR model building, J Chem Inf Comput Sci (2001) 41, 1218-1227. 2. Activity sampling Kauffman, et al QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically based numerical descriptors, J Chem Inf Comput Sci (2001) 41, 1553-1560. Mattioni, et al Development of Q. S.-A. R. and classification models for a set of carbonic anhydrase inhibitors, J Chem Inf Comput Sci (2002) 42, 94-102. Training and test
35
35 3. Systematic clustering techniques Burden, et al Use of automatic relevance determination in QSAR studies using Bayesian neural networks, J Chem Inf Comput Sci (2000) 40, 1423-1430. Snarey, et al Comparison of algorithms for dissimilarity-based compound selection, J Mol Graph Model (1997) 15, 372-385. 4. Self organizing maps (SOMs) Gramatica, et al QSAR study on the tropospheric degradation of organic compounds, Chemosphere (1999) 38, 1371-1378. Better than random selection Training and test
36
Kohonen Map 53 × 31 Columns (molecules) as input for Kohonen map: Sampling from all region of columns (molecules) space 19, 18 4, 23,14 3,20 15,16 selwood data matrix TestTrain 19,18,3,20,4,23,14, 15,16 Other.. arrangem 1 27,12,3,7,30,23,11, 16 …. arrangem 2 Training and test
37
RMSEPRMSECV 0.62510.4384 Sample selection (Kohonen) Descriptor selection (Kohonen) 0.64320.4205 Sample selection (Kohonen) Descriptor selection (P-value) Descriptor selection using Kohonen correlation map Descriptor selection using P-Value 35,36,37, 40,44 43,51 51,37,35,38,39 36,15 15 Correlation With activity ! Training and test
38
38 5. Kennard Stone Kennard, et al Computer aided design of experiments, Technometrics (1969) 11, 137-148. Bourguignon, et al Optimization in irregularly shaped regions- pH and solvent strength in reverse-phase HPLC separation, Anal Chem (1994) 66, 893-904. 6. Factorial and D-optimal design Eriksson, et al Multivariate design and modeling in QSAR. Tutorial, Chemometr Intell Lab Syst (1996) 34, 1-19. Mitchell, et al Algorithm for the construction of “D-optimal” experimental designs, Technometrics (2000) 42, 48-54. Training and test
39
39 Gramatica, et al QSAR modeling of bioconcentration factors by theoretical molecular descriptors, Quant Struct-Act Relat (2003) 22, 374-385. D-optimal Selection of samples that maximize the |X’X| determinant. X: Variance-covariance (information) matrix of independent variables (desriptors) or independent plus dependent variables. These samples will be spanned across the whole area occupied by representative points and constitute the training set. The points not selected are used as test set. => well-balanced structural diversity and representativity of entire data space(descriptors and responses) Training and test
40
40 trianD1 = [D(1:3:end,:);[D(2:3:end,:)]]; Training and test trianD2 = D([1:2 5:13 17 21 22 25:end],:); detCovDySelected descritors -3.48 e-236 !!D=Dini; %all 2.18 e53D=Dini(:,[51 37 35 38 39 36 15]); 5.90 e 08D=Dini(:,[38 50 3]); 2.13 e-243 !!D=Dini; 2.66 e53D=Dini(:,[51 37 35 38 39 36 15]); 4.45 e08D=Dini(:,[38 50 3]); Optimum selection of descriptors and molecules in training set can be performed using detCovDy (D-optimal
41
41 leverage Model applicability domain No matter how robust, significant and validated a QSAR maybe, it cannot be expected to reliably predict the modeled property for the entire universe of chemicals.!! Leverage is a criterion for determining the applicability domain of the model for a query compound: x : vector of query compound X : Matrix of training set indep variables
42
42 Using all descriptors leverage for all test samples are very high. It means that test samples are not in the space of training samples and can not be predicted. leverage
43
43 leverage 38 50 3 13 24 Using a number of descriptors ( 38 50 3 13 24 ) leverage for test samples are similar to training samples. It means that test samples are in the space of training samples and can be predicted.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.