Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 LMO & Jackknife If a QSPR/QSAR model has a high average q 2 in LMO valiation, it can be reasonably concluded that obtained model is robust. Leave-many-out.

Similar presentations


Presentation on theme: "1 LMO & Jackknife If a QSPR/QSAR model has a high average q 2 in LMO valiation, it can be reasonably concluded that obtained model is robust. Leave-many-out."— Presentation transcript:

1 1 LMO & Jackknife If a QSPR/QSAR model has a high average q 2 in LMO valiation, it can be reasonably concluded that obtained model is robust. Leave-many-out (LMO) validation: An internal validation procedure, like LOO. LMO employs smaller training set than LOO and can be repeated many more times due to possibility of large combinations in leave many compounds out from training set. n objects in data set, G cancellation groups of equal size, G =n/m j ) 2<G<10 A large number of groups : n-m j objects in training set m j objects in validation set => q 2 (from m j estim.s

2 2 Jackknife: training set → a # Subsamples, SubSampNo>G each SubSample → SubTrain and SubValid +SubSampNo times estimation of parameters. (instead of time consuming repetition of the experiment) Along with LMO cross validation (internal validation) LMO & Jackknife

3 3 SubTrain1 SubValid1 q2TOT # subsamples >> # molec.s in training set …. LMO n=6 m=2 G=3 SubTrain4 SubValid4

4 LMO & Jackknife 4 # subsamples >> # molec.s in training set …. Jackknife n=6 m=2 G=3 b1b1 b2b2 b3b3 b4b4 bsn SubTrain1 SubTrain4

5 5 >> for i=1:subSampNo PERMUT(i,:)=randperm(Dr); end for i=1:9 % 9 Subsamples PERMUT(i,:)=randperm(6); % 6 molecules in train end PERMUT = 6 5 2 4 3 1 1 6 3 5 2 4 5 2 6 4 3 1 5 4 2 1 6 3 5 4 1 6 2 3 2 6 5 1 3 4 1 2 6 5 3 4 6 2 1 5 4 3 4 5 1 6 3 2 SubValid LMO & Jackknife SubTrain

6 6 6 5 2 4 → b1 3 1 → q 2 1 1 6 3 5 → b2 2 4 → q 2 2 5 2 6 4 → b3 3 1 → q 2 3 5 4 2 1 → b4 6 3 → q 2 4 5 4 1 6 → b5 2 3 → q 2 5 2 6 5 1 → b6 3 4 → q 2 6 1 2 6 5 → b7 3 4 → q 2 7 6 2 1 5 → b8 4 3 → q 2 8 4 5 1 6 → b9 3 2 → q 2 9 q 2 TOT Db=y b=D + y SubTrain sets: SubValid sets: histogr LMO & Jackknife

7 7 dn…d6d5d4d3d2d1 Distribution of b for 3 rd Descriptor LMO & Jackknife 6 5 2 4 → b1 1 6 3 5 → b2 5 2 6 4 → b3 5 4 2 1 → b4 5 4 1 6 → b5 2 6 5 1 → b6 1 2 6 5 → b7 6 2 1 5 → b8 4 5 1 6 → b9

8 8 Jackknife: on all 31 molec.s and all 53 desc.s 200 subsamples(using MLR) Desc No 25 Desc No 15 LMO & Jackknife

9 9 Jackknife: on all 31 samples and all 53 desc.s ( using MLR) Desc No 25 Desc No 15 >> histfit(bJACK(:,15),20); LMO & Jackknife

10 10 How much is the probability that 0.0 is different from the population by chance. To determine the probability: All data in population, and 0.0, should be standardized to z.

11 11 >> disttool z = -1.5 Probability that 1.5 is different from μ by chance

12 12 >> disttool x2 =0.134 =p 2 tailed Probability that difference between -1.5 and μ is because of random error is not 0.05)  -1.5 is not significantly different from population p signif. difference >>cdf Gives the area before z, from left. 0.0668

13 LMO & Jackknife 13 All descriptors, MLRq2TOT = -407.46 # p<0.05 =0 # signif descrip.s =0

14 14 All descriptors, PLS, lv=14q2TOT = -0.0988 # p<0.05 =28 LMO & Jackknife # signif descrip.s =28

15 15 All descriptors, PLS, lv=14 q2TOT = -0.0988 # p<0.05 =28 511.4002e-022 371.383e-010 358.605e-009 389.1021e-009 391.8559e-008 368.7005e-008 150.00027689 10.00038808 20.00040547 450.00059674 320.00063731 Significant descriptors with p<0.05 can be sorted (according to p value), For doing a forward selection --------------------------------- Desc No p --------------------------------- LMO & Jackknife

16 16 q2TOT at different number of latent variables in PLS (applying all descriptors) 4 times running the program lv 8 -.0411.0776 -.0431.0270 9.2200.2340.3641.2576 10.1721.1147.2391.1434 37 signif var 11.2855.1948.0667.2372 12.1847.1275.2390.2184 13 -.0343 -.1439.0120.0049 14 -.2578 -.2460 -.3010 -.0989 28 signif var Overfitt Inform ↓ LMO & Jackknife

17 17 for lv=6:13 % Number of latent var.s in PLS for i=lv:18 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,SRTDpDESC(1:i,1)), y, 150, 27,2,lv); end lv No of descriptors q2TOT Max q2TOT at lv=7 and #desc=7 LMO & Jackknife

18 18 D=Dini(:,[38 50 3]); [q2, bJACK]=jackknife(D, y, 500, 27) Three significant descriptors with q2 < 0.05, as example. LMO & Jackknife

19 19 [p,Z,q2TOTbox(lv,i),q2, bJACK]=… jackknife(D(:,[34 38 45 51]), y, 150, 27,2,7); [34 38 45 51]: Selected descriptors 150: #subset samples in Jackknife 27: #samples in training set of each subset 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS Jackknife is a method for determining the significant descriptors beside LMO CV, as internal validation …. and can be applied for descriptor selection... LMO & Jackknife function

20 20 Exercise: Applying Jackknife of selected set of descriptors, using MLR and determining the results and significance of descriptors…

21 21 Cross model validation (CMV) Anderssen, et al Reducing over-optimism in variable selection by cross model validation, Chemom Intell Lab Syst (2006) 84, 69-74. Validation during variable selection, and not posterior to it. Gidskehaug, et al Cross model validation and optimization of bilinear regression models, Chemom Intell Lab Syst (2008) 93, 1-10. CMV Data set → a # train, and Test sets. train → SubSample → SubTrain and SubValid

22 22 q2CMV1 q2CMV2 Jackknife -Selec Var.s - # latent var.s PLS model ( b1 ) predic CMV Test set: No contribution to var and lv sel process n=15 m=3 G=3 TrainTest

23 23 q2CMVm............ CMV Effective external validation

24 24 [q2TOT,q2CMV]=crossmv(trainD,trainy,testD,testy,selVAR,7) selVAR: set of selected descriptors (applied calibration method is PLS) 7: Number of latent variables in PLS CMV is an effective external validation method... function CMV

25 25 Bootstrapping Bootstrapping: Bootstrap re-sampling, another approach to internal validation Wehrens, et al The Bootstrap: a tutorial, Chemom Intell Lab Syst (2002) 54, 35-52. There is only one data set. Data set should be representative of the population from which it was drawn. Bootstr. is simulation of random selection Generation of K groups of size n, by a repeated random selection of n objects from original data set.

26 26 Some of the objects can be included in the same random sample several times, while other objects will never be selected. The model obtained on the n randomly selected objects is used to predict the target properties for the excluded sample. + q 2 estimation, as in LMO. Bootstrapping

27 27 for i=1:10 %No of subSamples in bootstr for j=1:6 %Dr=6 number of molec.s in Train RND=randperm(6); bootSamp(i,j)=RND(1); end Bootstrapping bootSamp = 5 5 6 3 6 1 → b1 2 4 → q 2 1 4 2 6 3 2 6.. 1 5.. 2 5 3 1 2 4 6 2 3 1 4 4 1 5 6 3 3 2 6 3 3 1 4 5 5 5 6 4 4 3 1 2 4 3 6 1 1 2 5 2 2 5 4 5 1 3 6 3 3 2 3 3 5 1 4 6 2 3 1 6 4 6 → b10 5 → q 2 10 SubValid SubTrain Not present in SubTrainSame no of molec as Train

28 28 Bootstrapping 38 50 15

29 29 Distribution of b values are not normal Nonparam estimation Of confidence limits Sorted 200 subsamples, 200x0.025=5 => 5 th from left and 5 th from right are the 95% confidence limits. signif Not signif signif Bootstrapping

30 30 -12e-5 0.1113 -0.0181 -1.5e-5 0.5131 0.0250 Bootstrapping  38 50 15 Small effect But signif Not signif

31 31 [bBOOT]=bootstrp(trainD, trainy,1000,2,7) 1000: #subset samples in bootstrapping (#molecules in SubTraining set = #molec.s in Train) 2: calibration method (1, MLR; 2, PLS) 7: Number of latent variables in PLS Bootstrap is a method for determining the confidence interval for descriptors... function Bootstrapping

32 32 Model validation Y-randomization: Random shuffling of dependent variable vector, and development of a new QSAR model using original dependent variable matrix. Repeating the process a number of times, chance correlation or structural redundancy of training set Sometimes: High q 2 values Expected: QSAR models with low R 2 and LOO q 2 values Acceptable model can not be obtained by this method.

33 33 Training and test External validation: Selecting training and test sets: a. Finding new experimental tested set: not a simple task b. Splitting data set into training and test set. For establishing QSAR model For external validation Both training and test sets should separately span the whole descriptor space occupied by the entire data set. Ideally, each member of test set should be close to one point in training set.

34 34 Approaches for creating training and test sets: 1. Straightforward random selection Yasri, et al Toward an optimal procedure for variable selection and QSAR model building, J Chem Inf Comput Sci (2001) 41, 1218-1227. 2. Activity sampling Kauffman, et al QSAR and k-nearest neighbor classification analysis of selective cyclooxygenase-2 inhibitors using topologically based numerical descriptors, J Chem Inf Comput Sci (2001) 41, 1553-1560. Mattioni, et al Development of Q. S.-A. R. and classification models for a set of carbonic anhydrase inhibitors, J Chem Inf Comput Sci (2002) 42, 94-102. Training and test

35 35 3. Systematic clustering techniques Burden, et al Use of automatic relevance determination in QSAR studies using Bayesian neural networks, J Chem Inf Comput Sci (2000) 40, 1423-1430. Snarey, et al Comparison of algorithms for dissimilarity-based compound selection, J Mol Graph Model (1997) 15, 372-385. 4. Self organizing maps (SOMs) Gramatica, et al QSAR study on the tropospheric degradation of organic compounds, Chemosphere (1999) 38, 1371-1378. Better than random selection Training and test

36 Kohonen Map 53 × 31 Columns (molecules) as input for Kohonen map: Sampling from all region of columns (molecules) space 19, 18 4, 23,14 3,20 15,16 selwood data matrix TestTrain 19,18,3,20,4,23,14, 15,16 Other.. arrangem 1 27,12,3,7,30,23,11, 16 …. arrangem 2 Training and test

37 RMSEPRMSECV 0.62510.4384 Sample selection (Kohonen) Descriptor selection (Kohonen) 0.64320.4205 Sample selection (Kohonen) Descriptor selection (P-value) Descriptor selection using Kohonen correlation map Descriptor selection using P-Value 35,36,37, 40,44 43,51 51,37,35,38,39 36,15 15 Correlation With activity ! Training and test

38 38 5. Kennard Stone Kennard, et al Computer aided design of experiments, Technometrics (1969) 11, 137-148. Bourguignon, et al Optimization in irregularly shaped regions- pH and solvent strength in reverse-phase HPLC separation, Anal Chem (1994) 66, 893-904. 6. Factorial and D-optimal design Eriksson, et al Multivariate design and modeling in QSAR. Tutorial, Chemometr Intell Lab Syst (1996) 34, 1-19. Mitchell, et al Algorithm for the construction of “D-optimal” experimental designs, Technometrics (2000) 42, 48-54. Training and test

39 39 Gramatica, et al QSAR modeling of bioconcentration factors by theoretical molecular descriptors, Quant Struct-Act Relat (2003) 22, 374-385. D-optimal Selection of samples that maximize the |X’X| determinant. X: Variance-covariance (information) matrix of independent variables (desriptors) or independent plus dependent variables. These samples will be spanned across the whole area occupied by representative points and constitute the training set. The points not selected are used as test set. => well-balanced structural diversity and representativity of entire data space(descriptors and responses) Training and test

40 40 trianD1 = [D(1:3:end,:);[D(2:3:end,:)]]; Training and test trianD2 = D([1:2 5:13 17 21 22 25:end],:); detCovDySelected descritors -3.48 e-236 !!D=Dini; %all 2.18 e53D=Dini(:,[51 37 35 38 39 36 15]); 5.90 e 08D=Dini(:,[38 50 3]); 2.13 e-243 !!D=Dini; 2.66 e53D=Dini(:,[51 37 35 38 39 36 15]); 4.45 e08D=Dini(:,[38 50 3]); Optimum selection of descriptors and molecules in training set can be performed using detCovDy (D-optimal

41 41 leverage Model applicability domain No matter how robust, significant and validated a QSAR maybe, it cannot be expected to reliably predict the modeled property for the entire universe of chemicals.!! Leverage is a criterion for determining the applicability domain of the model for a query compound: x : vector of query compound X : Matrix of training set indep variables

42 42 Using all descriptors leverage for all test samples are very high. It means that test samples are not in the space of training samples and can not be predicted. leverage

43 43 leverage 38 50 3 13 24 Using a number of descriptors ( 38 50 3 13 24 ) leverage for test samples are similar to training samples. It means that test samples are in the space of training samples and can be predicted.


Download ppt "1 LMO & Jackknife If a QSPR/QSAR model has a high average q 2 in LMO valiation, it can be reasonably concluded that obtained model is robust. Leave-many-out."

Similar presentations


Ads by Google