Presentation is loading. Please wait.

Presentation is loading. Please wait.

Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow.

Similar presentations


Presentation on theme: "Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow."— Presentation transcript:

1 Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow

2 Outline  Introduction. What is representative subset ?  Training set and Test set  Influential subset selection Boundary subset Kennard-Stone subset Models’ comparison  Conclusions

3 What is representative subset? YX XIXI Model Y(X) X II Y II X III Y III

4 Influential Subset Training set X(n  m), Y(n  k) Influential Subset X(l  m), Y(l  k) Model I (A factors) Model II (A factors) <l<n<l<n ~ Model 2 ~ Model 1 ? Model I (A factors) RMSEP 1 Model II (A factors) RMSEP 2 Quality of prediction

5 Training and Test Sets Entire Data Set K Entire Data Set K Training Set N Training Set N Test Set K-N Test Set K-N

6 Statistical Tests D. Jouan-Rimbaud, D.L.Massart, C.A. Saby, C. Puel Characterisation of the representativity of selected sets of samples in multivariate calibration and pattern recognition, Analitica Chimica Acta 350 (1997) 149-161 Generalization of Bartlett’s test Hotelling T 2 -test Clouds orientation Dispersion around their means Similar position in space

7 RPV Influential Subset  Boundary Samples

8

9 Whole Wheat Samples (Data description) X- NIR Spectra of Whole Wheat (118 wave lengths) Y- moisture content N=139 Entire Set Data pre- processed. X- NIR Spectra of Whole Wheat (118 wave lengths) Y- moisture content N=139 Entire Set Data pre- processed. PLS-model, 4PCs SIC-modeling bsic=1.5 PLS-model, 4PCs SIC-modeling bsic=1.5 Training set = 99 objects Test set = 40 objects Training set = 99 objects Test set = 40 objects

10 Boundary subset l=19 Boundary samples Training set n  m n=99 Model 1 ‘Redundant subset’ n-l=80

11 Boundary Subset Training set Model 1 Training set Model 1 Boundary subset Model 2 Boundary subset Model 2 TEST SET n=99l=19 4 PLS comp-s  =1.5

12 SIC prediction Model1 (Training set)  Test set Model 2 (Boundary subset)  Test set

13 Quality of prediction (PLS models) ? RMSEC=0.303 RMSEP=0.337 Mean(Cal. Leverage)=0.051 Maximum(Cal. Leverage)=0.25 RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45 Model 1 (Training set)  Test setModel 2 (Boundary set)  Test set

14 Aim Kennard-Stone Method Objects are chosen sequentially in X or T space Select samples that are uniformly distributed over predictors’ space d jr, j=1,...k, is the square Euclidean distance from candidate object r, to the k objects in the subset

15 Kennard-Stone Subset Training set n=99 Model 1 4 PLS comp-s K-S subset l=19 Model 3 Boundary subset Model 2

16 Boundary Subset & K-S Subset (SIC prediction)

17 Boundary Subset & K-S Subset (PLS models) Model 2 (Boundary set)  Test set RMSEC=0.461 RMSEP=0.357 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.45 Model 3 (K-S set)  Test set RMSEC=0.229 RMSEP=0.368 Mean(Cal. Leverage)=0.26 Maximum(Cal. Leverage)=0.73

18 ‘Redundant samples’ Kennard-Stone set L=19 Model 3 Redundant Set RS_3 (RS_3) N-L=80 Boundary set L=19 Model 2 Redundant Set RS_2 (RS_2) N-L=80 Test set N1=40 Test set N1=40 Training set N=99 Model 1 PLS Cs=4 b sic =1.5 Training set N=99 Model 1 PLS Cs=4 b sic =1.5

19 Prediction of Redundant Sets Model 2 (Boundary set)  RS_2Model 3 (K-S set)  RS_3 RMSEP=0.267RMSEP=0.338

20 Model comparison Entire Data Set 139 objects Training Set 99 objects Test Set 40 objects Randomly 10 times In Average

21 Conclusions 1.The model constructed with the help of Boundary Subset can predict all other samples with accuracy that is not worse than the error of calibration evaluated on the whole data set. 2. Boundary Subset is indeed significantly smaller than the whole Training Set.Questions 1.Prediction ability, how to evaluate it? 2.Representativity, how to verify it?

22


Download ppt "Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow."

Similar presentations


Ads by Google