WSC-6 Critical levels in projection Alexey Pomerantsev Semenov Institute of Chemical Physics, Moscow
WSC-6 Projection approach
WSC-6 Scores & Orthogonal Distances OD: distance to the model SD: distance within the model
WSC-6 Where applied SIMCA Classification PLS/PCR Influence plot MSPC
WSC-6 Giants battle at ICS-L, April 2007 The ratios of residual variances of PCA are fairly well F-distributed. This is easy - the shape of the distribution of a ratio of two variances usually looks like an F. Svante Wold No, the residuals from PCA don't follow an F- distribution unless you fuss with the degrees of freedom, and there are better alternatives in any case. Barry Wise
WSC-6 Full PCA Decomposition K=rank(X) ≤ min (I, J) X=TP t =T t T=diag( 1,.., K ) X I JK T I =× PtPt J K
WSC-6 Truncated PCA Decomposition A ≤ K I A TATA A PAPA EAEA + X I =× J J t I J
WSC-6 Score distance (SD), h i hihi Leverage = h i +1/I Mahalanobis = (h i ) ½
WSC-6 Orthogonal distance (OD), v i vivi Variance per sample=v i /J Q statistics = v i
WSC-6 Distribution of distances: the shape? =h/h 0 x= =v/v 0 x ~ χ 2 (N)/N N = DoF E(x) = 1 D(x) = 2/N
WSC-6 Example: Leon Rusinov data I=1440 A=6 N h =5 N v =1 SDOD
WSC-6 Distribution of distances: DoF? Method of MomentsInterquartile Approach x (1) ≤ x (2 ) ≤.... ≤ x (I-1) ≤ x (I) ¼ IQR ¼ = h/h 0 x= = v/v 0 x 1,...., x I ~ χ 2 (N)/N N = ?
WSC-6 Type I error I=100 = point is out = points are out = points are out = points are out = points are out
WSC-6 SIM Data. MSPC task I=100 J=25 A=5 =0.05
WSC-6 SD & OD values
WSC-6 DoF Estimates Interquartile ApproachMethod of Moments N h = 5.7 N v =21.6 N h = 5.0 N v =20.0
WSC-6 Acceptance areas: conventional I=100 =0.05
WSC-6 Acceptance areas =0.05: Sum of CHIs I=100 =0.05
WSC-6 Acceptance areas: Ratio of CHIs I=100 =0.05
WSC-6 Wilson-Hilferty approximation for Chi
WSC-6 Acceptance areas: Wilson-Hilferty I=100 =0.05
WSC-6 Modified Wilson-Hilferty approximation 1–γ=P 0 +P 1 +P 2 +P 3 = = Φ(r) – ¼exp(–½r 2 ) r=r(γ)
WSC-6 Acceptance areas: modified Wilson-Hilferty I=100 =0.05
WSC-6 Areas Validation: variation of
WSC-6 BMT Data. SIMCA I=45 J=3501 A=2 N h =3 N v =2 =0.025
WSC-6 Extremes & Outliers in calibration set is significance level for outliers =1 – (1 – ) 1/I extreme outlier Calibration set: I=45 γ I = 45 = 1.25 I out =2
WSC-6 SIMCA Classification without G07-4 New set: I new =30 10 Genuine + 20 Fakes γ I new = 10 = 0.25 I out =3
WSC-6 What’s up? This is absolutely wrong classification but Oxana will explain how fix it over.
WSC-6 GRAIN Data. Influence plots I=123 J=118 A=4 =0.01 N h =5.7 N v =3.0 N u =1.0 X Y
WSC-6 Orthogonal distance to Y
WSC-6 Back to WSC-4
WSC-4 Training set Model 1 Boundary subset l=19 Boundary samples (WSC-4)
WSC-6 Influence plots for X and Y YX Calibration Boundary (SIC)
WSC-6 Box or Egg? Box or Egg? I<30
WSC-6 Conclusion 1 The χ 2 -distribution can be used in the modeling of the score and orthogonal distances.
WSC-6 Conclusion 2 Any classification problem should be solved with respect to a given type I error. Five of such areas have been presented but only two are recommended. I>30 I<30
WSC-6 Conclusion 3 Estimation of DoF is a key challenge in the projection modeling. A data-driven estimator of DoF, rather than a theory-driven one should be used. The method of moments is effective, but sensitive to outliers. The IQR estimator is a robust but less effective alternative. More examples will be demonstrated in the subsequent presentation by Oxana.