Presentation is loading. Please wait.

Presentation is loading. Please wait.

Dual data driven SIMCA as a one-class classifier 20.02.14 1 WSC-9 Alexey Pomerantsev ICP RAS.

Similar presentations


Presentation on theme: "Dual data driven SIMCA as a one-class classifier 20.02.14 1 WSC-9 Alexey Pomerantsev ICP RAS."— Presentation transcript:

1 Dual data driven SIMCA as a one-class classifier 20.02.14 1 WSC-9 Alexey Pomerantsev ICP RAS

2 One-class classifier, e.g. SIMCA 20.02.14 2 WSC-9 Target Alternative

3 Standard bi-variate normal distribution 20.02.14 3 WSC-9

4 20.02.14 4 WSC-9 Extremes and Outliers α is Extreme significance  =0.01 γ =0.05 γ is Outlier significance

5 Extreme plot 20.02.14 5 WSC-9

6 20.02.14 6 WSC-9 Principal Component Analysis I A TATA A PAPA EAEA + X I =× J J t I J Karl Pearson, 1901

7 20.02.14 7 WSC-9 Scores & Orthogonal Distances SD: distance within the model OD: distance to the model

8 ODSD SD & OD distributions 20.02.14 8 WSC-9 Nomikos P, MacGregor JF. Technometrics 1995; 37: 41–59. De Braekeleer K, et al. Chemom. Intell. Lab. Syst.1999; 46:103–116. Westerhuis JA, Gurden SP, Smilde AK. Chemom. Intell. Lab. Syst. 2000; 51: 95–114 Pomerantsev A. J. Chemometrics, 2008, 22 : 601-609

9 20.02.14 9 WSC-9 Data Driven SIMCA SDOD h u = v u 0 = ? N = ?

10 Total Distance 20.02.14 10 WSC-9 Scores distance (SD) Orthogonal distance (OD) Total distance (TD)

11 20.02.14 11 WSC-9 Tolerance Areas α is Extreme significance γ is Outlier significance

12 20.02.14 12 WSC-9 Classical Data Driven (CDD) SIMCA Classical Method of Moments Given Then Where

13 20.02.14 13 WSC-9 Robust Data Driven (RDD) SIMCA Robust Method of Moments M=median(u) R=interquartile(u) Given Then Where u (1) ≤ u (2 ) ≤.... ≤ u (I-1) ≤ u (I) ½ ¼

14 20.02.14 14 WSC-9 Dual Data Driven SIMCA Given Then X=T t P+E h=(h 1,...., h I ) v=(v 1,...., v I ) CDD SIMCARDD SIMCA Yes CDD SIMCA No RDD SIMCA

15 20.02.14 15 WSC-9 Case study I. Simulated data with outliers The numbers of variables, J=3 The numbers of objects, I=100 The number of principal components, A=2 The  properties are: E(  ) = 0, v 11 = v 22 = v 33 = 0.28, rank(V) = 2. The  component properties are: E(  ) = 0,  =0.05 (first 97 objects) E(  ) = 0,  =0.2 (last 3 objects)

16 20.02.14 16 WSC-9 SIMCA plots

17 20.02.14 17 WSC-9 REFERENCE & RDD-SIMCA

18 20.02.14 18 WSC-9 Totally in 10 data sets with outliers Expected

19 20.02.14 19 WSC-9 Case study II. Real world data with 2 groups Substance in the closed PE bags, 82 drums measured by NIR. Totally: 246 spectra Group G1: 200 objects Group G2: 46 objects ACA 642 (2009) 222-227

20 20.02.14 20 WSC-9 Probe position effect

21 20.02.14 21 WSC-9 Extreme plots Clean subset G1Contaminated dataset G1+G2 Expected number of extremes N=  I

22 20.02.14 22 WSC-9 Results of separation Subset G1 revealedSubset G2 revealed

23 Reference 20.02.14 23 WSC-9

24 One-class classification 20.02.14 24 WSC-9 Alternatives Type II error =1 − Type I error

25 How to find β in case AC is known 20.02.14 25 WSC-9 Target Alternative

26 Two-classes discrimination: plums & apples 20.02.14 26 WSC-9 mesh size ?

27 Errors of Type I and Type II 20.02.14 27 WSC-9

28 PC1 PC2 Type II error β 20.02.14 28 WSC-9 Target Alternative PCAPCA χ'2χ'2 χ2χ2

29 Non-central chi-squared distribution 20.02.14 29 WSC-9 the noncentrality parameter chi-squared distribution non-central chi-squared distribution

30 Calculation of β 20.02.14 30 WSC-9 Total distance of Target class (TC) Type II error h 0 =?,v 0 =?, N h =?, N v =? Total distance of Alternative class (AC)

31 20.02.14 31 WSC-9 Case study II. Real world data with 2 groups Substance in the closed PE bags, 82 drums measured by NIR. Totally: 246 spectra Group G1: 200 objects Group G2: 46 objects Type II error estimation

32 G2 = AC1 + AC2 + AC3 + AC4 20.02.14 32 WSC-9

33 Total distance c distributions 20.02.14 33 WSC-9 β α

34 Type II validation 20.02.14 34 WSC-9

35 Risk management 20.02.14 35 WSC-9 given α calculated c crit found β given β found α calculated c crit

36 20.02.14 36 WSC-9 Conclusion 1 Extreme objects play an important role in data analysis. These objects should not be confused with outliers. The number of extremes should be compared to the expected number, coupled with the significance level . Clean datasetContaminated dataset

37 20.02.14 37 WSC-9 Conclusion 2 Errors in decision making are inevitable. Reducing one error, we increase the other. The researcher's task is to find the balance of risks. Our approach provides such an opportunity. Examples will be presented in Oxana’s lecture. β α

38 20.02.14 38 WSC-9 Conclusion 3 The proposed Dual Data Driven PCA/SIMCA approach looks like a fine competitor to the pure classical and to the strictly robust methods. This technique has demonstrated a proper performance in the analysis of both regular and contaminated data sets. Clean datasetContaminated dataset

39 20.02.14 39 WSC-9 Thank you for your attention A Lawyer’s Mistake


Download ppt "Dual data driven SIMCA as a one-class classifier 20.02.14 1 WSC-9 Alexey Pomerantsev ICP RAS."

Similar presentations


Ads by Google