Download presentation
Presentation is loading. Please wait.
Published byOpal Ray Modified over 9 years ago
1
Dual data driven SIMCA as a one-class classifier 20.02.14 1 WSC-9 Alexey Pomerantsev ICP RAS
2
One-class classifier, e.g. SIMCA 20.02.14 2 WSC-9 Target Alternative
3
Standard bi-variate normal distribution 20.02.14 3 WSC-9
4
20.02.14 4 WSC-9 Extremes and Outliers α is Extreme significance =0.01 γ =0.05 γ is Outlier significance
5
Extreme plot 20.02.14 5 WSC-9
6
20.02.14 6 WSC-9 Principal Component Analysis I A TATA A PAPA EAEA + X I =× J J t I J Karl Pearson, 1901
7
20.02.14 7 WSC-9 Scores & Orthogonal Distances SD: distance within the model OD: distance to the model
8
ODSD SD & OD distributions 20.02.14 8 WSC-9 Nomikos P, MacGregor JF. Technometrics 1995; 37: 41–59. De Braekeleer K, et al. Chemom. Intell. Lab. Syst.1999; 46:103–116. Westerhuis JA, Gurden SP, Smilde AK. Chemom. Intell. Lab. Syst. 2000; 51: 95–114 Pomerantsev A. J. Chemometrics, 2008, 22 : 601-609
9
20.02.14 9 WSC-9 Data Driven SIMCA SDOD h u = v u 0 = ? N = ?
10
Total Distance 20.02.14 10 WSC-9 Scores distance (SD) Orthogonal distance (OD) Total distance (TD)
11
20.02.14 11 WSC-9 Tolerance Areas α is Extreme significance γ is Outlier significance
12
20.02.14 12 WSC-9 Classical Data Driven (CDD) SIMCA Classical Method of Moments Given Then Where
13
20.02.14 13 WSC-9 Robust Data Driven (RDD) SIMCA Robust Method of Moments M=median(u) R=interquartile(u) Given Then Where u (1) ≤ u (2 ) ≤.... ≤ u (I-1) ≤ u (I) ½ ¼
14
20.02.14 14 WSC-9 Dual Data Driven SIMCA Given Then X=T t P+E h=(h 1,...., h I ) v=(v 1,...., v I ) CDD SIMCARDD SIMCA Yes CDD SIMCA No RDD SIMCA
15
20.02.14 15 WSC-9 Case study I. Simulated data with outliers The numbers of variables, J=3 The numbers of objects, I=100 The number of principal components, A=2 The properties are: E( ) = 0, v 11 = v 22 = v 33 = 0.28, rank(V) = 2. The component properties are: E( ) = 0, =0.05 (first 97 objects) E( ) = 0, =0.2 (last 3 objects)
16
20.02.14 16 WSC-9 SIMCA plots
17
20.02.14 17 WSC-9 REFERENCE & RDD-SIMCA
18
20.02.14 18 WSC-9 Totally in 10 data sets with outliers Expected
19
20.02.14 19 WSC-9 Case study II. Real world data with 2 groups Substance in the closed PE bags, 82 drums measured by NIR. Totally: 246 spectra Group G1: 200 objects Group G2: 46 objects ACA 642 (2009) 222-227
20
20.02.14 20 WSC-9 Probe position effect
21
20.02.14 21 WSC-9 Extreme plots Clean subset G1Contaminated dataset G1+G2 Expected number of extremes N= I
22
20.02.14 22 WSC-9 Results of separation Subset G1 revealedSubset G2 revealed
23
Reference 20.02.14 23 WSC-9
24
One-class classification 20.02.14 24 WSC-9 Alternatives Type II error =1 − Type I error
25
How to find β in case AC is known 20.02.14 25 WSC-9 Target Alternative
26
Two-classes discrimination: plums & apples 20.02.14 26 WSC-9 mesh size ?
27
Errors of Type I and Type II 20.02.14 27 WSC-9
28
PC1 PC2 Type II error β 20.02.14 28 WSC-9 Target Alternative PCAPCA χ'2χ'2 χ2χ2
29
Non-central chi-squared distribution 20.02.14 29 WSC-9 the noncentrality parameter chi-squared distribution non-central chi-squared distribution
30
Calculation of β 20.02.14 30 WSC-9 Total distance of Target class (TC) Type II error h 0 =?,v 0 =?, N h =?, N v =? Total distance of Alternative class (AC)
31
20.02.14 31 WSC-9 Case study II. Real world data with 2 groups Substance in the closed PE bags, 82 drums measured by NIR. Totally: 246 spectra Group G1: 200 objects Group G2: 46 objects Type II error estimation
32
G2 = AC1 + AC2 + AC3 + AC4 20.02.14 32 WSC-9
33
Total distance c distributions 20.02.14 33 WSC-9 β α
34
Type II validation 20.02.14 34 WSC-9
35
Risk management 20.02.14 35 WSC-9 given α calculated c crit found β given β found α calculated c crit
36
20.02.14 36 WSC-9 Conclusion 1 Extreme objects play an important role in data analysis. These objects should not be confused with outliers. The number of extremes should be compared to the expected number, coupled with the significance level . Clean datasetContaminated dataset
37
20.02.14 37 WSC-9 Conclusion 2 Errors in decision making are inevitable. Reducing one error, we increase the other. The researcher's task is to find the balance of risks. Our approach provides such an opportunity. Examples will be presented in Oxana’s lecture. β α
38
20.02.14 38 WSC-9 Conclusion 3 The proposed Dual Data Driven PCA/SIMCA approach looks like a fine competitor to the pure classical and to the strictly robust methods. This technique has demonstrated a proper performance in the analysis of both regular and contaminated data sets. Clean datasetContaminated dataset
39
20.02.14 39 WSC-9 Thank you for your attention A Lawyer’s Mistake
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.