Download presentation
Presentation is loading. Please wait.
Published byLoraine Joanna Harrison Modified over 9 years ago
1
10.02.05 1 WSC-4 Simple View on Simple Interval Calculation (SIC) Alexey Pomerantsev, Oxana Rodionova Institute of Chemical Physics, Moscow and Kurt Varmuza Vienna Technical University © Kurt Varmuza
2
10.02.05 2 WSC-4 CAC, Lisbon, September 2004
3
10.02.05 3 WSC-4 Leisured Agenda 1.Why errors are limited? 2.Simple calculations, indeed! Univariate case 3.Complicated SIC. Bivariate case 4.Conclusions
4
10.02.05 4 WSC-4 Part I. Why errors are limited?
5
10.02.05 5 WSC-4 Water in wheat. NIR spectra by Lumex C o
6
10.02.05 6 WSC-4 Histogram for Y (water contents) 141 samples
7
10.02.05 7 WSC-4 Normal Probability Plot for Y 3% 21% 38%
8
10.02.05 8 WSC-4 PLS Regression. Whole data set
9
10.02.05 9 WSC-4 PLS Regression. Marked “outliers”
10
10.02.05 10 WSC-4 PLS Regression. Revised data set
11
10.02.05 11 WSC-4 Histogram for Y. Revised data set 124 samples
12
10.02.05 12 WSC-4 Normal Probability Plot. Revised data set 31% 81% 96%
13
10.02.05 13 WSC-4 Histogram for Y. Revised data set m+m+ m+2 m+3 m-3 m-2 m-m- m
14
10.02.05 14 WSC-4 Error Distribution + - Normal distribution Truncated normal distribution 3.5 + - Both distributions + -
15
10.02.05 15 WSC-4 Main SIC postulate All errors are limited! There exists Maximum Error Deviation, , such that for any error Prob { | | > } = 0
16
10.02.05 16 WSC-4 Part 2. Simple calculations
17
10.02.05 17 WSC-4 Case study. Simple Univariate Model xy Training C11.01.28 C22.01.68 C34.04.25 C45.05.32 Test T13.03.35 T24.56.19 T35.55.40 Data y=ax+ Model Error distribution
18
10.02.05 18 WSC-4 OLS calibration OLS Calibration is minimizing the Sum of Least Squares
19
10.02.05 19 WSC-4 Uncertainties in OLS t 3 (P) is quantile of Student's t-distribution for probability P with 3 degrees of freedom
20
10.02.05 20 WSC-4 Maximum Error Deviation is known: = 0.7 (=2.5s) SIC calibration 22 22 22 22 | | <
21
10.02.05 21 WSC-4 SIC calibration xya min a max Training C11.01.280.581.98 C22.01.680.491.19 C34.04.250.891.24 C45.05.320.921.20
22
10.02.05 22 WSC-4 Region of Possible Values xya min a max Training C11.01.280.581.98 C22.01.680.491.19 C34.04.250.891.24 C45.05.320.921.20 RPV
23
10.02.05 23 WSC-4 SIC prediction xyv -v - v +v + Test T13.03.352.773.57 T24.56.194.165.36 T35.55.405.086.55
24
10.02.05 24 WSC-4 Object Status. Calibration Set xya min a max Training C11.01.280.581.98 C22.01.680.491.19 C34.04.250.891.24 C45.05.320.921.20 Samples C2 & C4 are the boundary objects. They form RPV. Samples C1 & C3 are insiders. They could be removed from the calibration set and RPV doesn’t change.
25
10.02.05 25 WSC-4 Object Status. Test Set Let’s consider what happens when a new sample is added to the calibration set.
26
10.02.05 26 WSC-4 Object Status. Insider If we add sample T1, RPV doesn’t change. This object is an insider. Prediction interval lies inside error interval
27
10.02.05 27 WSC-4 Object Status. Outlier If we add sample T2, RPV disappears. This object is an outlier. Prediction Interval lies out error interval
28
10.02.05 28 WSC-4 Object Status. Outsider If we add sample T3, RPV becomes smaller. This object is an outsider. Prediction interval overlaps error interval
29
10.02.05 29 WSC-4 v +v + v –v – y y+y+ y–y– SIC-Residual and SIC-Leverage Definition 1. SIC-residual is defined as – This is a characteristic of bias Definition 2. SIC-leverage is defined as – This is a normalized precision r r h h They characterize interactions between prediction and error intervals
30
10.02.05 30 WSC-4 Object Status Plot Statement 1 An object (x, y) is an insider, iff | r (x, y) | 1 – h (x) Presented by triangle BCD Statement 2 An object (x, y) is an outlier, iff | r (x, y) | > 1 + h (x) Presented by lines AB and DE Using simple algebraic calculus one can prove the following statements
31
10.02.05 31 WSC-4 Insiders Outsiders Outliers Absolute outsiders Object Status Classification
32
10.02.05 32 WSC-4 OLS Confidence versus SIC Prediction True response value, y, is always located within the SIC prediction interval. This has been confirmed by simulations repeated 100,000 times. Thus Prob{ v - < y < v + } = 1.00 Confidence intervals tends to infinity when P is increased. Confidence intervals are unreasonably wide!
33
10.02.05 33 WSC-4 Beta Estimation. Minimum RPV = 0.7 RPV = 0.6 RPV = 0.5 RPV = 0.4 = 0.3 > b min = 0.3
34
10.02.05 34 WSC-4 Beta Estimation from Regression Residuals e = y measured – y predicted b OLS = max {|e 1 |, |e 2 |,..., |e n |} b OLS = 0.4 b SIC = b OLS C(n) Prob{ < b SIC }=0.90 b SIC = 0.8
35
10.02.05 35 WSC-4 1-2-3-4 Sigma Rule 1s RMSEC 2s b min 3s b OLS 4s b SIC RMSEC = 0.2 = 1s b min = 0.3 = 1.5s b OLS = 0.4 = 2s b SIC = 0.8 = 4s
36
10.02.05 36 WSC-4 Part 3. Complicated SIC. Bivariate case
37
10.02.05 37 WSC-4 Octane Rating Example (by K. Esbensen) X-values are NIR-measurements over 226 wavelengths Training set = 24 samples Test set =13 samples Y-values are reference measurements of octane number.
38
10.02.05 38 WSC-4 Calibration
39
10.02.05 39 WSC-4 PLS Decomposition n X b y = p p 1 1 n 2PC T a = n 2 1 y n 1 – y 0 1 n 1 P L S
40
10.02.05 40 WSC-4 1-2-3-4 Sigma Rule for Octane Example RMSEC = 0.27 = 1s b min = 0.48 = 1.8s b OLS = 0.58 = 2.2s b SIC = 0.88 = 3.3s = b SIC = 0.88
41
10.02.05 41 WSC-4 RPV in Two-Dimensional Case y 1 – y 0 – t 11 a 1 + t 12 a 2 y 1 – y 0 + y 2 – y 0 – t 21 a 1 + t 22 a 2 y 2 – y 0 + ... y n – y 0 – t n1 a 1 + t n2 a 2 y n – y 0 + We have a system of 2n =48 inequalities regarding two parameters a 1 and a 2
42
10.02.05 42 WSC-4 Region of Possible Values
43
10.02.05 43 WSC-4 Close view on RPV. Calibration Set SamplesBoundary Samples 24 C7C9C13C14C18C23 —— RPV RPV in parameter spaceObject Status Plot
44
10.02.05 44 WSC-4 v –v – SIC Prediction with Linear Programming Linear Programming Problem Vertex #a1a1 a2a2 t t ay 113.9116.36-0.4088.86 214.2218.36-0.3588.90 316.7926.66-0.2489.01 419.9126.61-0.4688.79 520.4113.16-0.9688.30 617.4413.52-0.7488.52 -0.7413.5217.446 88.30-0.9613.1620.415 88.79-0.4626.6119.914 89.01-0.2426.6616.793 88.90-0.3518.3614.222 88.86-0.4016.3613.911 yt t aa2a2 a1a1 Vertex # v +v +
45
10.02.05 45 WSC-4 Octane Prediction. Test Set Reference values PLS 2RMSEP SIC prediction Prediction intervals: SIC & PLSObject Status Plot
46
10.02.05 46 WSC-4 Conclusions Real errors are limited. The truncated normal distribution is a much more realistic model for the practical applications than unlimited error distribution. Postulating that all errors are limited we can draw out a new concept of data modeling that is the SIC method. It is based on this single assumption and nothing else. SIC approach let us a new view on the old chemometrics problems, like outliers, influential samples, etc. I think that this is interesting and helpful view.
47
10.02.05 47 WSC-4 OLS versus SIC SIC-Residuals vs. OLS-ResidualsSIC-Leverages vs. OLS-Leverages SIC Object Status PlotOLS/PLS Influence Plot
48
10.02.05 48 WSC-4 Statistical view on OLS & SIC OLSSIC Statistics Deviation Let’s have a sampling {x 1,...x n } from a distribution with finite support [-1,+1]. The mean value a is unknown!
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.