WSC-4 Simple View on Simple Interval Calculation (SIC) Alexey Pomerantsev, Oxana Rodionova Institute of Chemical Physics, Moscow and Kurt Varmuza Vienna Technical University © Kurt Varmuza
WSC-4 CAC, Lisbon, September 2004
WSC-4 Leisured Agenda 1.Why errors are limited? 2.Simple calculations, indeed! Univariate case 3.Complicated SIC. Bivariate case 4.Conclusions
WSC-4 Part I. Why errors are limited?
WSC-4 Water in wheat. NIR spectra by Lumex C o
WSC-4 Histogram for Y (water contents) 141 samples
WSC-4 Normal Probability Plot for Y 3% 21% 38%
WSC-4 PLS Regression. Whole data set
WSC-4 PLS Regression. Marked “outliers”
WSC-4 PLS Regression. Revised data set
WSC-4 Histogram for Y. Revised data set 124 samples
WSC-4 Normal Probability Plot. Revised data set 31% 81% 96%
WSC-4 Histogram for Y. Revised data set m+m+ m+2 m+3 m-3 m-2 m-m- m
WSC-4 Error Distribution + - Normal distribution Truncated normal distribution 3.5 + - Both distributions + -
WSC-4 Main SIC postulate All errors are limited! There exists Maximum Error Deviation, , such that for any error Prob { | | > } = 0
WSC-4 Part 2. Simple calculations
WSC-4 Case study. Simple Univariate Model xy Training C C C C Test T T T Data y=ax+ Model Error distribution
WSC-4 OLS calibration OLS Calibration is minimizing the Sum of Least Squares
WSC-4 Uncertainties in OLS t 3 (P) is quantile of Student's t-distribution for probability P with 3 degrees of freedom
WSC-4 Maximum Error Deviation is known: = 0.7 (=2.5s) SIC calibration 22 22 22 22 | | <
WSC-4 SIC calibration xya min a max Training C C C C
WSC-4 Region of Possible Values xya min a max Training C C C C RPV
WSC-4 SIC prediction xyv -v - v +v + Test T T T
WSC-4 Object Status. Calibration Set xya min a max Training C C C C Samples C2 & C4 are the boundary objects. They form RPV. Samples C1 & C3 are insiders. They could be removed from the calibration set and RPV doesn’t change.
WSC-4 Object Status. Test Set Let’s consider what happens when a new sample is added to the calibration set.
WSC-4 Object Status. Insider If we add sample T1, RPV doesn’t change. This object is an insider. Prediction interval lies inside error interval
WSC-4 Object Status. Outlier If we add sample T2, RPV disappears. This object is an outlier. Prediction Interval lies out error interval
WSC-4 Object Status. Outsider If we add sample T3, RPV becomes smaller. This object is an outsider. Prediction interval overlaps error interval
WSC-4 v +v + v –v – y y+y+ y–y– SIC-Residual and SIC-Leverage Definition 1. SIC-residual is defined as – This is a characteristic of bias Definition 2. SIC-leverage is defined as – This is a normalized precision r r h h They characterize interactions between prediction and error intervals
WSC-4 Object Status Plot Statement 1 An object (x, y) is an insider, iff | r (x, y) | 1 – h (x) Presented by triangle BCD Statement 2 An object (x, y) is an outlier, iff | r (x, y) | > 1 + h (x) Presented by lines AB and DE Using simple algebraic calculus one can prove the following statements
WSC-4 Insiders Outsiders Outliers Absolute outsiders Object Status Classification
WSC-4 OLS Confidence versus SIC Prediction True response value, y, is always located within the SIC prediction interval. This has been confirmed by simulations repeated 100,000 times. Thus Prob{ v - < y < v + } = 1.00 Confidence intervals tends to infinity when P is increased. Confidence intervals are unreasonably wide!
WSC-4 Beta Estimation. Minimum RPV = 0.7 RPV = 0.6 RPV = 0.5 RPV = 0.4 = 0.3 > b min = 0.3
WSC-4 Beta Estimation from Regression Residuals e = y measured – y predicted b OLS = max {|e 1 |, |e 2 |,..., |e n |} b OLS = 0.4 b SIC = b OLS C(n) Prob{ < b SIC }=0.90 b SIC = 0.8
WSC Sigma Rule 1s RMSEC 2s b min 3s b OLS 4s b SIC RMSEC = 0.2 = 1s b min = 0.3 = 1.5s b OLS = 0.4 = 2s b SIC = 0.8 = 4s
WSC-4 Part 3. Complicated SIC. Bivariate case
WSC-4 Octane Rating Example (by K. Esbensen) X-values are NIR-measurements over 226 wavelengths Training set = 24 samples Test set =13 samples Y-values are reference measurements of octane number.
WSC-4 Calibration
WSC-4 PLS Decomposition n X b y = p p 1 1 n 2PC T a = n 2 1 y n 1 – y 0 1 n 1 P L S
WSC Sigma Rule for Octane Example RMSEC = 0.27 = 1s b min = 0.48 = 1.8s b OLS = 0.58 = 2.2s b SIC = 0.88 = 3.3s = b SIC = 0.88
WSC-4 RPV in Two-Dimensional Case y 1 – y 0 – t 11 a 1 + t 12 a 2 y 1 – y 0 + y 2 – y 0 – t 21 a 1 + t 22 a 2 y 2 – y 0 + ... y n – y 0 – t n1 a 1 + t n2 a 2 y n – y 0 + We have a system of 2n =48 inequalities regarding two parameters a 1 and a 2
WSC-4 Region of Possible Values
WSC-4 Close view on RPV. Calibration Set SamplesBoundary Samples 24 C7C9C13C14C18C23 —— RPV RPV in parameter spaceObject Status Plot
WSC-4 v –v – SIC Prediction with Linear Programming Linear Programming Problem Vertex #a1a1 a2a2 t t ay yt t aa2a2 a1a1 Vertex # v +v +
WSC-4 Octane Prediction. Test Set Reference values PLS 2RMSEP SIC prediction Prediction intervals: SIC & PLSObject Status Plot
WSC-4 Conclusions Real errors are limited. The truncated normal distribution is a much more realistic model for the practical applications than unlimited error distribution. Postulating that all errors are limited we can draw out a new concept of data modeling that is the SIC method. It is based on this single assumption and nothing else. SIC approach let us a new view on the old chemometrics problems, like outliers, influential samples, etc. I think that this is interesting and helpful view.
WSC-4 OLS versus SIC SIC-Residuals vs. OLS-ResidualsSIC-Leverages vs. OLS-Leverages SIC Object Status PlotOLS/PLS Influence Plot
WSC-4 Statistical view on OLS & SIC OLSSIC Statistics Deviation Let’s have a sampling {x 1,...x n } from a distribution with finite support [-1,+1]. The mean value a is unknown!