10.02.08 1 WSC-6 Critical levels in projection Alexey Pomerantsev Semenov Institute of Chemical Physics, Moscow.

Slides:



Advertisements
Similar presentations
Lecture 14 chi-square test, P-value Measurement error (review from lecture 13) Null hypothesis; alternative hypothesis Evidence against null hypothesis.
Advertisements

Application of NIR for counterfeit drug detection Another proof that chemometrics is usable: NIR confirmed by HPLC-DAD-MS and CE-UV Institute of Chemical.
Describing Quantitative Variables
Measures of Variation Sample range Sample variance Sample standard deviation Sample interquartile range.
Looking at data: distributions - Describing distributions with numbers IPS chapter 1.2 © 2006 W.H. Freeman and Company.
1 Simple Interval Calculation (SIC-method) theory and applications. Rodionova Oxana Semenov Institute of Chemical Physics RAS & Russian.
Simple Interval Calculation bi-linear modelling method. SIC-method Rodionova Oxana Semenov Institute of Chemical Physics RAS & Russian.
1 Status Classification of MVC Objects Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow.
10 Hypothesis Testing. 10 Hypothesis Testing Statistical hypothesis testing The expression level of a gene in a given condition is measured several.
Statistical inference form observational data Parameter estimation: Method of moments Use the data you have to calculate first and second moment To fit.
Experimental Evaluation
© 2011 Pearson Education, Inc. Statistics for Business and Economics Chapter 7 Inferences Based on Two Samples: Confidence Intervals & Tests of Hypotheses.
Aron, Aron, & Coups, Statistics for the Behavioral and Social Sciences: A Brief Course (3e), © 2005 Prentice Hall Chapter 8 Introduction to the t Test.
LECTURE 12 Tuesday, 6 October STA291 Fall Five-Number Summary (Review) 2 Maximum, Upper Quartile, Median, Lower Quartile, Minimum Statistical Software.
CHAPTER 2: Describing Distributions with Numbers ESSENTIAL STATISTICS Second Edition David S. Moore, William I. Notz, and Michael A. Fligner Lecture Presentation.
Kayla Jordan D. Wayne Mitchell RStats Institute Missouri State University.
Correlation.
Summary statistics Using a single value to summarize some characteristic of a dataset. For example, the arithmetic mean (or average) is a summary statistic.
Estimating Population Parameters Mean Variance (and standard deviation) –Degrees of Freedom Sample size –1 –Sample standard deviation –Degrees of confidence.
Chapter 11 Inference for Distributions AP Statistics 11.1 – Inference for the Mean of a Population.
Variability The goal for variability is to obtain a measure of how spread out the scores are in a distribution. A measure of variability usually accompanies.
Measures of Variability. Variability Measure of the spread or dispersion of a set of data 4 main measures of variability –Range –Interquartile range –Variance.
STA Lecture 131 STA 291 Lecture 13, Chap. 6 Describing Quantitative Data – Measures of Central Location – Measures of Variability (spread)
Statistical Significance of Data
WSC-4 Simple View on Simple Interval Calculation (SIC) Alexey Pomerantsev, Oxana Rodionova Institute of Chemical Physics, Moscow and Kurt Varmuza.
H1H1 H1H1 HoHo Z = 0 Two Tailed test. Z score where 2.5% of the distribution lies in the tail: Z = Critical value for a two tailed test.
Variability.  Reflects the degree to which scores differ from one another  Usually in reference to the mean value  A measure of the central tendency.
Chapter 14 – 1 Chapter 14: Analysis of Variance Understanding Analysis of Variance The Structure of Hypothesis Testing with ANOVA Decomposition of SST.
Summary Five numbers summary, percentiles, mean Box plot, modified box plot Robust statistic – mean, median, trimmed mean outlier Measures of variability.
Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow.
1)Construct a box and whisker plot for the data below that represents the goals in a soccer game. (USE APPROPRIATE SCALE) 7, 0, 2, 5, 4, 9, 5, 0 2)Calculate.
Measures of Dispersion How far the data is spread out.
1 Inferences About The Pearson Correlation Coefficient.
Previous Lecture: Phylogenetics. Analysis of Variance This Lecture Judy Zhong Ph.D.
CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.
Chapter 12 For Explaining Psychological Statistics, 4th ed. by B. Cohen 1 Chapter 12: One-Way Independent ANOVA What type of therapy is best for alleviating.
Numerical descriptors BPS chapter 2 © 2006 W.H. Freeman and Company.
Numerical descriptors BPS chapter 2 © 2006 W.H. Freeman and Company.
EQ: How do we create and interpret box plots? Assessment: Students will write a summary of how to create a box plot. Warm Up Create a histogram for the.
1.3 Describing Quantitative Data with Numbers Pages Objectives SWBAT: 1)Calculate measures of center (mean, median). 2)Calculate and interpret measures.
FIT ANALYSIS IN RASCH MODEL University of Ostrava Czech republic 26-31, March, 2012.
ES 07 These slides can be found at optimized for Windows)
Section Copyright © 2014, 2012, 2010 Pearson Education, Inc. Lecture Slides Elementary Statistics Twelfth Edition and the Triola Statistics Series.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
Variability. What Do We Mean by Variability?  Variability provides a quantitative measure of the degree to which scores in a distribution are spread.
Chapter 9 Inferences Based on Two Samples: Confidence Intervals and Tests of Hypothesis.
WSC-5 Hard and soft modeling. A case study Alexey Pomerantsev Institute of Chemical Physics, Moscow.
IPS Chapter 1 © 2012 W.H. Freeman and Company  1.1: Displaying distributions with graphs  1.2: Describing distributions with numbers  1.3: Density Curves.
Dual data driven SIMCA as a one-class classifier WSC-9 Alexey Pomerantsev ICP RAS.
DATA ANALYSIS AND MODEL BUILDING LECTURE 4 Prof. Roland Craigwell Department of Economics University of the West Indies Cave Hill Campus and Rebecca Gookool.
Data Screening. What is it? Data screening is very important to make sure you’ve met all your assumptions, outliers, and error problems. Each type of.
AP Statistics 5 Number Summary and Boxplots. Measures of Center and Distributions For a symmetrical distribution, the mean, median and the mode are the.
Heteroscedasticity Heteroscedasticity is present if the variance of the error term is not a constant. This is most commonly a problem when dealing with.
Describing Distributions of Quantitative Data
MATH-138 Elementary Statistics
Course survey: what has been done, and what should be done
CHAPTER 2: Describing Distributions with Numbers
How to solve authentication problems
Math 4030 – 10b Inferences Concerning Variances: Hypothesis Testing
Chapter 4 Fundamental statistical characteristics II: Dispersion and form measurements.
Multi-class PLS-DA: soft and hard approaches
STA 291 Spring 2008 Lecture 5 Dustin Lueker.
Statistical Process Control
Inferences about Population Means
Inference for Distributions
CHAPTER 2: Describing Distributions with Numbers
Data Driven SIMCA – more than One-Class Classifier
SIMCA.XLA as an extension of Chemometrics Add-In
Introduction to the t Test
Recognition of the 'high quality’ forgeries among medicines
Presentation transcript:

WSC-6 Critical levels in projection Alexey Pomerantsev Semenov Institute of Chemical Physics, Moscow

WSC-6 Projection approach

WSC-6 Scores & Orthogonal Distances OD: distance to the model SD: distance within the model

WSC-6 Where applied SIMCA Classification PLS/PCR Influence plot MSPC

WSC-6 Giants battle at ICS-L, April 2007 The ratios of residual variances of PCA are fairly well F-distributed. This is easy - the shape of the distribution of a ratio of two variances usually looks like an F. Svante Wold No, the residuals from PCA don't follow an F- distribution unless you fuss with the degrees of freedom, and there are better alternatives in any case. Barry Wise

WSC-6 Full PCA Decomposition K=rank(X) ≤ min (I, J) X=TP t  =T t T=diag( 1,.., K ) X I JK T I =× PtPt J K

WSC-6 Truncated PCA Decomposition A ≤ K I A TATA A PAPA EAEA + X I =× J J t I J

WSC-6 Score distance (SD), h i hihi Leverage = h i +1/I Mahalanobis = (h i ) ½

WSC-6 Orthogonal distance (OD), v i vivi Variance per sample=v i /J Q statistics = v i

WSC-6 Distribution of distances: the shape? =h/h 0 x= =v/v 0 x ~ χ 2 (N)/N N = DoF E(x) = 1 D(x) = 2/N

WSC-6 Example: Leon Rusinov data I=1440 A=6 N h =5 N v =1 SDOD

WSC-6 Distribution of distances: DoF? Method of MomentsInterquartile Approach x (1) ≤ x (2 ) ≤.... ≤ x (I-1) ≤ x (I) ¼ IQR ¼ = h/h 0 x= = v/v 0 x 1,...., x I ~ χ 2 (N)/N N = ?

WSC-6 Type I error  I=100  = point is out  = points are out  = points are out  = points are out  = points are out

WSC-6 SIM Data. MSPC task I=100 J=25 A=5  =0.05

WSC-6 SD & OD values

WSC-6 DoF Estimates Interquartile ApproachMethod of Moments N h = 5.7 N v =21.6 N h = 5.0 N v =20.0

WSC-6 Acceptance areas: conventional I=100  =0.05

WSC-6 Acceptance areas  =0.05: Sum of CHIs I=100  =0.05

WSC-6 Acceptance areas: Ratio of CHIs I=100  =0.05

WSC-6 Wilson-Hilferty approximation for Chi

WSC-6 Acceptance areas: Wilson-Hilferty I=100  =0.05

WSC-6 Modified Wilson-Hilferty approximation 1–γ=P 0 +P 1 +P 2 +P 3 = = Φ(r) – ¼exp(–½r 2 ) r=r(γ)

WSC-6 Acceptance areas: modified Wilson-Hilferty I=100  =0.05

WSC-6 Areas Validation: variation of 

WSC-6 BMT Data. SIMCA I=45 J=3501 A=2 N h =3 N v =2  =0.025

WSC-6 Extremes & Outliers in calibration set  is significance level for outliers  =1 – (1 –  ) 1/I extreme outlier Calibration set: I=45 γ  I =  45 = 1.25 I out =2

WSC-6 SIMCA Classification without G07-4 New set: I new =30 10 Genuine + 20 Fakes γ  I new =  10 = 0.25 I out =3

WSC-6 What’s up? This is absolutely wrong classification but Oxana will explain how fix it over.

WSC-6 GRAIN Data. Influence plots I=123 J=118 A=4  =0.01 N h =5.7 N v =3.0 N u =1.0 X Y

WSC-6 Orthogonal distance to Y

WSC-6 Back to WSC-4

WSC-4 Training set Model 1 Boundary subset l=19 Boundary samples (WSC-4)

WSC-6 Influence plots for X and Y YX Calibration Boundary (SIC)

WSC-6 Box or Egg? Box or Egg? I<30

WSC-6 Conclusion 1 The χ 2 -distribution can be used in the modeling of the score and orthogonal distances.

WSC-6 Conclusion 2 Any classification problem should be solved with respect to a given type I error. Five of such areas have been presented but only two are recommended. I>30 I<30

WSC-6 Conclusion 3 Estimation of DoF is a key challenge in the projection modeling. A data-driven estimator of DoF, rather than a theory-driven one should be used. The method of moments is effective, but sensitive to outliers. The IQR estimator is a robust but less effective alternative. More examples will be demonstrated in the subsequent presentation by Oxana.