Data Driven SIMCA – more than One-Class Classifier

Slides:



Advertisements
Similar presentations
Application of NIR for counterfeit drug detection Another proof that chemometrics is usable: NIR confirmed by HPLC-DAD-MS and CE-UV Institute of Chemical.
Advertisements

A SOFTWARE TOOL DEVELOPED FOR THE CLASSIFICATION OF REMOTE SENSING SPECTRAL REFLECTANCE DATA Abdullah Faruque School of Computing & Software Engineering.
Efficient And Accurate Ranking of Multidimensional Drug Profiling Data by Graph-Based Algorithm Dorit S. Hochbaum Chun-nan Hsu Yan T. Yang.
Software Quality Ranking: Bringing Order to Software Modules in Testing Fei Xing Michael R. Lyu Ping Guo.
1 Simple Interval Calculation (SIC-method) theory and applications. Rodionova Oxana Semenov Institute of Chemical Physics RAS & Russian.
Simple Interval Calculation bi-linear modelling method. SIC-method Rodionova Oxana Semenov Institute of Chemical Physics RAS & Russian.
1 Status Classification of MVC Objects Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow.
WSC-6 Critical levels in projection Alexey Pomerantsev Semenov Institute of Chemical Physics, Moscow.
CALIBRATION Prof.Dr.Cevdet Demir
1 lBayesian Estimation (BE) l Bayesian Parameter Estimation: Gaussian Case l Bayesian Parameter Estimation: General Estimation l Problems of Dimensionality.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Prof.Dr.Cevdet Demir
Chapter In Chapter 3… … we used stemplots to look at shape, central location, and spread of a distribution. In this chapter we use numerical summaries.
Bayesian Estimation (BE) Bayesian Parameter Estimation: Gaussian Case
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Chemometrics Method comparison
ElectroScience Lab IGARSS 2011 Vancouver Jul 26th, 2011 Chun-Sik Chae and Joel T. Johnson ElectroScience Laboratory Department of Electrical and Computer.
To determine the rate constants for the second order consecutive reactions, a number of chemometrics and hard kinetic based methods are described. The.
Chi-squared distribution  2 N N = number of degrees of freedom Computed using incomplete gamma function: Moments of  2 distribution:
Chemometric functions in Excel
The Unscrambler ® A Handy Tool for Doing Chemometrics Prof. Waltraud Kessler Prof. Dr. Rudolf Kessler Hochschule Reutlingen, School of Applied Chemistry.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Measures of Variability. Variability Measure of the spread or dispersion of a set of data 4 main measures of variability –Range –Interquartile range –Variance.
Fuzzy Entropy based feature selection for classification of hyperspectral data Mahesh Pal Department of Civil Engineering National Institute of Technology.
Statistical Analysis. Statistics u Description –Describes the data –Mean –Median –Mode u Inferential –Allows prediction from the sample to the population.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
WSC-4 Simple View on Simple Interval Calculation (SIC) Alexey Pomerantsev, Oxana Rodionova Institute of Chemical Physics, Moscow and Kurt Varmuza.
Subset Selection Problem Oxana Rodionova & Alexey Pomerantsev Semenov Institute of Chemical Physics Russian Chemometric Society Moscow.
THE ANALYSIS OF FRACTURE SURFACES OF POROUS METAL MATERIALS USING AMT AND FRACTAL GEOMETRY METHODS Sergei Kucheryavski Artem Govorov Altai State University.
Chapter 3 (part 2): Maximum-Likelihood and Bayesian Parameter Estimation Bayesian Estimation (BE) Bayesian Estimation (BE) Bayesian Parameter Estimation:
CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
A NOVEL METHOD FOR COLOR FACE RECOGNITION USING KNN CLASSIFIER
Speech Lab, ECE, State University of New York at Binghamton  Classification accuracies of neural network (left) and MXL (right) classifiers with various.
PATTERN RECOGNITION : PRINCIPAL COMPONENTS ANALYSIS Richard Brereton
Evaluation of gene-expression clustering via mutual information distance measure Ido Priness, Oded Maimon and Irad Ben-Gal BMC Bioinformatics, 2007.
FIT ANALYSIS IN RASCH MODEL University of Ostrava Czech republic 26-31, March, 2012.
Principal Component Analysis (PCA)
© 2008 McGraw-Hill Higher Education The Statistical Imagination Chapter 5. Measuring Dispersion or Spread in a Distribution of Scores.
Variability Introduction to Statistics Chapter 4 Jan 22, 2009 Class #4.
1 Robustness of Multiway Methods in Relation to Homoscedastic and Hetroscedastic Noise T. Khayamian Department of Chemistry, Isfahan University of Technology,
GRAPPLING WITH DATA Variability in observations Sources of variability measurement error and reliability Visualizing the sample data Frequency distributions.
Chi Square Test for Goodness of Fit Determining if our sample fits the way it should be.
Dual data driven SIMCA as a one-class classifier WSC-9 Alexey Pomerantsev ICP RAS.
Martina Uray Heinz Mayer Joanneum Research Graz Institute of Digital Image Processing Horst Bischof Graz University of Technology Institute for Computer.
Potential of Hyperspectral Imaging to Monitor Cheese Ripening
MECH 373 Instrumentation and Measurements
Chapter 11 Analysis of Variance
JMP Discovery Summit 2016 Janet Alvarado
Cluster Analysis II 10/03/2012.
Chapter 14 Inference on the Least-Squares Regression Model and Multiple Regression.
Course survey: what has been done, and what should be done
Statistics for Managers Using Microsoft Excel 3rd Edition
How to solve authentication problems
Outlier Processing via L1-Principal Subspaces
Principal Component Analysis (PCA)
Application of Independent Component Analysis (ICA) to Beam Diagnosis
Quality Control at a Local Brewery
Multi-class PLS-DA: soft and hard approaches
Chapter 11 Analysis of Variance
Pattern Classification via Density Estimation
Reasoning in Psychology Using Statistics
PCA based Noise Filter for High Spectral Resolution IR Observations
Nat. Rev. Nephrol. doi: /nrneph
Dimension reduction : PCA and Clustering
Feature Selection Methods
Diagnostics and Remedial Measures
SIMCA.XLA as an extension of Chemometrics Add-In
Lecture 8: Factor analysis (FA)
Recognition of the 'high quality’ forgeries among medicines
NOISE FILTER AND PC FILTERING
Presentation transcript:

Data Driven SIMCA – more than One-Class Classifier Semenov Institute of Chemical Physics, RAS Moscow Russian Chemometric Society Oxana Rodionova, Alexey Pomerantsev WSC-11

Soft Independent Modeling of Class Analogy - SIMCA (S Soft Independent Modeling of Class Analogy - SIMCA (S. Wold: Pattern Recognition by Means of Disjoint Principal Components Models, (1976)   t1 t2 t3 Disjoint PCA class -modeling Cut-off levels using orthogonal distances New object is compared with each class by calculation of the orthogonal distance … … … many additions and modifications WSC-11

Data driven approach Projection Orthogonal distance vi Score distance hi WSC-11

Distribution of distances: DoF estimation = h/h0 x= = v/v0 x1,...., xI ~ χ2(N)/N N = ? Method of Moments Interquartile Approach x(1) ≤ x(2 )≤ .... ≤ x(I-1) ≤ x(I) ¼ IQR ¼ WSC-11

Total Distance and Cut-off Level Total distance (TD) For given α, the rate of wrong rejections of the target class samples, a type I error WSC-11

2 distribution (reminder 1) =h/h0 x= =v/v0 x ~ χ2(N)/N N = DoF E(x) = 1 D(x) = 2/N A chi-squared variable with N degrees of freedom is defined as the sum of the squares of N independent standard normal random variables. WSC-11

2 distribution (reminder 2) 1001 10020 N(0,) E(i,j) ~ χ2(20) WSC-11

Simulated example 1 Gaussian noise only 10020 N(0,) E PCA DoFs Rank(E)=K=20 DoF(SD)= A (principal component) DoF(OD)=K-A WSC-11

Simulated example 2 Structure & no Noise 100200 ΛT=(150, 100, 50, 20, 1, 0.001) S Rank(S)=6 S=UΛVT PCA DosF WSC-11

DoFs for matrix S, α=0.05 PC=1 PC=2 Theory Estimate Nh=1 Nv=5 Nv=2 10 out 6 out PC=1 Theory Estimate Nh=2 Nv=4 Nv=1 9 out 4 out PC=2 WSC-11

Simulated example 3 Structure + Additive Noise 100200 ΛT=(150, 100, 50, 20, 1,0.001) S 100200 N(0,) E PCA X= + Estimates of DoFs for various  WSC-11

Extreme plot. Dependence on α =0.1 =0.05 =0.01 Demonstrates the dependence of the observed number of the extremes versus theoretically expected values, calculated as n=I. The plot is obtained by varying =n/I. WSC-11

Extreme plot. Training & Test sets Test set (20 objects) Training set (80 objects) PC=4 PC=3 PC=7 PC=2 PC=1 WSC-11

Simulated example 4. Test set partly carries different structure 100200 Straining=UΛVt+E(0,) Λ U(100×6) Vnt(6×200) Vt(6×200) 100200 Stest=UΛVnt+E1(0,) Rank(S)=6 S=UΛVT ΛT=(150, 100, 50, 20, 1,0.001) WSC-11

Simulated example 4 (PCs=3) Training set Test set PC=3 Nh=4 Nv=1 WSC-11

Simulated example 4 (PCs=4) Training set Test set PC=4 Nh=5 Nv=1 WSC-11

Real-world example “Olives in brine” 3 Classes 233 objects 1258 variables Measurements: NIR spectra in DR mode 4000 -10000 cm-1 Class 1 Training set: 75 objects Test set : 44 objects O.Ye. Rodionova, P. Oliveri, A.L. Pomerantsev, "Rigorous and compliant approaches to one-class classification", Chemom. Intell. Lab. Syst. 159, 89-96 (2016) WSC-11

Class 1.Training set 75×1258 PC=4 α=0.05 WSC-11

Application of the Extreme plot (1) ‘Olive in brine’ Class 1. Test set:44 objects PCs=4 PCs=5 PCs=7 PCs=2 WSC-11

Application of the Extreme plot (2) Assessment of instruments’ performance for monitoring of tablets’ quality and anti-counterfeiting MicroNIR 1700 by VIAVI Solution Working range is 908 -1676 nm Resolution is less than 12.5 nm Aims: 1. Compare the results of 2 instruments 2. Compare the results of measurements in 2 days WSC-11

Anti-inflammatory medicine packed in PVC blisters Objects: 50 tablets from 5 batches Dataset (50 × 125) DD-SIMCA model, PCs=3 Dataset “Instrument #1 day 1” Dataset “Second day” Dataset “Second instrument” WSC-11

New Datasets Second day PCs=3 PCs=4 PCs=2 PCs=1 Second instrument WSC-11

Interquartile Approach Outlier detection DoF: Classical and Robust estimates = h/h0 x= = v/v0 x1,...., xI ~ χ2(N)/N N = ? Method of Moments Interquartile Approach x(1) ≤ x(2 )≤ .... ≤ x(I-1) ≤ x(I) ¼ IQR ¼ WSC-11

Real-world example (Poster #23) Confocal Raman Spectroscopy and MDA in Evaluation of Spermatozoa with Normal and Abnormal Morphology Morphology classification 125 Normal 36 Abnormal Study the sperm nuclear DNA quality. Compare the results of morphology and Raman spectroscopy analysis in revealing normal and abnormal cells. WSC-11

Sequential application of DD-SIMCA for outlier detection Sequential application of DD-SIMCA for outlier detection. ‘Normal’ model. Initial step Classical Nv=1; Nh=1 Classical Nv=1; Nh=2 PCs=4 PCs=3 Robust Nv=2; Nh=3 Robust Nv=3; Nh=4 WSC-11

Sequential application of DD-SIMCA for outlier detection Sequential application of DD-SIMCA for outlier detection. ‘Normal’ model. Final step Classical Nv=2; Nh=3 Classical Nv=2; Nh=3 PCs=4 PCs=3 Robust Nv=3; Nh=4 Robust Nv=2; Nh=3 WSC-11

17 ‘Abnormal’-’Normal’ objects Objects partitioning Morphology classification 17 ‘Abnormal’-’Normal’ objects 125 Normal 102 Normal 23 Abnormal Spectral classification 36 Abnormal 17 Normal 19 Abnormal WSC-11

Conclusions (1) PCA. Determination of the number of principal components for… Description of the X- data in details Determination of hidden structures even for higher PCs Separate structure from random noise Revealing the main common features of the X-data, without analyzing between-objects’ differences Outlier detection Estimation of the DoF for SD and OD distances Tool: WSC-11

Conclusions (2) Tool: PCA, application of various datasets for… Comparison, to what extent the test set is similar to the training set Comparison a new data with the training objects Extreme plots for the training and test/new sets Tool: WSC-11

Conclusions (3) Both tools may be used not only for DD-SIMCA but for the preliminary analysis of any data set. We acknowledge partly funding from the IAEA in the frame of projects D5240 and G42007 WSC-11

Software tools For Chemometrics Add-In users: SIMCA Template.xlsb For Matlab users: DD-SIMCA — a MATLAB GUI tool Github: https://github.com/yzontov/dd-simca.git  Y.V. Zontov, O.Ye. Rodionova, S.V. Kucheryavskiy, A.L. Pomerantsev, "DD-SIMCA – A MATLAB GUI tool for data driven SIMCA approach",  Chemom. Intell. Lab. Syst. 167, 23-28 (2017)