CLASSIFICATION. Periodic Table of Elements 1789 Lavosier 1869 Mendelev.

Slides:



Advertisements
Similar presentations
Pseudo Inverse Heisenberg Uncertainty for Data Mining Explicit Principal Components Implicit Principal Components NIPALS Algorithm for Eigenvalues and.
Advertisements

Test of (µ 1 – µ 2 ),  1 =  2, Populations Normal Test Statistic and df = n 1 + n 2 – 2 2– )1– 2 ( 2 1 )1– 1 ( 2 where ] 2 – 1 [–
A. The Basic Principle We consider the multivariate extension of multiple linear regression – modeling the relationship between m responses Y 1,…,Y m and.
Computational Statistics. Basic ideas  Predict values that are hard to measure irl, by using co-variables (other properties from the same measurement.
PCA for analysis of complex multivariate data. Interpretation of large data tables by PCA In industry, research and finance the amount of data is often.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Dimension reduction (1)
Data preprocessing before classification In Kennedy et al.: “Solving data mining problems”
Face Recognition and Biometric Systems
Metrics, Algorithms & Follow-ups Profile Similarity Measures Cluster combination procedures Hierarchical vs. Non-hierarchical Clustering Statistical follow-up.
LINEAR REGRESSION: Evaluating Regression Models Overview Assumptions for Linear Regression Evaluating a Regression Model.
LINEAR REGRESSION: Evaluating Regression Models. Overview Assumptions for Linear Regression Evaluating a Regression Model.
© 2005 The McGraw-Hill Companies, Inc., All Rights Reserved. Chapter 14 Using Multivariate Design and Analysis.
Principal Component Analysis
Multivariate Distance and Similarity Robert F. Murphy Cytometry Development Workshop 2000.
Metabolomics Bob Ward German Lab Food Science and Technology.
What is Cluster Analysis
Lecture 6: Multiple Regression
Bioinformatics Challenge  Learning in very high dimensions with very few samples  Acute leukemia dataset: 7129 # of gene vs. 72 samples  Colon cancer.
Chapter 11 Multiple Regression.
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Data Mining Anomaly Detection Lecture Notes for Chapter 10 Introduction to Data Mining by.
MACHINE LEARNING 6. Multivariate Methods 1. Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2 Motivating Example  Loan.
Tables, Figures, and Equations
Dr. Michael R. Hyman Cluster Analysis. 2 Introduction Also called classification analysis and numerical taxonomy Goal: assign objects to groups so that.
Slide 1 Detecting Outliers Outliers are cases that have an atypical score either for a single variable (univariate outliers) or for a combination of variables.
Introduction to Regression Analysis, Chapter 13,
Multivariate Methods EPSY 5245 Michael C. Rodriguez.
Multiple Linear Regression Response Variable: Y Explanatory Variables: X 1,...,X k Model (Extension of Simple Regression): E(Y) =  +  1 X 1 +  +  k.
1 MULTI VARIATE VARIABLE n-th OBJECT m-th VARIABLE.
The Tutorial of Principal Component Analysis, Hierarchical Clustering, and Multidimensional Scaling Wenshan Wang.
Multivariate Statistical Data Analysis with Its Applications
This week: overview on pattern recognition (related to machine learning)
Classification Supervised and unsupervised Tormod Næs Matforsk and University of Oslo.
COMMON EVALUATION FINAL PROJECT Vira Oleksyuk ECE 8110: Introduction to machine Learning and Pattern Recognition.
Multiple Regression Selecting the Best Equation. Techniques for Selecting the "Best" Regression Equation The best Regression equation is not necessarily.
11/12/2012ISC471 / HCI571 Isabelle Bichindaritz 1 Prediction.
Measures of Spread 1. Range: the distance from the lowest to the highest score * Problem of clustering differences ** Problem of outliers.
Regression Analysis © 2007 Prentice Hall17-1. © 2007 Prentice Hall17-2 Chapter Outline 1) Correlations 2) Bivariate Regression 3) Statistics Associated.
PATTERN RECOGNITION : CLUSTERING AND CLASSIFICATION Richard Brereton
Clustering.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Multivariate Analysis and Data Reduction. Multivariate Analysis Multivariate analysis tries to find patterns and relationships among multiple dependent.
Module III Multivariate Analysis Techniques- Framework, Factor Analysis, Cluster Analysis and Conjoint Analysis Research Report.
Chapter 13 (Prototype Methods and Nearest-Neighbors )
Introduction to Machine Learning Multivariate Methods 姓名 : 李政軒.
Principal Component Analysis (PCA)
Biostatistics Regression and Correlation Methods Class #10 April 4, 2000.
Univariate Point Estimation Confidence Interval Estimation Bivariate: Linear Regression Multivariate: Multiple Regression 1 Chapter 4: Statistical Approaches.
SUPERVISED AND UNSUPERVISED LEARNING Presentation by Ege Saygıner CENG 784.
Simple and multiple regression analysis in matrix form Least square Beta estimation Beta Simple linear regression Multiple regression with two predictors.
Chapter 7 Estimation. Chapter 7 ESTIMATION What if it is impossible or impractical to use a large sample? Apply the Student ’ s t distribution.
JMP Discovery Summit 2016 Janet Alvarado
GRAPHICAL REPRESENTATIONS OF A DATA MATRIX
School of Computer Science & Engineering
Outlier Processing via L1-Principal Subspaces
CH 5: Multivariate Methods
Principal Component Analysis (PCA)
Quality Control at a Local Brewery
BA 275 Quantitative Business Methods
Example of PCR, interpretation of calibration equations
EPSY 5245 EPSY 5245 Michael C. Rodriguez
Introduction to Statistical Methods for Measuring “Omics” and Field Data PCA, PcoA, distance measure, AMOVA.
Chapter 13 Group Differences
Product moment correlation
Checking the data and assumptions before the final analysis.
Multivariate Methods Berlin Chen
Feature Selection Methods
Multivariate Methods Berlin Chen, 2005 References:
Machine Learning – a Probabilistic Perspective
Unsupervised Learning
Presentation transcript:

CLASSIFICATION

Periodic Table of Elements

1789 Lavosier 1869 Mendelev

Measures of similarity i) distance ii) angular (correlation)

x k T d kl = || x T k -x T l || x l T Variable space Var 1 Var 2 Two objects plotted in the two-dimensional variable space. The difference between the object vectors is defined as the Euclidean distance between the objects, d kl angular

Measuring similarity Distance i) Euclidean ii) Minkowski (“Manhatten”, “taxis”) iii) Mahalanobis (correlated variables)

X1X1 X2X2 p1p1 p2p2 Euclidean Euclidean: Manhattan: Distance

Classification using distance: Nearest neighbor(s) define the membership of an object. KNN (K nearest neighbors) K = 1 K = 3

Classification X1X1 X2X2 X 1 and X 2 is uncorrelated, cov(X 1, X 2 ) = 0 for both subsets (classes) => can use KNN to measure similarity

Classification X1X1 X2X2 PC1 PC2 Class 3 Class 4 Class 1 Class 2 Univariate classification can NOT provide a good separation between class 1 and class 2. Bivariate classification (KNN) provides separation. For class 3 and class 4, PC analysis provides excellent separation on PC2.

Classification X1X1 X2X2 X 1 and X 2 is correlated, cov(X 1, X 2 )  0 for both “classes” (high X 1 => high X 2 ) KNN fails, but PC analysis provides the correct classification

Classification Cluster methods like KNN (K nearest neighbors) use all the data in the calculation of distances. Drawback: No separation of noise from information Cure: Use scores from major PCs

VARIABLE CORRELATION AND SIMILARITY BETWEEN OBJECTS

CORRELATION&SIMILARITY Variable space Var 1 Var 2

CORRELATION&SIMILARITY Variable space Var 1 Var 2 PC class 2 PC class 1 SUPERVISED COMPARISON (SIMCA)

CORRELATION-SIMILARITY Variable space Var 1 Var 2 PC1 PC2 UNSUPERVISED COMPARISON (PCA)

CORRELATION&SIMILARITY eTkeTk xcTxcT xTkxTk Var 2 Var 1 Variable Space

CORRELATION&SIMILARITY Unsupervised: PCA - score plot Fuzzy clustering Supervised: SIMCA

CORRELATION-SIMILARITY KM Characterisation and Correlation of crude oils…. Kvalheim et al. (1985) Anal. Chem.

CORRELATION&SIMILARITY Sample 1 Sample 2 Sample N

CORRELATION&SIMILARITY SCORE PLOT t1t1 t2t2 PC1 PC2

Soft Independent Modelling of Class Analogies (SIMCA)

SIMCA Data (Variance) = Model (Covar. pattern) Residuals (Unique variance, noise) + Angular correlation Distance

SIMCA Objects N N+1 N+N’ ………… …………...M Data matrix Variables X ki Class 1 Class 2 Unassigned objects Class Q Training set (Reference set) Test set Class - group of similar objects Object - sample, individual Variable - feature, characteristics, attribute

SIMCA Chromatogram N N+1 N+N’ ………… …………...M Data matrix Peak area X ki Oil field 1 Oil field 2 New samples Oil field Q Training set (Reference set Test set

PC MODELS 2* 3* 1* x 2’ 3’ 1’ x p1p1 x ki = x i + e ki x’ k = x’ + e’ k x ki = x i + t k p’ i + e ki x’ k = x’ + t k p’ + e’ k

PC MODELS 2’ 3’ 1’ X p1p1 p2p2 x ki = x i +  t k p’ i + e ki x’ k = x’ + t k1 p’ 1 + t k2 p’ 2 + e’ k

PRINCIPAL COMPONENT CLASS MODEL X C = X C + T C P` C + E C information (structure) noise k = 1,2,…,N (object,sample) i = 1,2,…,N (variable) a = 1,2,….,A (principal component c = 1,2,----,C (class)

PC MODELS Deletion pattern for objects in the leave-out-one group-of elements-at-a-time cross validation procedure developed by Wold

CROSS VALIDATING PC MODELS i) Calculate scores and loadings for PC a+1 ; t a+1 and p` a+1, excluding elements in one group ii) Predict values for elements e ki, a+1 = t k,a+1 p` a+1,i iii) Sum over the elements iv) Repeat i)-iii) for all the other groups of elements v) Compare with Adjust for degrees of freedom

1-component PC model S max p= 0.05 S max p= 0.01 PC 1

Residual Standard Deviation (RSD) S max PC 1 S0S0 Mean RSD of class: RSD of object:

t upper t lower PC 1 s max t max t min 1/2s t

CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test)

CLASSIFICATION OF A NEW OBJECT i) Fit object to the class model Defines For a = 1,2,...,A Calculate the residuals to the object: ii) Compare residual distance of object to the class model with the average residual distance of objects used to obtain the class (F-test) k  class q > F critical => k  class q

t upper t lower PC 1 s max t max t min 1/2s t Objects outside the model

PC 1 Detection of atypical objects RSD max slsl t max + 1/2s t t lower t max t min t min - 1/2s t tltl sksk Object k: S k > RSD max => k is outside the class Object l: t l is outside the “normal area”, {t min -1/2s t, t max +1/2s t } => Calculate the distance to the extreme point, that is, s l > RSD max => l is outside the class

Detection of outliers 1. Score plots 2. DIXON-TESTS on each LATENT VARIABLE, 3. Normal plots of scores for each LATENT VARIABLE 4. Test of residuals, F-test (class model)

MODELLEING POWER DISCRIMINATION POWER

MODELLEING POWER The variables contribution to the class model q (intra- class variation) MP i q = 1 - S q i,A / S q i,0 MP i = 1.0 => the variable i is completely explained by the class model MP i = 0.0 => the variable i does NOT contribute to the class model

DISCRIMINATION POWER The variables ability to separate two class models (inter-class variation) DP r,q i = 1.0 => no discrimination power DP r,q i > 3-4 => “Good” discrimination power

s l (q) l k Class q Class r s k (q) s k (r) s l (r) SEPARATION BETWEEN CLASSES Worst ratio:,l  r Class distance: => “good separation”

POLISHED CLASSES 1) Remove “outliers” 2) Remove variables with both low MP <  and low DP <  2-3

How does SIMCA separate from other multivariate methods? i) Models systematic intra-class variation (angular correlation) ii) Assuming normally distributed population, the residuals can be used to decide class belonging (F-test)! iii) “Closed” models iv) Considers correlation, important for large data sets v) SIMCA separates noise from systematic (predictive) variation in each class

Latent Data Analysis (LDA) Separating surface New classes ? Outliers Asymmetric case? Looking for dissimilarities

MISSING DATA x1x1 x2x2 f 2 (x 1,x 2 ) f 1 (x 1,x 2 ) ? ? ? ?

WHEN DOES SIMCA WORK? 1. Similarity between objects in the same class, homogenous data. 2. Some relevant variables for the problem in question (MP, DP) 3. At least 5 objects, 3 variables.

ALGORITHM FOR SIMCA MODELLING Read Raw-data Pretreatment of data Select Subset/Class Evaluation of subsets Cross validated PC-model Variable Weighting Outliers? More Classes? Remodel? Yes “Polished” subsets Standardise Eliminate variables with low modelling and discriminated power Square Root, Normalise and more Fit new objects