X.2 Linear Discriminant Analysis: 2-Class

Slides:



Advertisements
Similar presentations
Krishna Rajan Data Dimensionality Reduction: Introduction to Principal Component Analysis Case Study: Multivariate Analysis of Chemistry-Property data.
Advertisements

Component Analysis (Review)
Linear Discriminant Analysis
Response Surface Method Principle Component Analysis
Principal Component Analysis
Motion Analysis Slides are from RPI Registration Class.
L15:Microarray analysis (Classification) The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Face Recognition using PCA (Eigenfaces) and LDA (Fisherfaces)
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
L15:Microarray analysis (Classification). The Biological Problem Two conditions that need to be differentiated, (Have different treatments). EX: ALL (Acute.
Gaussian Information Bottleneck Gal Chechik Amir Globerson, Naftali Tishby, Yair Weiss.
Chapter 5 Part II 5.3 Spread of Data 5.4 Fisher Discriminant.
The Terms that You Have to Know! Basis, Linear independent, Orthogonal Column space, Row space, Rank Linear combination Linear transformation Inner product.
Linear Equations in Linear Algebra
Principal Component Analysis. Philosophy of PCA Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data.
Empirical Modeling Dongsup Kim Department of Biosystems, KAIST Fall, 2004.
Summarized by Soo-Jin Kim
Linear Least Squares Approximation. 2 Definition (point set case) Given a point set x 1, x 2, …, x n  R d, linear least squares fitting amounts to find.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
1 February 24 Matrices 3.2 Matrices; Row reduction Standard form of a set of linear equations: Chapter 3 Linear Algebra Matrix of coefficients: Augmented.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Principal Component Analysis Bamshad Mobasher DePaul University Bamshad Mobasher DePaul University.
Unsupervised learning
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Discriminant Analysis
Reduces time complexity: Less computation Reduces space complexity: Less parameters Simpler models are more robust on small datasets More interpretable;
EIGENSYSTEMS, SVD, PCA Big Data Seminar, Dedi Gadot, December 14 th, 2014.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:
Dimensionality reduction
Signal & Weight Vector Spaces
Performance Surfaces.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
STROUD Worked examples and exercises are in the text Programme 5: Matrices MATRICES PROGRAMME 5.
5 5.1 © 2016 Pearson Education, Ltd. Eigenvalues and Eigenvectors EIGENVECTORS AND EIGENVALUES.
LDA (Linear Discriminant Analysis) ShaLi. Limitation of PCA The direction of maximum variance is not always good for classification.
Feature Extraction 主講人:虞台文.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
STROUD Worked examples and exercises are in the text PROGRAMME 5 MATRICES.
Unsupervised Learning II Feature Extraction
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
Fundamentals of Data Analysis Lecture 11 Methods of parametric estimation.
Machine Learning Supervised Learning Classification and Regression K-Nearest Neighbor Classification Fisher’s Criteria & Linear Discriminant Analysis Perceptron:
Principal Component Analysis
Principal Component Analysis (PCA)
Physics 114: Lecture 13 Probability Tests & Linear Fitting
Information Management course
LECTURE 11: Advanced Discriminant Analysis
Background on Classification
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
OSE801 Engineering System Identification Spring 2010
LECTURE 10: DISCRIMINANT ANALYSIS
Properties Of the Quadratic Performance Surface
Roberto Battiti, Mauro Brunato
PCA vs ICA vs LDA.
Computational Intelligence: Methods and Applications
Linear Equations in Linear Algebra
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Linear Discriminant Analysis(LDA)
X.3 Linear Discriminant Analysis: C-Class
X.1 Principal component analysis
Feature space tansformation methods
LECTURE 09: DISCRIMINANT ANALYSIS
Linear Algebra Lecture 32.
Parametric Methods Berlin Chen, 2005 References:
Feature Selection Methods
Performance Surfaces.
3.5 Perform Basic Matrix Operations Algebra II.
Unsupervised Learning
Presentation transcript:

X.2 Linear Discriminant Analysis: 2-Class Supervised versus unsupervised approaches for dimension reduction. Initial attempt based on axis between the means of a 2-class data set. Adaptation to account for in-class variance: the Fisher discriminant. Algorithm for identifying the optimal axis corresponding to this discriminant. 6.7 : 1/9

Limitations of PCA Often, we wish to discriminate between classes (e.g., healthy vs. diseased, molecular identification based on compiled Raman or MS data, determination of origin of complex samples, chemical fingerprinting, etc.). The outcome of the measurement is therefore: i) assignment of an unknown into a class, ii) and assessment of the confidence of that assignment. PCA is “unsupervised”, in that the algorithm does not utilize class information in dimension reduction. The entire data set, including both samples of known and unknown origin, are treated equally in the analysis. The directions of maximum variance in the data do not necessarily correspond to the directions of maximum class discrimination / resolution. 6.7 : 2/9

LDA: 2-Class Let’s revisit the earlier example, in which just two discrete wavelengths out of UV-Vis spectra are plotted to allow visualization in 2 D. PC1 Class a Class b Dimension reduction by PCA does not provide clear discrimination in this case. Abs(2) Abs(1) 6.7 : 2/9

LDA: 2-Class (2) Using class information, the simplest first choice for a new coordinate to discriminate between a and b is the axis connecting the means of each class. Class a Class b µa - µb Abs(2) Abs(1) 6.7 : 2/9

LDA: 2-Class (3) The mean vector (spectrum) of each class is calculated independently, in which na and nb correspond to the number of measured spectra obtained for classes a and b, respectively. For later stages, it is useful to define a single scalar value that can be used for optimization. If we let “w” be the new test axis we are considering for discrimination, the scalar property J(w) can be calculated by the projections of each mean vector onto w. µa and µb are scalars. 6.7 : 2/9

LDA: 2-Class (5) However, a selection based on the separation of the means neglects to include the influence of the variance about the mean. In the example below, the two classes are not resolved using the difference in means. Class a Class b µa - µb Abs(2) Abs(1) 6.7 : 2/9

LDA: 2-Class (4) Weighting by the in-class variance along particular w selections shifts the optimal resolution/discrimination away from that expected just by the difference in means alone. Class a Class b J µa - µb Abs(2) w Abs(1) 6.7 : 2/9

LDA: 2-Class (5) How about a separation direction based on the definition of RESOLUTION? 6.7 : 2/9

LDA: 2-Class (6) The Fisher linear discriminant includes a weighting by an equivalent of the within-class variance along the test axis w. Improved discrimination corresponds to minimizing the in-class variance along w, such that the within-class variance term should appear in the denominator. yi is the scalar projection of the ith data vector x onto the w axis. s2a is the variance about the projected mean in class a along the w axis. s2a can be rewritten in matrix notation to isolate the parts that are dependent and independent of w. 6.7 : 2/9

LDA: 2-Class (7) We can perform additional manipulations to express the numerator in terms of w-dependent and w-independent contributions. …such that… 6.7 : 2/9

LDA: 2-Class In summary: -SB is the between-class scatter, with maximization of J corresponding to maximizing the separation between the means of classes. -Sw is the within-class scatter, which increases the chances for better separation between classes when minimized. 6.7 : 2/9

LDA: 2-Class Two ways to optimize (derivative = 0 , or selection of maximum of eigenvalues): Because J is simply a scalar, we can move it around within the equation. 6.7 : 2/9

LDA: 2-Class (9) The optimal axis w* obtained by setting the derivative of J equal to zero. Equivalently, this result can be cast as the identifying the eigenvector that maximizes the value of the corresponding eigenvalue J. The eigenvector(s) of the matrix Sw-1SB correspond to the optimal direction(s) of w, with those corresponding to the greater values of J providing greater discrimination/resolution. For the two-class system, solving the eigenvector, eigenvalue problem leads to a concise analytical expression for the optimal direction based on the Fisher discriminant. 6.7 : 2/9

LDA: 2-Class Example W 6.7 : 2/9

LDA: 2-Class Example (2) Now, we can quantify our ability to resolve the two classes and assess the reliability of assignment with scalar values by using the mean and variance projected along that new axis. Note – maximizing J is almost mathematically equivalent to maximizing the resolution R. J is the nonzero eigenvalue corresponding to resolution. 6.7 : 2/9