Elements of Pattern Recognition CNS/EE-148 -- Lecture 5 M. Weber P. Perona.

Slides:



Advertisements
Similar presentations
Component Analysis (Review)
Advertisements

Pattern Recognition and Machine Learning
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Face Recognition Ying Wu Electrical and Computer Engineering Northwestern University, Evanston, IL
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Dimension reduction (1)
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Chapter 4: Linear Models for Classification
Principal Component Analysis CMPUT 466/551 Nilanjan Ray.
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Linear Methods for Classification
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Face Recognition and Feature Subspaces
Outline Separating Hyperplanes – Separable Case
0 Pattern Classification, Chapter 3 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda,
Principles of Pattern Recognition
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Using Support Vector Machines to Enhance the Performance of Bayesian Face Recognition IEEE Transaction on Information Forensics and Security Zhifeng Li,
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.
Pattern Classification
ECE 8443 – Pattern Recognition LECTURE 08: DIMENSIONALITY, PRINCIPAL COMPONENTS ANALYSIS Objectives: Data Considerations Computational Complexity Overfitting.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 471/571 – Lecture 6 Dimensionality Reduction – Fisher’s Linear Discriminant 09/08/15.
Lecture 4 Linear machine
Linear Models for Classification
Discriminant Analysis
Bayesian Decision Theory Basic Concepts Discriminant Functions The Normal Density ROC Curves.
Over-fitting and Regularization Chapter 4 textbook Lectures 11 and 12 on amlbook.com.
MACHINE LEARNING 7. Dimensionality Reduction. Dimensionality of input Based on E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Feature Extraction 主講人:虞台文. Content Principal Component Analysis (PCA) PCA Calculation — for Fewer-Sample Case Factor Analysis Fisher’s Linear Discriminant.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 10: PRINCIPAL COMPONENTS ANALYSIS Objectives:
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 09: Discriminant Analysis Objectives: Principal.
Computational Intelligence: Methods and Applications Lecture 22 Linear discrimination - variants Włodzisław Duch Dept. of Informatics, UMK Google: W Duch.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 11: Advanced Discriminant Analysis
LECTURE 09: BAYESIAN ESTIMATION (Cont.)
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Classification Discriminant Analysis
In summary C1={skin} C2={~skin} Given x=[R,G,B], is it skin or ~skin?
Classification Discriminant Analysis
Course Outline MODEL INFORMATION COMPLETE INCOMPLETE
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Feature space tansformation methods
Generally Discriminant Analysis
LECTURE 09: DISCRIMINANT ANALYSIS
Hairong Qi, Gonzalez Family Professor
Presentation transcript:

Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona

What is Classification? We want to assign objects to classes based on a selection of attributes (features). Examples: –(age, income)  {credit worthy, not credit worthy} –(blood cell count, body temp)  {flue, hepatitis B, hepatitis C} –(pixel vector)  {Bill Clinton, coffee cup} Feature vector can be continuous, discrete or mixed.

What is Classification? Want to find a function from measurements to class labels  decision boundary. x1x1 x2x2 Signal 1 Noise Signal 2 Statistical methods use pdf: p(C,x) Assume p(C,x) known for now Space of Feature Vectors

Some Terminology p(C) is called a prior or a priori probability p(x|C) is called a class-conditional density or likelihood of C with respect to x p(C|x) is called a posterior or a posteriori probability

Examples One measurement, symmetric cost, equal priors bad x p(x|C 1 ) p(x|C 2 )

Examples One measurement, symmetric cost, equal priors good x p(x|C 1 ) p(x|C 2 )

How to Make the Best Decision? (Bayes Decision Theory) Define a cost function for mistakes, e.g. Minimize expected loss (risk) over entire p(C,x). Sufficient to assure optimal decision for each individual x. Result: decide according to maximum posterior probability:

Two Classes, C 1, C 2 It is helpful to consider the likelihood ratio: Use known priors p(C i ) or ignore them. For more elaborate loss function (proof is easy): g(x) is called a discriminant function ?

Discriminant Functions for Multivariate Gaussian Class Conditional Densities Two multivariate Gaussians in d dimensions Since log is monotonic, we can look at log g(x). Mahalanobis Distance 2 superfluous

Mahalanobis Distance iso-distance lines = iso- probability lines Decision surface: x1x1 x2x2 11 22 decision surface

Case 1:  i =  2 I Discriminant functions… …simplify to:

Decision Boundary If  2 =0, we obtain... The matched filter! With an expression for the threshold.

Two Signals and Additive White Gaussian Noise Signal 1 Signal 2 x 11 22 1-21-2 x-  2 x1x1 x2x2

Case 2:  i =  Two classes, 2D measurements, p(x|C) are multivariate Gaussians with equal covariance matrices. Derivation is similar –Quadratic term vanishes since it is independent of class –We obtain a linear decision surface Matlab demo

Case 3: General Covariance Matrix See transparency

Isn’t this to simple? Not at all… It is true that images form complicated manifolds (from a pixel point of view, translation, rotation and scaling are all highly non-linear operations) The high dimensionality helps

Assume Unknown Class Densitites In real life, we do not know the class conditional densities. But we do have example data. This puts us in the typical machine learning scenario: We want to learn a function, c(x), from examples. Why not just estimate class densities from examples and apply the previous ideas? –Learn Gaussian (simple density): in N dimensions need N 2 samples at least! 10x10 pixels  10,000 examples! –Avoid estimating densities whenever you can! (too general) –posterior is generally simpler than class conditional (see transparency)

Remember PCA? Principal components are eigenvectors of covariance matrix Use reconstruction error for recognition (e.g. Eigenfaces) –good reduces dimensionality –bad no model within subspace linearity may be inappropriate covariance not appropriate to optimize discrimination x1x1 x2x2 u1u1  x

Fisher’s Linear Discriminant Goal: Reduce dimensionality before training classifiers etc. (Feature Selection) Similar goal as PCA! Fisher has classification in mind… Find projection directions such that separation is easiest Eigenfaces vs. Fisherfaces x1x1 x2x2

Fisher’s Linear Discriminant Assume we have n d-dimensional samples x 1,…,x n n 1 from set (class) X 1 and n 2 from set X 2 we form linear combinations: and obtain y 1 …,y n only direction of w is important

Objective for Fisher Measure the separation as the distance between the means after projecting (k = 1,2): Measure the scatter after projecting: Objective becomes to maximize

We need to make the dependence on w explicit: Defining the within-class scatter matrix, S W =S 1 +S 2, we obtain Similarly for the separation (between-class scatter matrix) Finally we can write

Fisher’s Solution Is called a generalized Rayleigh quotient. Any w that maximizes J must satisfy the generalized eigenvalue problem Since S B is very singular (rank 1), and S B w is in the direction of (m 1 -m 2 ), we are done:

Comments on FLD We did not follow Bayes Decision Theory FLD is useful for many types of densities Fisher can be extended (see demo): –more than one projection direction –more than two clusters Let’s try it out: Matlab Demo

Fisher vs. Bayes Assume we do have identical Gaussian class densities, then Bayes says: while Fisher says: Since S W is proportional to the covariance matrix, w is in the same direction in both cases. Comforting...

What have we achieved? Found out that maximum posterior strategy is optimal. Always. Looked at different cases of Gaussian class densities, where we could derive simple decision rules. Gaussian classifiers do reasonable jobs! Learned about FLD which is useful and often preferable to PCA.

Just for Fun: Support Vector Machine Very fashionable…s.o.t.a? Does not model densities Fits decision surface directly Maximizes margin  reduces “complexity” Decision surface only depends on nearby samples Matlab Demo x1x1 x2x2

Learning Algorithms Set of functions Learning Algorithm Examples: (x i,y i ) p(x,y) Learned function y = f(x) f = ?

Assume Unknown Class Densitites SVM Examples Densitites are hard to estimate -> avoid it –example from Ripley Give intuitions on overfitting Need to learn –Standard machine learning problem –Training/Test sets