Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.

Slides:



Advertisements
Similar presentations
Bayesian Learning & Estimation Theory
Advertisements

Pattern Recognition and Machine Learning
Probability and Maximum Likelihood. How are we doing on the pass sequence? This fit is pretty good, but… Hand-labeled horizontal coordinate, t The red.
Neural Networks and Kernel Methods
Clustering. How are we doing on the pass sequence? Pretty good! We can now automatically learn the features needed to track both people But, it sucks.
Generative Models Thus far we have essentially considered techniques that perform classification indirectly by modeling the training data, optimizing.
Why does it work? We have not addressed the question of why does this classifier performs well, given that the assumptions are unlikely to be satisfied.
Linear Regression.
Notes Sample vs distribution “m” vs “µ” and “s” vs “σ” Bias/Variance Bias: Measures how much the learnt model is wrong disregarding noise Variance: Measures.
Chapter 4: Linear Models for Classification
Pattern Recognition and Machine Learning
x – independent variable (input)
Classification and risk prediction
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Arizona State University DMML Kernel Methods – Gaussian Processes Presented by Shankar Bhargav.
Pattern Recognition. Introduction. Definitions.. Recognition process. Recognition process relates input signal to the stored concepts about the object.
1 Linear Classification Problem Two approaches: -Fisher’s Linear Discriminant Analysis -Logistic regression model.
Review Rong Jin. Comparison of Different Classification Models  The goal of all classifiers Predicating class label y for an input x Estimate p(y|x)
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Probability of Error Feature vectors typically have dimensions greater than 50. Classification accuracy depends upon the dimensionality and the amount.
Outline Separating Hyperplanes – Separable Case
Principles of Pattern Recognition
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Linear Models for Classification
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Radial Basis Function ANN, an alternative to back propagation, uses clustering of examples in the training set.
Linear Discriminant Functions  Discriminant Functions  Least Squares Method  Fisher’s Linear Discriminant  Probabilistic Generative Models.
Elements of Pattern Recognition CNS/EE Lecture 5 M. Weber P. Perona.
Lecture 5: Statistical Methods for Classification CAP 5415: Computer Vision Fall 2006.
Linear Classifiers Dept. Computer Science & Engineering, Shanghai Jiao Tong University.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Giansalvo EXIN Cirrincione unit #4 Single-layer networks They directly compute linear discriminant functions using the TS without need of determining.
Computer Vision Lecture 7 Classifiers. Computer Vision, Lecture 6 Oleh Tretiak © 2005Slide 1 This Lecture Bayesian decision theory (22.1, 22.2) –General.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
CMPS 142/242 Review Section Fall 2011 Adapted from Lecture Slides.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Lecture 2. Bayesian Decision Theory
Chapter 3: Maximum-Likelihood Parameter Estimation
Deep Feedforward Networks
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
LECTURE 10: DISCRIMINANT ANALYSIS
CH 5: Multivariate Methods
Classification Discriminant Analysis
Ying shen Sse, tongji university Sep. 2016
Mathematical Foundations of BME Reza Shadmehr
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
EE513 Audio Signals and Systems
Generally Discriminant Analysis
Mathematical Foundations of BME
LECTURE 09: DISCRIMINANT ANALYSIS
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
Logistic Regression Chapter 7.
Multivariate Methods Berlin Chen, 2005 References:
Linear Discrimination
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

Pattern Classification & Decision Theory

How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? Not very well. Feature, x Hand-labeled horizontal coordinate, t Regression fails to identify that there really are two classes of solution

Decision theory We wish to classify an input x as belonging to one of K classes, where class k is denoted Example: Buffalo digits, 10 classes, 16x16 images, each image x is a 256-dimensional vector, x m [0,1]

Decision theory We wish to classify an input x as belonging to one of K classes, where class k is denoted Partition the input space into regions,, …, so that if x, our classifier predicts class How should we choose the partition? Suppose x is presented with probability p(x) and the distribution over the class labels given x is p( |x) Then, p(correct) = k x p(x) p( |x) dx This is maximized by assigning each x to the region whose class maximizes p( |x)

Three approaches to pattern classification 1.Discriminative and non-probabilistic –Learn a discriminant function f (x), which maps x directly to a class label 2.Discriminative and probabilistic –For each class k, learn the probability model –Use this probability to classify a new input x f(x)f(x)

Three approaches to pattern classification 3.Generative –For each class k, learn the generative probability model –Also, learn the class probabilities –Use Bayes rule to classify a new input: where

Three approaches to pattern classification 1.Discriminative and non-probabilistic –Learn a discriminant function f (x), which maps x directly to a class label 2.Discriminative and probabilistic –For each class k, learn the probability model –Use this probability to classify a new input x f(x)f(x)

Can we use regression to learn discriminant functions? f(x)f(x)

f(x)f(x) f(x)f(x) What do the classification regions look like? Is there any sense in which square error is an appropriate cost function for classification? We should be careful to not interpret integer-valued class labels as ordinal targets for regression Can we use regression to learn discriminant functions?

The one-of-K representation For > 2 classes, each class is represented by a binary vector t with 1 indicating the class: K regression problems: To classify x, pick class k with largest y k (x) t =

Lets focus on binary classification Predict target t {0,1} from input vector x Denote the m th input of training case n by x nm

Classification boundary for linear regression Values of x where y(x)=0.5 are ambiguous – these form the classification boundary For these points, = 0.5 If x is ( M+1 )-dimensional, this defines an M - dimensional hyperplane separating the classes

How well does linear regression work? Works well in some cases, but there are two problems: Due to linearity, extreme x s cause extreme y(x,w) s Due to squared error, extreme y(x,w) s dominate learning Logistic regression (more later) Linear regression

Clipping off extreme y s To clip off extremes of, we can use a sigmoid function: where Now, squared error wont penalize extreme x s y is now a non-linear function of w, so learning is harder

How the observed error propagates back to the parameters E(w) = ½ n ( t n – ( m w m x nm ) ) 2 The rate of change of E w.r.t. w m is E(w)/ w m = - n ( t n - y n ) y n ( 1 - y n ) x nm –Useful fact:(a) = (a) ( 1 - (a)), Compare with linear regression: E(w)/ w m = - n ( t n - y n ) x nm ynyn

The effect of clipping Regression with sigmoid: E(w)/ w m = - n ( t n - y n ) y n ( 1 - y n ) x nm Linear regression: E(w)/ w m = - n ( t n - y n ) x nm For these outliers, both (t n -y n ) 0 and y( 1 -y) 0, so the outliers wont hold back improvement of the boundary

Squared error for learning probabilities If t = 0 and y 1, y is moderately pulled down (grad 1 ) If t = 0 and y 0, y is weakly pulled down (grad 0 ) E = ½ (t-y) 2 y t = 0t = 0 E dE/dy = 1 dE/dy = 0 Problems: Being certainly wrong is often undesirable Often, tiny differences between small probabilities count a lot

Three approaches to pattern classification 1.Discriminative and non-probabilistic –Learn a discriminant function f (x), which maps x directly to a class label 2.Discriminative and probabilistic –For each class k, learn the probability model –Use this probability to classify a new input x f(x)f(x)

Logistic regression: Binary likelihood As before, we use: where Now, use binary likelihood: p(t|x) = y(x) t (1-y(x)) 1 -t Data log-likelihood: L = n t n ln ( m w m x nm ) + ( 1 -t n ) ln ( 1- ( m w m x nm ) ) Unlike linear regression, L is nonlinear in the w s, so gradient-based optimizers are needed

Binary likelihood for learning probabilities If t = 0 and y 1, y is strongly pulled down (grad ) If t = 0 and y 0, y is moderately pulled down (grad 1 ) E = -ln(1-y) y t = 0t = 0 E dE/dy = 1 dE/dy = 0 E = ½ (t-y) 2 t = 0t = 0 dE/dy = 1 dE/dy

How the observed error propagates back to the parameters L = n t n ln ( m w m x nm ) + ( 1 -t n ) ln ( 1- ( m w m x nm ) ) The rate of change of L w.r.t. w m is L / w m = n ( t n - y n ) x nm Compare with sigmoid plus squared error: E(w)/ w m = - n ( t n - y n ) y n ( 1 - y n ) x nm Compare with linear regression: E(w)/ w m = - n ( t n - y n ) x nm ynyn

How well does logistic regression work? Logistic regression Linear regression

Multiclass logistic regression Create one set of weights per class and define The K -class generalization of the sigmoid function is p(t|x) = k exp(t k y k (x)) / k exp(y k (x)) which is equivalent to p( |x) = exp(y k (x)) / j exp(y j (x)) Learning: Similar to logistic regression (see textbook)

Three approaches to pattern classification 3.Generative –For each class k, learn the generative probability model –Also, learn the class probabilities –Use Bayes rule to classify a new input: where

Gaussian generative models We can assume each element of x is independent and Gaussian, given the class: p(x| ) = m p(x m | ) = m N (x m | km, km 2 ) Contour plot of density: k1 k2 k1 k2 Isotropic Gaussian: k1 2 = k2 2

Learning a Buffalo digit classifier (5000 training cases) The generative ML estimates of km and km 2 are just the data means and variances: Means: Variances (black=low variance, white=high variance): The classes are equally frequent, so = 1/10 To classify a new input x, compute (in the log-domain!)

A problem with the ML estimate Some pixels were constant across all training images within a class, so ML 2 = 0 This causes numerical problems when evaluating Gaussian densities, but is also an overfitting problem Common hack: Add min 2 to all variances More principled approaches –Regularize 2 –Place a prior on 2 and use MAP –Place a prior on 2 and use Bayesian learning

Classifying test data (5000 test cases) Adding min 2 = 0.01 to the variances, we obtain: –Error rate on training set = 16.00% (std dev.5%) –Error rate on test set = 16.72% (std dev.5%) Effect of value of min 2 on error rates: log 10 min 2 Test error rate Training error rate

Full-covariance Gaussian models Let x = y, where y is isotropic Gaussian and is an M x M rotation and scale matrix This generates a full-covariance Gaussian: Defining = ( -1 T -1 ) - 1, we obtain where is the covariance matrix: jk = COV(x j, x k ) Determinant

Generative models easily induce non-linear decision boundaries The following three-class problem shows how three axis-aligned Gaussians can induce nonlinear decision boundaries

Generative models easily induce non-linear decision boundaries Two Gaussians can be used to account for inliers and outliers

How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the striped shirt based on labeled data Can we track the man in the white shirt? Not very well. Feature, x Hand-labeled horizontal coordinate, t Regression fails to identify that there really are two classes of solution

Position of man in white shirt Using classification to improve tracking The position of the man in the striped shirt can be used to classify the tracking mode of the man in the white shirt Feature Position of man in striped shirt Feature

Position of man in white shirt, t w Using classification to improve tracking x s and t s = feature and position of man in striped shirt x w and t w = feature and position of man in white shirt For man in white shirt, hand-label regions and and learn two trackers Feature, x s Position of man in striped shirt, t s Feature, x w p( |t s ) p(t w |x w, )p(ts|xs)p(ts|xs) p( |t s ) p(t w |x w, )

Using classification to improve tracking The classifier can be obtained using the generative approach, where each class-conditional likelihood is a Gaussian Note: Linear classifiers wont work Feature, x s Position of man in striped shirt, t s p( |t s ) p(ts|xs)p(ts|xs) p(t s | )

Questions?

How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both men in 30 frames –Hand-extracted features (stripe detector, white blob detector) –Hand-labeled classes for the white-shirt tracker We have a framework for how to optimally make decisions and track the men

How are we doing on the pass sequence? We can now track both men, provided with –Hand-labeled coordinates of both men in 30 frames –Hand-extracted features (stripe detector, white blob detector) –Hand-labeled classes for the white-shirt tracker We have a framework for how to optimally make decisions and track the men This takes too much time to do by hand!

Lecture 4 Appendix

Binary classification regions for linear regression is defined by, and vice versa for Values of x satisfying are on the decision boundary, which is a D-1 dimensional hyperplane –w specifies the orientation of the decision hyperplane –-w 0 /||w|| specifies the distance from the hyperplane to the origin –The distance from input x to the hyperplane is y(x)/||w||

K -ary classification regions for linear regression x if Each resulting classification region is contiguous and has linear boundaries:

Fishers linear discriminant and least squares Fisher: Viewing y = w T x as a projection, pick w to maximize the distance between the means of the data sets, while also minimizing the variances of the data sets This result is also obtained using linear regression, by setting t = N/N 1 for class 1 and t = - N/N 2 for class 2, where N k = # training cases in class k

In what sense is logistic regression linear? The log-odds can be written thus: Each input contributes linearly to the log-odds 1-

Gaussian likelihoods and logistic regression For two classes, if their covariance matrices are equal, 1 = 2 =, we can write the log-odds as So… = = = where Logistic regression classifiers Classifiers using equal-covariance Gaussian generative models

Linear models or classifiers with linear boundaries Dont be fooled Such classifiers can be very hard to learn Such classifiers may have boundaries that are highly nonlinear in x (eg, via basis functions) All this means is that in some space the boundaries are linear