Mathematical Foundations of BME

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Pattern Classification & Decision Theory. How are we doing on the pass sequence? Bayesian regression and estimation enables us to track the man in the.
Pattern Recognition and Machine Learning
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Integration of sensory modalities
Pattern Classification. Chapter 2 (Part 1): Bayesian Decision Theory (Sections ) Introduction Bayesian Decision Theory–Continuous Features.
Chapter 4: Linear Models for Classification
Visual Recognition Tutorial
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
METU Informatics Institute Min 720 Pattern Classification with Bio-Medical Applications PART 2: Statistical Pattern Classification: Optimal Classification.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
1 Linear Methods for Classification Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Inference for the mean vector. Univariate Inference Let x 1, x 2, …, x n denote a sample of n from the normal distribution with mean  and variance 
Principles of Pattern Recognition
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
ECE 8443 – Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Normal Distributions Whitening Transformations Linear Discriminants Resources.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Learning Theory Reza Shadmehr logistic regression, iterative re-weighted least squares.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
Learning Theory Reza Shadmehr Linear and quadratic decision boundaries Kernel estimates of density Missing data.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
Correlation Assume you have two measurements, x and y, on a set of objects, and would like to know if x and y are related. If they are directly related,
1 Bayesian Decision Theory Shyh-Kang Jeng Department of Electrical Engineering/ Graduate Institute of Communication/ Graduate Institute of Networking and.
Linear Models for Classification
Ch15: Decision Theory & Bayesian Inference 15.1: INTRO: We are back to some theoretical statistics: 1.Decision Theory –Make decisions in the presence of.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Linear Methods for Classification Based on Chapter 4 of Hastie, Tibshirani, and Friedman David Madigan.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Machine Learning CUNY Graduate Center Lecture 2: Math Primer.
ECE 8443 – Pattern Recognition LECTURE 04: PERFORMANCE BOUNDS Objectives: Typical Examples Performance Bounds ROC Curves Resources: D.H.S.: Chapter 2 (Part.
Classification Course web page: vision.cis.udel.edu/~cv May 14, 2003  Lecture 34.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Objectives: Chernoff Bound Bhattacharyya Bound ROC Curves Discrete Features Resources: V.V. – Chernoff Bound J.G. – Bhattacharyya T.T. – ROC Curves NIST.
Learning Theory Reza Shadmehr Distribution of the ML estimates of model parameters Signal dependent noise models.
Part 3: Estimation of Parameters. Estimation of Parameters Most of the time, we have random samples but not the densities given. If the parametric form.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Lecture 2. Bayesian Decision Theory
Bayesian Estimation and Confidence Intervals
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
LECTURE 04: DECISION SURFACES
LECTURE 03: DECISION SURFACES
CH 5: Multivariate Methods
Comp328 tutorial 3 Kai Zhang
Clustering Evaluation The EM Algorithm
Statistical Learning Dong Liu Dept. EEIS, USTC.
LECTURE 05: THRESHOLD DECODING
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Modelling data and curve fitting
Mathematical Foundations of BME Reza Shadmehr
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Integration of sensory modalities
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
For the data set classify_regress.dat, find w.
Generally Discriminant Analysis
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Multivariate Methods Berlin Chen
Mathematical Foundations of BME
LECTURE 05: THRESHOLD DECODING
Multivariate Methods Berlin Chen, 2005 References:
Mathematical Foundations of BME Reza Shadmehr
LECTURE 11: Exam No. 1 Review
Mathematical Foundations of BME Reza Shadmehr
Presentation transcript:

580.704 Mathematical Foundations of BME Reza Shadmehr Classification via regression Fisher linear discriminant Bayes classifier Confidence and Error rate of the Bayes classifier

Classification via regression Suppose we wish to classify vector x as belonging to either class C0 or C1. We can approach the problem as if it was regression:

Classification via regression Model: -4 -2 2 4 6 -4 -2 2 4 6 x1 x2 -0.5 0.5 1 1.5 y

Classification via regression: concerns Model: -4 -2 2 4 6 6 4 2 -2 The problem is in the example on the right, there are a lot more red points than green. The error in the regression of the red points sums up to much larger value than the green points because the variance of the red points is larger. Each red point’s distance from the dotted line is the error in that point. -4 -4 -2 2 4 6 This classification looks good. This one not so good. Sometimes an x can give us a y that is outside our range (outside 0-1).

Classification via regression: concerns Model: Since y is a random variable that can only take on values of 0 or 1, error in regression will not be normally distributed. Variance of the error (which is equal to the variance of y) depends on x, unlike in regression.

Regression as projection A linear regression function projects each data point: Each data point x(n)=[x1,x2] is projected onto For a given w1, there will be a specific distribution of the projected points z={z(1),z(2),…,z(n)}. We can study how well the projected points are distributed into classes. -4 -2 2 4 -4 -2 2 4 -4 -2 2 4

Fisher discriminant analysis Suppose we wish to classify vector x as belonging to either class C0 or C1. Class y=0: n0 number of points, mean m0, variance S0 Class y=1: n1 number of points, mean m1, variance S1 Class descriptions in the classification (or projected) space: (i.e., variance of yhat for x’s that belong to class 0 or class 1)

Fisher discriminant analysis Find w so that when each point is projected to the classification space, the classes are maximally separated. Large separation Small separation -4 -2 2 4 -4 -2 2 4

Fisher discriminant analysis Symmetric positive definite We can always write S like this, where R is a “square root” matrix Using R, change the coordinate system of J from w to v:

Fisher discriminant analysis Dot product of a vector of norm 1 and another vector is maximum when the two have the same direction. arbitrary constant Note that to do this analysis, the entire data needs to be mean zero. Otherwise, the sigma_0 and sigma_1 matrices will not be invertible as they will be semi-positive definite.

Bayesian classification Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function: Likelihood prior Classify x into the class l that maximizes the posterior probability. marginal

Classification when distributions have equal variance Suppose we wish to classify a person as male or female based on height. What we have: What we want: Assume equal probability of being male or female: female male 160 180 200 0.01 0.02 0.03 0.04 160 180 200 0.005 0.01 0.015 0.02 160 180 200 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Note that the two densities have equal variance

Classification when distributions have equal variance Decision boundary posterior 160 180 200 0.2 0.4 0.6 0.8 1 160 180 200 0.005 0.01 0.015 0.02 Decision boundary= To classify, we really don’t need to compute the posterior prob. All we need is: If this ratio is greater than 1, then we choose class 0, otherwise class 1. The boundaries between classes occur where the ratio is 1. In other words, the boundary occurs where the log of the ratio is 0

Uncertainty of the classifier Starting with our likelihood and prior: we compute a posterior probability distribution as a function of x: This is a binomial distribution. We can compute the variance of this distribution: 1 0.8 0.6 0.4 0.2 140 160 180 200 0.25 0.2 0.15 0.1 Classification is most uncertain at the decision boundary 0.05 140 160 180 200

Classification when distributions have unequal variance What we have: Classification: Assume: 160 180 200 0.2 0.4 0.6 0.8 1 160 180 200 0.005 0.01 0.015 0.02 0.025 160 180 200 0.005 0.01 0.015 0.02 0.025 0.03 0.035 140 160 180 200 0.05 0.1 0.15 0.2 0.25

Bayes error rate: Probability of misclassification 160 180 200 0.005 0.01 0.015 0.02 0.025 Prob of data belonging to c0 but we classify as c1 Prob of data belonging to c1, but we classify as c0 decision boundary In general, it is actually quite hard to compute Pr(error) because we will need to integrate the posterior probabilities over decision regions that may be discontinuous (for example, when the distributions have unequal variances). To help with this, there is the Chernoff bound.

Bayes error rate: Chernoff bound In the two class classification problem, we note that the classification error depends on the area under the minimum of the two posterior probabilities. 1 0.8 0.6 0.4 0.2 140 160 180 200

Bayes error rate: Chernoff bound To compute the minimum, we will need the following inequality: To help figure out this inequality, we note that: And without loss of generality, if we suppose that b is smaller than a. Then a/b>1, and we have: So we can think of the term a^b*b^(1-b) (for all values of b), as an upper bound on the min[a,b]. Returning to our P(error) problem, we can replace the min[] function with our inequality: The bound is found by numerically finding the value of b that minimizes the above expression. The key benefit here is that our search is in the one dimensional space of b, and we also got rid of the discontinuous decision regions.