Mathematical Foundations of BME

580.704 Mathematical Foundations of BME
Reza Shadmehr Classification via regression Fisher linear discriminant Bayes classifier Confidence and Error rate of the Bayes classifier

Classification via regression
Suppose we wish to classify vector x as belonging to either class C0 or C1. We can approach the problem as if it was regression:

Classification via regression
Model: -4 -2 2 4 6 -4 -2 2 4 6 x1 x2 -0.5 0.5 1 1.5 y

Classification via regression: concerns
Model: -4 -2 2 4 6 6 4 2 -2 The problem is in the example on the right, there are a lot more red points than green. The error in the regression of the red points sums up to much larger value than the green points because the variance of the red points is larger. Each red point’s distance from the dotted line is the error in that point. -4 -4 -2 2 4 6 This classification looks good. This one not so good. Sometimes an x can give us a y that is outside our range (outside 0-1).

Classification via regression: concerns
Model: Since y is a random variable that can only take on values of 0 or 1, error in regression will not be normally distributed. Variance of the error (which is equal to the variance of y) depends on x, unlike in regression.

Regression as projection
A linear regression function projects each data point: Each data point x(n)=[x1,x2] is projected onto For a given w1, there will be a specific distribution of the projected points z={z(1),z(2),…,z(n)}. We can study how well the projected points are distributed into classes. -4 -2 2 4 -4 -2 2 4 -4 -2 2 4

Fisher discriminant analysis
Suppose we wish to classify vector x as belonging to either class C0 or C1. Class y=0: n0 number of points, mean m0, variance S0 Class y=1: n1 number of points, mean m1, variance S1 Class descriptions in the classification (or projected) space: (i.e., variance of yhat for x’s that belong to class 0 or class 1)

Find w so that when each point is projected to the classification space, the classes are maximally separated. Large separation Small separation -4 -2 2 4 -4 -2 2 4

Symmetric positive definite We can always write S like this, where R is a “square root” matrix Using R, change the coordinate system of J from w to v:

Dot product of a vector of norm 1 and another vector is maximum when the two have the same direction. arbitrary constant Note that to do this analysis, the entire data needs to be mean zero. Otherwise, the sigma_0 and sigma_1 matrices will not be invertible as they will be semi-positive definite.

Bayesian classification
Suppose we wish to classify vector x as belonging to a class: {1,…,L}. We are given labeled data and need to form a classification function: Likelihood prior Classify x into the class l that maximizes the posterior probability. marginal

Classification when distributions have equal variance
Suppose we wish to classify a person as male or female based on height. What we have: What we want: Assume equal probability of being male or female: female male 160 180 200 0.01 0.02 0.03 0.04 160 180 200 0.005 0.01 0.015 0.02 160 180 200 0.005 0.01 0.015 0.02 0.025 0.03 0.035 Note that the two densities have equal variance

Classification when distributions have equal variance
Decision boundary posterior 160 180 200 0.2 0.4 0.6 0.8 1 160 180 200 0.005 0.01 0.015 0.02 Decision boundary= To classify, we really don’t need to compute the posterior prob. All we need is: If this ratio is greater than 1, then we choose class 0, otherwise class 1. The boundaries between classes occur where the ratio is 1. In other words, the boundary occurs where the log of the ratio is 0

Uncertainty of the classifier
Starting with our likelihood and prior: we compute a posterior probability distribution as a function of x: This is a binomial distribution. We can compute the variance of this distribution: 1 0.8 0.6 0.4 0.2 140 160 180 200 0.25 0.2 0.15 0.1 Classification is most uncertain at the decision boundary 0.05 140 160 180 200

Classification when distributions have unequal variance
What we have: Classification: Assume: 160 180 200 0.2 0.4 0.6 0.8 1 160 180 200 0.005 0.01 0.015 0.02 0.025 160 180 200 0.005 0.01 0.015 0.02 0.025 0.03 0.035 140 160 180 200 0.05 0.1 0.15 0.2 0.25

Bayes error rate: Probability of misclassification
160 180 200 0.005 0.01 0.015 0.02 0.025 Prob of data belonging to c0 but we classify as c1 Prob of data belonging to c1, but we classify as c0 decision boundary In general, it is actually quite hard to compute Pr(error) because we will need to integrate the posterior probabilities over decision regions that may be discontinuous (for example, when the distributions have unequal variances). To help with this, there is the Chernoff bound.

Bayes error rate: Chernoff bound
In the two class classification problem, we note that the classification error depends on the area under the minimum of the two posterior probabilities. 1 0.8 0.6 0.4 0.2 140 160 180 200

Bayes error rate: Chernoff bound
To compute the minimum, we will need the following inequality: To help figure out this inequality, we note that: And without loss of generality, if we suppose that b is smaller than a. Then a/b>1, and we have: So we can think of the term a^b*b^(1-b) (for all values of b), as an upper bound on the min[a,b]. Returning to our P(error) problem, we can replace the min[] function with our inequality: The bound is found by numerically finding the value of b that minimizes the above expression. The key benefit here is that our search is in the one dimensional space of b, and we also got rid of the discontinuous decision regions.

Mathematical Foundations of BME

Similar presentations

Presentation on theme: "Mathematical Foundations of BME"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Mathematical Foundations of BME

Similar presentations

Presentation on theme: "Mathematical Foundations of BME"— Presentation transcript:

Similar presentations

About project

Feedback