Presentation is loading. Please wait.

Presentation is loading. Please wait.

Jeff Howbert Introduction to Machine Learning Winter 2012 1 Classification Bayesian Classifiers.

Similar presentations


Presentation on theme: "Jeff Howbert Introduction to Machine Learning Winter 2012 1 Classification Bayesian Classifiers."— Presentation transcript:

1 Jeff Howbert Introduction to Machine Learning Winter 2012 1 Classification Bayesian Classifiers

2 Jeff Howbert Introduction to Machine Learning Winter 2012 2 l A probabilistic framework for solving classification problems. –Used where class assignment is not deterministic, i.e. a particular set of attribute values will sometimes be associated with one class, sometimes with another. –Requires estimation of posterior probability for each class, given a set of attribute values: for each class C i –Then use decision theory to make predictions for a new sample x Bayesian classification

3 Jeff Howbert Introduction to Machine Learning Winter 2012 3 l Conditional probability: l Bayes theorem: Bayesian classification posterior probability likelihood prior probability evidence

4 Jeff Howbert Introduction to Machine Learning Winter 2012 4 l Given: –A doctor knows that meningitis causes stiff neck 50% of the time –Prior probability of any patient having meningitis is 1/50,000 –Prior probability of any patient having stiff neck is 1/20 l If a patient has stiff neck, what’s the probability he/she has meningitis? Example of Bayes theorem

5 Jeff Howbert Introduction to Machine Learning Winter 2012 5 l Treat each attribute and class label as random variables. l Given a sample x with attributes ( x 1, x 2, …, x n ): –Goal is to predict class C. –Specifically, we want to find the value of C i that maximizes p( C i | x 1, x 2, …, x n ). l Can we estimate p( C i | x 1, x 2, …, x n ) directly from data? Bayesian classifiers

6 Jeff Howbert Introduction to Machine Learning Winter 2012 6 Approach: l Compute the posterior probability p( C i | x 1, x 2, …, x n ) for each value of C i using Bayes theorem: l Choose value of C i that maximizes p( C i | x 1, x 2, …, x n ) l Equivalent to choosing value of C i that maximizes p( x 1, x 2, …, x n | C i ) p( C i ) (We can ignore denominator – why?) l Easy to estimate priors p( C i ) from data. (How?) l The real challenge: how to estimate p( x 1, x 2, …, x n | C i )? Bayesian classifiers

7 Jeff Howbert Introduction to Machine Learning Winter 2012 7 l How to estimate p( x 1, x 2, …, x n | C i )? l In the general case, where the attributes x j have dependencies, this requires estimating the full joint distribution p( x 1, x 2, …, x n ) for each class C i. l There is almost never enough data to confidently make such estimates. Bayesian classifiers

8 Jeff Howbert Introduction to Machine Learning Winter 2012 8 l Assume independence among attributes x j when class is given: p( x 1, x 2, …, x n | C i ) = p( x 1 | C i ) p( x 2 | C i ) … p( x n | C i ) l Usually straightforward and practical to estimate p( x j | C i ) for all x j and C i. l New sample is classified to C i if p( C i )  p( x j | C i ) is maximal. Naïve Bayes classifier

9 Jeff Howbert Introduction to Machine Learning Winter 2012 9 l Class priors: p( C i ) = N i / N p( No ) = 7/10 p( Yes ) = 3/10 l For discrete attributes: p( x j | C i ) = | x ji | / N i where | x ji | is number of instances in class C i having attribute value x j Examples: p ( Status = Married | No ) = 4/7 p( Refund = Yes | Yes ) = 0 How to estimate p ( x j | C i ) from data?

10 Jeff Howbert Introduction to Machine Learning Winter 2012 10 l For continuous attributes: –Discretize the range into bins  replace with an ordinal attribute –Two-way split: ( x i v )  replace with a binary attribute –Probability density estimation:  assume attribute follows some standard parametric probability distribution (usually a Gaussian)  use data to estimate parameters of distribution (e.g. mean and variance)  once distribution is known, can use it to estimate the conditional probability p( x j | C i ) How to estimate p ( x j | C i ) from data?

11 Jeff Howbert Introduction to Machine Learning Winter 2012 11 l Gaussian distribution: –one for each ( x j, C i ) pair l For ( Income | Class = No ): –sample mean = 110 –sample variance = 2975 How to estimate p ( x j | C i ) from data?

12 Jeff Howbert Introduction to Machine Learning Winter 2012 12 Example of using naïve Bayes classifier l p( x | Class = No ) = p( Refund = No | Class = No)  p( Married | Class = No )  p( Income = 120K | Class = No ) = 4/7  4/7  0.0072 = 0.0024 l p( x | Class = Yes ) = p( Refund = No | Class = Yes)  p( Married | Class = Yes )  p( Income = 120K | Class = Yes ) = 1  0  1.2  10 -9 = 0 p( x | No ) p( No ) > p( x | Yes ) p( Yes ) therefore p( No | x ) > p( Yes | x ) => Class = No Given a Test Record:

13 Jeff Howbert Introduction to Machine Learning Winter 2012 13 l Problem: if one of the conditional probabilities is zero, then the entire expression becomes zero. l This is a significant practical problem, especially when training samples are limited. l Ways to improve probability estimation: Naïve Bayes classifier c: number of classes p: prior probability m: parameter

14 Jeff Howbert Introduction to Machine Learning Winter 2012 14 Example of Naïve Bayes classifier X: attributes M: class = mammal N: class = non-mammal p( X | M ) p( M ) > p( X | N ) p( N ) => mammal

15 Jeff Howbert Introduction to Machine Learning Winter 2012 15 l Robust to isolated noise samples. l Handles missing values by ignoring the sample during probability estimate calculations. l Robust to irrelevant attributes. l NOT robust to redundant attributes. –Independence assumption does not hold in this case. –Use other techniques such as Bayesian Belief Networks (BBN). Summary of naïve Bayes


Download ppt "Jeff Howbert Introduction to Machine Learning Winter 2012 1 Classification Bayesian Classifiers."

Similar presentations


Ads by Google