Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)

Conditional Probability Probability of an event given the occurrence of some other event.

Example Consider choosing a card from a well-shuffled standard deck of 52 playing cards. Given that the first card chosen is an ace, what is the probability that the second card chosen will be an ace?

Deriving Bayes Rule

Bayesian Learning

Application to Machine Learning In machine learning we have a space H of hypotheses: h 1, h 2,..., h n We also have a set D of data We want to calculate P(h | D)

–Prior probability of h: P(h): Probability that hypothesis h is true given our prior knowledge If no prior knowledge, all h  H are equally probable –Posterior probability of h: P(h | D): Probability that hypothesis h is true, given the data D. –Likelihood of D: P(D | h): Probability that we will see data D, given hypothesis h is true. Terminology

Bayes Rule: Machine Learning Formulation

The Monty Hall Problem You are a contestant on a game show. There are 3 doors, A, B, and C. There is a new car behind one of them and goats behind the other two. Monty Hall, the host, knows what is behind the doors. He asks you to pick a door, any door. You pick door A. Monty tells you he will open a door, different from A, that has a goat behind it. He opens door B: behind it there is a goat. Monty now gives you a choice: Stick with your original choice A or switch to C. Should you switch? http://math.ucsd.edu/~crypto/Monty/monty.html

Bayesian probability formulation Hypothesis space H: h 1 = Car is behind door A h 2 = Car is behind door B h 3 = Car is behind door C Data D = Monty opened B to show a goat What is P(h 1 | D)? What is P(h 2 | D)? What is P(h 3 | D)?

Event space Event space = All possible configurations of cars and goats behind doors A, B, C Y = Goat behind door B X = Car behind door A

Y = Goat behind door B X = Car behind door A Bayes Rule: Event space

Using Bayes’ Rule to solve the Monty Hall problem

MAP (“maximum a posteriori”) Learning Bayes rule: Goal of learning: Find maximum a posteriori hypothesis h MAP : because P(D) is a constant independent of h.

Note: If every h  H is equally probable, then This is called the “maximum likelihood hypothesis”.

A Medical Example Toby takes a test for leukemia. The test has two outcomes: positive and negative. It is known that if the patient has leukemia, the test is positive 98% of the time. If the patient does not have leukemia, the test is positive 3% of the time. It is also known that 0.008 of the population has leukemia. Toby’s test is positive. Which is more likely: Toby has leukemia or Toby does not have leukemia?

Hypothesis space: h 1 = T. has leukemia h 2 = T. does not have leukemia Prior: 0.008 of the population has leukemia. Thus P(h 1 ) = 0.008 P(h 2 ) = 0.992 Likelihood: P(+ | h 1 ) = 0.98, P(− | h 1 ) = 0.02 P(+ | h 2 ) = 0.03, P(− | h 2 ) = 0.97 Posterior knowledge: Blood test is + for this patient.

In summary P(h 1 ) = 0.008, P(h 2 ) = 0.992 P(+ | h 1 ) = 0.98, P(− | h 1 ) = 0.02 P(+ | h 2 ) = 0.03, P(− | h 2 ) = 0.97 Thus:

What is P(leukemia|+)? So, These are called the “posterior” probabilities.

In-Class Exercises 1-2

Quiz 5 (this Thursday) What you need to know: Definition of Information Gain and how to calculate it Bayes rule and how to derive it from definition of conditional probability How to solve problems similar to in-class exercises using Bayes rule Note: Naïve Bayes algorithm won’t be on this quiz. No calculator needed

In-class exercises 2b-2c

Bayesianism vs. Frequentism Classical probability: Frequentists –Probability of a particular event is defined relative to its frequency in a sample space of events. –E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses. Bayesian probability: –Combine measure of “prior” belief you have in a proposition with your subsequent observations of events. Example: Bayesian can assign probability to statement “There was life on Mars a billion years ago” but frequentist cannot.

Independence and Conditional Independence Two random variables, X and Y, are independent if Two random variables, X and Y, are independent given Z if Examples?

Naive Bayes Classifier Let f (x) be a target function for classification: f (x)  {+1, −1}. Let x = We want to find the most probable class value, h MAP, given the data x:

By Bayes Theorem: P(class) can be estimated from the training data. How? However, in general, not practical to use training data to estimate P(x 1, x 2,..., x n | class). Why not?

Naive Bayes classifier: Assume Is this a good assumption? Given this assumption, here’s how to classify an instance x = : Naive Bayes classifier: Estimate the values of these various probabilities over the training set.

DayOutlook Temp Humidity Wind PlayTennis D1Sunny HotHigh Weak No D2Sunny HotHigh Strong No D3Overcast HotHigh Weak Yes D4Rain MildHigh Weak Yes D5Rain CoolNormal Weak Yes D6Rain CoolNormal Strong No D7Overcast CoolNormal StrongYes D8Sunny MildHigh Weak No D9Sunny CoolNormal Weak Yes D10Rain MildNormal Weak Yes D11Sunny MildNormal Strong Yes D12Overcast MildHigh Strong Yes D13Overcast HotNormal Weak Yes D14Rain MildHigh Strong No Training data: D15Sunny CoolHigh Strong ? Test data:

In practice, use training data to compute a probablistic model:

D15Sunny CoolHigh Strong ? DayOutlook Temp Humidity Wind PlayTennis

In practice, use training data to compute a probablistic model: D15Sunny CoolHigh Strong ? DayOutlook Temp Humidity Wind PlayTennis

In-class exercise 3a

Estimating probabilities / Smoothing Recap: In previous example, we had a training set and a new example, We asked: What classification is given by a naive Bayes classifier? Let n(c) be the number of training instances with class c, and n(x i = a i, c) be the number of training instances with attribute value x i =a i and class c. Then

Problem with this method: If n(c) is very small, gives a poor estimate. E.g., P(Outlook = Overcast | no) = 0.

Now suppose we want to classify a new instance:. Then: This incorrectly gives us zero probability due to small sample.

One solution: Laplace smoothing (also called “add-one” smoothing) For each class c j and attribute x i with value a i, add one “virtual” instance. That is, recalculate: where k is the number of possible values of attribute a.

DayOutlook Temp Humidity Wind PlayTennis D1Sunny HotHigh Weak No D2Sunny HotHigh Strong No D3Overcast HotHigh Weak Yes D4Rain MildHigh Weak Yes D5Rain CoolNormal Weak Yes D6Rain CoolNormal Strong No D7Overcast CoolNormal StrongYes D8Sunny MildHigh Weak No D9Sunny CoolNormal Weak Yes D10Rain MildNormal Weak Yes D11Sunny MildNormal Strong Yes D12Overcast MildHigh Strong Yes D13Overcast HotNormal Weak Yes D14Rain MildHigh Strong No Training data: Add virtual instances for Outlook: Outlook=Sunny: YesOutlook=Overcast: YesOutlook=Rain: Yes Outlook=Sunny: NoOutlook=Overcast: NoOutlook=Rain: No P(Outlook=Overcast| No) = 0 / 5 0 + 1 / 5 + 3 = 1/8

In-class exercise 3b

Naive Bayes on continuous-valued attributes How to deal with continuous-valued attributes? Two possible solutions: –Discretize –Assume particular probability distribution of classes over values (estimate parameters from training data)

Simplest discretization method For each attribute x i, create k equal-size bins in interval from min(x i ) to max(x i ). Choose thresholds in between bins. P(Humidity < 40 | yes) P(40<=Humidity < 80 | yes) P(80<=Humidity < 120 | yes) P(Humidity < 40 | no) P(40<=Humidity < 80 | no) P(80<=Humidity < 120 | no) Threshold: 40 Humidity: 25, 38, 50, 80, 90, 92, 96, 99 Threshold: 80Threshold: 120

Questions: What should k be? What if some bins have very few instances? Problem with balance between discretization bias and variance. The more bins, the lower the bias, but the higher the variance, due to small sample size.

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Humidity: 25, 38, 50, 80, 90, 92, 96, 99 Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.

In-class exercise 4

Gaussian Naïve Bayes Assume that within each class, values of each numeric feature are normally distributed: where μ i,c is the mean of feature i given the class c j, and σ i,c is the standard deviation of feature i given the class c j We estimate μ i,c and σ i,c from training data.

Example f1f1 f2f2 Class 3.05.1 + 4.16.3+ 7.29.8+ 2.01.1− 4.12.0− 8.19.4−

Example f1f1 f2f2 Class 3.05.1 + 4.16.3+ 7.29.8+ 2.01.1− 4.12.0− 8.19.4− Fill in the rest…

N 1,+ = N(x; 4.8, 1.8) N 1,- = N(x; 4.7, 2.5) http://homepage.stat.uiowa.edu/~mbognar/appl ets/normal.html

Now, suppose you have a new example x, with f 1 = 5.2, f 2 = 6.3. What is class NB (x) ?

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer (P. Domingos and M. Pazzani) Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

Experiments Compare five classification methods on 30 data sets from the UCI ML database. SBC = Simple Bayesian Classifier Default = “Choose class with most representatives in data” C4.5 = Quinlan’s decision tree induction system PEBLS = An instance-based learning system CN2 = A rule-induction system

For SBC, numeric values were discretized into ten equal- length intervals.

Number of domains in which SBC was more accurate versus less accurate than corresponding classifier Same as line 1, but significant at 95% confidence Average rank over all domains (1 is best in each domain)

Measuring Attribute Dependence They used a simple, pairwise mutual information measure: For attributes A m and A n, dependence is defined as where A m A n is a “derived attribute”, whose values consist of the possible combinations of values of A m and A n Note: If A m and A n are independent, then D(A m, A n | C) = 0.

Results: (1) SBC is more successful than more complex methods, even when there is substantial dependence among attributes. (2) No correlation between degree of attribute dependence and SBC’s rank. But why????

Explanation: Suppose C = {+1,−1} are the possible classes. Let x be a new example with attributes.. What the naive Bayes classifier does is calculates two probabilities, and returns the class that has the maximum probability given x.

The probability calculations are correct only if the independence assumption is correct. However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct! The latter covers a lot more cases than the former. Thus, the SBC is effective in many cases in which the independence assumption does not hold.

Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)

Similar presentations

Presentation on theme: "Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)

Similar presentations

Presentation on theme: "Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)"— Presentation transcript:

Similar presentations

About project

Feedback