Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)

Slides:



Advertisements
Similar presentations
Bayes rule, priors and maximum a posteriori
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Bayesian Learning No reading assignment for this topic
Ai in game programming it university of copenhagen Statistical Learning Methods Marco Loog.
Visual Recognition Tutorial
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Visual Recognition Tutorial
Introduction to Bayesian Learning Bob Durrant School of Computer Science University of Birmingham (Slides: Dr Ata Kabán)
Evaluating Hypotheses
I The meaning of chance Axiomatization. E Plurbus Unum.
Introduction to Bayesian Learning Ata Kaban School of Computer Science University of Birmingham.
Thanks to Nir Friedman, HU
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Probability and Statistics Review Thursday Sep 11.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
Midterm Review Rao Vemuri 16 Oct Posing a Machine Learning Problem Experience Table – Each row is an instance – Each column is an attribute/feature.
Introduction to Management Science
11-1 Copyright © 2010 Pearson Education, Inc. Publishing as Prentice Hall Probability and Statistics Chapter 11.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
CS464 Introduction to Machine Learning1 Bayesian Learning Features of Bayesian learning methods: Each observed training example can incrementally decrease.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Making sense of randomness
CpSc 881: Machine Learning Evaluating Hypotheses.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 6 Bayesian Learning
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Decision Trees, Part 1 Reading: Textbook, Chapter 6.
Natural Language Processing Giuseppe Attardi Introduction to Probability IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Bayes Theorem. Prior Probabilities On way to party, you ask “Has Karl already had too many beers?” Your prior probabilities are 20% yes, 80% no.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Quiz 3: Mean: 9.2 Median: 9.75 Go over problem 1.
Decision Trees Reading: Textbook, “Learning From Examples”, Section 3.
Bayesian Learning Bayes Theorem MAP, ML hypotheses MAP learners
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
CS Ensembles and Bayes1 Ensembles, Model Combination and Bayesian Combination.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Evaluating Hypotheses. Outline Empirically evaluating the accuracy of hypotheses is fundamental to machine learning – How well does this estimate its.
Chap 4-1 Chapter 4 Using Probability and Probability Distributions.
On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Bayesian Estimation and Confidence Intervals Lecture XXII.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Applied statistics Usman Roshan.
Oliver Schulte Machine Learning 726
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
Bayes Net Learning: Bayesian Approaches
Oliver Schulte Machine Learning 726
Data Mining Lecture 11.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Statistical NLP: Lecture 4
LECTURE 07: BAYESIAN ESTIMATION
Mathematical Foundations of BME Reza Shadmehr
Naïve Bayes Classifier
Presentation transcript:

Bayesian Learning Reading: C. Haruechaiyasak, “A tutorial on naive Bayes classification” (linked from class website)

Conditional Probability Probability of an event given the occurrence of some other event.

Example Consider choosing a card from a well-shuffled standard deck of 52 playing cards. Given that the first card chosen is an ace, what is the probability that the second card chosen will be an ace?

Example Consider choosing a card from a well-shuffled standard deck of 52 playing cards. Given that the first card chosen is an ace, what is the probability that the second card chosen will be an ace?

Deriving Bayes Rule

Bayesian Learning

Application to Machine Learning In machine learning we have a space H of hypotheses: h 1, h 2,..., h n We also have a set D of data We want to calculate P(h | D)

–Prior probability of h: P(h): Probability that hypothesis h is true given our prior knowledge If no prior knowledge, all h  H are equally probable –Posterior probability of h: P(h | D): Probability that hypothesis h is true, given the data D. –Likelihood of D: P(D | h): Probability that we will see data D, given hypothesis h is true. Terminology

Bayes Rule: Machine Learning Formulation

The Monty Hall Problem You are a contestant on a game show. There are 3 doors, A, B, and C. There is a new car behind one of them and goats behind the other two. Monty Hall, the host, knows what is behind the doors. He asks you to pick a door, any door. You pick door A. Monty tells you he will open a door, different from A, that has a goat behind it. He opens door B: behind it there is a goat. Monty now gives you a choice: Stick with your original choice A or switch to C. Should you switch?

Bayesian probability formulation Hypothesis space H: h 1 = Car is behind door A h 2 = Car is behind door B h 3 = Car is behind door C Data D = Monty opened B to show a goat What is P(h 1 | D)? What is P(h 2 | D)? What is P(h 3 | D)?

Event space Event space = All possible configurations of cars and goats behind doors A, B, C Y = Goat behind door B X = Car behind door A

Y = Goat behind door B X = Car behind door A Bayes Rule: Event space

Using Bayes’ Rule to solve the Monty Hall problem

By Bayes rule: P(h 1 |D) = P(D|h 1 )p(h 1 ) / P(D) = (1/2) (1/3) / (1/2) = 1/3 P(h 2 |D) = P(D|h 2 )p(h 2 ) / P(D) = (1)(1/3) / (1/2) = 2/3 So you should switch! You pick door A. Data D = Monty opened door B to show a goat. Hypothesis space H: h 1 = Car is behind door A h 2 = Car is behind door C h 3 = Car is behind door B What is P(h 1 | D)? What is P(h 2 | D)? What is P(h 3 | D)? Prior probability: P(h 1 ) = 1/3 P(h 2 ) =1/3 P(h 3 ) =1/3 Likelihood: P(D | h 1 ) = 1/2 P(D | h 2 ) = 1 P(D | h 3 ) = 0 P(D) = p(D|h 1 )p(h 1 ) + p(D|h 2 )p(h 2 ) + p(D|h 3 )p(h 3 ) = 1/6 + 1/3 + 0 = 1/2

MAP (“maximum a posteriori”) Learning Bayes rule: Goal of learning: Find maximum a posteriori hypothesis h MAP : because P(D) is a constant independent of h.

Note: If every h  H is equally probable, then This is called the “maximum likelihood hypothesis”.

A Medical Example Toby takes a test for leukemia. The test has two outcomes: positive and negative. It is known that if the patient has leukemia, the test is positive 98% of the time. If the patient does not have leukemia, the test is positive 3% of the time. It is also known that of the population has leukemia. Toby’s test is positive. Which is more likely: Toby has leukemia or Toby does not have leukemia?

Hypothesis space: h 1 = T. has leukemia h 2 = T. does not have leukemia Prior: of the population has leukemia. Thus P(h 1 ) = P(h 2 ) = Likelihood: P(+ | h 1 ) = 0.98, P(− | h 1 ) = 0.02 P(+ | h 2 ) = 0.03, P(− | h 2 ) = 0.97 Posterior knowledge: Blood test is + for this patient.

In summary P(h 1 ) = 0.008, P(h 2 ) = P(+ | h 1 ) = 0.98, P(− | h 1 ) = 0.02 P(+ | h 2 ) = 0.03, P(− | h 2 ) = 0.97 Thus:

What is P(leukemia|+)? So, These are called the “posterior” probabilities.

In-Class Exercises 1-2

Quiz 5 (this Thursday) What you need to know: Definition of Information Gain and how to calculate it Bayes rule and how to derive it from definition of conditional probability How to solve problems similar to in-class exercises using Bayes rule Note: Naïve Bayes algorithm won’t be on this quiz. No calculator needed

In-class exercises 2b-2c

Bayesianism vs. Frequentism Classical probability: Frequentists –Probability of a particular event is defined relative to its frequency in a sample space of events. –E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses. Bayesian probability: –Combine measure of “prior” belief you have in a proposition with your subsequent observations of events. Example: Bayesian can assign probability to statement “There was life on Mars a billion years ago” but frequentist cannot.

Independence and Conditional Independence Two random variables, X and Y, are independent if Two random variables, X and Y, are independent given Z if Examples?

Naive Bayes Classifier Let f (x) be a target function for classification: f (x)  {+1, −1}. Let x = We want to find the most probable class value, h MAP, given the data x:

By Bayes Theorem: P(class) can be estimated from the training data. How? However, in general, not practical to use training data to estimate P(x 1, x 2,..., x n | class). Why not?

Naive Bayes classifier: Assume Is this a good assumption? Given this assumption, here’s how to classify an instance x = : Naive Bayes classifier: Estimate the values of these various probabilities over the training set.

DayOutlook Temp Humidity Wind PlayTennis D1Sunny HotHigh Weak No D2Sunny HotHigh Strong No D3Overcast HotHigh Weak Yes D4Rain MildHigh Weak Yes D5Rain CoolNormal Weak Yes D6Rain CoolNormal Strong No D7Overcast CoolNormal StrongYes D8Sunny MildHigh Weak No D9Sunny CoolNormal Weak Yes D10Rain MildNormal Weak Yes D11Sunny MildNormal Strong Yes D12Overcast MildHigh Strong Yes D13Overcast HotNormal Weak Yes D14Rain MildHigh Strong No Training data: D15Sunny CoolHigh Strong ? Test data:

In practice, use training data to compute a probablistic model:

D15Sunny CoolHigh Strong ? DayOutlook Temp Humidity Wind PlayTennis

In practice, use training data to compute a probablistic model: D15Sunny CoolHigh Strong ? DayOutlook Temp Humidity Wind PlayTennis

In-class exercise 3a

Estimating probabilities / Smoothing Recap: In previous example, we had a training set and a new example, We asked: What classification is given by a naive Bayes classifier? Let n(c) be the number of training instances with class c, and n(x i = a i, c) be the number of training instances with attribute value x i =a i and class c. Then

Problem with this method: If n(c) is very small, gives a poor estimate. E.g., P(Outlook = Overcast | no) = 0.

Now suppose we want to classify a new instance:. Then: This incorrectly gives us zero probability due to small sample.

One solution: Laplace smoothing (also called “add-one” smoothing) For each class c j and attribute x i with value a i, add one “virtual” instance. That is, recalculate: where k is the number of possible values of attribute a.

DayOutlook Temp Humidity Wind PlayTennis D1Sunny HotHigh Weak No D2Sunny HotHigh Strong No D3Overcast HotHigh Weak Yes D4Rain MildHigh Weak Yes D5Rain CoolNormal Weak Yes D6Rain CoolNormal Strong No D7Overcast CoolNormal StrongYes D8Sunny MildHigh Weak No D9Sunny CoolNormal Weak Yes D10Rain MildNormal Weak Yes D11Sunny MildNormal Strong Yes D12Overcast MildHigh Strong Yes D13Overcast HotNormal Weak Yes D14Rain MildHigh Strong No Training data: Add virtual instances for Outlook: Outlook=Sunny: YesOutlook=Overcast: YesOutlook=Rain: Yes Outlook=Sunny: NoOutlook=Overcast: NoOutlook=Rain: No P(Outlook=Overcast| No) = 0 / / = 1/8

Etc.

In-class exercise 3b

Naive Bayes on continuous-valued attributes How to deal with continuous-valued attributes? Two possible solutions: –Discretize –Assume particular probability distribution of classes over values (estimate parameters from training data)

Simplest discretization method For each attribute x i, create k equal-size bins in interval from min(x i ) to max(x i ). Choose thresholds in between bins. P(Humidity < 40 | yes) P(40<=Humidity < 80 | yes) P(80<=Humidity < 120 | yes) P(Humidity < 40 | no) P(40<=Humidity < 80 | no) P(80<=Humidity < 120 | no) Threshold: 40 Humidity: 25, 38, 50, 80, 90, 92, 96, 99 Threshold: 80Threshold: 120

Questions: What should k be? What if some bins have very few instances? Problem with balance between discretization bias and variance. The more bins, the lower the bias, but the higher the variance, due to small sample size.

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.

Alternative simple (but effective) discretization method (Yang & Webb, 2001) Humidity: 25, 38, 50, 80, 90, 92, 96, 99 Let n = number of training examples. For each attribute A i, create bins. Sort values of A i in ascending order, and put of them in each bin. Don’t need add-one smoothing of probabilities This gives good balance between discretization bias and variance.

In-class exercise 4

Gaussian Naïve Bayes Assume that within each class, values of each numeric feature are normally distributed: where μ i,c is the mean of feature i given the class c j, and σ i,c is the standard deviation of feature i given the class c j We estimate μ i,c and σ i,c from training data.

Example f1f1 f2f2 Class − − −

Example f1f1 f2f2 Class − − − Fill in the rest…

N 1,+ = N(x; 4.8, 1.8) N 1,- = N(x; 4.7, 2.5) ets/normal.html

Now, suppose you have a new example x, with f 1 = 5.2, f 2 = 6.3. What is class NB (x) ?

Now, suppose you have a new example x, with f 1 = 5.2, f 2 = 6.3. What is class NB (x) ?

Now, suppose you have a new example x, with f 1 = 5.2, f 2 = 6.3. What is class NB (x) ?

Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer (P. Domingos and M. Pazzani) Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.

This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?

Experiments Compare five classification methods on 30 data sets from the UCI ML database. SBC = Simple Bayesian Classifier Default = “Choose class with most representatives in data” C4.5 = Quinlan’s decision tree induction system PEBLS = An instance-based learning system CN2 = A rule-induction system

For SBC, numeric values were discretized into ten equal- length intervals.

Number of domains in which SBC was more accurate versus less accurate than corresponding classifier Same as line 1, but significant at 95% confidence Average rank over all domains (1 is best in each domain)

Measuring Attribute Dependence They used a simple, pairwise mutual information measure: For attributes A m and A n, dependence is defined as where A m A n is a “derived attribute”, whose values consist of the possible combinations of values of A m and A n Note: If A m and A n are independent, then D(A m, A n | C) = 0.

Results: (1) SBC is more successful than more complex methods, even when there is substantial dependence among attributes. (2) No correlation between degree of attribute dependence and SBC’s rank. But why????

Explanation: Suppose C = {+1,−1} are the possible classes. Let x be a new example with attributes.. What the naive Bayes classifier does is calculates two probabilities, and returns the class that has the maximum probability given x.

The probability calculations are correct only if the independence assumption is correct. However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct! The latter covers a lot more cases than the former. Thus, the SBC is effective in many cases in which the independence assumption does not hold.