Naïve Bayes Classification 10-701 Recitation, 1/25/07 Jonathan Huang.

Slides:



Advertisements
Similar presentations
Naïve Bayes Classification
Advertisements

Bayes Rule The product rule gives us two ways to factor a joint probability: Therefore, Why is this useful? –Can get diagnostic probability P(Cavity |
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Naïve Bayes Classifier
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Probabilistic inference
Assuming normally distributed data! Naïve Bayes Classifier.
Bayes Rule How is this rule derived? Using Bayes rule for probabilistic inference: –P(Cause | Evidence): diagnostic probability –P(Evidence | Cause): causal.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Thanks to Nir Friedman, HU
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Review: Probability Random variables, events Axioms of probability
Crash Course on Machine Learning
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Naive Bayes Classifier
1 CS546: Machine Learning and Natural Language Discriminative vs Generative Classifiers This lecture is based on (Ng & Jordan, 02) paper and some slides.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
MLE’s, Bayesian Classifiers and Naïve Bayes Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University January 30,
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
Slides for “Data Mining” by I. H. Witten and E. Frank.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Machine Learning  Up until now: how to reason in a model and how to make optimal decisions  Machine learning: how to acquire a model on the basis of.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
Bayesian Classification 1. 2 Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Introduction to Machine Learning Nir Ailon Lecture 11: Probabilistic Models.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Model selection –Error decomposition –Bias-Variance Tradeoff –Classification:
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Bayesian inference, Naïve Bayes model
Matt Gormley Lecture 3 September 7, 2016
Histograms CSE 6363 – Machine Learning Vassilis Athitsos
Probability David Kauchak CS158 – Fall 2013.
Naïve Bayes Classifiers
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
Perceptrons Lirong Xia.
ECE 5424: Introduction to Machine Learning
Lecture 15: Text Classification & Naive Bayes
Naïve Bayes Classifier
ECE 5424: Introduction to Machine Learning
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
Naïve Bayes CSC 600: Data Mining Class 19.
Hidden Markov Models Part 2: Algorithms
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Naïve Bayes Classifiers
On Convolutional Neural Network
Machine Learning in Practice Lecture 6
Parametric Methods Berlin Chen, 2005 References:
Speech recognition, machine learning
Naïve Bayes Classifiers
Naïve Bayes CSC 576: Data Science.
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
NAÏVE BAYES CLASSIFICATION
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Speech recognition, machine learning
Naïve Bayes Classifier
Presentation transcript:

Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang

Things We’d Like to Do Spam Classification  Given an , predict whether it is spam or not Medical Diagnosis  Given a list of symptoms, predict whether a patient has cancer or not Weather  Based on temperature, humidity, etc… predict if it will rain tomorrow

Bayesian Classification Problem statement:  Given features X 1,X 2,…,X n  Predict a label Y

Another Application Digit Recognition X 1,…,X n  {0,1} (Black vs. White pixels) Y  {5,6} (predict whether a digit is a 5 or a 6) Classifier 5

The Bayes Classifier In class, we saw that a good strategy is to predict:  (for example: what is the probability that the image represents a 5 given its pixels?) So … how do we compute that?

The Bayes Classifier Use Bayes Rule! Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label Normalization Constant LikelihoodPrior

The Bayes Classifier Let’s expand this for our digit recognition task: To classify, we’ll simply compute these two probabilities and predict based on which one is greater

Model Parameters For the Bayes classifier, we need to “learn” two functions, the likelihood and the prior How many parameters are required to specify the prior for our digit recognition example? Just 1

Model Parameters How many parameters are required to specify the likelihood?  (Supposing that each image is 30x30 pixels) 2( )

Model Parameters The problem with explicitly modeling P(X 1,…,X n |Y) is that there are usually way too many parameters:  We’ll run out of space  We’ll run out of time  And we’ll need tons of training data (which is usually not available)

The Naïve Bayes Model The Naïve Bayes Assumption: Assume that all features are independent given the class label Y Equationally speaking: (We will discuss the validity of this assumption later)

Why is this useful? # of parameters for modeling P(X 1,…,X n |Y):  2(2 n -1) # of parameters for modeling P(X 1 |Y),…,P(X n |Y)  2n

Naïve Bayes Training Now that we’ve decided to use a Naïve Bayes classifier, we need to train it with some data: MNIST Training Data

Naïve Bayes Training Training in Naïve Bayes is easy:  Estimate P(Y=v) as the fraction of records with Y=v  Estimate P(X i =u|Y=v) as the fraction of records with Y=v for which X i =u (This corresponds to Maximum Likelihood estimation of model parameters)

Naïve Bayes Training In practice, some of these counts can be zero Fix this by adding “virtual” counts:  (This is like putting a prior on parameters and doing MAP estimation instead of MLE)  This is called Smoothing

Naïve Bayes Training For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

Naïve Bayes Classification

Outputting Probabilities What’s nice about Naïve Bayes (and generative models in general) is that it returns probabilities  These probabilities can tell us how confident the algorithm is  So… don’t throw away those probabilities!

Performance on a Test Set Naïve Bayes is often a good choice if you don’t have much training data!

Naïve Bayes Assumption Recall the Naïve Bayes assumption:  that all features are independent given the class label Y Does this hold for the digit recognition problem?

Exclusive-OR Example For an example where conditional independence fails:  Y=XOR(X 1,X 2 ) X1X1 X2X2 P(Y=0|X 1,X 2 )P(Y=1|X 1,X 2 )

Actually, the Naïve Bayes assumption is almost never true Still… Naïve Bayes often performs surprisingly well even when its assumptions do not hold

Numerical Stability It is often the case that machine learning algorithms need to work with very small numbers  Imagine computing the probability of 2000 independent coin flips  MATLAB thinks that (.5) 2000 =0

Numerical Stability Instead of comparing P(Y=5|X 1,…,X n ) with P(Y=6|X 1,…,X n ),  Compare their logarithms

Recovering the Probabilities Suppose that for some constant K, we have:  And How would we recover the original probabilities?

Recovering the Probabilities Given: Then for any constant C: One suggestion: set C such that the greatest  i is shifted to zero:

Recap We defined a Bayes classifier but saw that it’s intractable to compute P(X 1,…,X n |Y) We then used the Naïve Bayes assumption – that everything is independent given the class label Y A natural question: is there some happy compromise where we only assume that some features are conditionally independent?  Stay Tuned…

Conclusions Naïve Bayes is:  Really easy to implement and often works well  Often a good first thing to try  Commonly used as a “punching bag” for smarter algorithms

Questions?