Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.

Slides:



Advertisements
Similar presentations
Classification Classification Examples
Advertisements

Naïve Bayes Classification
Imbalanced data David Kauchak CS 451 – Fall 2013.
2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Naïve Bayes Classifier
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
What is Statistical Modeling
The loss function, the normal equation,
Evaluation (practice). 2 Predicting performance  Assume the estimated error rate is 25%. How close is this to the true error rate?  Depends on the amount.
Evaluation.
Assuming normally distributed data! Naïve Bayes Classifier.
Data Mining with Naïve Bayesian Methods
Evaluation.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Generative Models Rong Jin. Statistical Inference Training ExamplesLearning a Statistical Model  Prediction p(x;  ) Female: Gaussian distribution N(
Evaluation and Credibility How much should we believe in what was learned?
Algorithms for Classification: The Basic Methods.
Experimental Evaluation
Evaluation and Credibility
Bayes Classifier, Linear Regression 10701/15781 Recitation January 29, 2008 Parts of the slides are from previous years’ recitation and lecture notes,
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Review: Probability Random variables, events Axioms of probability
Crash Course on Machine Learning
NAÏVE BAYES CLASSIFIER 1 ACM Student Chapter, Heritage Institute of Technology 10 th February, 2012 SIGKDD Presentation by Anirban Ghose Parami Roy Sourav.
CLassification TESTING Testing classifier accuracy
Bayesian Networks. Male brain wiring Female brain wiring.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Naïve Bayes Readings: Barber
Text Classification, Active/Interactive learning.
Naive Bayes Classifier
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
1 CS 391L: Machine Learning: Experimental Evaluation Raymond J. Mooney University of Texas at Austin.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
Bayesian Classification Using P-tree  Classification –Classification is a process of predicting an – unknown attribute-value in a relation –Given a relation,
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
Algorithms for Classification: The Basic Methods.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Review: Probability Random variables, events Axioms of probability Atomic events Joint and marginal probability distributions Conditional probability distributions.
Active learning Haidong Shi, Nanyi Zeng Nov,12,2008.
Data Mining Practical Machine Learning Tools and Techniques By I. H. Witten, E. Frank and M. A. Hall Chapter 5: Credibility: Evaluating What’s Been Learned.
Classification And Bayesian Learning
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
Validation methods.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
Machine Learning in Practice Lecture 21 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
KNN & Naïve Bayes Hongning Wang
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Data Science Credibility: Evaluating What’s Been Learned
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Data Science Algorithms: The Basic Methods
Perceptrons Lirong Xia.
Lecture 15: Text Classification & Naive Bayes
Machine Learning Techniques for Data Mining
Naïve Bayes Classifiers
Machine Learning in Practice Lecture 6
Speech recognition, machine learning
NAÏVE BAYES CLASSIFICATION
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Perceptrons Lirong Xia.
Speech recognition, machine learning
Presentation transcript:

Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others

Things We’d Like to Do Spam Classification – Given an , predict whether it is spam or not Medical Diagnosis – Given a list of symptoms, predict whether a patient has disease X or not Weather – Based on temperature, humidity, etc… predict if it will rain tomorrow

Bayesian Classification Problem statement: – Given features X 1,X 2,…,X n – Predict a label Y

Another Application Digit Recognition X 1,…,X n  {0,1} (Black vs. White pixels) Y  {5,6} (predict whether a digit is a 5 or a 6) Classifier 5

The Bayes Classifier In class, we saw that a good strategy is to predict: – (for example: what is the probability that the image represents a 5 given its pixels?) So … How do we compute that?

The Bayes Classifier Use Bayes Rule! Why did this help? Well, we think that we might be able to specify how features are “generated” by the class label Normalization Constant LikelihoodPrior

The Bayes Classifier Let’s expand this for our digit recognition task: To classify, we’ll simply compute these two probabilities and predict based on which one is greater

Model Parameters For the Bayes classifier, we need to “learn” two functions, the likelihood and the prior How many parameters are required to specify the prior for our digit recognition example?

Model Parameters How many parameters are required to specify the likelihood? – (Supposing that each image is 30x30 pixels) ?

Model Parameters The problem with explicitly modeling P(X 1,…,X n |Y) is that there are usually way too many parameters: – We’ll run out of space – We’ll run out of time – And we’ll need tons of training data (which is usually not available)

The Naïve Bayes Model The Naïve Bayes Assumption: Assume that all features are independent given the class label Y Equationally speaking: (We will discuss the validity of this assumption later)

Why is this useful? # of parameters for modeling P(X 1,…,X n |Y):  2(2 n -1) # of parameters for modeling P(X 1 |Y),…,P(X n |Y)  2n

Naïve Bayes Training Now that we’ve decided to use a Naïve Bayes classifier, we need to train it with some data: MNIST Training Data

Naïve Bayes Training Training in Naïve Bayes is easy: – Estimate P(Y=v) as the fraction of records with Y=v – Estimate P(X i =u|Y=v) as the fraction of records with Y=v for which X i =u (This corresponds to Maximum Likelihood estimation of model parameters)

Naïve Bayes Training In practice, some of these counts can be zero Fix this by adding “virtual” counts: – (This is like putting a prior on parameters and doing MAP estimation instead of MLE) – This is called Smoothing

Naïve Bayes Training For binary digits, training amounts to averaging all of the training fives together and all of the training sixes together.

Naïve Bayes Classification

Another Example of the Naïve Bayes Classifier The weather data, with counts and probabilities outlooktemperaturehumiditywindyplay yesnoyesnoyesnoyesnoyesno sunny23hot22high34false6295 overcast40mild42normal61true33 rainy32cool31 sunny2/93/5hot2/92/5high3/94/5false6/92/59/145/14 overcast4/90/5mild4/92/5normal6/91/5true3/93/5 rainy3/92/5cool3/91/5 A new day outlooktemperaturehumiditywindyplay sunnycoolhightrue?

Likelihood of yes Likelihood of no Therefore, the prediction is No

Outputting Probabilities What’s nice about Naïve Bayes (and generative models in general) is that it returns probabilities – These probabilities can tell us how confident the algorithm is – So… don’t throw away those probabilities!

Naïve Bayes Assumption Recall the Naïve Bayes assumption: – that all features are independent given the class label Y Does this hold for the digit recognition problem?

Exclusive-OR Example For an example where conditional independence fails: – Y=XOR(X 1,X 2 ) X1X1 X2X2 P(Y=0|X 1,X 2 )P(Y=1|X 1,X 2 )

Actually, the Naïve Bayes assumption is almost never true Still… Naïve Bayes often performs surprisingly well even when its assumptions do not hold

Recap We defined a Bayes classifier but saw that it’s intractable to compute P(X 1,…,X n |Y) We then used the Naïve Bayes assumption – that everything is independent given the class label Y A natural question: is there some happy compromise where we only assume that some features are conditionally independent? – Stay Tuned…

Conclusions Naïve Bayes is: – Really easy to implement and often works well – Often a good first thing to try – Commonly used as a “punching bag” for smarter algorithms

Cross-validation Cross-validation avoids overlapping test sets – First step: split data into k subsets of equal size – Second step: use each subset in turn for testing, the remainder for training Called k-fold cross-validation

Leave-One-Out cross-validation Leave-One-Out: a particular form of cross-validation: – Set number of folds to number of training instances – I.e., for n training instances, build classifier n times Makes best use of the data Involves no random subsampling Very computationally expensive – (exception: NN)