1 Naïve Bayes Classifiers CS 171/271. 2 Definition A classifier is a system that categorizes instances Inputs to a classifier: feature/attribute values.

Slides:



Advertisements
Similar presentations
Data Mining Tools Overview Business Intelligence for Managers.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Naive Bayes Classifiers, an Overview By Roozmehr Safi.
Naïve-Bayes Classifiers Business Intelligence for Managers.
Naïve Bayes Advanced Statistical Methods in NLP Ling572 January 19,
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 10: The Bayesian way to fit models Geoffrey Hinton.
. Learning – EM in ABO locus Tutorial #08 © Ydo Wexler & Dan Geiger.
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Assuming normally distributed data! Naïve Bayes Classifier.
1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.
Chapter 11 Problems of Estimation
CS 590M Fall 2001: Security Issues in Data Mining Lecture 3: Classification.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Evaluation.
Co-training LING 572 Fei Xia 02/21/06. Overview Proposed by Blum and Mitchell (1998) Important work: –(Nigam and Ghani, 2000) –(Goldman and Zhou, 2000)
Visual Recognition Tutorial
Kernel Methods Part 2 Bing Han June 26, Local Likelihood Logistic Regression.
Bayesian Classification with a brief introduction to pattern recognition Modified from slides by Michael L. Raymer, Ph.D.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Crash Course on Machine Learning
1  The goal is to estimate the error probability of the designed classification system  Error Counting Technique  Let classes  Let data points in class.
Supervised Learning and k Nearest Neighbors Business Intelligence for Managers.
1 Bayesian methods for parameter estimation and data assimilation with crop models Part 2: Likelihood function and prior distribution David Makowski and.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Investment Analysis and Portfolio management Lecture: 24 Course Code: MBF702.
Bayesian Networks. Male brain wiring Female brain wiring.
Text Classification, Active/Interactive learning.
1 Data Mining Lecture 5: KNN and Bayes Classifiers.
Naive Bayes Classifier
Evaluating What’s Been Learned. Cross-Validation Foundation is a simple idea – “ holdout ” – holds out a certain amount for testing and uses rest for.
Data Reduction. 1.Overview 2.The Curse of Dimensionality 3.Data Sampling 4.Binning and Reduction of Cardinality.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
1 Bayesian Methods. 2 Naïve Bayes New data point to classify: X=(x 1,x 2,…x m ) Strategy: – Calculate P(C i /X) for each class C i. – Select C i for which.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Chapter 6 – Three Simple Classification Methods © Galit Shmueli and Peter Bruce 2008 Data Mining for Business Intelligence Shmueli, Patel & Bruce.
Sampling and estimation Petter Mostad
1 Chapter 15 Probability Rules. 2 Recall That… For any random phenomenon, each trial generates an outcome. An event is any set or collection of outcomes.
CS 8751 ML & KDDComputational Learning Theory1 Notions of interest: efficiency, accuracy, complexity Probably, Approximately Correct (PAC) Learning Agnostic.
Chapter 6. Classification and Prediction Classification by decision tree induction Bayesian classification Rule-based classification Classification by.
CS Machine Learning Instance Based Learning (Adapted from various sources)
Computer Performance Modeling Dirk Grunwald Prelude to Jain, Chapter 12 Laws of Large Numbers and The normal distribution.
Chapter 8 – Naïve Bayes DM for Business Intelligence.
Bayesian Learning. Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:
Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
CSC321: Lecture 8: The Bayesian way to fit models Geoffrey Hinton.
Usman Roshan CS 675 Machine Learning
Chapter 3: Maximum-Likelihood Parameter Estimation
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Naive Bayes Classifier
Classification of unlabeled data:
SSL Chapter 4 Risk of Semi-supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers.
Data Mining Lecture 11.
CS 4/527: Artificial Intelligence
Classification Techniques: Bayesian Classification
K Nearest Neighbor Classification
Hidden Markov Models Part 2: Algorithms
CS498-EA Reasoning in AI Lecture #20
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
The Naïve Bayes (NB) Classifier
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Parametric Methods Berlin Chen, 2005 References:
MIRA, SVM, k-NN Lirong Xia. MIRA, SVM, k-NN Lirong Xia.
Presentation transcript:

1 Naïve Bayes Classifiers CS 171/271

2 Definition A classifier is a system that categorizes instances Inputs to a classifier: feature/attribute values of a given instance Output of a classifier: predicted category for that instance

3 Classifiers Classifier… feature values category X1 X2 X3 Xn Y Example: X1 (motility) = “flies”, X2 (number of legs) = 2, X3 (height) = 6 in  Y = “bird”

4 Learning from datasets In the context of a learning agent, a classifier’s intelligence will be based on a dataset consisting of instances with known categories Typical goal of a classifier: predict the category of a new instance that is rationally consistent with the dataset

5 Classifier algorithm (approach 1) Select all instances in the dataset that match the input tuple (X1,X2,…,Xn) of feature values Determine the distribution of Y-values for all the matches Output the Y-value representing the most instances

6 Problems with this approach Classification process is proportional to dataset size Time complexity: O( m ), where m is the dataset size Not practical if the dataset is huge

7 Pre-computing distributions (approach 2) What if we pre-compute all distributions for all possible tuples? The classification process is then a simple matter of looking up the pre-computed distribution Time complexity burden will be in the pre- computation stage, done only once Still not practical if the number of features is not small Suppose there are only two possible values per feature and there are n features -> 2 n possible combinations!

8 What we need Typically, n (number of features) will be in the hundreds and m (number of instances in the dataset) will be in the tens of thousands We want a classifier that pre-computes enough so that it does not need to scan through the instances during the query, but we do not want to pre-compute too many values

9 Probability Notation What we want to estimate from our dataset is a conditional probability P( Y=c | X1=v1, X2=v2, …, Xn = vn ) represents the probability that the category of the instance is c, given that the feature values are v1,v2,…,vn (the input) In our classifier, we output the c with maximum probability

10 Bayes Theorem Bayes theorem allows us to invert conditional probability P( A=a | B=b ) = P( B=b | A=a ) P( A=a ) P( B=b ) Why and how this will help? The answer will come later

11 P( A=a ) P( B=b ) W X Y Z Suppose U = W+X+Y+Z P( A=a | B=b ) = Z/(Z+Y) P( B=b | A=a ) = Z/(Z+X) P( A=a ) = (Z+X)/U P( B=b ) = (Z+Y)/U P( A=a ) / P( B=b ) = (Z+X)/(Z+Y) P( B=b | A=a ) P( A=a ) P( B=b ) = [ Z/(Z+X) ] (Z+X)/(Z+Y) = Z/(Z+Y) = P( A=a | B=b )

12 Another helpful equivalence Assuming that two events are independent, the probability that both events occur is equal to the product of their individual probabilities P( X1=v1, X2=v2 ) = P( X1=v1 ) P( X2=v2 )

13 Goal: maximize this quantity over all possible Y-values P( Y=c | X1=v1, X2=v2, …, Xn=vn ) = P( X1=v1, X2=v2, …, Xn = vn | Y=c ) P( Y=c ) P(X1=v1, X2=v2, …, Xn = vn) P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P( Y=c ) P(X1=v1, X2=v2, …, Xn = vn) Can ignore the divisor since it remains the same regardless of Y-value The critical step

14 And here it is… We want a classifier to compute max P( Y=c | X1=v1, X2=v2, …, Xn = vn ) We get the same c if we instead compute max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c) P(Y=c) These values can be pre-computed and the number of computations is not combinatorially explosive

15 Building a classifier (approach 3) For each category c, estimate P( Y=c ) = number of c-instances total number of instances For each category c, for each feature Xi, determine the distribution P( Xi | Y=c ) For each possible value v for Xi, estimate P( Xi=v | Y=c ) = number of c-instances where Xi=v number of c-instances

16 Using the classifier (approach 3) For a given input tuple (v1,v2,…,vn), determine the category c that yields max P(X1=v1|Y=c) P(X2=v2|Y=c)…P(Xn=vn|Y=c)P(Y=c) by looking up the terms from the pre- computed values Output category c

17 Example Suppose we wanted a classifier that categorizes organisms according to certain characteristics Organism categories (Y) are: mammal, bird, fish, insect, spider Characteristics (X1,X2,X3,X4): motility (walks, flies, swim), number of legs (2,4,6,8), size (small, large), body-covering (fur, scales, feathers) The dataset contains 1000 organism samples m = 1000, n = 4, number of categories = 5

18 Comparing approaches Approach 1: requires scanning all tuples for matching feature values entails 1000*4 = 4000 comparisons per query, count occurrences of each category Approach 2: pre-compute probabilities Preparation: for each of the 3*4*2*3 = 64 combinations, determine the probability for each category (64*5=320 computations) Query: straightforward lookup of answer

19 Comparing approaches Approach 3: Naïve Bayes classifier Preparation: compute P(Y=c) probabilities: 5 of them; compute P( Xi=v | Y=c ), 5*( )=60 of them Query: straightforward computation of 5 probabilities, determine maximum, return category that yields the maximum

20 About the Naïve Bayes Classifier Computations and resources required are reasonable, both for the preparatory stage and actual query stage Even if the number n of features is in the thousands! The classifier is naïve because it assumes independence of features (this is likely not the case) It turns out that the classifier works well in practice even with this limitation Log of probabilities are often used instead of actual probabilities to avoid underflow when computing the probability products

21 Related areas of study Density estimators: alternate methods of computing the probabilities Feature selection: eliminate unnecessary or redundant features (those that don’t help as much with classification) in order to reduce the value of n