Classification: Naïve Bayes Classifier

Slides:

Advertisements

Similar presentations

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Part I Introduction to Data Mining by Tan,

Advertisements

Classification: Definition Given a collection of records (training set ) –Each record contains a set of attributes, one of the attributes is the class.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Classification: Definition l Given a collection of records (training set) l Find a model.

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach,

Data Mining Classification: Alternative Techniques

Data Mining Classification: Alternative Techniques

What we will cover here What is a classifier

Naïve Bayes Classifier

Naïve Bayes Classifier

On Discriminative vs. Generative classifiers: Naïve Bayes

Copyright © 2005 Brooks/Cole, a division of Thomson Learning, Inc. 6.1 Chapter Six Probability.

Data Mining Classification: Naïve Bayes Classifier

Lecture Notes for Chapter 4 Introduction to Data Mining

Data Mining Classification: Alternative Techniques

Classification: Decision Trees, and Naïve Bayes etc. March 17, 2010 Adapted from Chapters 4 and 5 of the book Introduction to Data Mining by Tan, Steinbach,

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

© Vipin Kumar CSci 8980 Fall CSci 8980: Data Mining (Fall 2002) Vipin Kumar Army High Performance Computing Research Center Department of Computer.

Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.

Naïve Bayes Classifier Ke Chen Extended by Longin Jan Latecki COMP20411 Machine Learning.

INTRODUCTION TO Machine Learning ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.

Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki

CISC 4631 Data Mining Lecture 03: Introduction to classification Linear classifier Theses slides are based on the slides by Tan, Steinbach and Kumar (textbook.

1 Data Mining Lecture 3: Decision Trees. 2 Classification: Definition l Given a collection of records (training set ) –Each record contains a set of attributes,

Last lecture summary Naïve Bayes Classifier. Bayes Rule Normalization Constant LikelihoodPrior Posterior Prior and likelihood must be learnt (i.e. estimated.

1 Data Mining Lecture 5: KNN and Bayes Classifiers.

Naive Bayes Classifier

Lecture 7. Outline 1. Overview of Classification and Decision Tree 2. Algorithm to build Decision Tree 3. Formula to measure information 4. Weka, data.

Naïve Bayes Classifier. Bayes Classifier l A probabilistic framework for classification problems l Often appropriate because the world is noisy and also.

Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/ Statistical Inference (By Michael Jordon) l Bayesian perspective –conditional perspective—inferences.

Naïve Bayes Classifier Ke Chen Modified and extended by Longin Jan Latecki

1 Chapter 12 Probabilistic Reasoning and Bayesian Belief Networks.

1 Probability Chapter Assigning probabilities to Events Random experiment –a random experiment is a process or course of action, whose outcome.

Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.

Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.

1 Illustration of the Classification Task: Learning Algorithm Model.

COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.

Bayesian Learning. Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:

Naïve Bayes Classifier April 25 th, Classification Methods (1) Manual classification Used by Yahoo!, Looksmart, about.com, ODP Very accurate when.

Introduction to Data Mining Clustering & Classification Reference: Tan et al: Introduction to data mining. Some slides are adopted from Tan et al.

Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.

Chapter 2 Probability. Motivation We need concept of probability to make judgments about our hypotheses in the scientific method. Is the data consistent.

COMP24111 Machine Learning Naïve Bayes Classifier Ke Chen.

Data Mining Classification and Clustering Techniques Introduction to Data Mining by Tan, Steinbach, Kumar © Tan,Steinbach, Kumar Introduction to Data Mining.

Illustrating Classification Task

Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.

Data Mining Introduction to Classification using Linear Classifiers

Lecture Notes for Chapter 4 Introduction to Data Mining

Naïve Bayes Classifier

Naive Bayes Classifier

Teori Keputusan (Decision Theory)

COMP61011 : Machine Learning Probabilistic Models + Bayes’ Theorem

text processing And naïve bayes

Data Mining Classification: Alternative Techniques

Naïve Bayes Classifier

Bayes Net Learning: Bayesian Approaches

Naïve Bayes Classifier

Bayesian Classification

Data Mining Lecture 11.

Lecture Notes for Chapter 4 Introduction to Data Mining

Classification Techniques: Bayesian Classification

Data Mining Classification: Alternative Techniques

Naïve Bayes Classifier

Generative Models and Naïve Bayes

Generative Models and Naïve Bayes

NAÏVE BAYES CLASSIFICATION

COSC 4368 Intro Supervised Learning Organization

Naïve Bayes Classifier

COP5577: Principles of Data Mining Fall 2008 Lecture 4 Dr

Presentation transcript:

Classification: Naïve Bayes Classifier © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Joint, Marginal, Conditional Probability… To determine probabilities of events that result from combining other events in various ways. There are several types of combinations and relationships between events: Complement of an event [everything other than that event] Intersection of two events [event A and event B] or [A*B] Union of two events [event A or event B] or [A+B]

Example of Joint Probability Why are some mutual fund managers more successful than others? One possible factor is where the manager earned his or her MBA. The following table compares mutual fund performance against the ranking of the school where the fund manager earned their MBA: Venn Diagrams Mutual fund outperforms the market Mutual fund doesn’t outperform the market Top 20 MBA program .11 .29 Not top 20 MBA program .06 .54 E.g. This is the probability that a mutual fund outperforms AND the manager was in a top-20 MBA program; it’s a joint probability [intersection].

Example of Joint Probability Alternatively, we could introduce shorthand notation to represent the events: A1 = Fund manager graduated from a top-20 MBA program A2 = Fund manager did not graduate from a top-20 MBA program B1 = Fund outperforms the market B2 = Fund does not outperform the market B1 B2 A1 .11 .29 A2 .06 .54 E.g. P(A2 and B1) = .06 = the probability a fund outperforms the market and the manager isn’t from a top-20 school.

Marginal Probabilities… Marginal probabilities are computed by adding across rows and down columns; that is they are calculated in the margins of the table: P(A2) = .06 + .54 “what’s the probability a fund manager isn’t from a top school?” B1 B2 P(Ai) A1 .11 .29 .40 A2 .06 .54 .60 P(Bj) .17 .83 1.00 P(B1) = .11 + .06 BOTH margins must add to 1 (useful error check) “what’s the probability a fund outperforms the market?”

Conditional Probability… Conditional probability is used to determine how two events are related; that is, we can determine the probability of one event given the occurrence of another related event. Experiment: random select one student in class. P(randomly selected student is male) = P(randomly selected student is male/student is on 3rd row) = Conditional probabilities are written as P(A | B) and read as “the probability of A given B” and is calculated as:

Conditional Probability… Again, the probability of an event given that another event has occurred is called a conditional probability… P( A and B) = P(A)*P(B/A) = P(B)*P(A/B) both are true Keep this in mind!

Conditional Probability… Example 6.2 • What’s the probability that a fund will outperform the market given that the manager graduated from a top-20 MBA program? Recall: A1 = Fund manager graduated from a top-20 MBA program A2 = Fund manager did not graduate from a top-20 MBA program B1 = Fund outperforms the market B2 = Fund does not outperform the market Thus, we want to know “what is P(B1 | A1) ?”

Conditional Probability… We want to calculate P(B1 | A1) B1 B2 P(Ai) A1 .11 .29 .40 A2 .06 .54 .60 P(Bj) .17 .83 1.00 Thus, there is a 27.5% chance that that a fund will outperform the market given that the manager graduated from a top-20 MBA program.

Independence… One of the objectives of calculating conditional probability is to determine whether two events are related. In particular, we would like to know whether they are independent, that is, if the probability of one event is not affected by the occurrence of the other event. Two events A and B are said to be independent if P(A|B) = P(A) and P(B|A) = P(B) P(you have a flat tire going home/radio quits working)

Are B1 and A1 Independent

Independence… For example, we saw that P(B1 | A1) = .275 The marginal probability for B1 is: P(B1) = 0.17 Since P(B1|A1) ≠ P(B1), B1 and A1 are not independent events. Stated another way, they are dependent. That is, the probability of one event (B1) is affected by the occurrence of the other event (A1).

Determine the probability that a fund outperforms (B1) or the manager graduated from a top-20 MBA program (A1).

Union… A1 or B1 occurs whenever: Determine the probability that a fund outperforms (B1) or the manager graduated from a top-20 MBA program (A1). A1 or B1 occurs whenever: A1 and B1 occurs, A1 and B2 occurs, or A2 and B1 occurs… B1 B2 P(Ai) A1 .11 .29 .40 A2 .06 .54 .60 P(Bj) .17 .83 1.00 P(A1 or B1) = .11 + .06 + .29 = .46

Data Mining Classification: Naïve Bayes Classifier © Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 15

Classification: Definition Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.

Illustrating Classification Task

Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc

Classification Techniques Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines

Example of a Decision Tree categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree

Another Example of Decision Tree categorical categorical continuous class MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES There could be more than one tree that fits the same data!

Decision Tree Classification Task

Apply Model to Test Data Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

Apply Model to Test Data Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES

Apply Model to Test Data Refund Yes No NO MarSt Married Assign Cheat to “No” Single, Divorced TaxInc NO < 80K > 80K NO YES

Bayes Classifier A probabilistic framework for solving classification problems Conditional Probability: Bayes theorem:

Bayes Theorem Given a hypothesis h and data D which bears on the hypothesis: P(h): independent probability of h: prior probability P(D): independent probability of D P(D|h): conditional probability of D given h: likelihood P(h|D): conditional probability of h given D: posterior probability

Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis?

Example of Bayes Theorem Given: A doctor knows that meningitis causes stiff neck 50% of the time Prior probability of any patient having meningitis is 1/50,000 Prior probability of any patient having stiff neck is 1/20 If a patient has stiff neck, what’s the probability he/she has meningitis?

Bayesian Classifiers Consider each attribute and class label as random variables Given a record with attributes (A1, A2,…,An) Goal is to predict class C Specifically, we want to find the value of C that maximizes P(C| A1, A2,…,An ) Can we estimate P(C| A1, A2,…,An ) directly from data?

Bayesian Classifiers Approach: How to estimate P(A1, A2, …, An | C )? compute the posterior probability P(C | A1, A2, …, An) for all values of C using the Bayes theorem Choose value of C that maximizes P(C | A1, A2, …, An) Equivalent to choosing value of C that maximizes P(A1, A2, …, An|C) P(C) How to estimate P(A1, A2, …, An | C )?

The Bayes Classifier Likelihood Prior Normalization Constant

Maximum A Posterior Based on Bayes Theorem, we can compute the Maximum A Posterior (MAP) hypothesis for the data We are interested in the best hypothesis for some space H given observed training data D. H: set of all hypothesis. Note that we can drop P(D) as the probability of the data is constant (and independent of the hypothesis).

Naïve Bayes Classifier Assume independence among attributes Ai when class is given: P(A1, A2, …, An |C) = P(A1| Cj) P(A2| Cj)… P(An| Cj) Can estimate P(Ai| Cj) for all Ai and Cj. New point is classified to Cj if P(Cj)  P(Ai| Cj) is maximal.

Example Example: Play Tennis

Example Learning Phase 2/9 3/5 4/9 0/5 3/9 2/5 2/9 2/5 4/9 3/9 1/5 3/9 Outlook Play=Yes Play=No Sunny 2/9 3/5 Overcast 4/9 0/5 Rain 3/9 2/5 Temperature Play=Yes Play=No Hot 2/9 2/5 Mild 4/9 Cool 3/9 1/5 Humidity Play=Yes Play=No High 3/9 4/5 Normal 6/9 1/5 Wind Play=Yes Play=No Strong 3/9 3/5 Weak 6/9 2/5 P(Play=Yes) = 9/14 P(Play=No) = 5/14

Example Test Phase Given a new instance, x’=(Outlook=Sunny, Temperature=Cool, Humidity=High, Wind=Strong) Look up tables MAP rule P(Outlook=Sunny|Play=Yes) = 2/9 P(Temperature=Cool|Play=Yes) = 3/9 P(Huminity=High|Play=Yes) = 3/9 P(Wind=Strong|Play=Yes) = 3/9 P(Play=Yes) = 9/14 P(Outlook=Sunny|Play=No) = 3/5 P(Temperature=Cool|Play==No) = 1/5 P(Huminity=High|Play=No) = 4/5 P(Wind=Strong|Play=No) = 3/5 P(Play=No) = 5/14 P(Yes|x’): [P(Sunny|Yes)P(Cool|Yes)P(High|Yes)P(Strong|Yes)]P(Play=Yes) = 0.0053 P(No|x’): [P(Sunny|No) P(Cool|No)P(High|No)P(Strong|No)]P(Play=No) = 0.0206 Given the fact P(Yes|x’) < P(No|x’), we label x’ to be “No”.

Naïve Bayes Classifier If one of the conditional probability is zero, then the entire expression becomes zero Probability estimation: c: number of classes p: prior probability

Problem Solving A: attributes M: mammals N: non-mammals

Solution A: attributes M: mammals N: non-mammals P(A|M)P(M) > P(A|N)P(N) => Mammals

Classifier Evaluation Metrics: Confusion Matrix Actual class\Predicted class C1 ¬ C1 True Positives (TP) False Negatives (FN) False Positives (FP) True Negatives (TN) Example of Confusion Matrix: Actual class\Predicted class buy_computer = yes buy_computer = no Total buy_computer = yes 6954 46 7000 412 2588 3000 7366 2634 10000 Given m classes, an entry, CMi,j in a confusion matrix indicates # of tuples in class i that were labeled by the classifier as class j May have extra rows/columns to provide totals 44

Classifier Evaluation Metrics: Accuracy, Error Rate, Sensitivity and Specificity A\P C ¬C TP FN P FP TN N P’ N’ All Class Imbalance Problem: One class may be rare, e.g. fraud, or HIV-positive Significant majority of the negative class and minority of the positive class Sensitivity: True Positive recognition rate Sensitivity = TP/P Specificity: True Negative recognition rate Specificity = TN/N Classifier Accuracy, or recognition rate: percentage of test set tuples that are correctly classified Accuracy = (TP + TN)/All Error rate: 1 – accuracy, or Error rate = (FP + FN)/All 45

Measuring Error Error rate = # of errors / # of instances = (FN+FP) / N Recall = # of found positives / # of positives = TP / (TP+FN) = sensitivity = hit rate Precision = # of found positives / # of found = TP / (TP+FP) Specificity = TN / (TN+FP) False alarm rate = FP / (FP+TN) = 1 - Specificity Lecture Notes for E Alpaydın 2004 Introduction to Machine Learning © The MIT Press (V1.1)