Data Mining with Naïve Bayesian Methods

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
Naïve Bayes Classifier
Naïve Bayes Classifier
Data Mining Classification: Naïve Bayes Classifier
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Naive Bayes (DataSet) OutlookTempHumidityWindyClass sunnyhothighfalseDon't Play sunnyhothightrueDon't Play overcasthothighfalsePlay rainymildhighfalsePlay.
Algorithms for Classification: Notes by Gregory Piatetsky.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true.
Algorithms for Classification: The Basic Methods.
CSCI 347 / CS 4206: Data Mining Module 04: Algorithms Topic 06: Regression.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Bayesian Networks. Male brain wiring Female brain wiring.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
1 Bayesian Classification 2 Outline Statistics Basics Naive Bayes Bayesian Network Applications.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.6: Linear Models Rodney Nielsen Many of.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Bayesian Classification. Bayesian Classification: Why? A statistical classifier: performs probabilistic prediction, i.e., predicts class membership probabilities.
Classification Techniques: Bayesian Classification
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
CHAPTER 6 Naive Bayes Models for Classification. QUESTION????
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
W E K A Waikato Environment for Knowledge Aquisition.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
Chapter 4: Algorithms CS 795. Inferring Rudimentary Rules 1R – Single rule – one level decision tree –Pick each attribute and form a single level tree.
Machine Learning in Practice Lecture 6 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Classification Vikram Pudi IIIT Hyderabad.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Bayesian Learning Evgueni Smirnov Overview Bayesian Theorem Maximum A Posteriori Hypothesis Naïve Bayes Classifier Learning Text Classifiers.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Oliver Schulte Machine Learning 726
Algorithms for Classification:
Naïve Bayes Classifier
Naive Bayes Classifier
Data Science Algorithms: The Basic Methods
Bayes Net Learning: Bayesian Approaches
Lecture 15: Text Classification & Naive Bayes
Naïve Bayes Classifier
Classification Techniques: Bayesian Classification
Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generative Models and Naïve Bayes
Parametric Methods Berlin Chen, 2005 References:
Generative Models and Naïve Bayes
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
NAÏVE BAYES CLASSIFICATION
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

Data Mining with Naïve Bayesian Methods Instructor: Qiang Yang Hong Kong University of Science and Technology Qyang@cs.ust.hk Thanks: Dan Weld, Eibe Frank Lecture 1

How to Predict? From a new day’s data we wish to predict the decision Applications Text analysis Spam Email Classification Gene analysis Network Intrusion Detection

Naïve Bayesian Models Two assumptions: Attributes are equally important statistically independent (given the class value) This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) Although based on assumptions that are almost never correct, this scheme works well in practice!

Why Naïve? Assume the attributes are independent, given class What does that mean? play outlook temp humidity windy Pr(outlook=sunny | windy=true, play=yes)= Pr(outlook=sunny|play=yes)

Weather data set Outlook Windy Play overcast FALSE yes rainy TRUE sunny

Is the assumption satisfied? Outlook Windy Play overcast FALSE yes rainy TRUE sunny #yes=9 #sunny=2 #windy, yes=3 #sunny|windy, yes=1 Pr(outlook=sunny|windy=true, play=yes)=1/3 Pr(outlook=sunny|play=yes)=2/9 Pr(windy|outlook=sunny,play=yes)=1/2 Pr(windy|play=yes)=3/9 Thus, the assumption is NOT satisfied. But, we can tolerate some errors (see later slides)

Probabilities for the weather data Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 Hot High 4 False 6 9 5 Overcast Mild Normal 1 True Rainy Cool 2/9 3/5 2/5 3/9 4/5 6/9 9/14 5/14 4/9 0/5 1/5 A new day: Likelihood of the two classes For “yes” = 2/9  3/9  3/9  3/9  9/14 = 0.0053 For “no” = 3/5  1/5  4/5  3/5  5/14 = 0.0206 Conversion into a probability by normalization: P(“yes”) = 0.0053 / (0.0053 + 0.0206) = 0.205 P(“no”) = 0.0206 / (0.0053 + 0.0206) = 0.795 Outlook Temp. Humidity Windy Play Sunny Cool High True ?

Bayes’ rule Probability of event H given evidence E: A priori probability of H: Probability of event before evidence has been seen A posteriori probability of H: Probability of event after evidence has been seen

Naïve Bayes for classification Classification learning: what’s the probability of the class given an instance? Evidence E = an instance Event H = class value for instance (Play=yes, Play=no) Naïve Bayes Assumption: evidence can be split into independent parts (i.e. attributes of instance are independent)

The weather data example Outlook Temp. Humidity Windy Play Sunny Cool High True ? Evidence E Probability for class “yes”

The “zero-frequency problem” What if an attribute value doesn’t occur with every class value (e.g. “Humidity = high” for class “yes”)? Probability will be zero! A posteriori probability will also be zero! (No matter how likely the other values are!) Remedy: add 1 to the count for every attribute value-class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates)

Modified probability estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes Weights don’t need to be equal (if they sum to 1) Sunny Overcast Rainy

Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: Outlook Temp. Humidity Windy Play ? Cool High True Likelihood of “yes” = 3/9  3/9  3/9  9/14 = 0.0238 Likelihood of “no” = 1/5  4/5  3/5  5/14 = 0.0343 P(“yes”) = 0.0238 / (0.0238 + 0.0343) = 41% P(“no”) = 0.0343 / (0.0238 + 0.0343) = 59%

Dealing with numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: The sample mean : The standard deviation : The density function f(x):

Statistics for the weather data Outlook Temperature Humidity Windy Play Yes No Sunny 2 3 83 85 86 False 6 9 5 Overcast 4 70 80 96 90 True Rainy 68 65 … 2/9 3/5 mean 73 74.6 79.1 86.2 6/9 2/5 9/14 5/14 4/9 0/5 std dev 6.2 7.9 10.2 9.7 3/9 Example density value:

Classifying a new day A new day: Missing values during training: not included in calculation of mean and standard deviation Outlook Temp. Humidity Windy Play Sunny 66 90 true ? Likelihood of “yes” = 2/9  0.0340  0.0221  3/9  9/14 = 0.000036 Likelihood of “no” = 3/5  0.0291  0.0380  3/5  5/14 = 0.000136 P(“yes”) = 0.000036 / (0.000036 + 0. 000136) = 20.9% P(“no”) = 0. 000136 / (0.000036 + 0. 000136) = 79.1%

Probability densities Relationship between probability and density: But: this doesn’t change calculation of a posteriori probabilities because  cancels out Exact relationship:

Example of Naïve Bayes in Weka Use Weka Naïve Bayes Module to classify Weather.nominal.arff

Discussion of Naïve Bayes Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) Note also: many numeric attributes are not normally distributed