1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Data Representation. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  individual, independent.
Naïve Bayes Classifier
Decision Trees.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Decision Trees Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Eibe Frank and Jiawei Han.
Assuming normally distributed data! Naïve Bayes Classifier.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Instance-based representation
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe.
Data Mining with Naïve Bayesian Methods
Classification: Decision Trees
Algorithms for Classification: Notes by Gregory Piatetsky.
Decision Trees an Introduction.
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true.
K Nearest Neighbor Classification Methods Qiang Yang.
Algorithms for Classification: The Basic Methods.
K Nearest Neighbor Classification Methods Qiang Yang.
Probabilistic techniques. Machine learning problem: want to decide the classification of an instance given various attributes. Data contains attributes.
5. Machine Learning ENEE 759D | ENEE 459D | CMSC 858Z
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Data Mining – Algorithms: OneR Chapter 4, Section 4.1.
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Bayesian Networks. Male brain wiring Female brain wiring.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
Naive Bayes Classifier
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
1 Bayesian Classification 2 Outline Statistics Basics Naive Bayes Bayesian Network Applications.
M. Sulaiman Khan Dept. of Computer Science University of Liverpool 2009 COMP527: Data Mining Classification: Bayes February 17, 2009.
Decision-Tree Induction & Decision-Rule Induction
Classification Techniques: Bayesian Classification
Slides for “Data Mining” by I. H. Witten and E. Frank.
Data Mining – Algorithms: Decision Trees - ID3 Chapter 4, Section 4.3.
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Probability Course web page: vision.cis.udel.edu/cv March 19, 2003  Lecture 15.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Sections 4.1 Inferring Rudimentary Rules Rodney Nielsen.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Classification And Bayesian Learning
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
1 Data Mining I Karl Young Center for Imaging of Neurodegenerative Diseases, UCSF.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
K Nearest Neighbor Classification Methods. Training Set.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
Machine Learning in Practice Lecture 5 Carolyn Penstein Rosé Language Technologies Institute/ Human-Computer Interaction Institute.
Data Warehouse [ Example ] J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001, ISBN Data Mining: Concepts and.
BAYESIAN LEARNING. 2 Bayesian Classifiers Bayesian classifiers are statistical classifiers, and are based on Bayes theorem They can calculate the probability.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Naive Bayes Classifier. REVIEW: Bayesian Methods Our focus this lecture: – Learning and classification methods based on probability theory. Bayes theorem.
Text Classification and Naïve Bayes Formalizing the Naïve Bayes Classifier.
Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.
Bayesian Learning Reading: Tom Mitchell, “Generative and discriminative classifiers: Naive Bayes and logistic regression”, Sections 1-2. (Linked from.
Algorithms for Classification:
Naïve Bayes Classifier
Data Science Algorithms: The Basic Methods
Bayes Net Learning: Bayesian Approaches
Naïve Bayes Classifier
Classification Techniques: Bayesian Classification
Bayesian Classification
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Logistic Regression [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Data Mining CSCI 307, Spring 2019 Lecture 6
Presentation transcript:

1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank

2 Goal: Mining Probability Models Probability Basics Our state s in world W, is distributed according to probability distribution 0 <= Pr(s) <= 1 for all s  S Pr(s) = 1 For subsets S1 and S2, Pr(S1  S2) = Pr(s1) + Pr(s2) - Pr(s1  S2) Bayes Rule:

3 Weather data set OutlookTemperatureHumidityWindyPlay sunnyhothighFALSEno sunnyhothighTRUEno overcasthothighFALSEyes rainymildhighFALSEyes rainycoolnormalFALSEyes rainycoolnormalTRUEno overcastcoolnormalTRUEyes sunnymildhighFALSEno sunnycoolnormalFALSEyes rainymildnormalFALSEyes sunnymildnormalTRUEyes overcastmildhighTRUEyes overcasthotnormalFALSEyes rainymildhighTRUEno

4 Basics Unconditional or Prior Probability Pr(Play=yes) + Pr(Play=no)=1 Pr(Play=yes) is sometimes written as Pr(Play) Table has 9 yes, 5 no Pr(Play=yes)=9/(9+5)=9/14 Thus, Pr(Play=no)=5/14 Joint Probability of Play and Windy: Pr(Play=x,Windy=y) for all values x and y, should be 1 Play=yes Play=no Windy=TrueWindy=False 3/14 ? 6/14

5 Probability Basics Conditional Probability Pr(A|B) # (Windy=False)=8 Within the 8, #(Play=yes)=6 Pr(Play=yes | Windy=False) =6/8 Pr(Windy=False)=8/14 Pr(Play=Yes)=9/14 Applying Bayes Rule Pr(B|A) = Pr(A|B)Pr(B) / Pr(A) Pr(Windy=False|Play=yes)= 6/8*8/14/(9/14)=6/9 WindyPlay *FALSEno TRUEno *FALSE*yes *FALSE*yes *FALSE*yes TRUEno TRUEyes *FALSEno *FALSE*yes *FALSE*yes TRUEyes TRUEyes *FALSE*yes TRUEno

6 Pr(A|C) = / ( ) = 0.04 / 0.1 = 0.4 Suppose C=True Pr(A|P,C) = 0.032/( ) = 0.032/0.080 = 0.4 Conditional Independence “ A and P are independent given C ” Pr(A | P,C) = Pr(A | C) and also Pr(P | A,C) = Pr(P | C) C A P Probability F F F F F T F T F F T T T F F T F T T T F T T T 0.032

7 Naïve Bayesian Models Two assumptions: Attributes are equally important statistically independent (given the class value) This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) Although based on assumptions that are almost never correct, this scheme works well in practice!

8 Why Naïve? Assume the attributes are independent, given class What does that mean? play outlooktemphumiditywindy Pr(outlook=sunny | windy=true, play=yes)= Pr(outlook=sunny|play=yes)

9 Weather data set OutlookWindyPlay overcastFALSEyes rainyFALSEyes rainyFALSEyes overcastTRUEyes sunnyFALSEyes rainyFALSEyes sunnyTRUEyes overcastTRUEyes overcastFALSEyes

10 Is the assumption satisfied? #yes=9 #sunny=2 #windy, yes=3 #sunny|windy, yes=1 Pr(outlook=sunny|windy=true, play=yes)=1/3 Pr(outlook=sunny|play=yes)=2/9 Pr(windy|outlook=sunny,play=yes)=1/2 Pr(windy|play=yes)=3/9 Thus, the assumption is NOT satisfied. But, we can tolerate some errors (see later slides) OutlookWindyPlay overcastFALSEyes rainyFALSEyes rainyFALSEyes overcastTRUEyes sunnyFALSEyes rainyFALSEyes sunnyTRUEyes overcastTRUEyes overcastFALSEyes

11 Probabilities for the weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue? A new day:

12 Likelihood of the two classes For “yes” = 2/9  3/9  3/9  3/9  9/14 = For “no” = 3/5  1/5  4/5  3/5  5/14 = Conversion into a probability by normalization: P(“yes”|E) = / ( ) = P(“no”|Eh) = / ( ) = 0.795

13 Bayes’ rule Probability of event H given evidence E: A priori probability of H: Probability of event before evidence has been seen A posteriori probability of H: Probability of event after evidence has been seen

14 Naïve Bayes for classification Classification learning: what’s the probability of the class given an instance? Evidence E = an instance Event H = class value for instance (Play=yes, Play=no) Naïve Bayes Assumption: evidence can be split into independent parts (i.e. attributes of instance are independent)

15 The weather data example OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue? Evidence E Probability for class “yes”

16 The “zero-frequency problem” What if an attribute value doesn’t occur with every class value (e.g. “Humidity = high” for class “yes”)? Probability will be zero! A posteriori probability will also be zero! (No matter how likely the other values are!) Remedy: add 1 to the count for every attribute value- class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates)

17 Modified probability estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes Weights don’t need to be equal (if they sum to 1) SunnyOvercastRainy

18 Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: OutlookTemp.HumidityWindyPlay ?CoolHighTrue? Likelihood of “yes” = 3/9  3/9  3/9  9/14 = Likelihood of “no” = 1/5  4/5  3/5  5/14 = P(“yes”) = / ( ) = 41% P(“no”) = / ( ) = 59%

19 Dealing with numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: The sample mean  : The standard deviation  : The density function f(x):

20 Statistics for the weather data Example density value: OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny False6295 Overcast True33 Rainy ………… Sunny2/93/5mean7374.6mean False6/92/59/145/14 Overcast4/90/5std dev6.27.9std dev True3/93/5 Rainy3/92/5

21 Classifying a new day A new day: Missing values during training: not included in calculation of mean and standard deviation OutlookTemp.HumidityWindyPlay Sunny6690true? Likelihood of “yes” = 2/9    3/9  9/14 = Likelihood of “no” = 3/5    3/5  5/14 = P(“yes”) = / ( ) = 20.9% P(“no”) = / ( ) = 79.1%

22 Probability densities Relationship between probability and density: But: this doesn’t change calculation of a posteriori probabilities because  cancels out Exact relationship:

23 Discussion of Naïve Bayes Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) Note also: many numeric attributes are not normally distributed