1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
1 22c:145 Artificial Intelligence Bayesian Networks Reading: Ch 14. Russell & Norvig.
Bayesian Networks Chapter 14 Section 1, 2, 4. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
Decision Trees.
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) March, 16, 2009.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Consider data instances about patients: Can we certainly derive the diagnostic rule:
Algorithms: The basic methods. Inferring rudimentary rules Simplicity first Simple algorithms often work surprisingly well Many different kinds of simple.
Bayesian networks practice. Semantics e.g., P(j  m  a   b   e) = P(j | a) P(m | a) P(a |  b,  e) P(  b) P(  e) = … Suppose we have the variables.
Review: Bayesian learning and inference
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
1 Data Mining with Bayesian Networks (I) Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe.
Bayesian networks Chapter 14 Section 1 – 2.
Data Mining with Naïve Bayesian Methods
Bayesian Belief Networks
1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank.
Algorithms for Classification: Notes by Gregory Piatetsky.
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Review. 2 Statistical modeling  “Opposite” of 1R: use all the attributes  Two assumptions: Attributes are  equally important  statistically independent.
Handling Uncertainty. Uncertain knowledge Typical example: Diagnosis. Can we certainly derive the diagnostic rule: if Toothache=true then Cavity=true.
Bayesian networks practice. Semantics e.g., P(j  m  a   b   e) = P(j | a) P(m | a) P(a |  b,  e) P(  b) P(  e) = … Suppose we have the variables.
Algorithms for Classification: The Basic Methods.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
Bayesian networks More commonly called graphical models A way to depict conditional independence relationships between random variables A compact specification.
Jeff Howbert Introduction to Machine Learning Winter Classification Bayesian Classifiers.
Read R&N Ch Next lecture: Read R&N
Bayesian Networks 4 th, December 2009 Presented by Kwak, Nam-ju The slides are based on, 2nd ed., written by Ian H. Witten & Eibe Frank. Images and Materials.
Classification II. 2 Numeric Attributes Numeric attributes can take many values –Creating branches for each value is not ideal The value range is usually.
Bayesian networks Chapter 14 Section 1 – 2. Bayesian networks A simple, graphical notation for conditional independence assertions and hence for compact.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Data Mining – Algorithms: Prism – Learning Rules via Separating and Covering Chapter 4, Section 4.4.
Classification I. 2 The Task Input: Collection of instances with a set of attributes x and a special nominal attribute Y called class attribute Output:
1 Bayesian Classification 2 Outline Statistics Basics Naive Bayes Bayesian Network Applications.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
Classification Techniques: Bayesian Classification
METU Informatics Institute Min720 Pattern Classification with Bio-Medical Applications Lecture notes 9 Bayesian Belief Networks.
CS 416 Artificial Intelligence Lecture 14 Uncertainty Chapters 13 and 14 Lecture 14 Uncertainty Chapters 13 and 14.
Data Mining Practical Machine Learning Tools and Techniques Chapter 4: Algorithms: The Basic Methods Section 4.2 Statistical Modeling Rodney Nielsen Many.
 Classification 1. 2  Task: Given a set of pre-classified examples, build a model or classifier to classify new cases.  Supervised learning: classes.
Algorithms for Classification: The Basic Methods.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Data Mining – Algorithms: Naïve Bayes Chapter 4, Section 4.2.
Slide 1 DSCI 4520/5240: Data Mining Fall 2013 – Dr. Nick Evangelopoulos Lecture 5: Decision Tree Algorithms Material based on: Witten & Frank 2000, Olson.
Review: Bayesian inference  A general scenario:  Query variables: X  Evidence (observed) variables and their values: E = e  Unobserved variables: Y.
Classification And Bayesian Learning
Example: input data outlooktemp.humiditywindyplay sunnyhothighfalseno sunnyhothightrueno overcasthothighfalseyes rainymildhighfalseyes rainycoolnormalfalseyes.
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Data Management and Database Technologies 1 DATA MINING Extracting Knowledge From Data Petr Olmer CERN
CPSC 322, Lecture 26Slide 1 Reasoning Under Uncertainty: Belief Networks Computer Science cpsc322, Lecture 27 (Textbook Chpt 6.3) Nov, 13, 2013.
PROBABILISTIC REASONING Heng Ji 04/05, 04/08, 2016.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
Data Mining Chapter 4 Algorithms: The Basic Methods Reporter: Yuen-Kuei Hsueh.
Data and its Distribution. The popular table  Table (relation)  propositional, attribute-value  Example  record, row, instance, case  Table represents.
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Presented By S.Yamuna AP/CSE
Qian Liu CSE spring University of Pennsylvania
Data Science Algorithms: The Basic Methods
Read R&N Ch Next lecture: Read R&N
Bayesian Classification
Read R&N Ch Next lecture: Read R&N
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Presentation transcript:

1 Bayesian Classification Instructor: Qiang Yang Hong Kong University of Science and Technology Thanks: Dan Weld, Eibe Frank

2 Goal: Mining Probability Models Probability Basics Our state s in world W, is distributed according to probability distribution 0 <= Pr(s) <= 1 for all s  S Pr(s) = 1 For subsets S1 and S2, Pr(S1  S2) = Pr(s1) + Pr(s2) - Pr(s1  S2) Bayes Rule:

3 Weather data set OutlookTemperatureHumidityWindyPlay sunnyhothighFALSEno sunnyhothighTRUEno overcasthothighFALSEyes rainymildhighFALSEyes rainycoolnormalFALSEyes rainycoolnormalTRUEno overcastcoolnormalTRUEyes sunnymildhighFALSEno sunnycoolnormalFALSEyes rainymildnormalFALSEyes sunnymildnormalTRUEyes overcastmildhighTRUEyes overcasthotnormalFALSEyes rainymildhighTRUEno

4 Basics Unconditional or Prior Probability Pr(Play=yes) + Pr(Play=no)=1 Pr(Play=yes) is sometimes written as Pr(Play) Table has 9 yes, 5 no Pr(Play=yes)=9/(9+5)=9/14 Thus, Pr(Play=no)=5/14 Joint Probability of Play and Windy: Pr(Play=x,Windy=y) for all values x and y, should be 1 Play=yes Play=no Windy=TrueWindy=False 3/14 ? 6/14

5 Probability Basics Conditional Probability Pr(A|B) # (Windy=False)=8 Within the 8, #(Play=yes)=6 Pr(Play=yes | Windy=False) =6/8 Pr(Windy=False)=8/14 Pr(Play=Yes)=9/14 Applying Bayes Rule Pr(B|A) = Pr(A|B)Pr(B) / Pr(A) Pr(Windy=False|Play=yes)= 6/8*8/14/(9/14)=6/9 WindyPlay *FALSEno TRUEno *FALSE*yes *FALSE*yes *FALSE*yes TRUEno TRUEyes *FALSEno *FALSE*yes *FALSE*yes TRUEyes TRUEyes *FALSE*yes TRUEno

6 Conditional Independence “ A and P are independent given C ” Pr(A | P,C) = Pr(A | C) Cavity Probe Catches Ache C A P Probability F F F F F T F T F F T T T F F T F T T T F T T T 0.008

7 Pr(A|C) = / ( ) = 0.04 / 0.1 = 0.4 Suppose C=True Pr(A|P,C) = 0.032/( ) = 0.032/0.080 = 0.4 Conditional Independence “ A and P are independent given C ” Pr(A | P,C) = Pr(A | C) and also Pr(P | A,C) = Pr(P | C) C A P Probability F F F F F T F T F F T T T F F T F T T T F T T T 0.032

8 Conditional Independence Can encode joint probability distribution in compact form C A P Probability F F F F F T F T F F T T T F F T F T T T F T T T Cavity Probe Catches Ache P(C).1 C P(P) T 0.8 F 0.4 C P(A) T 0.4 F Conditional probability table (CPT)

9 Creating a Network 1: Bayes net = representation of a JPD 2: Bayes net = set of cond. independence statements If create correct structure that represents causality Then get a good network i.e. one that ’ s small = easy to compute with One that is easy to fill in numbers

10 Example My house alarm system just sounded (A). Both an earthquake (E) and a burglary (B) could set it off. John will probably hear the alarm; if so he ’ ll call (J). But sometimes John calls even when the alarm is silent Mary might hear the alarm and call too (M), but not as reliably We could be assured a complete and consistent model by fully specifying the joint distribution: Pr(A, E, B, J, M) Pr(A, E, B, J, ~M) etc.

11 Structural Models (HK book 7.4.3) Instead of starting with numbers, we will start with structural relationships among the variables There is a direct causal relationship from Earthquake to Alarm There is a direct causal relationship from Burglar to Alarm There is a direct causal relationship from Alarm to JohnCall Earthquake and Burglar tend to occur independently etc.

12 Possible Bayesian Network Burglary MaryCalls JohnCalls Alarm Earthquake

13 Graphical Models and Problem Parameters What probabilities need I specify to ensure a complete, consistent model given the variables I have identified the dependence and independence relationships I have specified by building a graph structure Answer provide an unconditional (prior) probability for every node in the graph with no parents for all remaining, provide a conditional probability table Prob(Child | Parent1, Parent2, Parent3) for all possible combination of Parent1, Parent2, Parent3 values

14 Complete Bayesian Network Burglary MaryCalls JohnCalls Alarm Earthquake P(A) ATFATF P(J) ATFATF P(M) P(B).001 P(E).002 ETFTFETFTF BTTFFBTTFF

15 Microsoft Bayesian Belief Net Can be used to construct and reason with Bayesian Networks Consider the example

16

17

18

19

20 Mining for Structural Models Difficult to mine Some methods are proposed Up to now, no good results yet Often requires domain expert’s knowledge Once set up, a Bayesian Network can be used to provide probabilistic queries Microsoft Bayesian Network Software

21 Use the Bayesian Net for Prediction From a new day’s data we wish to predict the decision New data: X Class label: C To predict the class of X, is the same as asking Value of Pr(C|X)? Pr(C=yes|X) Pr(C=no|X) Compare the two

22 Naïve Bayesian Models Two assumptions: Attributes are equally important statistically independent (given the class value) This means that knowledge about the value of a particular attribute doesn’t tell us anything about the value of another attribute (if the class is known) Although based on assumptions that are almost never correct, this scheme works well in practice!

23 Why Naïve? Assume the attributes are independent, given class What does that mean? play outlooktemphumiditywindy Pr(outlook=sunny | windy=true, play=yes)= Pr(outlook=sunny|play=yes)

24 Weather data set OutlookWindyPlay overcastFALSEyes rainyFALSEyes rainyFALSEyes overcastTRUEyes sunnyFALSEyes rainyFALSEyes sunnyTRUEyes overcastTRUEyes overcastFALSEyes

25 Is the assumption satisfied? #yes=9 #sunny=2 #windy, yes=3 #sunny|windy, yes=1 Pr(outlook=sunny|windy=true, play=yes)=1/3 Pr(outlook=sunny|play=yes)=2/9 Pr(windy|outlook=sunny,play=yes)=1/2 Pr(windy|play=yes)=3/9 Thus, the assumption is NOT satisfied. But, we can tolerate some errors (see later slides) OutlookWindyPlay overcastFALSEyes rainyFALSEyes rainyFALSEyes overcastTRUEyes sunnyFALSEyes rainyFALSEyes sunnyTRUEyes overcastTRUEyes overcastFALSEyes

26 Probabilities for the weather data OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny23Hot22High34False6295 Overcast40Mild42Normal61True33 Rainy32Cool31 Sunny2/93/5Hot2/92/5High3/94/5False6/92/59/145/14 Overcast4/90/5Mild4/92/5Normal6/91/5True3/93/5 Rainy3/92/5Cool3/91/5 OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue? A new day:

27 Likelihood of the two classes For “yes” = 2/9  3/9  3/9  3/9  9/14 = For “no” = 3/5  1/5  4/5  3/5  5/14 = Conversion into a probability by normalization: P(“yes”|E) = / ( ) = P(“no”|E) = / ( ) = 0.795

28 Bayes’ rule Probability of event H given evidence E: A priori probability of H: Probability of event before evidence has been seen A posteriori probability of H: Probability of event after evidence has been seen

29 Naïve Bayes for classification Classification learning: what’s the probability of the class given an instance? Evidence E = an instance Event H = class value for instance (Play=yes, Play=no) Naïve Bayes Assumption: evidence can be split into independent parts (i.e. attributes of instance are independent)

30 The weather data example OutlookTemp.HumidityWindyPlay SunnyCoolHighTrue? Evidence E Probability for class “yes”

31 The “zero-frequency problem” What if an attribute value doesn’t occur with every class value (e.g. “Humidity = high” for class “yes”)? Probability will be zero! A posteriori probability will also be zero! (No matter how likely the other values are!) Remedy: add 1 to the count for every attribute value- class combination (Laplace estimator) Result: probabilities will never be zero! (also: stabilizes probability estimates)

32 Modified probability estimates In some cases adding a constant different from 1 might be more appropriate Example: attribute outlook for class yes Weights don’t need to be equal (if they sum to 1) SunnyOvercastRainy

33 Missing values Training: instance is not included in frequency count for attribute value-class combination Classification: attribute will be omitted from calculation Example: OutlookTemp.HumidityWindyPlay ?CoolHighTrue? Likelihood of “yes” = 3/9  3/9  3/9  9/14 = Likelihood of “no” = 1/5  4/5  3/5  5/14 = P(“yes”) = / ( ) = 41% P(“no”) = / ( ) = 59%

34 Dealing with numeric attributes Usual assumption: attributes have a normal or Gaussian probability distribution (given the class) The probability density function for the normal distribution is defined by two parameters: The sample mean  : The standard deviation  : The density function f(x):

35 Statistics for the weather data Example density value: OutlookTemperatureHumidityWindyPlay YesNoYesNoYesNoYesNoYesNo Sunny False6295 Overcast True33 Rainy ………… Sunny2/93/5mean7374.6mean False6/92/59/145/14 Overcast4/90/5std dev6.27.9std dev True3/93/5 Rainy3/92/5

36 Classifying a new day A new day: Missing values during training: not included in calculation of mean and standard deviation OutlookTemp.HumidityWindyPlay Sunny6690true? Likelihood of “yes” = 2/9    3/9  9/14 = Likelihood of “no” = 3/5    3/5  5/14 = P(“yes”) = / ( ) = 20.9% P(“no”) = / ( ) = 79.1%

37 Probability densities Relationship between probability and density: But: this doesn’t change calculation of a posteriori probabilities because  cancels out Exact relationship:

38 Example of Naïve Bayes in Weka Use Weka Naïve Bayes Module to classify Weather.nominal.arff

39

40

41

42

43 Discussion of Naïve Bayes Naïve Bayes works surprisingly well (even if independence assumption is clearly violated) Why? Because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class However: adding too many redundant attributes will cause problems (e.g. identical attributes) Note also: many numeric attributes are not normally distributed