1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin.

Slides:



Advertisements
Similar presentations
Bayesian networks Chapter 14 Section 1 – 2. Outline Syntax Semantics Exact computation.
Advertisements

Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
Uncertainty Everyday reasoning and decision making is based on uncertain evidence and inferences. Classical logic only allows conclusions to be strictly.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
For Monday Read chapter 18, sections 1-2 Homework: –Chapter 14, exercise 8 a-d.
Text Categorization CSC 575 Intelligent Information Retrieval.
For Monday Finish chapter 14 Homework: –Chapter 13, exercises 8, 15.
Bayesian network inference
Inference in Bayesian Nets
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
Bayesian networks Chapter 14 Section 1 – 2.
Bayesian Belief Networks
Bayesian Networks What is the likelihood of X given evidence E? i.e. P(X|E) = ?
Text Classification – Naive Bayes
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
Bayesian Reasoning. Tax Data – Naive Bayes Classify: (_, No, Married, 95K, ?)
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
For Monday after Spring Break Read Homework: –Chapter 13, exercise 6 and 8 May be done in pairs.
CSCI 121 Special Topics: Bayesian Network Lecture #1: Reasoning Under Uncertainty.
11 CS 388: Natural Language Processing: Discriminative Training and Conditional Random Fields (CRFs) for Sequence Labeling Raymond J. Mooney University.
1 Naïve Bayes A probabilistic ML algorithm. 2 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false.
Bayesian networks Chapter 14. Outline Syntax Semantics.
1 CS 343: Artificial Intelligence Bayesian Networks Raymond J. Mooney University of Texas at Austin.
1 CS 343: Artificial Intelligence Probabilistic Reasoning and Naïve Bayes Raymond J. Mooney University of Texas at Austin.
1 CS 391L: Machine Learning: Bayesian Learning: Beyond Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
For Wednesday Read Chapter 11, sections 1-2 Program 2 due.
Aprendizagem Computacional Gladys Castillo, UA Bayesian Networks Classifiers Gladys Castillo University of Aveiro.
1 CS 391L: Machine Learning: Bayesian Learning: Naïve Bayes Raymond J. Mooney University of Texas at Austin.
Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.
1 Text Categorization. 2 Categorization Given: –A description of an instance, x  X, where X is the instance language or instance space. –A fixed set.
Classification Techniques: Bayesian Classification
Mehdi Ghayoumi MSB rm 132 Ofc hr: Thur, a Machine Learning.
1 Naïve Bayes Classification CS 6243 Machine Learning Modified from the slides by Dr. Raymond J. Mooney
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
1 Text Categorization Slides based on R. Mooney (UT Austin)
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
CS 2750: Machine Learning Probability Review Prof. Adriana Kovashka University of Pittsburgh February 29, 2016.
Conditional Independence As with absolute independence, the equivalent forms of X and Y being conditionally independent given Z can also be used: P(X|Y,
Belief Networks CS121 – Winter Other Names Bayesian networks Probabilistic networks Causal networks.
1 Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood.
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Web-Mining Agents Data Mining Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
Bayesian Networks Chapter 2 (Duda et al.) – Section 2.11 CS479/679 Pattern Recognition Dr. George Bebis.
Review of Probability.
CS 2750: Machine Learning Review
CS 2750: Machine Learning Directed Graphical Models
Bayesian networks Chapter 14 Section 1 – 2.
Web-Mining Agents Part: Data Mining
Computer Science Department
Read R&N Ch Next lecture: Read R&N
Learning Bayesian Network Models from Data
Classification Techniques: Bayesian Classification
Text Categorization.
Read R&N Ch Next lecture: Read R&N
Bayesian Statistics and Belief Networks
Class #19 – Tuesday, November 3
Probabilistic ML algorithm Naïve Bayes and Maximum Likelyhood
Class #16 – Tuesday, October 26
Hankz Hankui Zhuo Bayesian Networks Hankz Hankui Zhuo
Belief Networks CS121 – Winter 2003 Belief Networks.
Bayesian networks Chapter 14 Section 1 – 2.
Probabilistic Reasoning
Read R&N Ch Next lecture: Read R&N
Bayesian Networks CSE 573.
Presentation transcript:

1 Chapter 6 Bayesian Learning lecture slides of Raymond J. Mooney, University of Texas at Austin

2 Outline

3 Two Roles of Bayesian Methods:

4

5 Axioms of Probability Theory All probabilities between 0 and 1 True proposition has probability 1, false has probability 0. P(true) = 1 P(false) = 0. The probability of disjunction is: A B

6 Conditional Probability P(A | B) is the probability of A given B Assumes that B is all and only information known. Defined by: A B

7 Independence A and B are independent iff: Therefore, if A and B are independent: These two constraints are logically equivalent

8 Joint Distribution The joint probability distribution for a set of random variables, X 1,…,X n gives the probability of every combination of values (an n- dimensional array with v n values if all variables are discrete with v values, all v n values must sum to 1): P(X 1,…,X n ) The probability of all possible conjunctions (assignments of values to some subset of variables) can be calculated by summing the appropriate subset of values from the joint distribution. Therefore, all conditional probabilities can also be calculated. circlesquare red blue circlesquare red blue0.20 positive negative

9 Bayes Theorem Simple proof from definition of conditional probability: QED: (Def. cond. prob.)

10

11

12

13

14

15 P(cancer) =.008P(¬cancer) =.992 P(+ | cancer)=.98P(- | cancer)=.02 P(+ | ¬cancer)=.03P(- | ¬cancer)=.97 Two alternative hypotheses: cancer, ¬cancer P(+ | cancer) p(cancer) = (.98)(.008) =.0078 P(+ | ¬cancer) p(¬cancer) = (.03)(.992) =.0298 h MAP = ¬cancer P(cancer | +) = 0078 / ( ) =.21 P(¬cancer | +) = 1 - P(cancer | +) =.79

16 Provides a standard against which we may compare the performance of other learning algorithms

17

18

19 Every consistent hypothesis has a posterior probability 1 / |VS H,D | Every inconsistent hypothesis has a posterior probability 0 Thıus, every consistent hypothesis is a MAP learner.

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35 Naïve Bayes Generative Model Size Color Shape Positive Negative pos neg pos neg sm med lg med sm med lg red blue grn circ sqr tri circ sqr tri sm lg med sm lg med lg sm blue red grn blue grn red grn blue circ sqr tri circ sqr circ tri Category

36 Naïve Bayes Inference Problem Size Color Shape Positive Negative pos neg pos neg sm med lg med sm med lg red blue grn circ sqr tri circ sqr tri sm lg med sm lg med lg sm blue red grn blue grn red grn blue circ sqr tri circ sqr circ tri Category lg red circ ??

37 Naïve Bayesian Categorization If we assume features of an instance are independent given the category (conditionally independent). Therefore, we then only need to know P(X i | Y) for each possible pair of a feature-value and a category. If Y and all X i and binary, this requires specifying only 2n parameters: –P(X i =true | Y=true) and P(X i =true | Y=false) for each X i –P(X i =false | Y) = 1 – P(X i =true | Y) Compared to specifying 2 n parameters without any independence assumptions.

38 Naïve Bayes Example Probabilitypositivenegative P(Y)0.5 P(small | Y)0.4 P(medium | Y) P(large | Y) P(red | Y) P(blue | Y) P(green | Y) P(square | Y) P(triangle | Y) P(circle | Y) Test Instance:

39 Naïve Bayes Example Probabilitypositivenegative P(Y)0.5 P(medium | Y) P(red | Y) P(circle | Y) P(positive | X) = P(positive)*P(medium | positive)*P(red | positive)*P(circle | positive) / P(X) 0.5 * 0.1 * 0.9 * 0.9 = / P(X) P(negative | X) = P(negative)*P(medium | negative)*P(red | negative)*P(circle | negative) / P(X) 0.5 * 0.2 * 0.3 * 0.3 = / P(X) P(positive | X) + P(negative | X) = / P(X) / P(X) = 1 P(X) = ( ) = = / = = / = Test Instance:

40 Estimating Probabilities Normally, probabilities are estimated based on observed frequencies in the training data. If D contains n k examples in category y k, and n ijk of these n k examples have the jth value for feature X i, x ij, then: However, estimating such probabilities from small training sets is error-prone. If due only to chance, a rare feature, X i, is always false in the training data,  y k :P(X i =true | Y=y k ) = 0. If X i =true then occurs in a test example, X, the result is that  y k : P(X | Y=y k ) = 0 and  y k : P(Y=y k | X) = 0

41 Probability Estimation Example Probabilitypositivenegative P(Y)0.5 P(small | Y)0.5 P(medium | Y)0.0 P(large | Y)0.5 P(red | Y) P(blue | Y) P(green | Y)0.0 P(square | Y)0.0 P(triangle | Y) P(circle | Y) ExSizeColorShapeCategory 1smallredcirclepositive 2largeredcirclepositive 3smallredtrianglenegitive 4largebluecirclenegitive Test Instance X: P(positive | X) = 0.5 * 0.0 * 1.0 * 1.0 = 0 P(negative | X) = 0.5 * 0.0 * 0.5 * 0.5 = 0

42 Smoothing To account for estimation from small samples, probability estimates are adjusted or smoothed. Laplace smoothing using an m-estimate assumes that each feature is given a prior probability, p, that is assumed to have been previously observed in a “virtual” sample of size m. For binary features, p is simply assumed to be 0.5.

43 Laplace Smothing Example Assume training set contains 10 positive examples: –4: small –0: medium –6: large Estimate parameters as follows (if m=1, p=1/3) –P(small | positive) = (4 + 1/3) / (10 + 1) = –P(medium | positive) = (0 + 1/3) / (10 + 1) = 0.03 –P(large | positive) = (6 + 1/3) / (10 + 1) = –P(small or medium or large | positive) = 1.0

44 Continuous Attributes If X i is a continuous feature rather than a discrete one, need another way to calculate P(X i | Y). Assume that X i has a Gaussian distribution whose mean and variance depends on Y. During training, for each combination of a continuous feature X i and a class value for Y, y k, estimate a mean, μ ik, and standard deviation σ ik based on the values of feature X i in class y k in the training data. During testing, estimate P(X i | Y=y k ) for a given example, using the Gaussian distribution defined by μ ik and σ ik.

45 Comments on Naïve Bayes Tends to work well despite strong assumption of conditional independence. Experiments show it to be quite competitive with other classification methods on standard UCI datasets. Although it does not produce accurate probability estimates when its independence assumptions are violated, it may still pick the correct maximum-probability class in many cases. –Able to learn conjunctive concepts in any case Does not perform any search of the hypothesis space. Directly constructs a hypothesis from parameter estimates that are easily calculated from the training data. –Strong bias Not guarantee consistency with training data. Typically handles noise well since it does not even focus on completely fitting the training data.

46

47

48

49

50

51

52

53

54 Bayesian Networks Directed Acyclic Graph (DAG) –Nodes are random variables –Edges indicate causal influences Burglary Earthquake Alarm JohnCalls MaryCalls

55 Conditional Probability Tables Each node has a conditional probability table (CPT) that gives the probability of each of its values given every possible combination of values for its parents (conditioning case). –Roots (sources) of the DAG that have no parents are given prior probabilities. Burglary Earthquake Alarm JohnCalls MaryCalls P(B).001 P(E).002 BEP(A) TT.95 TF.94 FT.29 FF.001 AP(M) T.70 F.01 AP(J) T.90 F.05

56 CPT Comments Probability of false not given since rows must add to 1. Example requires 10 parameters rather than 2 5 –1=31 for specifying the full joint distribution. Number of parameters in the CPT for a node is exponential in the number of parents (fan-in).

57 Joint Distributions for Bayes Nets A Bayesian Network implicitly defines a joint distribution. Example Therefore an inefficient approach to inference is: –1) Compute the joint distribution using this equation. –2) Compute any desired conditional probability using the joint distribution.

58 Naïve Bayes as a Bayes Net Naïve Bayes is a simple Bayes Net Y X1X1 X2X2 … XnXn Priors P(Y) and conditionals P(X i |Y) for Naïve Bayes provide CPTs for the network.

59 Bayes Net Inference Given known values for some evidence variables, determine the posterior probability of some query variables. Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? John calls 90% of the time there is an Alarm and the Alarm detects 94% of Burglaries so people generally think it should be fairly high. However, this ignores the prior probability of John calling.

60 Bayes Net Inference Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? John also calls 5% of the time when there is no Alarm. So over 1,000 days we expect 1 Burglary and John will probably call. However, he will also call with a false report 50 times on average. So the call is about 50 times more likely a false report: P(Burglary | JohnCalls) ≈ 0.02 P(B).001 AP(J) T.90 F.05

61 Bayes Net Inference Example: Given that John calls, what is the probability that there is a Burglary? Burglary Earthquake Alarm JohnCalls MaryCalls ??? Actual probability of Burglary is since the alarm is not perfect (an Earthquake could have set it off or it could have gone off on its own). On the other side, even if there was not an alarm and John called incorrectly, there could have been an undetected Burglary anyway, but this is unlikely. P(B).001 AP(J) T.90 F.05

62 Complexity of Bayes Net Inference In general, the problem of Bayes Net inference is NP-hard (exponential in the size of the graph). For singly-connected networks or polytrees in which there are no undirected loops, there are linear- time algorithms based on belief propagation. –Each node sends local evidence messages to their children and parents. –Each node updates belief in each of its possible values based on incoming messages from it neighbors and propagates evidence on to its neighbors. There are approximations to inference for general networks based on loopy belief propagation that iteratively refines probabilities that converge to accurate values in the limit.

63

64

65

66

67

68

69

70

71

72

73