Download presentation
Presentation is loading. Please wait.
1
Still More Uncertainty
2
Administrivia Read Ch 14, 15
3
Axioms of Probability Theory
Just 3 are enough to build entire theory! 1. Range of probabilities: 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is: A B A B True
4
Bayes rules! Simple proof from def of conditional probability 4
5
Independence P(AB) = P(A)P(B) A A B B True © Daniel S. Weld 5
6
Conditional Independence
A&B not independent, since P(A|B) < P(A) A A B B True © Daniel S. Weld 6
7
Conditional Independence
But: A&B are made independent by C A A B AC P(A|C) = P(A|B,C) B True C B C © Daniel S. Weld 7
8
Joint Distribution All you need to know Can answer any question
Inference by enumeration © Daniel S. Weld
9
Joint Distribution All you need to know Can answer any question
Inference by enumeration But… exponential Both time & space Solution: exploit conditional independence © Daniel S. Weld 9
10
Sample Bayes Net Earthquake Burglary Radio Alarm Nbr1Calls Nbr2Calls
Pr(B=t) Pr(B=f) Earthquake Burglary Pr(A|E,B) e,b (0.1) e,b (0.8) e,b (0.15) e,b (0.99) Radio Alarm Nbr1Calls Nbr2Calls © Daniel S. Weld 10
11
Given Parents, X is Independent of Non-Descendants
© Daniel S. Weld 11
12
Given Markov Blanket, X is Independent of All Other Nodes
MB(X) = Par(X) Childs(X) Par(Childs(X)) © Daniel S. Weld 12
13
Given Markov Blanket, X is Independent of All Other Nodes
Sprinkler On Rained Grass Wet MB(X) = Par(X) Childs(X) Par(Childs(X)) © Daniel S. Weld 13
14
Galaxy Quest 2 astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars Normally there is a chance of error, e, of under or over counting by 1 star But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars
15
Structures F1 F2 F1 N F2 M1 M2 M1 M2 M1 M2 N N F1 F2
Two astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars. Normally there is a chance of error, e, of under or over counting by 1 star. But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars
16
Inference in BNs We generally want to compute
Pr(X), or Pr(X|E) where E is (conjunctive) evidence The graphical independence representation Yields efficient inference Computations organized by network topology Two simple algorithms: Variable elimination (VE) Junction trees © Daniel S. Weld 16
17
Variable Elimination Procedure
The initial potentials are the CPTs in BN Repeat until only query variable remains: Choose another variable to eliminate Multiply all potentials that contain the variable If no evidence for the variable then sum the variable out and replace original potential by the new result Else, remove variable based on evidence Normalize remaining potential to get the final distribution over the query variable
18
Junction Tree Inference Algorithm
Incorporate Evidence: For each evidence variable, go to one table including that variable Set to 0 all entries that disagree with evidence Renormalize this potential Upward Step: Pass message to parents Downward Step: Pass message to children
19
Learning Parameter Estimation:
Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld 19
20
Topics Bayesian networks overview Infernence Parameter Estimation:
Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld
21
Which coin will I use? Coin Flip P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3
P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations
22
What is the probability of heads after two experiments?
Your Estimate? What is the probability of heads after two experiments? Most likely coin: C2 Best estimate for P(H) P(H|C2) = 0.5 C1 C2 C3 P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9 P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
23
Maximum Likelihood Estimate
The best hypothesis that fits observed data assuming uniform prior After 2 Experiments: heads, tails Most likely coin: Best estimate for P(H) P(H|C2) = 0.5 C2 P(H|C2) = 0.5 P(C2) = 1/3
24
Using Prior Knowledge Should we always use a Uniform Prior ?
Background knowledge: Heads => we have take-home midterm Dan likes take-homes… => Dan is more likely to use a coin biased in his favor P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70 P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3 P(H|C2) = 0.5
25
Maximum A Posteriori (MAP) Estimate
The best hypothesis that fits observed data assuming a non-uniform prior Most likely coin: Best estimate for P(H) P(H|C3) = 0.9 C3 P(H|C3) = 0.9 P(C3) = 0.70
26
Bayesian Estimate non-uniform prior = 0.680 P(C1|HT)=0.035
Minimizes prediction error, given data and (generally) assuming a non-uniform prior = 0.680 P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485 C1 C2 C3 P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9
27
Summary For Now Prior Hypothesis Maximum Likelihood Estimate
Maximum A Posteriori Estimate Bayesian Estimate Uniform The most likely Any Weighted combination
28
Topics Parameter Estimation:
Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld
29
Parameter Estimation and Bayesian Networks
J M T F ... We have: - Bayes Net structure and observations - We need: Bayes Net parameters
30
Parameter Estimation and Bayesian Networks
J M T F ... Prior Now compute either MAP or Bayesian estimate P(B) = ? + data =
31
What Prior to Use? The following are two common priors
Binary variable Beta Posterior distribution is binomial Easy to compute posterior Discrete variable Dirichlet Posterior distribution is multinomial © Daniel S. Weld
32
One Prior: Beta Distribution
a,b For any positive integer y, G(y) = (y-1)!
34
Beta Distribution Example: Flip coin with Beta distribution as prior over p [prob(heads)] Parameterized by two positive numbers: a, b Mode of distribution (E[p]) is a/(a+b) Specify our prior belief for p = a/(a+b) Specify confidence in this belief with high initial values for a and b Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)
35
Parameter Estimation and Bayesian Networks
J M T F ... Prior .3 B ¬B .7 P(B|data) = ? Beta(1,4) “+ data” = (3,7) Prior P(B)= 1/(1+4) = 20% with equivalent sample size 5
36
Parameter Estimation and Bayesian Networks
J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?
37
Parameter Estimation and Bayesian Networks
J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3)
38
Parameter Estimation and Bayesian Networks
J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3) + data= (3,4)
39
What if we don’t know structure?
40
Learning The Structure of Bayesian Networks
Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? Problem !?!? When scoring, add a penalty model complexity
41
Learning The Structure of Bayesian Networks
Search thru the space For each structure, learn parameters Pick the one that fits observed data best Penalize complex models Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?
42
Structure Learning as Search
Local Search Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld
43
A E C D B A B C A E C D B A E C D B D E A E C D B
44
Score Functions Bayesian Information Criteion (BIC) MAP score
P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points) MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponentially with # of parameters for this to work well © Daniel S. Weld
45
Naïve Bayes Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1
46
Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997]
Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.