Still More Uncertainty
Administrivia Read Ch 14, 15
Axioms of Probability Theory Just 3 are enough to build entire theory! 1. Range of probabilities: 0 ≤ P(A) ≤ 1 2. P(true) = 1 and P(false) = 0 3. Probability of disjunction of events is: A B A B True
Bayes rules! Simple proof from def of conditional probability 4
Independence P(AB) = P(A)P(B) A A B B True © Daniel S. Weld 5
Conditional Independence A&B not independent, since P(A|B) < P(A) A A B B True © Daniel S. Weld 6
Conditional Independence But: A&B are made independent by C A A B AC P(A|C) = P(A|B,C) B True C B C © Daniel S. Weld 7
Joint Distribution All you need to know Can answer any question Inference by enumeration © Daniel S. Weld
Joint Distribution All you need to know Can answer any question Inference by enumeration But… exponential Both time & space Solution: exploit conditional independence © Daniel S. Weld 9
Sample Bayes Net Earthquake Burglary Radio Alarm Nbr1Calls Nbr2Calls Pr(B=t) Pr(B=f) 0.05 0.95 Earthquake Burglary Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio Alarm Nbr1Calls Nbr2Calls © Daniel S. Weld 10
Given Parents, X is Independent of Non-Descendants © Daniel S. Weld 11
Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X) Childs(X) Par(Childs(X)) © Daniel S. Weld 12
Given Markov Blanket, X is Independent of All Other Nodes Sprinkler On Rained Grass Wet MB(X) = Par(X) Childs(X) Par(Childs(X)) © Daniel S. Weld 13
Galaxy Quest 2 astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars Normally there is a chance of error, e, of under or over counting by 1 star But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars
Structures F1 F2 F1 N F2 M1 M2 M1 M2 M1 M2 N N F1 F2 Two astronomers in Chile & Alaska measure (M1, M2) the number (N) of stars. Normally there is a chance of error, e, of under or over counting by 1 star. But sometimes, with probability f, a telescope can be out of focus (events F1 and F2) in which case the affected scientist will undercount by 3 stars
Inference in BNs We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence The graphical independence representation Yields efficient inference Computations organized by network topology Two simple algorithms: Variable elimination (VE) Junction trees © Daniel S. Weld 16
Variable Elimination Procedure The initial potentials are the CPTs in BN Repeat until only query variable remains: Choose another variable to eliminate Multiply all potentials that contain the variable If no evidence for the variable then sum the variable out and replace original potential by the new result Else, remove variable based on evidence Normalize remaining potential to get the final distribution over the query variable
Junction Tree Inference Algorithm Incorporate Evidence: For each evidence variable, go to one table including that variable Set to 0 all entries that disagree with evidence Renormalize this potential Upward Step: Pass message to parents Downward Step: Pass message to children
Learning Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld 19
Topics Bayesian networks overview Infernence Parameter Estimation: Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld
Which coin will I use? Coin Flip P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3 P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations
What is the probability of heads after two experiments? Your Estimate? What is the probability of heads after two experiments? Most likely coin: C2 Best estimate for P(H) P(H|C2) = 0.5 C1 C2 C3 P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9 P(C1) = 1/3 P(C2) = 1/3 P(C3) = 1/3
Maximum Likelihood Estimate The best hypothesis that fits observed data assuming uniform prior After 2 Experiments: heads, tails Most likely coin: Best estimate for P(H) P(H|C2) = 0.5 C2 P(H|C2) = 0.5 P(C2) = 1/3
Using Prior Knowledge Should we always use a Uniform Prior ? Background knowledge: Heads => we have take-home midterm Dan likes take-homes… => Dan is more likely to use a coin biased in his favor P(C1) = 0.05 P(C2) = 0.25 P(C3) = 0.70 P(H|C1) = 0.1 C1 C2 P(H|C3) = 0.9 C3 P(H|C2) = 0.5
Maximum A Posteriori (MAP) Estimate The best hypothesis that fits observed data assuming a non-uniform prior Most likely coin: Best estimate for P(H) P(H|C3) = 0.9 C3 P(H|C3) = 0.9 P(C3) = 0.70
Bayesian Estimate non-uniform prior = 0.680 P(C1|HT)=0.035 Minimizes prediction error, given data and (generally) assuming a non-uniform prior = 0.680 P(C1|HT)=0.035 P(C2|HT)=0.481 P(C3|HT)=0.485 C1 C2 C3 P(H|C1) = 0.1 P(H|C2) = 0.5 P(H|C3) = 0.9
Summary For Now Prior Hypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate Uniform The most likely Any Weighted combination
Topics Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks © Daniel S. Weld
Parameter Estimation and Bayesian Networks J M T F ... We have: - Bayes Net structure and observations - We need: Bayes Net parameters
Parameter Estimation and Bayesian Networks J M T F ... Prior Now compute either MAP or Bayesian estimate P(B) = ? + data =
What Prior to Use? The following are two common priors Binary variable Beta Posterior distribution is binomial Easy to compute posterior Discrete variable Dirichlet Posterior distribution is multinomial © Daniel S. Weld
One Prior: Beta Distribution a,b For any positive integer y, G(y) = (y-1)!
Beta Distribution Example: Flip coin with Beta distribution as prior over p [prob(heads)] Parameterized by two positive numbers: a, b Mode of distribution (E[p]) is a/(a+b) Specify our prior belief for p = a/(a+b) Specify confidence in this belief with high initial values for a and b Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)
Parameter Estimation and Bayesian Networks J M T F ... Prior .3 B ¬B .7 P(B|data) = ? Beta(1,4) “+ data” = (3,7) Prior P(B)= 1/(1+4) = 20% with equivalent sample size 5
Parameter Estimation and Bayesian Networks J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?
Parameter Estimation and Bayesian Networks J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3)
Parameter Estimation and Bayesian Networks J M T F ... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior Beta(2,3) + data= (3,4)
What if we don’t know structure?
Learning The Structure of Bayesian Networks Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? Problem !?!? When scoring, add a penalty model complexity
Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Penalize complex models Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?
Structure Learning as Search Local Search Start with some network structure Try to make a change (add or delete or reverse edge) See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld
A E C D B A B C A E C D B A E C D B D E A E C D B
Score Functions Bayesian Information Criteion (BIC) MAP score P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points) MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponentially with # of parameters for this to work well © Daniel S. Weld
Naïve Bayes Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1
Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997] Class Value … F 1 F 2 F 3 F N-2 F N-1 F N Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time