Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld.

Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld

© Daniel S. Weld 2 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks

Problems with Enumeration Worst case time: O(d n ) Where d = max arity of random variables e.g., d = 2 for Boolean (T/F) And n = number of random variables Space complexity also O(d n ) Size of joint distribution Problem: Hard/impossible to estimate all O(d n ) entries for large problems

© Daniel S. Weld 6 Conditional Independence P(A, B |C) = P(A | C) P(B | C) or P(A | B, C) = P(A | B) Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!! Conditional independence is the most basic & robust form of knowledge about uncertain environments.

© Daniel S. Weld 7 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation & inference BNs provide a graphical representation of conditional independence relations in P 1.usually quite compact 2.requires assessment of fewer parameters, those being quite natural (e.g., causal) 3.efficient (usually) inference: query answering and belief update

© Daniel S. Weld 10 Bayesian Networks Graphical structure of BN reflects conditional independence among variables Each variable X is a node in the DAG Edges denote direct probabilistic influence usually interpreted causally parents of X are denoted Par(X) Each node X has a conditional probability distribution P(X | Parents(X)) X is conditionally independent of all nondescendents given its parents

© Daniel S. Weld 12 Conditional Probability Tables For complete spec. of joint dist., quantify BN For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)| If X 1, X 2,... X n is any topological sort of the network, then we are assured: P(X n,X n-1,...X 1 ) = P(X n | X n-1,...X 1 ) · P(X n-1 | X n-2,… X 1 ) … P(X 2 | X 1 ) · P(X 1 ) = P(X n | Par(X n )) · P(X n-1 | Par(X n-1 )) … P(X 1 )

© Daniel S. Weld 18 Inference in BNs The graphical independence representation yields efficient inference schemes We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence Computations organized by network topology Two simple algorithms: Variable elimination (VE) Junction trees

© Daniel S. Weld 19 Variable Elimination A factor is a function from some set of variables into a specific value: e.g., f(E,A,N1) CPTs are factors, e.g., P(A|E,B) function of A,E,B VE works by eliminating all variables in turn until there is a factor with only query variable

Joint Distributoins & CPDs Vs. Potentials ¬b¬b ¬b¬b ¬a¬a ¬a¬a a b ¬a¬a b.1.6.9.4.2.3.4.5 CPT for P(B | A) Potential Potentials occur when we temporarily forget meaning associated with table 1.Must be non-negative 2.Doesn’t have to sum to 1 Arise when incorporating evidence Represent probability distributions 1.For CPT, specific setting of parents, values of child must sum to 1 2.For joint, all entries sum to 1

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02 x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02.04 x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02.06.04 x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02.06.04.10 x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.10.24.20.40.02.06.04.10 x =

¬b¬b ¬a¬a a b.1.2.5.8 = ¬b¬b ¬a¬a a b.1.2.5.8 ¬b¬b ¬a¬a a b.0625.125.3125.5  a ¬b¬b b.31.3 Marginalize/sum out a variable Normalize a potential α

Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a.10.24.20.40.02.06.04.10 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b ¬c¬cc b.12.3.24.5 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b b.3 1.3  a ¬b¬b ¬c¬cc b.12.3.24.5  a (P 1 x P 2 ) = (  a P 1 )x P 2 if A is not in P 2

Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a.10.24.20.40.02.06.04.10 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5  a (P 1 x P 2 ) = (  a P 1 )x P 2 if A is not in P 2

Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a.10.24.20.40.02.06.04.10 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b ¬c¬cc b.12.3.24.5 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b b.3 1.3  a ¬b¬b ¬c¬cc b.12.3.24.5  a (P 1 x P 2 ) = (  a P 1 )x P 2 if A is not in P 2

Variable Elimination Procedure The initial potentials are the CPTs in BN Repeat until only query variable remains: 1.Choose another variable to eliminate 2.Multiply all potentials that contain the variable 3.If no evidence for the variable then sum the variable out and replace original potential by the new result 4.Else, remove variable based on evidence Normalize remaining potential to get the final distribution over the query variable

A F E C D B ¬d¬dd.1.3 ¬a¬a a b.8.3 ¬b¬b b d.4.7 ¬a¬a a c.1.6 ¬c¬c c e.5.9.4.9 ¬e¬ee¬e¬ee f.2 a P(A,B,C,D,E,F) = P(A) P(B|A) P(C|A) P(D|B) P(E|C) P(F|D,E)

Query: P(F| C = true) Elimination Ordering: A,B,C,D,E P(F) =  a,b,c,d,e (P(F|D,E)P(D|B)P(E|C)P(B|A)P(C|A)P(A)) P(F) =  e [  d [  c [  b [(  a P(B|A)P(C|A)P(A)) P(D|B)] P(E|C)] P(D|F,E)]]

A CB ¬a¬a a c.1.6.2 a Query: P(F| C = true) Elimination Ordering: A,B,C,D,E.8.9.4 ¬a¬a a b.8.3.2.7 ¬b¬b ¬c¬c ¬a¬a ¬a¬a ¬c¬c ¬c¬cc c a b ¬b¬b.004.336.036.224.016.144.096 Before eliminating A, multiple all potentials involving A.16.24.34.26 Sum out A

D B C ¬b¬b ¬d¬d ¬d¬dd d b c ¬c¬c.096.182.144.078.064.238.096.102.302.198.278.222 Sum out B ¬b¬b ¬d¬dd b.4.7.6.3 ¬b¬b ¬c¬cc b.16.34.24.26 Now, eliminate B, multiple all potentials involving B

E C D ¬c¬c ¬e¬e ¬e¬ee e c d ¬d¬d.099.200.099.022.151.250.151.028 We have evidence for C, so eliminate ¬c ¬c¬c ¬e¬ee c.5.9.5.1 ¬c¬c ¬d¬dd c.302.278.198.222 Next, eliminate C, multiple all potentials involving C

F ED ¬f¬f ¬e¬e ¬e¬ee e f d ¬d¬d.4.6.9.1.9.3.7 ¬d¬d ¬f¬f ¬f¬ff f d e ¬e¬e.040.089.106.010.015.040.136.059.055.195.129.116 Sum out d ¬e¬e ¬d¬dd e.151.099 Next, eliminate D, multiple all potentials involving D ¬e¬e ¬f¬ff e.055.129.195.116

Normalize α = 2.0202 ¬e¬e ¬f¬ff e.055.129.195.116 Next, eliminate E.184 f¬f¬f.311.372 f¬f¬f.628  e

© Daniel S. Weld 38 Notes on VE Each operation is a simply multiplication of factors and summing out a variable Complexity determined by size of largest factor 1.e.g., in example, 3 vars (not 5) 2.linear in number of vars, 3.exponential in largest factor elimination ordering greatly impacts factor size 4.optimal elimination orderings: NP-hard 5.heuristics, special structure (e.g., polytrees) Practically, inference is much more tractable using structure of this sort

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles We can avoid cycles if we turn highly- interconnected subsets of the nodes into “supernodes”

Step 1: Make the Graph Moral A F E C D B Add edge between non-adjacent parents of same child

Step 2: Remove Directionality A F E C D B

Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no chord, add chord

Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no, cord, add chord

Triangulation Checking The following algorithm only terminates if the graph is triangulated Choose any node in the graph and label 1 For I = 2 to n 1.Choose the node with most labeled neighbors and label it I 2.If any two labeled neighbors of I are not adjacent to each other fail Succeed © Daniel S. Weld 45

Is It Triangulated Yet? A F E C D B 1

A F E C D B 1 2

A F E C D B 1 2 3

A F E C D B 1 2 4 3

A F E C D B 1 2 4 3 5

A F E C D B 1 2 4 3 5 6

Triangulation: Key Points In general, many triangulations may exist The only efficient algorithms are heuristic Jensen and Jensen (1994) showed that any scheme for exact inference (belief updating given evidence) must perform triangulation (perhaps hidden as in Draper 1995)

Step 4: Build the Clique Graph A F E C D B Find all cliques in moralized, triangulated graph If two cliques intersect, they are joined by an edge labeled with their intersection Clique: maximal complete subgraph (e.g., ABC, BCD)

Step 4: Build the Clique Graph ABC CDE DEF BCD C C,D B,C D D,E

Junction Trees A junction tree is a subgraph of the clique graph that 1.Is a tree 2.Contains all the nodes of the clique graph 3.Satisfies the junction tree property Junction tree property: For each pair U, V of cliques with intersection S, all cliques on the path between U and V contain S

Clique Graph to Junction Tree We can perform exact inference efficiently on a junction tree (although CPTs may be large). But can we always build a junction tree? If so, how? Let the weight of an edge in the clique graph be the cardinality of the separator. Than any maximum weight spanning tree is a junction tree (Jensen & Jensen 1994).

Step 5: Build the Junction Tree ABC CDE DEF BCD C,D B,C D,E

Step 6: Choose a Root ABC CDE DEFBCD C,D B,C D,E

Step 7: Populate Clique Nodes For each distribution (CPT) in the original Bayes Net, put this distribution into one of the clique nodes that contains all the variables referenced by the CPT. (At least one such node must exist because of the moralization step). For each clique node, take the product of the distributions (as in variable elimination).

Step 8: Assign CPTs ABC CDE DEFBCD C,D B,C D,E ¬d¬d ¬b¬bb d.4.6.7.3 ¬e¬e ¬c¬cc e.5.9.1 ¬d¬dd.3.4 ¬e¬ee¬e¬ee f.9 P(E | C) P(F | D,E) P(A,B,C) P(D | B) ¬a¬a ¬c¬c ¬c¬cc c a b ¬b¬b.004.336.036.224.016.144.096

Junction Tree Inference Algorithm Incorporate Evidence: For each evidence variable, go to one table including that variable Set to 0 all entries that disagree with evidence Renormalize this potential Upward Step: Pass message to parents Downward Step: Pass message to children

Upward Step Each leaf sends a message to its parent Message is the marginal of its table, summing out any variable not in the separator When a parent receives a message from a child, it multiplies its table by the message table to obtain its new table When a parent receives messages from all its children, it repeats the process This process continues until the root receives messages from all its children © Daniel S. Weld 62

Downward Step Root sends a message to each child. root divides its current table by the message received from that child, Marginalizes the resulting table to the separator, sends this to the child. Child multiplies message from parent by the child’s current table The process repeats (the child acts as root) and continues until all leaves receive messages from their parents © Daniel S. Weld 63

Answering Queries: Final Step With junction tree, can query any variable Find clique node containing that variable and sum out the other variables to obtain answer If given new evidence, we must repeat the Upward-Downward process Only need to compute junction tree once! A junction tree can be thought of as storing the subjoints computed during elimination See Finn V. Jensen “Bayesian Networks and Decision Graphs” for algorithm description

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Prior: Probability of a hypothesis before we make any observations

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 )=0.1 C1C1 C2C2 C3C3 P(C 1 )=1/3P(C 2 ) = 1/3P(C 3 ) = 1/3

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = 0.6 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Posterior: Probability of a hypothesis given data

Terminology Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ?

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Best estimate for P(H) P(H|C 2 ) = 0.5 Most likely coin: C2C2

Your Estimate? P(H|C 2 ) = 0.5 C2C2 P(C 2 ) = 1/3 Most likely coin:Best estimate for P(H) P(H|C 2 ) = 0.5 C2C2 Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior

Using Prior Knowledge Should we always use a Uniform Prior ? Background knowledge: Heads => we have take-home midterm Dan likes take-homes… => Dan is more likely to use a coin biased in his favor P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3

Using Prior Knowledge P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 We can encode it in the prior:

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.006P(C 2 |H) = 0.165P(C 3 |H) = 0.829 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = 0.600 Compare with ML posterior after Exp 1:

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |HT) = 0.035P(C 2 |HT) = 0.481P(C 3 |HT) = 0.485

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Most likely coin:

Your Estimate? Most likely coin:Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior P(H|C 3 ) = 0.9 C3C3 P(C 3 ) = 0.70

Did We Do The Right Thing? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

A Better Estimate P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 Recall: = 0.680 P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

Comparison After more experiments: HTH 8 ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H) = 0.9 MAP (Maximum A Posteriori): P(H) = 0.9 after 10 experiments: P(H) = 0.9 Bayesian: P(H) = 0.68 after 10 experiments: P(H) = 0.9

Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute

Summary For Now PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... We have: - Bayes Net structure and observations - We need: Bayes Net parameters

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Now compute either MAP or Bayesian estimate

What Prior to Use? The following are two common priors Binary variable Beta Posterior distribution is binomial Easy to compute posterior Discrete variable Dirichlet Posterior distribution is multinomial Easy to compute posterior © Daniel S. Weld 95

One Prior: Beta Distribution a,b For any positive integer y,  (y) = (y-1)!

Beta Distribution Example: Flip coin with Beta distribution as prior over p [prob(heads)] 1.Parameterized by two positive numbers: a, b 2.Mode of distribution (E[p]) is a/(a+b) 3.Specify our prior belief for p = a/(a+b) 4.Specify confidence in this belief with high initial values for a and b Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Beta(1,4) (3,7).3 B¬B¬B.7

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior + data= Beta(2,3) (3,4)

What if we don’t know structure?

Learning The Structure of Bayesian Networks Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? When scoring, add a penalty  model complexity Problem !?!?

Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?

Structure Learning as Search Local Search 1.Start with some network structure 2.Try to make a change (add or delete or reverse edge) 3.See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld 105

A E C D B A E C D B A E C D B A E C D B A E C D B

Score Functions Bayesian Information Criteion (BIC) P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points) MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponential with # of parameters for this to work well © Daniel S. Weld 107

Naïve Bayes F 2F N-2F N-1F NF 1F 3 Class Value … Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1

Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997] F 2F N-2F N-1F NF 1F 3 Class Value … Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time

Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld.

Similar presentations

Presentation on theme: "Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld.

Similar presentations

Presentation on theme: "Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld."— Presentation transcript:

Similar presentations

About project

Feedback