Download presentation
Presentation is loading. Please wait.
Published byMiranda Lang Modified over 9 years ago
1
Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld
2
© Daniel S. Weld 2 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks
3
© Daniel S. Weld 3 Inference by Enumeration P(toothache cavity) =.20 + ??.072 +.008.28
4
Problems with Enumeration Worst case time: O(d n ) Where d = max arity of random variables e.g., d = 2 for Boolean (T/F) And n = number of random variables Space complexity also O(d n ) Size of joint distribution Problem: Hard/impossible to estimate all O(d n ) entries for large problems
5
© Daniel S. Weld 5 Independence A and B are independent iff: These two constraints are logically equivalent Therefore, if A and B are independent:
6
© Daniel S. Weld 6 Conditional Independence P(A, B |C) = P(A | C) P(B | C) or P(A | B, C) = P(A | B) Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!! Conditional independence is the most basic & robust form of knowledge about uncertain environments.
7
© Daniel S. Weld 7 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation & inference BNs provide a graphical representation of conditional independence relations in P 1.usually quite compact 2.requires assessment of fewer parameters, those being quite natural (e.g., causal) 3.efficient (usually) inference: query answering and belief update
8
© Daniel S. Weld 8 An Example Bayes Net Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) 0.05 0.95 Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio
9
© Daniel S. Weld 9 Earthquake Example (con’t) If I know if Alarm, no other evidence influences my degree of belief in Nbr1Calls P(N1|N2,A,E,B) = P(N1|A) also: P(N2|N2,A,E,B) = P(N2|A) and P(E|B) = P(E) By the chain rule we have P(N1,N2,A,E,B) = P(N1|N2,A,E,B) ·P(N2|A,E,B)· P(A|E,B) ·P(E|B) ·P(B) = P(N1|A) ·P(N2|A) ·P(A|B,E) ·P(E) ·P(B) Full joint requires only 10 parameters (cf. 63) Earthquake Burglary Alarm Nbr2CallsNbr1Calls Radio
10
© Daniel S. Weld 10 Bayesian Networks Graphical structure of BN reflects conditional independence among variables Each variable X is a node in the DAG Edges denote direct probabilistic influence usually interpreted causally parents of X are denoted Par(X) Each node X has a conditional probability distribution P(X | Parents(X)) X is conditionally independent of all nondescendents given its parents
11
© Daniel S. Weld 11 Conditional Probability Tables Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) 0.05 0.95 Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio
12
© Daniel S. Weld 12 Conditional Probability Tables For complete spec. of joint dist., quantify BN For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)| If X 1, X 2,... X n is any topological sort of the network, then we are assured: P(X n,X n-1,...X 1 ) = P(X n | X n-1,...X 1 ) · P(X n-1 | X n-2,… X 1 ) … P(X 2 | X 1 ) · P(X 1 ) = P(X n | Par(X n )) · P(X n-1 | Par(X n-1 )) … P(X 1 )
13
© Daniel S. Weld 13 Given Parents, X is Independent of Non-Descendants
14
© Daniel S. Weld 14 For Example EarthquakeBurglary Alarm Nbr2CallsNbr1Calls Radio
15
© Daniel S. Weld 15 For Example EarthquakeBurglary Alarm Nbr2CallsNbr1Calls Radio
16
© Daniel S. Weld 16 Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X) Childs(X) Par(Childs(X))
17
© Daniel S. Weld 17 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks
18
© Daniel S. Weld 18 Inference in BNs The graphical independence representation yields efficient inference schemes We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence Computations organized by network topology Two simple algorithms: Variable elimination (VE) Junction trees
19
© Daniel S. Weld 19 Variable Elimination A factor is a function from some set of variables into a specific value: e.g., f(E,A,N1) CPTs are factors, e.g., P(A|E,B) function of A,E,B VE works by eliminating all variables in turn until there is a factor with only query variable
20
Joint Distributoins & CPDs Vs. Potentials ¬b¬b ¬b¬b ¬a¬a ¬a¬a a b ¬a¬a b.1.6.9.4.2.3.4.5 CPT for P(B | A) Potential Potentials occur when we temporarily forget meaning associated with table 1.Must be non-negative 2.Doesn’t have to sum to 1 Arise when incorporating evidence Represent probability distributions 1.For CPT, specific setting of parents, values of child must sum to 1 2.For joint, all entries sum to 1
21
Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02 x =
22
Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02.04 x =
23
Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02.06.04 x =
24
Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.02.06.04.10 x =
25
Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a.1.2.5.8.2.3.4.5.10.24.20.40.02.06.04.10 x =
26
¬b¬b ¬a¬a a b.1.2.5.8 = ¬b¬b ¬a¬a a b.1.2.5.8 ¬b¬b ¬a¬a a b.0625.125.3125.5 a ¬b¬b b.31.3 Marginalize/sum out a variable Normalize a potential α
27
Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a.10.24.20.40.02.06.04.10 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b ¬c¬cc b.12.3.24.5 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b b.3 1.3 a ¬b¬b ¬c¬cc b.12.3.24.5 a (P 1 x P 2 ) = ( a P 1 )x P 2 if A is not in P 2
28
Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a.10.24.20.40.02.06.04.10 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 a (P 1 x P 2 ) = ( a P 1 )x P 2 if A is not in P 2
29
Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a.10.24.20.40.02.06.04.10 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b ¬c¬cc b.12.3.24.5 x = ¬a¬a ¬b¬bb a.1.2.5.8 ¬b¬b ¬c¬cc b.2.3.4.5 ¬b¬b b.3 1.3 a ¬b¬b ¬c¬cc b.12.3.24.5 a (P 1 x P 2 ) = ( a P 1 )x P 2 if A is not in P 2
30
Variable Elimination Procedure The initial potentials are the CPTs in BN Repeat until only query variable remains: 1.Choose another variable to eliminate 2.Multiply all potentials that contain the variable 3.If no evidence for the variable then sum the variable out and replace original potential by the new result 4.Else, remove variable based on evidence Normalize remaining potential to get the final distribution over the query variable
31
A F E C D B ¬d¬dd.1.3 ¬a¬a a b.8.3 ¬b¬b b d.4.7 ¬a¬a a c.1.6 ¬c¬c c e.5.9.4.9 ¬e¬ee¬e¬ee f.2 a P(A,B,C,D,E,F) = P(A) P(B|A) P(C|A) P(D|B) P(E|C) P(F|D,E)
32
Query: P(F| C = true) Elimination Ordering: A,B,C,D,E P(F) = a,b,c,d,e (P(F|D,E)P(D|B)P(E|C)P(B|A)P(C|A)P(A)) P(F) = e [ d [ c [ b [( a P(B|A)P(C|A)P(A)) P(D|B)] P(E|C)] P(D|F,E)]]
33
A CB ¬a¬a a c.1.6.2 a Query: P(F| C = true) Elimination Ordering: A,B,C,D,E.8.9.4 ¬a¬a a b.8.3.2.7 ¬b¬b ¬c¬c ¬a¬a ¬a¬a ¬c¬c ¬c¬cc c a b ¬b¬b.004.336.036.224.016.144.096 Before eliminating A, multiple all potentials involving A.16.24.34.26 Sum out A
34
D B C ¬b¬b ¬d¬d ¬d¬dd d b c ¬c¬c.096.182.144.078.064.238.096.102.302.198.278.222 Sum out B ¬b¬b ¬d¬dd b.4.7.6.3 ¬b¬b ¬c¬cc b.16.34.24.26 Now, eliminate B, multiple all potentials involving B
35
E C D ¬c¬c ¬e¬e ¬e¬ee e c d ¬d¬d.099.200.099.022.151.250.151.028 We have evidence for C, so eliminate ¬c ¬c¬c ¬e¬ee c.5.9.5.1 ¬c¬c ¬d¬dd c.302.278.198.222 Next, eliminate C, multiple all potentials involving C
36
F ED ¬f¬f ¬e¬e ¬e¬ee e f d ¬d¬d.4.6.9.1.9.3.7 ¬d¬d ¬f¬f ¬f¬ff f d e ¬e¬e.040.089.106.010.015.040.136.059.055.195.129.116 Sum out d ¬e¬e ¬d¬dd e.151.099 Next, eliminate D, multiple all potentials involving D ¬e¬e ¬f¬ff e.055.129.195.116
37
Normalize α = 2.0202 ¬e¬e ¬f¬ff e.055.129.195.116 Next, eliminate E.184 f¬f¬f.311.372 f¬f¬f.628 e
38
© Daniel S. Weld 38 Notes on VE Each operation is a simply multiplication of factors and summing out a variable Complexity determined by size of largest factor 1.e.g., in example, 3 vars (not 5) 2.linear in number of vars, 3.exponential in largest factor elimination ordering greatly impacts factor size 4.optimal elimination orderings: NP-hard 5.heuristics, special structure (e.g., polytrees) Practically, inference is much more tractable using structure of this sort
39
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles We can avoid cycles if we turn highly- interconnected subsets of the nodes into “supernodes”
40
Step 1: Make the Graph Moral A F E C D B Add edge between non-adjacent parents of same child
41
Step 2: Remove Directionality A F E C D B
42
Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no chord, add chord
43
Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no, cord, add chord
44
Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no, cord, add chord
45
Triangulation Checking The following algorithm only terminates if the graph is triangulated Choose any node in the graph and label 1 For I = 2 to n 1.Choose the node with most labeled neighbors and label it I 2.If any two labeled neighbors of I are not adjacent to each other fail Succeed © Daniel S. Weld 45
46
Is It Triangulated Yet? A F E C D B 1
47
A F E C D B 1 2
48
A F E C D B 1 2 3
49
A F E C D B 1 2 4 3
50
A F E C D B 1 2 4 3 5
51
A F E C D B 1 2 4 3 5 6
52
Triangulation: Key Points In general, many triangulations may exist The only efficient algorithms are heuristic Jensen and Jensen (1994) showed that any scheme for exact inference (belief updating given evidence) must perform triangulation (perhaps hidden as in Draper 1995)
53
Step 4: Build the Clique Graph A F E C D B Find all cliques in moralized, triangulated graph If two cliques intersect, they are joined by an edge labeled with their intersection Clique: maximal complete subgraph (e.g., ABC, BCD)
54
Step 4: Build the Clique Graph ABC CDE DEF BCD C C,D B,C D D,E
55
Junction Trees A junction tree is a subgraph of the clique graph that 1.Is a tree 2.Contains all the nodes of the clique graph 3.Satisfies the junction tree property Junction tree property: For each pair U, V of cliques with intersection S, all cliques on the path between U and V contain S
56
Clique Graph to Junction Tree We can perform exact inference efficiently on a junction tree (although CPTs may be large). But can we always build a junction tree? If so, how? Let the weight of an edge in the clique graph be the cardinality of the separator. Than any maximum weight spanning tree is a junction tree (Jensen & Jensen 1994).
57
Step 5: Build the Junction Tree ABC CDE DEF BCD C,D B,C D,E
58
Step 6: Choose a Root ABC CDE DEFBCD C,D B,C D,E
59
Step 7: Populate Clique Nodes For each distribution (CPT) in the original Bayes Net, put this distribution into one of the clique nodes that contains all the variables referenced by the CPT. (At least one such node must exist because of the moralization step). For each clique node, take the product of the distributions (as in variable elimination).
60
Step 8: Assign CPTs ABC CDE DEFBCD C,D B,C D,E ¬d¬d ¬b¬bb d.4.6.7.3 ¬e¬e ¬c¬cc e.5.9.1 ¬d¬dd.3.4 ¬e¬ee¬e¬ee f.9 P(E | C) P(F | D,E) P(A,B,C) P(D | B) ¬a¬a ¬c¬c ¬c¬cc c a b ¬b¬b.004.336.036.224.016.144.096
61
Junction Tree Inference Algorithm Incorporate Evidence: For each evidence variable, go to one table including that variable Set to 0 all entries that disagree with evidence Renormalize this potential Upward Step: Pass message to parents Downward Step: Pass message to children
62
Upward Step Each leaf sends a message to its parent Message is the marginal of its table, summing out any variable not in the separator When a parent receives a message from a child, it multiplies its table by the message table to obtain its new table When a parent receives messages from all its children, it repeats the process This process continues until the root receives messages from all its children © Daniel S. Weld 62
63
Downward Step Root sends a message to each child. root divides its current table by the message received from that child, Marginalizes the resulting table to the separator, sends this to the child. Child multiplies message from parent by the child’s current table The process repeats (the child acts as root) and continues until all leaves receive messages from their parents © Daniel S. Weld 63
64
Answering Queries: Final Step With junction tree, can query any variable Find clique node containing that variable and sum out the other variables to obtain answer If given new evidence, we must repeat the Upward-Downward process Only need to compute junction tree once! A junction tree can be thought of as storing the subjoints computed during elimination See Finn V. Jensen “Bayesian Networks and Decision Graphs” for algorithm description
65
© Daniel S. Weld 65 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks
66
Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Prior: Probability of a hypothesis before we make any observations
67
Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations
68
Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 )=0.1 C1C1 C2C2 C3C3 P(C 1 )=1/3P(C 2 ) = 1/3P(C 3 ) = 1/3
69
Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = 0.6 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Posterior: Probability of a hypothesis given data
70
Terminology Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis
71
Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ?
72
Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21
73
Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21
74
Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Best estimate for P(H) P(H|C 2 ) = 0.5 Most likely coin: C2C2
75
Your Estimate? P(H|C 2 ) = 0.5 C2C2 P(C 2 ) = 1/3 Most likely coin:Best estimate for P(H) P(H|C 2 ) = 0.5 C2C2 Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior
76
Using Prior Knowledge Should we always use a Uniform Prior ? Background knowledge: Heads => we have take-home midterm Dan likes take-homes… => Dan is more likely to use a coin biased in his favor P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3
77
Using Prior Knowledge P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 We can encode it in the prior:
78
Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70
79
Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.006P(C 2 |H) = 0.165P(C 3 |H) = 0.829 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = 0.600 Compare with ML posterior after Exp 1:
80
Experiment 2: Tails Which coin did I use? P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70
81
Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |HT) = 0.035P(C 2 |HT) = 0.481P(C 3 |HT) = 0.485
82
Experiment 2: Tails Which coin did I use? P(C 1 |HT) = 0.035P(C 2 |HT)=0.481P(C 3 |HT) = 0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70
83
Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Most likely coin:
84
Your Estimate? Most likely coin:Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior P(H|C 3 ) = 0.9 C3C3 P(C 3 ) = 0.70
85
Did We Do The Right Thing? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485
86
Did We Do The Right Thing? P(C 1 |HT) =0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 C 2 and C 3 are almost equally likely
87
A Better Estimate P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 Recall: = 0.680 P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485
88
Bayesian Estimate P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 = 0.680 Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior
89
Comparison After more experiments: HTH 8 ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H) = 0.9 MAP (Maximum A Posteriori): P(H) = 0.9 after 10 experiments: P(H) = 0.9 Bayesian: P(H) = 0.68 after 10 experiments: P(H) = 0.9
90
Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute
91
Summary For Now PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination
92
© Daniel S. Weld 92 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks
93
Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... We have: - Bayes Net structure and observations - We need: Bayes Net parameters
94
Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Now compute either MAP or Bayesian estimate
95
What Prior to Use? The following are two common priors Binary variable Beta Posterior distribution is binomial Easy to compute posterior Discrete variable Dirichlet Posterior distribution is multinomial Easy to compute posterior © Daniel S. Weld 95
96
One Prior: Beta Distribution a,b For any positive integer y, (y) = (y-1)!
98
Beta Distribution Example: Flip coin with Beta distribution as prior over p [prob(heads)] 1.Parameterized by two positive numbers: a, b 2.Mode of distribution (E[p]) is a/(a+b) 3.Specify our prior belief for p = a/(a+b) 4.Specify confidence in this belief with high initial values for a and b Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)
99
Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Beta(1,4) (3,7).3 B¬B¬B.7
100
Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?
101
Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior + data= Beta(2,3) (3,4)
102
What if we don’t know structure?
103
Learning The Structure of Bayesian Networks Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? When scoring, add a penalty model complexity Problem !?!?
104
Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?
105
Structure Learning as Search Local Search 1.Start with some network structure 2.Try to make a change (add or delete or reverse edge) 3.See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld 105
106
A E C D B A E C D B A E C D B A E C D B A E C D B
107
Score Functions Bayesian Information Criteion (BIC) P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points) MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponential with # of parameters for this to work well © Daniel S. Weld 107
108
Naïve Bayes F 2F N-2F N-1F NF 1F 3 Class Value … Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1
109
Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997] F 2F N-2F N-1F NF 1F 3 Class Value … Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.