Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld.

Slides:



Advertisements
Similar presentations
A Tutorial on Learning with Bayesian Networks
Advertisements

CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Bayesian Networks CSE 473. © Daniel S. Weld 2 Last Time Basic notions Atomic events Probabilities Joint distribution Inference by enumeration Independence.
BAYESIAN NETWORKS. Bayesian Network Motivation  We want a representation and reasoning system that is based on conditional independence  Compact yet.
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Identifying Conditional Independencies in Bayes Nets Lecture 4.
EE462 MLCV Lecture Introduction of Graphical Models Markov Random Fields Segmentation Tae-Kyun Kim 1.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
From Variable Elimination to Junction Trees
GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.
Bayesian network inference
Inference in Bayesian Nets
Regulatory Network (Part II) 11/05/07. Methods Linear –PCA (Raychaudhuri et al. 2000) –NIR (Gardner et al. 2003) Nonlinear –Bayesian network (Friedman.
Probabilistic Reasoning Copyright, 1996 © Dale Carnegie & Associates, Inc. Chapter 14 (14.1, 14.2, 14.3, 14.4) Capturing uncertain knowledge Probabilistic.
Machine Learning II Ensembles & Cotraining CSE 573 Representing Uncertainty.
Bayesian Belief Networks
Belief Propagation, Junction Trees, and Factor Graphs
5/25/2005EE562 EE562 ARTIFICIAL INTELLIGENCE FOR ENGINEERS Lecture 16, 6/1/2005 University of Washington, Department of Electrical Engineering Spring 2005.
CS 188: Artificial Intelligence Spring 2007 Lecture 14: Bayes Nets III 3/1/2007 Srini Narayanan – ICSI and UC Berkeley.
© Daniel S. Weld 1 Naïve Bayes & Expectation Maximization CSE 573.
Learning Bayesian Networks
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
Exact Inference: Clique Trees
Bayesian Networks Alan Ritter.
Rutgers CS440, Fall 2003 Introduction to Statistical Learning Reading: Ch. 20, Sec. 1-4, AIMA 2 nd Ed.
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.
A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graphs.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
CSE 573: Artificial Intelligence Spring 2012 Learning Bayesian Networks Dan Weld Slides adapted from Carlos Guestrin, Krzysztof Gajos, Dan Klein, Stuart.
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Bayesian networks. Motivation We saw that the full joint probability can be used to answer any question about the domain, but can become intractable as.
1 Instance-Based & Bayesian Learning Chapter Some material adapted from lecture notes by Lise Getoor and Ron Parr.
Introduction to Bayesian Networks
An Introduction to Artificial Intelligence Chapter 13 & : Uncertainty & Bayesian Networks Ramin Halavati
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.
CSE 473: Artificial Intelligence Spring 2012 Bayesian Networks - Learning Dan Weld Slides adapted from Jack Breese, Dan Klein, Daphne Koller, Stuart Russell,
Learning With Bayesian Networks Markus Kalisch ETH Zürich.
Slides for “Data Mining” by I. H. Witten and E. Frank.
Marginalization & Conditioning Marginalization (summing out): for any sets of variables Y and Z: Conditioning(variant of marginalization):
Bayesian Networks CSE 473. © D. Weld and D. Fox 2 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential.
Lecture 2: Statistical learning primer for biologists
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
Pattern Recognition and Machine Learning
Today Graphical Models Representing conditional dependence graphically
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
Chapter 12. Probability Reasoning Fall 2013 Comp3710 Artificial Intelligence Computing Science Thompson Rivers University.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Inference in Bayesian Networks
Read R&N Ch Next lecture: Read R&N
Still More Uncertainty
Bayesian Networks: Motivation
Read R&N Ch Next lecture: Read R&N
Class #19 – Tuesday, November 3
Class #16 – Tuesday, October 26
Read R&N Ch Next lecture: Read R&N
Variable Elimination Graphical Models – Carlos Guestrin
Bayesian Networks CSE 573.
Probabilistic Reasoning
Representing Uncertainty
Presentation transcript:

Bayesian Networks Statistical Learning CSE 573 Based on lecture notes from David Page and Dan Weld

© Daniel S. Weld 2 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks

© Daniel S. Weld 3 Inference by Enumeration P(toothache  cavity) =.20 + ??

Problems with Enumeration Worst case time: O(d n ) Where d = max arity of random variables e.g., d = 2 for Boolean (T/F) And n = number of random variables Space complexity also O(d n ) Size of joint distribution Problem: Hard/impossible to estimate all O(d n ) entries for large problems

© Daniel S. Weld 5 Independence A and B are independent iff: These two constraints are logically equivalent Therefore, if A and B are independent:

© Daniel S. Weld 6 Conditional Independence P(A, B |C) = P(A | C) P(B | C) or P(A | B, C) = P(A | B) Often, using conditional independence reduces the storage complexity of the joint distribution from exponential to linear!! Conditional independence is the most basic & robust form of knowledge about uncertain environments.

© Daniel S. Weld 7 Bayes Nets In general, joint distribution P over set of variables (X 1 x... x X n ) requires exponential space for representation & inference BNs provide a graphical representation of conditional independence relations in P 1.usually quite compact 2.requires assessment of fewer parameters, those being quite natural (e.g., causal) 3.efficient (usually) inference: query answering and belief update

© Daniel S. Weld 8 An Example Bayes Net Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

© Daniel S. Weld 9 Earthquake Example (con’t) If I know if Alarm, no other evidence influences my degree of belief in Nbr1Calls P(N1|N2,A,E,B) = P(N1|A) also: P(N2|N2,A,E,B) = P(N2|A) and P(E|B) = P(E) By the chain rule we have P(N1,N2,A,E,B) = P(N1|N2,A,E,B) ·P(N2|A,E,B)· P(A|E,B) ·P(E|B) ·P(B) = P(N1|A) ·P(N2|A) ·P(A|B,E) ·P(E) ·P(B) Full joint requires only 10 parameters (cf. 63) Earthquake Burglary Alarm Nbr2CallsNbr1Calls Radio

© Daniel S. Weld 10 Bayesian Networks Graphical structure of BN reflects conditional independence among variables Each variable X is a node in the DAG Edges denote direct probabilistic influence usually interpreted causally parents of X are denoted Par(X) Each node X has a conditional probability distribution P(X | Parents(X)) X is conditionally independent of all nondescendents given its parents

© Daniel S. Weld 11 Conditional Probability Tables Earthquake BurglaryAlarm Nbr2CallsNbr1Calls Pr(B=t) Pr(B=f) Pr(A|E,B) e,b 0.9 (0.1) e,b 0.2 (0.8) e,b 0.85 (0.15) e,b 0.01 (0.99) Radio

© Daniel S. Weld 12 Conditional Probability Tables For complete spec. of joint dist., quantify BN For each variable X, specify CPT: P(X | Par(X)) number of params locally exponential in |Par(X)| If X 1, X 2,... X n is any topological sort of the network, then we are assured: P(X n,X n-1,...X 1 ) = P(X n | X n-1,...X 1 ) · P(X n-1 | X n-2,… X 1 ) … P(X 2 | X 1 ) · P(X 1 ) = P(X n | Par(X n )) · P(X n-1 | Par(X n-1 )) … P(X 1 )

© Daniel S. Weld 13 Given Parents, X is Independent of Non-Descendants

© Daniel S. Weld 14 For Example EarthquakeBurglary Alarm Nbr2CallsNbr1Calls Radio

© Daniel S. Weld 15 For Example EarthquakeBurglary Alarm Nbr2CallsNbr1Calls Radio

© Daniel S. Weld 16 Given Markov Blanket, X is Independent of All Other Nodes MB(X) = Par(X)  Childs(X)  Par(Childs(X))

© Daniel S. Weld 17 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks

© Daniel S. Weld 18 Inference in BNs The graphical independence representation yields efficient inference schemes We generally want to compute Pr(X), or Pr(X|E) where E is (conjunctive) evidence Computations organized by network topology Two simple algorithms: Variable elimination (VE) Junction trees

© Daniel S. Weld 19 Variable Elimination A factor is a function from some set of variables into a specific value: e.g., f(E,A,N1) CPTs are factors, e.g., P(A|E,B) function of A,E,B VE works by eliminating all variables in turn until there is a factor with only query variable

Joint Distributoins & CPDs Vs. Potentials ¬b¬b ¬b¬b ¬a¬a ¬a¬a a b ¬a¬a b CPT for P(B | A) Potential Potentials occur when we temporarily forget meaning associated with table 1.Must be non-negative 2.Doesn’t have to sum to 1 Arise when incorporating evidence Represent probability distributions 1.For CPT, specific setting of parents, values of child must sum to 1 2.For joint, all entries sum to 1

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a x =

Multiplying Potentials ¬c¬c ¬a¬a ¬b¬b ¬b¬b ¬b¬b ¬c¬c ¬c¬cc c b b a b c a ¬a¬a x =

¬b¬b ¬a¬a a b = ¬b¬b ¬a¬a a b ¬b¬b ¬a¬a a b  a ¬b¬b b.31.3 Marginalize/sum out a variable Normalize a potential α

Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a x = ¬a¬a ¬b¬bb a ¬b¬b ¬c¬cc b ¬b¬b ¬c¬cc b x = ¬a¬a ¬b¬bb a ¬b¬b ¬c¬cc b ¬b¬b b  a ¬b¬b ¬c¬cc b  a (P 1 x P 2 ) = (  a P 1 )x P 2 if A is not in P 2

Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a x = ¬a¬a ¬b¬bb a ¬b¬b ¬c¬cc b  a (P 1 x P 2 ) = (  a P 1 )x P 2 if A is not in P 2

Key Observation ¬b¬b ¬c¬c ¬c¬cc c b a ¬a¬a x = ¬a¬a ¬b¬bb a ¬b¬b ¬c¬cc b ¬b¬b ¬c¬cc b x = ¬a¬a ¬b¬bb a ¬b¬b ¬c¬cc b ¬b¬b b  a ¬b¬b ¬c¬cc b  a (P 1 x P 2 ) = (  a P 1 )x P 2 if A is not in P 2

Variable Elimination Procedure The initial potentials are the CPTs in BN Repeat until only query variable remains: 1.Choose another variable to eliminate 2.Multiply all potentials that contain the variable 3.If no evidence for the variable then sum the variable out and replace original potential by the new result 4.Else, remove variable based on evidence Normalize remaining potential to get the final distribution over the query variable

A F E C D B ¬d¬dd.1.3 ¬a¬a a b.8.3 ¬b¬b b d.4.7 ¬a¬a a c.1.6 ¬c¬c c e ¬e¬ee¬e¬ee f.2 a P(A,B,C,D,E,F) = P(A) P(B|A) P(C|A) P(D|B) P(E|C) P(F|D,E)

Query: P(F| C = true) Elimination Ordering: A,B,C,D,E P(F) =  a,b,c,d,e (P(F|D,E)P(D|B)P(E|C)P(B|A)P(C|A)P(A)) P(F) =  e [  d [  c [  b [(  a P(B|A)P(C|A)P(A)) P(D|B)] P(E|C)] P(D|F,E)]]

A CB ¬a¬a a c a Query: P(F| C = true) Elimination Ordering: A,B,C,D,E ¬a¬a a b ¬b¬b ¬c¬c ¬a¬a ¬a¬a ¬c¬c ¬c¬cc c a b ¬b¬b Before eliminating A, multiple all potentials involving A Sum out A

D B C ¬b¬b ¬d¬d ¬d¬dd d b c ¬c¬c Sum out B ¬b¬b ¬d¬dd b ¬b¬b ¬c¬cc b Now, eliminate B, multiple all potentials involving B

E C D ¬c¬c ¬e¬e ¬e¬ee e c d ¬d¬d We have evidence for C, so eliminate ¬c ¬c¬c ¬e¬ee c ¬c¬c ¬d¬dd c Next, eliminate C, multiple all potentials involving C

F ED ¬f¬f ¬e¬e ¬e¬ee e f d ¬d¬d ¬d¬d ¬f¬f ¬f¬ff f d e ¬e¬e Sum out d ¬e¬e ¬d¬dd e Next, eliminate D, multiple all potentials involving D ¬e¬e ¬f¬ff e

Normalize α = ¬e¬e ¬f¬ff e Next, eliminate E.184 f¬f¬f f¬f¬f.628  e

© Daniel S. Weld 38 Notes on VE Each operation is a simply multiplication of factors and summing out a variable Complexity determined by size of largest factor 1.e.g., in example, 3 vars (not 5) 2.linear in number of vars, 3.exponential in largest factor elimination ordering greatly impacts factor size 4.optimal elimination orderings: NP-hard 5.heuristics, special structure (e.g., polytrees) Practically, inference is much more tractable using structure of this sort

Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles We can avoid cycles if we turn highly- interconnected subsets of the nodes into “supernodes”

Step 1: Make the Graph Moral A F E C D B Add edge between non-adjacent parents of same child

Step 2: Remove Directionality A F E C D B

Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no chord, add chord

Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no, cord, add chord

Step 3: Triangulate the Graph A F E C D B While there are cycles with length > 3 and no, cord, add chord

Triangulation Checking The following algorithm only terminates if the graph is triangulated Choose any node in the graph and label 1 For I = 2 to n 1.Choose the node with most labeled neighbors and label it I 2.If any two labeled neighbors of I are not adjacent to each other fail Succeed © Daniel S. Weld 45

Is It Triangulated Yet? A F E C D B 1

A F E C D B 1 2

A F E C D B 1 2 3

A F E C D B

A F E C D B

A F E C D B

Triangulation: Key Points In general, many triangulations may exist The only efficient algorithms are heuristic Jensen and Jensen (1994) showed that any scheme for exact inference (belief updating given evidence) must perform triangulation (perhaps hidden as in Draper 1995)

Step 4: Build the Clique Graph A F E C D B Find all cliques in moralized, triangulated graph If two cliques intersect, they are joined by an edge labeled with their intersection Clique: maximal complete subgraph (e.g., ABC, BCD)

Step 4: Build the Clique Graph ABC CDE DEF BCD C C,D B,C D D,E

Junction Trees A junction tree is a subgraph of the clique graph that 1.Is a tree 2.Contains all the nodes of the clique graph 3.Satisfies the junction tree property Junction tree property: For each pair U, V of cliques with intersection S, all cliques on the path between U and V contain S

Clique Graph to Junction Tree We can perform exact inference efficiently on a junction tree (although CPTs may be large). But can we always build a junction tree? If so, how? Let the weight of an edge in the clique graph be the cardinality of the separator. Than any maximum weight spanning tree is a junction tree (Jensen & Jensen 1994).

Step 5: Build the Junction Tree ABC CDE DEF BCD C,D B,C D,E

Step 6: Choose a Root ABC CDE DEFBCD C,D B,C D,E

Step 7: Populate Clique Nodes For each distribution (CPT) in the original Bayes Net, put this distribution into one of the clique nodes that contains all the variables referenced by the CPT. (At least one such node must exist because of the moralization step). For each clique node, take the product of the distributions (as in variable elimination).

Step 8: Assign CPTs ABC CDE DEFBCD C,D B,C D,E ¬d¬d ¬b¬bb d ¬e¬e ¬c¬cc e ¬d¬dd.3.4 ¬e¬ee¬e¬ee f.9 P(E | C) P(F | D,E) P(A,B,C) P(D | B) ¬a¬a ¬c¬c ¬c¬cc c a b ¬b¬b

Junction Tree Inference Algorithm Incorporate Evidence: For each evidence variable, go to one table including that variable Set to 0 all entries that disagree with evidence Renormalize this potential Upward Step: Pass message to parents Downward Step: Pass message to children

Upward Step Each leaf sends a message to its parent Message is the marginal of its table, summing out any variable not in the separator When a parent receives a message from a child, it multiplies its table by the message table to obtain its new table When a parent receives messages from all its children, it repeats the process This process continues until the root receives messages from all its children © Daniel S. Weld 62

Downward Step Root sends a message to each child. root divides its current table by the message received from that child, Marginalizes the resulting table to the separator, sends this to the child. Child multiplies message from parent by the child’s current table The process repeats (the child acts as root) and continues until all leaves receive messages from their parents © Daniel S. Weld 63

Answering Queries: Final Step With junction tree, can query any variable Find clique node containing that variable and sum out the other variables to obtain answer If given new evidence, we must repeat the Upward-Downward process Only need to compute junction tree once! A junction tree can be thought of as storing the subjoints computed during elimination See Finn V. Jensen “Bayesian Networks and Decision Graphs” for algorithm description

© Daniel S. Weld 65 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Prior: Probability of a hypothesis before we make any observations

Coin Flip P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 Which coin will I use? P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Uniform Prior: All hypothesis are equally likely before we make any observations

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 )=0.1 C1C1 C2C2 C3C3 P(C 1 )=1/3P(C 2 ) = 1/3P(C 3 ) = 1/3

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = 0.6 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Posterior: Probability of a hypothesis given data

Terminology Prior: Probability of a hypothesis before we see any data Uniform Prior: A prior that makes all hypothesis equaly likely Posterior: Probability of a hypothesis after we saw some data Likelihood: Probability of data given hypothesis

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ?

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 P(C 1 |HT) = 0.21P(C 2 |HT) = 0.58P(C 3 |HT) = 0.21

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 1/3P(C 2 ) = 1/3P(C 3 ) = 1/3 Best estimate for P(H) P(H|C 2 ) = 0.5 Most likely coin: C2C2

Your Estimate? P(H|C 2 ) = 0.5 C2C2 P(C 2 ) = 1/3 Most likely coin:Best estimate for P(H) P(H|C 2 ) = 0.5 C2C2 Maximum Likelihood Estimate: The best hypothesis that fits observed data assuming uniform prior

Using Prior Knowledge Should we always use a Uniform Prior ? Background knowledge: Heads => we have take-home midterm Dan likes take-homes… => Dan is more likely to use a coin biased in his favor P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3

Using Prior Knowledge P(H|C 2 ) = 0.5 P(H|C 1 ) = 0.1 C1C1 C2C2 P(H|C 3 ) = 0.9 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 We can encode it in the prior:

Experiment 1: Heads Which coin did I use? P(C 1 |H) = ?P(C 2 |H) = ?P(C 3 |H) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Experiment 1: Heads Which coin did I use? P(C 1 |H) = 0.006P(C 2 |H) = 0.165P(C 3 |H) = P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |H) = 0.066P(C 2 |H) = 0.333P(C 3 |H) = Compare with ML posterior after Exp 1:

Experiment 2: Tails Which coin did I use? P(C 1 |HT) = ?P(C 2 |HT) = ?P(C 3 |HT) = ? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Experiment 2: Tails Which coin did I use? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 P(C 1 |HT) = 0.035P(C 2 |HT) = 0.481P(C 3 |HT) = 0.485

Experiment 2: Tails Which coin did I use? P(C 1 |HT) = 0.035P(C 2 |HT)=0.481P(C 3 |HT) = P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70

Your Estimate? What is the probability of heads after two experiments? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 ) = 0.05P(C 2 ) = 0.25P(C 3 ) = 0.70 Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Most likely coin:

Your Estimate? Most likely coin:Best estimate for P(H) P(H|C 3 ) = 0.9 C3C3 Maximum A Posteriori (MAP) Estimate: The best hypothesis that fits observed data assuming a non-uniform prior P(H|C 3 ) = 0.9 C3C3 P(C 3 ) = 0.70

Did We Do The Right Thing? P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

Did We Do The Right Thing? P(C 1 |HT) =0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 C 2 and C 3 are almost equally likely

A Better Estimate P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 Recall: = P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485

Bayesian Estimate P(C 1 |HT)=0.035P(C 2 |HT)=0.481P(C 3 |HT)=0.485 P(H|C 2 ) = 0.5 P(H|C 3 ) = 0.9P(H|C 1 ) = 0.1 C1C1 C2C2 C3C3 = Bayesian Estimate: Minimizes prediction error, given data and (generally) assuming a non-uniform prior

Comparison After more experiments: HTH 8 ML (Maximum Likelihood): P(H) = 0.5 after 10 experiments: P(H) = 0.9 MAP (Maximum A Posteriori): P(H) = 0.9 after 10 experiments: P(H) = 0.9 Bayesian: P(H) = 0.68 after 10 experiments: P(H) = 0.9

Comparison ML (Maximum Likelihood): Easy to compute MAP (Maximum A Posteriori): Still easy to compute Incorporates prior knowledge Bayesian: Minimizes error => great when data is scarce Potentially much harder to compute

Summary For Now PriorHypothesis Maximum Likelihood Estimate Maximum A Posteriori Estimate Bayesian Estimate UniformThe most likely AnyThe most likely Any Weighted combination

© Daniel S. Weld 92 Topics Bayesian networks overview Infernence Variable elimination Junction trees Parameter Estimation: Maximum Likelihood (ML) Maximum A Posteriori (MAP) Bayesian Learning Parameters for a Bayesian Network Learning Structure of Bayesian Networks

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... We have: - Bayes Net structure and observations - We need: Bayes Net parameters

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Now compute either MAP or Bayesian estimate

What Prior to Use? The following are two common priors Binary variable Beta Posterior distribution is binomial Easy to compute posterior Discrete variable Dirichlet Posterior distribution is multinomial Easy to compute posterior © Daniel S. Weld 95

One Prior: Beta Distribution a,b For any positive integer y,  (y) = (y-1)!

Beta Distribution Example: Flip coin with Beta distribution as prior over p [prob(heads)] 1.Parameterized by two positive numbers: a, b 2.Mode of distribution (E[p]) is a/(a+b) 3.Specify our prior belief for p = a/(a+b) 4.Specify confidence in this belief with high initial values for a and b Updating our prior belief based on data incrementing a for every heads outcome incrementing b for every tails outcome So after h heads out of n flips, our posterior distribution says P(head)=(a+h)/(a+b+n)

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(B) = ? Prior + data = Beta(1,4) (3,7).3 B¬B¬B.7

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ?

Parameter Estimation and Bayesian Networks EBRAJM TFTTFT FFFFFT FTFTTT FFFTTT FTFFFF... P(A|E,B) = ? P(A|E,¬B) = ? P(A|¬E,B) = ? P(A|¬E,¬B) = ? Prior + data= Beta(2,3) (3,4)

What if we don’t know structure?

Learning The Structure of Bayesian Networks Search thru the space… of possible network structures! (for now, assume we observe all variables) For each structure, learn parameters Pick the one that fits observed data best Caveat – won’t we end up fully connected???? When scoring, add a penalty  model complexity Problem !?!?

Learning The Structure of Bayesian Networks Search thru the space For each structure, learn parameters Pick the one that fits observed data best Problem? Exponential number of networks! And we need to learn parameters for each! Exhaustive search out of the question! So what now?

Structure Learning as Search Local Search 1.Start with some network structure 2.Try to make a change (add or delete or reverse edge) 3.See if the new network is any better What should the initial state be? Uniform prior over random networks? Based on prior knowledge? Empty network? How do we evaluate networks? © Daniel S. Weld 105

A E C D B A E C D B A E C D B A E C D B A E C D B

Score Functions Bayesian Information Criteion (BIC) P(D | BN) – penalty Penalty = ½ (# parameters) Log (# data points) MAP score P(BN | D) = P(D | BN) P(BN) P(BN) must decay exponential with # of parameters for this to work well © Daniel S. Weld 107

Naïve Bayes F 2F N-2F N-1F NF 1F 3 Class Value … Assume that features are conditionally ind. given class variable Works well in practice Forces probabilities towards 0 and 1

Tree Augmented Naïve Bayes (TAN) [Friedman,Geiger & Goldszmidt 1997] F 2F N-2F N-1F NF 1F 3 Class Value … Models limited set of dependencies Guaranteed to find best structure Runs in polynomial time