UIUC CS 497: Section EA Lecture #6

Slides:



Advertisements
Similar presentations
CS188: Computational Models of Human Behavior
Advertisements

Variational Methods for Graphical Models Micheal I. Jordan Zoubin Ghahramani Tommi S. Jaakkola Lawrence K. Saul Presented by: Afsaneh Shirazi.
CS498-EA Reasoning in AI Lecture #15 Instructor: Eyal Amir Fall Semester 2011.
Lauritzen-Spiegelhalter Algorithm
Exact Inference in Bayes Nets
Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.
Dynamic Bayesian Networks (DBNs)
Bayesian Networks VISA Hyoungjune Yi. BN – Intro. Introduced by Pearl (1986 ) Resembles human reasoning Causal relationship Decision support system/ Expert.
Junction Trees: Motivation Standard algorithms (e.g., variable elimination) are inefficient if the undirected graph underlying the Bayes Net contains cycles.
From Variable Elimination to Junction Trees
PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.
. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.
. Inference I Introduction, Hardness, and Variable Elimination Slides by Nir Friedman.
Graphical Models: An Introduction Lise Getoor Computer Science Dept University of Maryland
. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.
PGM 2002/03 Tirgul5 Clique/Junction Tree Inference.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
A Brief Introduction to Graphical Models
Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.
Automated Planning and Decision Making Prof. Ronen Brafman Automated Planning and Decision Making 2007 Bayesian networks Variable Elimination Based on.
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
UIUC CS 598: Section EA Graphical Models Deepak Ramachandran Fall 2004 (Based on slides by Eyal Amir (which were based on slides by Lise Getoor and Alvaro.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
1 COROLLARY 4: D is an I-map of P iff each variable X is conditionally independent in P of all its non-descendants, given its parents. Proof  : Each variable.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
Knowledge Representation & Reasoning Lecture #4 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro.
1 Bayesian Networks (Directed Acyclic Graphical Models) The situation of a bell that rings whenever the outcome of two coins are equal can not be well.
Intro to Junction Tree propagation and adaptations for a Distributed Environment Thor Whalen Metron, Inc.
Exploiting Structure in Probability Distributions Irit Gat-Viks Based on presentation and lecture notes of Nir Friedman, Hebrew University.
1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
Introduction on Graphic Models
Today Graphical Models Representing conditional dependence graphically
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
. Bayesian Networks Some slides have been edited from Nir Friedman’s lectures which is available at Changes made by Dan Geiger.
CS 2750: Machine Learning Bayesian Networks Prof. Adriana Kovashka University of Pittsburgh March 14, 2016.
Knowledge Representation & Reasoning Lecture #5 UIUC CS 498: Section EA Professor: Eyal Amir Fall Semester 2005 (Based on slides by Lise Getoor and Alvaro.
CS 541: Artificial Intelligence Lecture VII: Inference in Bayesian Networks.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk
CS 2750: Machine Learning Directed Graphical Models
Qian Liu CSE spring University of Pennsylvania
Inference in Bayesian Networks
PGM 2003/04 Tirgul6 Clique/Junction Tree Inference
Approximate Inference
Bayesian Networks Background Readings: An Introduction to Bayesian Networks, Finn Jensen, UCL Press, Some slides have been edited from Nir Friedman’s.
Bell & Coins Example Coin1 Bell Coin2
CS 4/527: Artificial Intelligence
Bayesian Networks (Directed Acyclic Graphical Models)
CSCI 5822 Probabilistic Models of Human and Machine Learning
Bayesian Networks Based on
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
CAP 5636 – Advanced Artificial Intelligence
Professor Marie desJardins,
Markov Random Fields Presented by: Vladan Radosavljevic.
Class #19 – Tuesday, November 3
CS 188: Artificial Intelligence
Exact Inference Eric Xing Lecture 11, August 14, 2010
Class #16 – Tuesday, October 26
Class #22/23 – Wednesday, November 12 / Monday, November 17
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Inference III: Approximate Inference
Elimination in Chains A B C E D.
Junction Trees 3 Undirected Graphical Models
Read R&N Ch Next lecture: Read R&N
Variable Elimination Graphical Models – Carlos Guestrin
Advanced Machine Learning
Presentation transcript:

UIUC CS 497: Section EA Lecture #6 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Lise Getoor and Alvaro Cardenas (UMD) (in turn based on slides by Nir Friedman (Hebrew U)))

Last Time Tree decomposition in first-order logic Applications: Provably better computational bounds when low treewidth in propositional logic Eliminate possible interactions between clauses Applications: Planning, spatial reasoning

Today Probabilistic graphical models Treewidth methods: Variable elimination Clique tree algorithm Applications du jour: Sensor Networks

Independent Random Variables Two variables X and Y are independent if P(X = x|Y = y) = P(X = x) for all values x,y That is, learning the values of Y does not change prediction of X If X and Y are independent then P(X,Y) = P(X|Y)P(Y) = P(X)P(Y) In general, if X1,…,Xp are independent, then P(X1,…,Xp)= P(X1)...P(Xp) Requires O(n) parameters

Conditional Independence Unfortunately, most of random variables of interest are not independent of each other A more suitable notion is that of conditional independence Two variables X and Y are conditionally independent given Z if P(X = x|Y = y,Z=z) = P(X = x|Z=z) for all values x,y,z That is, learning the values of Y does not change prediction of X once we know the value of Z notation: I ( X , Y | Z )

Example: Family trees Noisy stochastic process: Example: Pedigree Homer Bart Marge Lisa Maggie Noisy stochastic process: Example: Pedigree A node represents an individual’s genotype Modeling assumptions: Ancestors can effect descendants' genotype only by passing genetic materials through intermediate generations

Markov Assumption Ancestor X Y1 Y2 Non-descendent We now make this independence assumption more precise for directed acyclic graphs (DAGs) Each random variable X, is independent of its non-descendents, given its parents Pa(X) Formally, I (X, NonDesc(X) | Pa(X)) Parent Non-descendent Descendent

Markov Assumption Example Earthquake Radio Burglary Alarm Call In this example: I ( E, B ) I ( B, {E, R} ) I ( R, {A, B, C} | E ) I ( A, R | B,E ) I ( C, {B, E, R} | A)

I-Maps A DAG G is an I-Map of a distribution P if all Markov assumptions implied by G are satisfied by P (Assuming G and P both use the same set of random variables) Examples: X Y X Y

Factorization Given that G is an I-Map of P, can we simplify the representation of P? Example: Since I(X,Y), we have that P(X|Y) = P(X) Applying the chain rule P(X,Y) = P(X|Y) P(Y) = P(X) P(Y) Thus, we have a simpler representation of P(X,Y) X Y

Factorization Theorem Thm: if G is an I-Map of P, then Proof: By chain rule: wlog. X1,…,Xp is an ordering consistent with G From assumption: Since G is an I-Map, I (Xi, NonDesc(Xi)| Pa(Xi)) Hence, We conclude, P(Xi | X1,…,Xi-1) = P(Xi | Pa(Xi) )

Factorization Example Earthquake Radio Burglary Alarm Call P(C,A,R,E,B) = P(B)P(E|B)P(R|E,B)P(A|R,B,E)P(C|A,R,B,E) versus P(C,A,R,E,B) = P(B) P(E) P(R|E) P(A|B,E) P(C|A)

Consequences  each conditional probability can be specified compactly We can write P in terms of “local” conditional probabilities If G is sparse, that is, |Pa(Xi)| < k ,  each conditional probability can be specified compactly e.g. for binary variables, these require O(2k) params.  representation of P is compact linear in number of variables

Summary We defined the following concepts The Markov Independences of a DAG G I (Xi , NonDesc(Xi) | Pai ) G is an I-Map of a distribution P If P satisfies the Markov independencies implied by G We proved the factorization theorem if G is an I-Map of P, then

Conditional Independencies Let Markov(G) be the set of Markov Independencies implied by G The factorization theorem shows G is an I-Map of P  We can also show the opposite: Thm:  G is an I-Map of P

Proof (Outline) X Z Example: Y

Implied Independencies Does a graph G imply additional independencies as a consequence of Markov(G)? We can define a logic of independence statements Some axioms: I( X ; Y | Z )  I( Y; X | Z ) I( X ; Y1, Y2 | Z )  I( X; Y1 | Z )

d-seperation A procedure d-sep(X; Y | Z, G) that given a DAG G, and sets X, Y, and Z returns either yes or no Goal: d-sep(X; Y | Z, G) = yes iff I(X;Y|Z) follows from Markov(G)

Paths Intuition: dependency must “flow” along paths in the graph A path is a sequence of neighboring variables Examples: R  E  A  B C  A  E  R Earthquake Radio Burglary Alarm Call

Paths We want to know when a path is active -- creates dependency between end nodes blocked -- cannot create dependency end nodes We want to classify situations in which paths are active.

Path Blockage Blocked Unblocked Blocked Active Three cases: E R A E R Common cause Blocked Unblocked Blocked Active E R A E R A

Path Blockage Blocked Active Blocked Unblocked Three cases: E A C Common cause Intermediate cause Blocked Active Blocked Unblocked E C A

Path Blockage Blocked Active Blocked Unblocked Three cases: E B A C E Common cause Intermediate cause Common Effect Blocked Active Blocked Unblocked E B A C E B A C

Path Blockage -- General Case A path is active, given evidence Z, if Whenever we have the configuration B or one of its descendents are in Z No other nodes in the path are in Z A path is blocked, given evidence Z, if it is not active. A C B

Example d-sep(R,B)? E B R A C

Example d-sep(R,B) = yes d-sep(R,B|A)? E B R A C

Example d-sep(R,B) = yes d-sep(R,B|A) = no d-sep(R,B|E,A)? E B R A C

d-Separation X is d-separated from Y, given Z, if all paths from a node in X to a node in Y are blocked, given Z. Checking d-separation can be done efficiently (linear time in number of edges) Bottom-up phase: Mark all nodes whose descendents are in Z X to Y phase: Traverse (BFS) all edges on paths from X to Y and check if they are blocked

Soundness Thm: If G is an I-Map of P d-sep( X; Y | Z, G ) = yes then P satisfies I( X; Y | Z ) Informally: Any independence reported by d-separation is satisfied by underlying distribution

Completeness Thm: If d-sep( X; Y | Z, G ) = no then there is a distribution P such that G is an I-Map of P P does not satisfy I( X; Y | Z ) Informally: Any independence not reported by d-separation might be violated by the underlying distribution We cannot determine this by examining the graph structure alone

Summary: Structure We explored DAGs as a representation of conditional independencies: Markov independencies of a DAG Tight correspondence between Markov(G) and the factorization defined by G d-separation, a sound & complete procedure for computing the consequences of the independencies Notion of minimal I-Map P-Maps This theory is the basis for defining Bayesian networks

Inference We now have compact representations of probability distributions: Bayesian Networks Markov Networks Network describes a unique probability distribution P How do we answer queries about P? We use inference as a name for the process of computing answers to such queries

Today Probabilistic graphical models Treewidth methods: Variable elimination Clique tree algorithm Applications du jour: Sensor Networks

Queries: Likelihood There are many types of queries we might ask. Most of these involve evidence An evidence e is an assignment of values to a set E variables in the domain Without loss of generality E = { Xk+1, …, Xn } Simplest query: compute probability of evidence This is often referred to as computing the likelihood of the evidence

Queries: A posteriori belief Often we are interested in the conditional probability of a variable given the evidence This is the a posteriori belief in X, given evidence e A related task is computing the term P(X, e) i.e., the likelihood of e and X = x for values of X

A posteriori belief This query is useful in many cases: Prediction: what is the probability of an outcome given the starting condition Target is a descendent of the evidence Diagnosis: what is the probability of disease/fault given symptoms Target is an ancestor of the evidence the direction between variables does not restrict the directions of the queries

Queries: MAP In this query we want to find the maximum a posteriori assignment for some variable of interest (say X1,…,Xl ) That is, x1,…,xl maximize the probability P(x1,…,xl | e) Note that this is equivalent to maximizing P(x1,…,xl, e)

Queries: MAP We can use MAP for: Classification Explanation find most likely label, given the evidence Explanation What is the most likely scenario, given the evidence

Complexity of Inference Thm: Computing P(X = x) in a Bayesian network is NP-hard Not surprising, since we can simulate Boolean gates.

Approaches to inference Exact inference Inference in Simple Chains Variable elimination Clustering / join tree algorithms Approximate inference – next time Stochastic simulation / sampling methods Markov chain Monte Carlo methods Mean field theory – your presentation

Variable Elimination General idea: Write query in the form Iteratively Move all irrelevant terms outside of innermost sum Perform innermost sum, getting a new term Insert the new term into the product

Example “Asia” network: Visit to Asia Smoking Lung Cancer Tuberculosis Abnormality in Chest Bronchitis X-Ray Dyspnea

Need to eliminate: v,s,x,t,l,a,b Initial factors We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors “Brute force approach” Complexity is exponential in the size of the graph (number of variables) = T. N=number of states for each variable

Need to eliminate: v,s,x,t,l,a,b Initial factors We want to compute P(d) Need to eliminate: v,s,x,t,l,a,b Initial factors Eliminate: v Compute: Note: fv(t) = P(t) In general, result of elimination is not necessarily a probability term

Need to eliminate: s,x,t,l,a,b Initial factors V S L T A B X D We want to compute P(d) Need to eliminate: s,x,t,l,a,b Initial factors Eliminate: s Compute: Summing on s results in a factor with two arguments fs(b,l) In general, result of elimination may be a function of several variables

Need to eliminate: x,t,l,a,b Initial factors V S L T A B X D We want to compute P(d) Need to eliminate: x,t,l,a,b Initial factors Eliminate: x Compute: Note: fx(a) = 1 for all values of a !!

Need to eliminate: t,l,a,b Initial factors V S L T A B X D We want to compute P(d) Need to eliminate: t,l,a,b Initial factors Eliminate: t Compute:

We want to compute P(d) Need to eliminate: l,a,b Initial factors V S L T A B X D We want to compute P(d) Need to eliminate: l,a,b Initial factors Eliminate: l Compute:

We want to compute P(d) Need to eliminate: b Initial factors V S L T A B X D We want to compute P(d) Need to eliminate: b Initial factors Eliminate: a,b Compute:

Different elimination ordering: Need to eliminate: a,b,x,t,v,s,l Initial factors Intermediate factors: Complexity is exponential in the size of the factors!

Variable Elimination We now understand variable elimination as a sequence of rewriting operations Actual computation is done in elimination step Exactly the same computation procedure applies to Markov networks Computation depends on order of elimination

Markov Network (Undirected Graphical Models) A graph with hyper-edges (multi-vertex edges) Every hyper-edge e=(x1…xk) has a potential function fe(x1…xk) The probability distribution is

Complexity of variable elimination Suppose in one elimination step we compute This requires multiplications For each value for x, y1, …, yk, we do m multiplications additions For each value of y1, …, yk , we do |Val(X)| additions Complexity is exponential in number of variables in the intermediate factor

Undirected graph representation At each stage of the procedure, we have an algebraic term that we need to evaluate In general this term is of the form: where Zi are sets of variables We now plot a graph where there is undirected edge X--Y if X,Y are arguments of some factor that is, if X,Y are in some Zi Note: this is the Markov network that describes the probability on the variables we did not eliminate yet

Chordal Graphs elimination ordering  undirected chordal graph Graph: Maximal cliques are factors in elimination Factors in elimination are cliques in the graph Complexity is exponential in size of the largest clique in graph V S L T A B X D L T A B X V S D

Induced Width The size of the largest clique in the induced graph is thus an indicator for the complexity of variable elimination This quantity is called the induced width of a graph according to the specified ordering Finding a good ordering for a graph is equivalent to finding the minimal induced width of the graph

PolyTrees A polytree is a network where there is at most one path from one variable to another Thm: Inference in a polytree is linear in the representation size of the network This assumes tabular CPT representation A C B D E F G H

Today Probabilistic graphical models Treewidth methods: Variable elimination Clique tree algorithm Applications du jour: Sensor Networks

Junction Tree Why junction tree? Objective More efficient for some tasks than variable elimination We can avoid cycles if we turn highly-interconnected subsets of the nodes into “supernodes”  cluster Objective Compute is a value of a variable and is evidence for a set of variable

Properties of Junction Tree An undirected tree Each node is a cluster (nonempty set) of variables Running intersection property: Given two clusters and , all clusters on the path between and contain Separator sets (sepsets): Intersection of the adjacent cluster ADE ABD DEF AD DE Cluster ABD Sepset DE

Potentials Potentials: Marginalization Multiplication Denoted by , the marginalization of into X Multiplication , the multiplication of and

Properties of Junction Tree Belief potentials: Map each instantiation of clusters or sepsets into a real number Constraints: Consistency: for each cluster and neighboring sepset The joint distribution

Properties of Junction Tree If a junction tree satisfies the properties, it follows that: For each cluster (or sepset) , The probability distribution of any variable , using any cluster (or sepset) that contains

Building Junction Trees DAG Moral Graph Triangulated Graph Identifying Cliques Junction Tree

Constructing the Moral Graph B D C E G F H

Constructing The Moral Graph Add undirected edges to all co-parents which are not currently joined –Marrying parents A B D C E G F H

Constructing The Moral Graph Add undirected edges to all co-parents which are not currently joined –Marrying parents Drop the directions of the arcs A B D C E G F H

Triangulating An undirected graph is triangulated iff every cycle of length >3 contains an edge to connects two nonadjacent nodes A B D C E G F H

Identifying Cliques A clique is a subgraph of an undirected graph that is complete and maximal A B D C E G F H EGH ADE ABD ACE DEF CEG

Junction Tree A junction tree is a subgraph of the clique graph that is a tree contains all the cliques satisfies the running intersection property EGH ADE ABD ACE DEF CEG ADE ABD ACE AD AE CEG CE DEF DE EGH EG

Principle of Inference DAG Junction Tree Inconsistent Junction Tree Initialization Consistent Junction Tree Propagation Marginalization

Example: Create Join Tree HMM with 2 time steps: X1 X2 Y1 Y2 Junction Tree: X1,X2 X1,Y1 X2,Y2 X1 X2

Example: Initialization X1,X2 X1,Y1 X2,Y2 X1 X2 Variable Associated Cluster Potential function X1 X1,Y1 Y1 X2 X1,X2 Y2 X2,Y2

Example: Collect Evidence Choose arbitrary clique, e.g. X1,X2, where all potential functions will be collected. Call recursively neighboring cliques for messages: 1. Call X1,Y1. 1. Projection: 2. Absorption:

Example: Collect Evidence (cont.) 2. Call X2,Y2: 1. Projection: 2. Absorption: X1,X2 X1,Y1 X2,Y2 X1 X2

Example: Distribute Evidence Pass messages recursively to neighboring nodes Pass message from X1,X2 to X1,Y1: 1. Projection: 2. Absorption:

Example: Distribute Evidence (cont.) Pass message from X1,X2 to X2,Y2: 1. Projection: 2. Absorption: X1,X2 X1,Y1 X2,Y2 X1 X2

Example: Inference with evidence Assume we want to compute: P(X2|Y1=0,Y2=1) (state estimation) Assign likelihoods to the potential functions during initialization:

Example: Inference with evidence (cont.) Repeating the same steps as in the previous case, we obtain:

Next Time Approximate Probabilistic Inference via sampling Gibbs Priority MCMC

THE END

Example: Naïve Bayesian Model A common model in early diagnosis: Symptoms are conditionally independent given the disease (or fault) Thus, if X1,…,Xp denote whether the symptoms exhibited by the patient (headache, high-fever, etc.) and H denotes the hypothesis about the patients health then, P(X1,…,Xp,H) = P(H)P(X1|H)…P(Xp|H), This naïve Bayesian model allows compact representation It does embody strong independence assumptions

Elimination on Trees Formally, for any tree, there is an elimination ordering with induced width = 1 Thm Inference on trees is linear in number of variables