Dependency Models – abstraction of Probability distributions

Slides:

Advertisements

Similar presentations

1 Undirected Graphical Models Graphical Models – Carlos Guestrin Carnegie Mellon University October 29 th, 2008 Readings: K&F: 4.1, 4.2, 4.3, 4.4,

Advertisements

Markov Networks Alan Ritter.

WSPD Applications.

Graphical Models BRML Chapter 4 1. the zoo of graphical models Markov networks Belief networks Chain graphs (Belief and Markov ) Factor graphs =>they.

Bayesian Networks, Winter Yoav Haimovitch & Ariel Raviv 1.

Exact Inference in Bayes Nets

Chapter 8-3 Markov Random Fields 1. Topics 1. Introduction 1. Undirected Graphical Models 2. Terminology 2. Conditional Independence 3. Factorization.

1 Discrete Structures & Algorithms Graphs and Trees: II EECE 320.

Boolean Algebra cont’ The digital abstraction Graphs and Topological Sort מבנה המחשב + מבוא למחשבים ספרתיים תרגול 2#

PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.

Bayesian Networks Clique tree algorithm Presented by Sergey Vichik.

Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.

Exact Inference: Clique Trees

Bayesian Networks Alan Ritter.

. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.

Relations Chapter 9.

Modular Decomposition and Interval Graphs recognition Speaker: Asaf Shapira.

Chapter 9. Chapter Summary Relations and Their Properties n-ary Relations and Their Applications (not currently included in overheads) Representing Relations.

Week 11 - Monday.  What did we talk about last time?  Binomial theorem and Pascal's triangle  Conditional probability  Bayes’ theorem.

The countable character of uncountable graphs François Laviolette Barbados 2003.

1 COROLLARY 4: D is an I-map of P iff each variable X is conditionally independent in P of all its non-descendants, given its parents. Proof  : Each variable.

Mathematical Preliminaries

Connectivity and Paths 報告人：林清池. Connectivity A separating set of a graph G is a set such that G-S has more than one component. The connectivity of G,

1 Bayesian Networks (Directed Acyclic Graphical Models) The situation of a bell that rings whenever the outcome of two coins are equal can not be well.

Independence, Decomposability and functions which take values into an Abelian Group Adrian Silvescu Vasant Honavar Department of Computer Science Iowa.

CS 103 Discrete Structures Lecture 13 Induction and Recursion (1)

Graph Theory and Applications

Chapter 9. Chapter Summary Relations and Their Properties n-ary Relations and Their Applications (not currently included in overheads) Representing Relations.

1 Use graphs and not pure logic Variables represented by nodes and dependencies by edges. Common in our language: “threads of thoughts”, “lines of reasoning”,

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Graphs Lecture 2. Graphs (1) An undirected graph is a triple (V, E, Y), where V and E are finite sets and Y:E g{X V :| X |=2}. A directed graph or digraph.

1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.

Algorithms for hard problems Parameterized complexity Bounded tree width approaches Juris Viksna, 2015.

Chromatic Coloring with a Maximum Color Class Bor-Liang Chen Kuo-Ching Huang Chih-Hung Yen* 30 July, 2009.

COMPSCI 102 Introduction to Discrete Mathematics.

1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.

. Bayesian Networks Some slides have been edited from Nir Friedman’s lectures which is available at Changes made by Dan Geiger.

. Introduction to Bayesian Networks Instructor: Dan Geiger Web page:

Theory of Computational Complexity Probability and Computing Chapter Hikaru Inada Iwama and Ito lab M1.

Binary Relation: A binary relation between sets A and B is a subset of the Cartesian Product A x B. If A = B we say that the relation is a relation on.

More NP-Complete and NP-hard Problems

Chapter 2 Sets and Functions.

An introduction to chordal graphs and clique trees

Relations Chapter 9.

Bipartite Matching Lecture 8: Oct 7.

Random walks on undirected graphs and a little bit about Markov Chains

Great Theoretical Ideas In Computer Science

The countable character of uncountable graphs François Laviolette Barbados 2003.

Parameterized complexity Bounded tree width approaches

What is the next line of the proof?

Bayesian Networks Background Readings: An Introduction to Bayesian Networks, Finn Jensen, UCL Press, Some slides have been edited from Nir Friedman’s.

Bell & Coins Example Coin1 Bell Coin2

The set  of all independence statements defined by (3

Bayesian Networks (Directed Acyclic Graphical Models)

Planarity Testing.

Discrete Mathematics for Computer Science

ICS 353: Design and Analysis of Algorithms

V17 Metabolic networks - Graph connectivity

Bayesian Networks Based on

V11 Metabolic networks - Graph connectivity

Approximation Algorithms

Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.

V12 Menger’s theorem Borrowing terminology from operations research

V11 Metabolic networks - Graph connectivity

Junction Trees 3 Undirected Graphical Models

Discrete Mathematics for Computer Science

CSE 589 Applied Algorithms Spring 1999

V11 Metabolic networks - Graph connectivity

BN Semantics 2 – The revenge of d-separation

Presentation transcript:

Dependency Models – abstraction of Probability distributions A dependency model M over a finite set of elements U is a rule that assigns truth values to the predicate IM(X,Z,Y) where X,Y, Z are (disjoint) subsets of U.

Important Properties of Independence 𝐼 𝑋,𝑍,𝑌 → 𝐼 𝑋,𝑍,𝑌∪𝑊 ↔𝐼(𝑋,𝑍∪𝑌,𝑊) 𝐼 𝑋,𝑍,𝑌 →{ 𝐼 𝑋,𝑍,𝑌∪𝑊 →𝐼(𝑋,𝑍∪𝑌,𝑊) ∧ [ ¬𝐼 𝑋,𝑍,𝑌∪𝑊 →¬𝐼 𝑋,𝑍∪𝑌,𝑊 ]} When learning an irrelevant fact Y, the relevance status of every fact W remains unchanged.

Undirected Graphs can represent Independence Let G be an undirected graph (V,E). Define IG(X,Z,Y) for disjoint sets of nodes X,Y, and Z if and only if all paths between a node in X and a node in Y pass via a node in Z. In the text book another notation used is <X|Z|Y>G

M = { IG(M1,{F1,F2},M2), IG(F1,{M1,M2},F2) + symmetry }

Definitions: 1. G=(U,E) is an I-map of a model M over U if IG(X,Z,Y)  IM(X,Z,Y) for all disjoint subsets X,Y, Z of U. 2. G is a perfect map of M if IG(X,Z,Y)  IM(X,Z,Y) for all disjoint subsets X,Y, Z of U. 3. M is graph-isomorph if there exists a graph G such that G is a perfect map of M.

Undirected Graphs can not always be perfect Strong Union: IG(X, Z, Y)  IG(X, Z∪W, Y) This property holds for graph separation but not for conditional independence in probability. So if G is an I-map of P it can represent IP(X, Z, Y) but can not represent the negation IP(X, Z∪W, Y). Needed property of separation for later Theorem 3: IG(X, S ,Y)  [ IG(X, S ∪ Y, δ) or IG(Y, S ∪X, δ) ]

Undirected Graphs as I-map Definition: An undirected graph G is a minimal I-map of a dependency model M if deleting an edge of G would make G cease to be an I-map of M. Such a graph is called a Markov network of M.

THEOREM 3 [Pearl and Paz 1985]: Every dependency model M satisfying symmetry, decomposition, and intersection has a unique minimal I-map G0 = (U, E0) where the vertices U are the elements of M and the edges E0 are defined by α,β ∉ 𝐸0 ↔ IM(α, U-{α, β}, β) (3.11) Proof: by descending induction as done for THEOREM 2 in HMWs

Proof: First we prove that G0 is an I-map of M, namely, that for every three disjoint non-empty subsets of U, IG(X,S,Y)  IM(X,S,Y) (Eq. 3.II) (i). Let n=|U|. For |S| = n-2 Eq. 3.II follows from (3.11). (ii). Assume the theorem holds for every S’ with size |S’| = k ≤ n-2 and let S be any set such that |S| = k-1 and IG(X,S,Y) We consider two cases: X∪S∪Y equal U or are a proper subset of U. (iii). If X∪S∪Y = U then either X or Y has two elements. Assume Y has 2 elements so Y=Y’ ∪ γ. From IG(X,S,Y) we get for graph separation IG(X,S ∪ γ,Y’) and IG(X, S ∪ Y’, γ). By the induction hypothesis: IM(X,S ∪ γ,Y’) and IM(X, S ∪ Y’, γ), which implies by Intersection IM(X,S,Y), as claimed.

(iv). X∪S∪Y ≠ U, then there exist an element δ not in X∪S∪Y. From IG(X,S ,Y) we get IG(X, S ∪ δ, Y) and also get [ IG(X, S ∪ Y, δ) or IG(δ, S ∪X, Y) ] The separating sets are all of size k+1 and hence by the Induction Hypothesis also IM(X, S ∪ δ, Y) and IM(X, S ∪ Y, δ) or IM(X, S ∪ δ, Y) and IM(δ, S ∪X, Y) Applying Intersection and then decomposition in either cases yields IM(X,S ,Y), as claimed.

Next we prove the graph G0 is minimal, namely, no edge can be removed from G0 without ceasing to be an I-map of M. Deleting an edge (α,β) leaves α separated from β by U-{α, β}. So if the remaining graph is still an I-map, then IM(α,U-{α, β}, β). But by definition of G0 (Eq 3.11) this edge is not part of G0. Hence no edge can be removed and G0 is edge-minimal. Finally is the claim that G0 is the unique Markov network of M. Let G be a minimal I-map. Every edge satisfying Eq. 3.11 must be removed from G to ensure a minimal I-map. No further edge can be removed without violating the I-map property. Hence G0 is formed from G and thus is the unique Markov network of M.

Pairwise Basis of a Markov Network The set  of all independence statements defined by α,β ∉ 𝐸0 ↔ IM(α, U-{α, β}, β) (3.11) is called the pairwise basis of G. This set consists of at most n(n-1) independence statements, one per missing edge, that define the Markov network of M.

Neighboring Basis of a Markov Network The set  of all independence statements defined by (3.12) is called the neighboring basis of G. This set consists of n independence statements, one per vertex, that define the neighbors of each vertex and hence defines a graph. Is this graph the Unique Markov network G0 of M ???

Alternative Construction of the Markov Network THEOREM 4 [Pearl and Paz 1985]: Every element α ∈ U in a dependency model M satisfying symmetry, decomposition, intersection, and weak union has a unique Markov boundary BI(α). Moreover, BI(α) equals the set of vertices BG0(α) neighboring α in the minimal I-map G0 (The Markov Network). Proof : (i) First we show that BI(α) is unique. Take two Markov blankets B1 and B2. Hence IG(α, B1 ,U-B1) and IG (α, B2, U-B2). By Intersection also IG (α, B1 ∩ B2 , U- B1∩ B2) {HMW !}. Hence BI(α) is the intersection of the set 𝑩𝑳 ∗ I(α) all blankets.

Proof Continues : It remains to show the second claim, that the graph G1 constructed with the neighbors BI(α) of each vertex is the same as G0 that is constructed with edge definitions (Eq. 3.11). (ii) Every Markov boundary BI(α) and for every element β not in the boundary, we get due to weak union: IG(α, BI(α) , β ∪ Rest-of-vertices)  IG(α, U – {a,b} , β). Hence every edge not in G1 is also not in G0. In other words, the set of neighbors of each vertex α in G0 is a subset of the set of neighbors in G1: BG0(α) subset-of BI(α) However, equality actually holds, because BG0(α) is by itself a Markov blanket of α, due to the I-map property of G0, and the boundary BI(α) is the intersection of all blankets. Hence equality Holds. Thus, G0 = G1.

Insufficiency of Local tests for non strictly positive probability distributions Consider the case X=Y=Z=W. What is a Markov network for it ? Is it unique ? The Intersection property is critical !

Markov Networks with probabilities 1. Define for each (maximal) clique Ci a non-negative function g(Ci) called the compatibility function. 2. Take the product i g(Ci) over all cliques. 3. Define P(X1,…,Xn) = K· i g(Ci) where K is a normalizing factor (inverse sum of the product).

P(, BG(), U-  -BG()) = f1(,BG()) f2 (U-) (*) Theorem 6 [Hammersley and Clifford 1971]: If a probability function P is formed by a normalized product of non negative functions on the cliques of G, then G is an I-map of P. Proof: It suffices to show (Theorem 4) that the neighborhood basis of G holds in P. Namely, show that I(, BG(), U-  -BG()) hold in P, or just that: P(, BG(), U-  -BG()) = f1(,BG()) f2 (U-) (*) Let J stand for the set of indices marking all cliques in G that include . = f1(,BG()) f2 (U-) The first product contains only variables adjacent to  because Cj is a clique. The second product does not contain . Hence (*) holds.

Note: The theorem and its converse hold also for extreme probabilities but the presented proof does not apply due to the use of Intersection in Theorem 4. Theorem X: Every undirected graph G has a distribution P such that G is a perfect map of P. (In light of previous notes, it must have the form of a product over cliques).

Proof Sketch of Theorem X Theorem Y (Completeness): Given a graph G, for every independence statement  = I(,Z,) that does NOT hold in G, there exists a probability distribution P that satisfies all independence statements that hold in the graph G and does not satisfy  = I(,Z,). Proof of Theorem Y: Pick a path in G between  and  that does not contain a node from Z. Define a probability distribution that is a perfect map of the chain and multiply it by any marginal probabilities on all other nodes not on the path, forming P . Sketch for Theorem X (Strong Completeness): “Multiply” all P (via Armstrong relation) to obtain P that is a perfect map of G. (Continue here with “Proof by intimidation” --)

Interesting conclusion of Theorem Y: All independence statements that follow for strictly-positive probability from the neighborhood basis are derivable via symmetry, decomposition, intersection, and weak union. These axioms are (sound and) complete for neighborhood bases. These axioms are (sound and) complete also for pairwise bases. In fact for saturated statements conditional independence (whose span of variables is all of U) and vertex separation have exactly the same axioms. Isn’t that amazing ? (See paper P2).

Drawback: Interpreting the Links is not simple Another drawback is the difficulty with extreme probabilities. There is no local test for I-mapness. Both drawbacks disappear in the class of decomposable models, which are a special case of Bayesian networks

Decomposable Models Example: Markov Chains and Markov Trees Assume the following chain is an I-map of some P(x1,x2,x3,x4) and was constructed using the methods we just described. The “compatibility functions” on all links can be easily interpreted in the case of chains. Same also for trees. This idea actually works for all chordal graphs.

Chordal Graphs

Interpretation of the links Clique 1 Clique 2 Clique 3 A probability distribution that can be written as a product of low order marginals divided by a product of low order marginals is said to be decomposable.

Importance of Decomposability When assigning compatibility functions it suffices to use marginal probabilities on cliques and just make sure to be locally consistent. Marginals can be assessed from experts or estimated directly from data.

Main results on d-separation The definition of ID(X, Z, Y) is such that: Soundness [Theorem 9]: ID(X, Z, Y) = yes implies IP(X, Z, Y) follows from the boundary Basis(D). Completeness [Theorem 10]: ID(X, Z, Y) = no implies IP(X, Z, Y) does not follow from the boundary Basis(D).

Claim 1: Each vertex Xi in a Bayesian Network is d-separated of all its non-descendants’ given its parents pai. Proof : Each vertex Xi is connected to its non-descendantsi via its parents or via its descendants. All paths via its parents are blocked because pai are given and all paths via descendants are blocked because they pass through converging edges  Z  were Z is not given. Hence by definition of d-separation the claim holds: ID(Xi, pai, non-descendantsi).

Claim 2: Each topological order d in a BN entails the same set of independence assumptions. Proof : By Claim 1: ID(Xi, pai, non-descendandsi) holds. For each topological order d on {1,…,n}, it follows IP(Xd(i), pad(i), non-descendsd(i)) holds as well. From soundness (Theorem 9) IP(Xd(i), pad(i), non-descendsd(i)) holds as well. By the decomposition property of conditional independence IP(Xd(i), pad(i), S ) holds for every S that is a subset of non-descendsd(i) . Hence, Xi is independent given its parents also from S ={all variables before Xi in an arbitrary topological order d}.