Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease.

Radu Marinescu 4C @ University College Cork

Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease ID  Uncertainties Symptoms may not occur Symptoms may not be reported Diagnostic tests are not perfect – False positive, false negative How do we estimate confidence?  P(disease | symptoms, tests) = ?

Uncertainty in medical decision-making  Physicians, patients must decide on treatments  Treatments may not be successful  Treatments may have unpleasant side effects Choosing treatments  Weigh risks of adverse outcomes People are BAD at reasoning intuitively about probabilities  Provide systematic analysis

Probabilistic modeling with joint distributions Conditional independence and factorization Belief (or Bayesian) networks  Example networks and software Inference in belief networks  Exact inference Variable elimination, join-tree clustering, AND/OR search  Approximate inference Mini-clustering, belief propagation, sampling

Judea Pearl. “Probabilistic reasoning in intelligent systems”, 1988 Stuart Russell & Peter Norvig. “Artificial Intelligence. A Modern Approach”, 2002 (Ch 13-17) Kevin Murphy. "A Brief Introduction to Graphical Models and Bayesian Networks" http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html Rina Dechter. "Bucket Elimination: A Unifying Framework for Probabilistic Inference" http://www.ics.uci.edu/~csp/R48a.ps Rina Dechter. "Mini-Buckets: A General Scheme for Approximating Inference" http://www.ics.uci.edu/~csp/r62a.pdf Rina Dechter & Robert Mateescu. "AND/OR Search Spaces for Graphical Models". http://www.ics.uci.edu/~csp/r126.pdf

A problem domain is modeled by a list of (discrete) random variables: X 1, X 2, …, X n Knowledge about the problem is represented by a joint probability distribution:P(X 1, X 2, …, X n )

Alarm (Pearl88)  Story: In Los Angeles, burglary and earthquake are common. They both can trigger an alarm. In case of alarm, two neighbors John and Mary may call 911  Problem: estimate the probability of a burglary based on who has or has not called  Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M)  Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)

Defines probabilities for all possible value assignments to the variables in the set

What is the probability of burglary given that Mary called, P(B=y | M=y)? Compute marginal probability: Compute answer (reasoning by conditioning):

Probability theory well-established and well understood In theory, can perform arbitrary inference among the variables given a joint probability. This is because the joint probability contains information of all aspects of the relationships among the variables  Diagnostic inference: From effects to causes Example: P(B=y | M=y)  Predictive inference: From causes to effects Example: P(M=y | B=y)  Combining evidence: P(B=y | J=y, M=y, E=n) All inference sanctioned by probability theory and hence has clear semantics

In Alarm example:  32 numbers needed (parameters)  Quite unnatural to assess P(B=y, E=y, A=y, J=y, M=y)  Computing P(B=y | M=y) takes 29 additions In general,  P(X 1, X 2, …, X n ) needs at least 2 n numbers to specify the joint probability distribution  Knowledge acquisition difficult (complex, unnatural)  Exponential storage and inference

Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks  Example networks and software Inference in belief networks  Exact inference  Approximate inference Miscellaneous  Mixed networks, influence diagrams, etc.

Overcome the problem of exponential size by exploiting conditional independencies  The chain rule of probability:  No gains yet. The number of parameters required by the factors is still O(2 n )

A random variable X is conditionally independent of a set of random variables Y given a set of random variables Z if  P(X | Y, Z) = P(X | Z) Intuitively:  Y tells us nothing more about X than we know by knowing Z  As far as X is concerned, we can ignore Y if we know Z

About P(X i |X 1,…,X i-1 ):  Domain knowledge usually allows one to identify a subset pa(X i )  {X 1, …, X i-1 } such that Given pa(X i ), X i is independent of all variables in {X 1,…,X i-1 } \ pa(X i ), i.e. P(X i | X 1, …, X i-1 ) = P(X i | pa(X i )) Then Joint distribution factorized! The number of parameters might have been substantially reduced

pa(B) = {}, pa(E) = {}, pa(A) = {B,E}, pa(J) = {A}, pa(M) = {A} Conditional probability tables (CPT)

Model size reduced from 32 to 2+2+4+4+8=20 Model construction easier  Fewer parameters to assess  Parameter more natural to assess e.g., P(B=y), P(J=y | A=y), P(A=y | B=y, E=y), etc. Inference easier. Will see this later.

Probabilistic modeling with joint distributions Conditional Independence and factorization Belief networks  Example networks and software Inference in belief networks  Exact inference  Approximate inference

Graphically represent the conditional independency relationships:  Construct a directed graph by drawing an arc from X j to X i iff X j  pa(X i )  Also attach the CPT P(X i | pa(X i )) to node X i BE A JM P(B)P(E) P(A|B,E) P(J|A)P(M|A)

A belief network is:  A directed acyclic graph (DAG), where: Each node represents a random variable And is associated with the conditional probability of the node given its parents  Represents the joint probability distribution:  A variable is conditionally independent of its non- descendants given its parents

3 basic independence structures Burglary Alarm JohnCalls 1: chain Burglary Alarm Earthquake 2: common descendants MaryCalls Alarm JohnCalls 3: common ancestors

Burglary Alarm JohnCalls 1. JohnCalls is independent of Burglary given Alarm

Burglary Alarm Earthquake 2. Burglary is independent of Earthquake not knowing Alarm. Burglary and Earthquake become dependent given Alarm!!

MaryCalls Alarm JohnCalls 3. MaryCalls is independent of JohnCalls given Alarm.

BN models many conditional independence relations relating distant variables and sets, which are defined in terms of the graphical criterion called d-separation d-separation = conditional independence  Let X, Y and Z be three sets of nodes  If X and Y are d-separated by Z, then X and Y are conditionally independent given Z: P(X|Y, Z) = P(X|Z) d-separation in the graph:  A is d-separated from B given C if every undirected path between them is blocked Path blocking  3 cases that expand on three basic independence structures

With “linear” substructure With “wedge” substructure (common ancestors) With “vee” substructure (common descendants) XY Z in C XY X Y Z or any of its descendants not in C

1 23 4 5 XY Z X = {2} and Y = {3} are d-separated by Z = {1} path 2  1  3 is blocked by 1  Z path 2  4  3 is blocked because 4 and all its descendants are outside Z X = {2} and Y = {3} are not d-separated by Z = {1,5} path 2  1  3 is blocked by 1  Z path 2  4  3 is activated because 5 (which is a descendant of 4) is in Z learning the value of consequence 5 renders 5’s causes 2 and 3 dependant

Given a probability distribution P on a set of variables {X 1, …, X n }, a belief network B representing P is a minimal I-map (Pearl88)  I-mapness: every d-separation condition displayed in B corresponds to a valid conditional independence relationship in P  Minimal: none of the arrows in B can be deleted without destroying its I-mapness

PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP The “alarm” network: Monitoring Intensive-Care Patients 37 variables, 509 parameters (instead of 2 37 )

GeNIe (University of Pittsburgh) - free  http://genie.sis.pitt.edu http://genie.sis.pitt.edu SamIam (UCLA) - free  http://reasoning.cs.ucla.edu/SamIam/ http://reasoning.cs.ucla.edu/SamIam/ Hugin - commercial  http://www.hugin.com http://www.hugin.com Netica - commercial  http://www.norsys.com http://www.norsys.com UCI Lab – free but no GUI  http://graphmod.ics.uci.edu/ http://graphmod.ics.uci.edu/

Belief networks are used in:  Genetic linkage analysis  Speech recognition  Medical diagnosis  Probabilistic error correcting coding  Monitoring and diagnosis in distributed systems  Troubleshooting (Microsoft)  …

Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks Inference in belief networks  Exact inference  Approximate inference

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search (tree, graph)

Smoking Bronchitis Lung cancer X-ray Dyspnoea P(Lung cancer = yes | Smoking = no, Dyspnoea = yes) ?

Belief updating Maximum probable explanation (MPE) Maximum a posteriori hypothesis (MAP)

ELIMINATION: multiply (*) and sum (∑) bucket(B): { P(E|B,C), P(D|A,B), P(B|A) }  λ B (A,C,D,E) = ∑ B P(B|A)*P(D|A,B)*P(E|B,C) OBSERVED BUCKET: bucket(B): { P(E|B,C), P(D|A,B), P(B|A), B=1 }  λ B (A) = P(B=1|A)  λ B (A,D) = P(D|A,B=1)  λ B (E,C) = P(E|B=1,C)

Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) ∑∏ Elimination operator λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) P(A,E=0) B C D E A w* = 4 “induced width” (max clique size)

B C D E A A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) Induced width of the ordering w*(d) || max width of the nodes A BC ED

w*(d) – induced width of the moral graph along ordering d A BC ED “Moral” graph B C D E A w*(d 1 ) = 4 E D C B A w*(d 2 ) = 2

NP-complete A tree has induced width of ? Greedy algorithms:  Min-width  Min induced-width  Max-cardinality  Min-fill (thought as the best)  Anytime min-width (via Branch-and-Bound)

Smoking Bronchitis Lung Cancer X-ray Dyspnoea

Probabilistic decoding  A stream of bits is transmitted across a noisy channel and the problem is to recover the transmitted stream given the observed output and parity check bits x0x0 x1x1 x2x2 x3x3 x4x4 u0u0 u1u1 u2u2 u3u3 u4u4 y0uy0u y1uy1u y2uy2u y3uy3u y4uy4u y0xy0x y1xy1x y2xy2x y3xy3x y4xy4x Transmitted bits Parity check bits Received bits (observed) Received parity check bits (observed)

Medical diagnosis  Given some observed symptoms, determine the most likely subset of diseases that may explain the symptoms Symptom2 Symptom3 Symptom4 Symptom5 Symptom1 Symptom6 Disease1 Disease2Disease4 Disease6 Disease5 Disease3 Disease7

Genetic linkage analysis  Given the genotype information of a pedigree, infer the maximum likelihood haplotype configuration (maternal and paternal) of the unobserved individuals 2 1 A B a b A a B b 3 genotyped haplotype S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 Locus 1 Locus 2 (Fishelson & Geiger, 2002)

ABCf(A,B,C) TTT0.03 TTF0.07 TFT0.54 TFF0.36 FTT0.06 FTF0.14 FFT0.48 FFF0.32 ACf(A,C) TT0.54 TF0.36 FT0.48 FF0.32 max out B

Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) max∏ Elimination/combination operators λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) MPE value B C D E A w* = 4 “induced width” (max clique size) width 4 3 1 1 0

w*(d) – induced width of the moral graph along ordering d A BC ED “Moral” graph B C D E A w*(d 1 ) = 4 E D C B A w*(d 2 ) = 2

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search (tree, graph)

Motivation  BE computes P(evidence) or P(X|evidence) where X is the last variable in the ordering  What if we need all marginal probabilities P(X i |evidence), where X i  {X 1, X 2, …, X n } ? Run BE n times with X i being the last variable Inefficient! – induced width may vary significantly from one ordering to another SOLUTION: Bucket-Tree Elimination (BTE)

Bucket Tree  A bucket tree has each bucket B i as a node and there is an arc from B i to B j if the function created at B i was placed in B j Graph-based definition  Let G d be the induced graph along d. Each variable X and its earlier neighbors is a node B X. There is an arc from B X to B Y if Y is the closest parent to X.

A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) Belief network E D C B A Induced graph E,B,C A,B,D A,B,C B,A A E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) Bucket tree P(C|A)

u XnXn X2X2 X1X1 v h(u,v) … Compute the message: h(x 1,u) h(x n,u) elim(u,v) = vars(u) – vars(v)

E,B,C A,B,D A,B,C B,A A E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) π A (A) π C (B,C) π B (A,B) A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A)

E,B,C : P(E|B,C) A,B,D : P(D|A,B) A,B,C : P(C|A) B,A : P(B|A) A : P(A) E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) π A (A) π C (B,C) π B (A,B)

G,F F,B,CD,B,A A,B,CB,A A F B,CA,B A G,F F,B,CD,B,A A,B,C F B,C A,B G,F A,B,C,D,F F A BC FD P(A) P(B|A) P(F|B,C) P(D|A,B) Time-space trade off! G P(C|A) P(G|F)

A tree decomposition for a belief network ‹X,D,G,P› is a triple ‹T,χ,ψ›, where T=(V,E) is a tree, and χ and ψ are labeling functions, associating with each vertex v  V two sets χ(v)  V and ψ(v)  P such that:  For each function (CPT) p i  P there is exactly one vertex such that p i  ψ(v) and scope(p i )  χ(v)  For each variable X i  X, the set {v  V | X i  χ(v)} forms a connected sub-tree (running intersection property) A join-tree is a tree decomposition where all clusters are maximal  E.g., a bucket-tree is a tree decomposition but not a join-tree

The width (aka treewidth) of a tree decomposition ‹T,χ,ψ› is max|χ(v)|, and its hyperwidth is max|ψ(v)| Given two adjacent vertices u and v of a tree decomposition, a separator of u and v is defined as sep(u,v) = χ(u)  χ(v)

Good join trees using triangulation  Create induced graph G’ along some ordering d  Identify all maximal cliques in G’  Order cliques {C 1, C 2, …, C t } by rank of the highest vertex in each clique  Form the join tree by connecting each C i to a predecessor C j (j < i) sharing the largest number of vertices with C i

A B CD F E G ABC BCDF BEF EFG BC BF EF 1 2 3 4 Time: O(exp(w+1)) Space: O(exp(sep))

Correctness and completeness  Algorithm CTE is correct, i.e. it computes the exact joint probability of a single variable and the evidence Time complexity: O(deg x (n+N) x d w*+1 ) Space complexity: O(N x d sep ) » deg = max degree of a node in T » n = number of variables (=number of CPTs) » N = number of nodes in T » d = maximum domain size » w* = induced width » sep = separator size

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  Cycle cutset scheme  VE+C hybrid  AND/OR search (tree, graph)

0000 01010101 0101010101010101 0101 E C D B A 01 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=0|A=0,B=0) P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=1|A=0,B=0) … P(A=0)P(B=1|A=0)P(C=1|A=0)P(E=0|B=1,C=1)P(D=1|A=0,B=1) ∑ = P(A=0, E=0)

0000 01010101 0101010101010101 0101 E C D B A 01 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) P(A=0, E=0)P(A=1, E=0)

IDEA: condition until w* of the remaining graph gets small enough! 0000 0101 E C D B A 01 Search Elimination A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) w* = 1w* loop cutset w* = ww* = 0 searchw-cutsetelimination

Condition until we get a polytree (no loops)  subset of conditioning variables = loop-cutset A BC ED BC ED A=0 BC ED A=1 P(B|D=0) = P(B,A=0|D=0) + P(B,A=1|D=0) Loop-cutset method is time exponential in loop-cutset size and linear space!

Identify a w-cutset, C w, of the network  Finding smallest loop-cutset/w-cutset is NP-hard For each assignment of the cutset, solve by VE the conditioned subproblem Aggregate the solutions over all cutset assignments Time complexity: exp(|C w | + w) Space complexity: exp(w)

Eliminate

Condition

All algorithms generalize to any graphical model  Through general operations of combination and marginalization  General BE, BTE, CTE, VE+C  Applicable to Markov networks, to constraint optimization, to counting number of solutions in SAT/CSP, etc.

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  Cycle cutset scheme  AND/OR search (tree, graph)

Search: Conditioning Complete Incomplete Gradient Descent Complete Incomplete Tree Clustering Variable Elimination Mini-Clustering(i) Mini-Bucket(i) Stochastic Local Search DFS search Inference: Elimination Time: exp(treewidth) Space:exp(treewidth) Time: exp(n) Space: linear AND/OR search Time: exp( treewidth*log n ) Space: linear Hybrids Space: exp(treewidth) Time: exp(treewidth) Time: exp(pathwidth) Space: exp(pathwidth) Belief Propagation Bucket Elimination

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  Cycle cutset  VE+C hybrid  AND/OR search spaces AND/OR tree search AND/OR graph search

A D BC E F Ordering: A B E C D F A D BC E F 01010101 0101010101010101 01010101010101010101010101010101 0101 E C F D B A 01

A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 A D BC E F A D B CE F Moral graphDFS tree A D BC E F A D BC E F

A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 E 01010101 0 C 101010101010101 F 0101010101010101010101010101010101010101010101010101010101010101 D 01010101010101010101010101010101 0 B 101 A 01 AND/OR OR A D BC E F A D B CE F 1 1 1 0 1 0

92 A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 E 01010101 0 C 101010101010101 F 0101010101010101010101010101010101010101010101010101010101010101 D 01010101010101010101010101010101 0 B 101 A 01 AND/OR OR A D BC E F A D B CE F AND/OR size: exp(4), OR size exp(6)

The AND/OR search tree of R relative to a spanning-tree, T, has:  Alternating levels of: OR nodes (variables) and AND nodes (values) Successor function:  The successors of OR nodes X are all its consistent values along its path  The successors of AND are all X child variables in T A solution is a consistent subtree Task: compute the value of the root node A D BC E F A D B CE F

(a) Graph 461 3275 (b) DFS tree depth=3 (c) Pseudo tree depth=2 (d) Chain depth=6 46 1 3 27 5 27 1 4 3 5 6 4 6 1 3 2 7 5 (Freuder85, Bayardo&Miranker95)

N = number of nodes, P = number of parents. MIN-FILL ordering. 100 instances.

Finding min depth DFS, or pseudo tree is NP- complete, but: Given a tree-decomposition whose treewidth is w*, there exists a pseudo tree T of G whose depth, satisfies: m <= w* log n (Bayardo & Miranker96, Bodlaender & Gilbert91)

FA C BD E E A C B D F (AF) (EF) (A) (AB) (AC) (BC) (AE) (BD) (DE) Bucket-tree based on dd: A B C E D F E A C B D F Induced graph E A C B DF Bucket-tree used as pseudo tree AND/OR search tree Bucket-tree ABE A ABC ABAB BDEBDEAEF bucket-A bucket-E bucket-B bucket-C bucket-Dbucket-F (AE) (BE)

Depth-first traversal of the induced graph constructed along some elimination ordering (e.g., min-fill)  Sometimes can get slightly different trees than those obtained from the bucket-tree Recursive decomposition of the dual hypergraph while minimizing the separator size at each step  Functions (CPTs) are vertices in the dual hypergraph, while variables are hyperedges  Separator = set of hyperedges (i.e., variables)

Bayesian Networks Repository

Theorem: Any AND/OR search tree based on a pseudo tree is sound and complete (expresses all and only solutions) Theorem: Size of AND/OR search tree is O(n k m ) Size of OR search tree is O(k n ) Theorem: Size of AND/OR search tree can be bounded by O(exp(w* log n)) Related to: (Freuder85; Dechter90, Bayardo et. al. 96, Darwiche01, Bacchus et. al. 03) When the pseudo-tree is a chain we get an OR space

Random graphs with 20 nodes, 20 edges and 2 values per node

v(n) is the value of the tree T(n) for the task:  Optimization (MPE): v(n) is the optimal solution in T(n)  Belief updating: v(n), probability of evidence in T(n). Goal: compute the value of the root node recursively using DFS search of the AND/OR tree. Theorem: Complexity of AO DFS search is:  Space: O(n)  Time:O(n k m )  Time:O(exp(w* log n))

0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4 D: P(D|B,C) D=1 C: P(C|A) E: P(E|A,B) E=0 B: P(B|A) A: P(A) w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path

OR node 1 A 2k w(A,1) w(A,2) w(A,k) v(A,1) … AND node 0 X1X1 X2X2 XmXm … v(X 1 )v(X 2 )v(X m ) NOTE: the value of a terminal AND node is 1 the weight of an OR-AND arc for which no CPTs are fully instantiated is 1

AND node: Combination operator (product) OR node: Marginalization operator (summation) Value of node = updated belief for sub-problem below 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.88.54.89.52.352.27.623.104.3028.1559.24408.3028.1559 Result: P(D=1,E=0) 0.3028*0.6 + 0.1559*0.4 = 0.24408

k = domain size m = depth of pseudo-tree n = number of variables w*= treewidth

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search spaces AND/OR tree search AND/OR graph search

Any two nodes that root identical sub-trees or sub-graphs can be merged

A D BC E F GH J K A D B CE F G H J K A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 0101 0101 0101 0101 OR AND 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01

A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 0101 0101 0101 0101 OR AND 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 A D BC E F GH J K A D B CE F G H J K

One way of recognizing nodes that can be merged context(X) = ancestors of X in the pseudo tree that are connected to X, or to descendants of X [ ] [A] [AB] [AE] [BC] [AB] A D B EC F pseudo tree A E C B F D A E C B F D

.7.8 0 A B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0101 1 EC 0101 A D BC E.7.8.9.5 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.9.8.9.5.7.5.8.9.7.5.4.5.7.2.88.54.89.52.352.27.623.104.3028.1559.24408.3028.1559 A D B CE [ ] [A] [AB] [BC] [AB] Context Cache table for D Result: P(D=1,E=0)

C 0 K 0 H 0 L 01 NN 0101 FFF 1 1 0101 F G 01 1 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 G 01 G 01 G 01 M 01 M 01 M 01 M 01 P 01 P 01 O 01 O 01 O 01 O 01 L 01 NN 0101 P 01 P 01 O 01 O 01 O 01 O 01 D 01 D 01 D 01 D 01 K 0 H 0 L 01 NN 0101 1 1 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 P 01 P 01 O 01 O 01 O 01 O 01 L 01 NN 0101 P 01 P 01 O 01 O 01 O 01 O 01 D 01 D 01 D 01 D 01 BA C E FG H J D K M L N O P C HK D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] [C] (C K H A B E J L N O D P M F G)

Theorem: The maximum context size for a pseudo tree is equal to the treewidth of the graph along the pseudo tree. C HK D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] [C] (C K H A B E J L N O D P M F G) BA C E FG H J D K M L N O P max context size = treewidth

G E K F L H C B A M J D E K L H C A M J ABC BDEF BDFG EFH FHK HJKLM treewidth = 3 = (max cluster size) - 1 ABC BDEFGEFHFHKJKLM pathwidth = 4 = (max cluster size) - 1 D G B F TREE CHAIN

AO(i): searches depth-first, cache i-context  i = the max size of a cache table (i.e. number of variables in a context) i=0i=w* Space:O(n) Time:O(exp(w* log n)) Space:O(exp w*) Time:O(exp w*) Space:O(exp(i) ) Time:O(exp(m_i+i ) i

k = domain size n = number of variables w*= treewidth pw*= pathwidth w* ≤ pw* ≤ w* log n

Recursive Conditioning (RC) (Darwiche01)  Can be viewed as an AND/OR graph search algorithm guided by tree  Guiding tree structure is called “dtree” Value Elimination (VE) (Bacchus et al.03)  Also an AND/OR graph search algorithm using an advanced caching scheme based on components rather than graph-based contexts  Can use dynamic variable orderings

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search spaces AND/OR tree search AND/OR graph search

A C BK G L DF H M J E A C B K G L D F H M J E A C BK G L DF H M J E C B K G L D F H M J E 3-cutset A C BK G L DF H M J E C K G L D F H M J E 2-cutset A C BK G L DF H M J E L D F H M J E 1-cutset

A C B K G L D F H M J E A C B K G L D F H M J E A C B K G L D F H M J E pseudo tree1-cutset treemoral graph

AO(i): searches depth-first, cache i-context  i = the max size of a cache table (i.e. number of variables in a context) i=0i=w* Space:O(n) Time:O(exp(w* log n)) Space:O(exp w*) Time:O(exp w*) Space:O(exp(i) ) Time:O(exp(m_i+i ) i

Definition:  T_w is a w-cutset tree relative to backbone pseudo tree T, iff T_w roots T and when removed, yields treewidth w. Theorem:  AO(i) time complexity for backbone T is time O(exp(i+m_i)) and space O(i), m_i is the depth of the T_i tree. Better than w-cutset: O(exp(i+c_i)) where c_i is the number of nodes in T_i

Variable elimination (inference)  Bucket elimination  Bucket-Tree elimination  Cluster-Tree elimination Conditioning (search)  VE+C hybrid  AND/OR search for Most Probable Explanations

Solved by BE in time and space exponential in treewidth w* Solved by Conditioning in linear space and time exponential in the number of variables n It can be solved by AND/OR search:  Tree search: space O(n), time O(exp(w* log n))  Graph search: time and space O(exp(w*))

0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4 D: P(D|B,C) D=1 C: P(C|A) E: P(E|A,B) E=0 B: P(B|A) A: P(A) w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path

OR node 1 A 2k w(A,1) w(A,2) w(A,k) v(A,1) … AND node 0 X1X1 X2X2 XmXm … v(X 1 )v(X 2 )v(X m ) NOTE: the value of a terminal AND node is 1 the weight of an OR-AND arc for which no CPTs are fully instantiated is 1

AND node: Combination operator (product) OR node: Marginalization operator (maximization) Value of node = MPE value for sub-problem below 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.72.40.81.45.288.20.567.09.12.081.072.12.081 Result: MPE(D=1,E=0) MAX( 0.12*0.6, 0.081*0.4 )= 0.072

n g(n) cost of the search path to n h(n) estimates the optimal cost below n UB(n) = g(n) * h(n) Upper Bound UB(n) OR Search Tree Prune if UB(n) ≤ LB Lower Bound LB (Lawler & Wood66)

0 D 0 (A=0, B=0, C=0, D=0) 0 A BC 0 0 A BC 00 D 1 (A=0, B=0, C=0, D=1) 0 A BC 01 D 0 (A=0, B=1, C=0, D=0) 0 A BC 01 D 1 (A=0, B=1, C=0, D=1) A BC D Pseudo tree Extension(T’) – solution trees that extend T’

OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 v(D,0) 3 350 0 9 tip nodes F 1 3 35 0 F v(F) A B C DE F A B CD E F f*(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * v(D,0) * v(F)

OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 h(D,0) = 4 3 350 0 9 tip nodes F 1 3 35 0 F h(F) = 5 A B C DE F A B CD E F f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’) h(n) ≥ v(n)

OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 1 0 D E E 0101 01 C 1 0 B 01 f(T’) ≤ LB LB (Marinescu and Dechter, 05)

Associate each node n with a heuristic upper bound h(n) on v(n) EXPAND (top-down)  Evaluate f(T’) of the current partial solution sub- tree T’, and prune search if f(T’) ≤ LB  Expand the tip node n by generating its successors PROPAGATE (bottom-up)  Update value of the parent p of n OR nodes: maximization AND nodes: product

The principle of relaxed models  Mini-Bucket Elimination for belief networks (Pearl86)

Min-fill pseudo tree. Time limit 1 hour. (Sang et al.05)

(Fishelson&Geiger02) Min-fill pseudo tree. Time limit 3 hours.

Associate each node n with a heuristic upper bound h(n) on v(n) EXPAND (top-down)  Evaluate f(T’) of the current partial solution sub-tree T’, and prune search if f(T’) ≤ LB  If not in cache, expand the tip node n by generating its successors PROPAGATE (bottom-up)  Update value of the parent p of n OR nodes: maximization AND nodes: multiplication  Cache value of n, based on context

Best-first search expands first the node with the best heuristic evaluation function among all nodes encountered so far It never expands nodes whose cost is beyond the optimal one, unlike depth-first search algorithms (Dechter & Pearl85) Superior among memory intensive algorithms employing the same heuristic function

Maintains the set of best partial solution trees EXPAND (top-down)  Traces down marked connectors from root (best partial solution tree)  Expands a tip node n by generating its successors n’  Associate each successor with heuristic estimate h(n’) Initialize v(n’) = h(n’) REVISE (bottom-up)  Updates node values v(n) OR nodes: maximization AND nodes: multiplication  Marks the most promising solution tree from the root  Label the nodes as SOLVED: OR is SOLVED if marked child is SOLVED AND is SOLVED if all children are SOLVED Terminate when root node is SOLVED [specializes Nilsson’s AO* to graphical models (Nilsson80)] (Marinescu & Dechter, 07)

Min-fill pseudo tree. Time limit 1 hour.

Solved by BE in time and space exponential in constrained induced width w* Solved by AND/OR search:  Tree search: space O(n), time O(exp(w* log n))  Graph search: time and space O(exp(w*))

A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) A BC ED Moralize (marry parents) Variables A and B are the hypothesis variables, variable E is evidence

Bucket E: Bucket D: Bucket C: Bucket B: Bucket A: P(E|B,C), E = 0 P(D|A,B) P(A) λ E (B, C) λ C (A,B)λ D (A, B) λB(A)λB(A) MAP value P(C|A) P(B|A) SUM buckets MAX buckets

Elimination order is important: SUM variables are eliminated first, followed by the MAX variables  ordering: A, B, C, D, E is legal  ordering: A, C, D, E, B is illegal Induced width corresponding to a legal elimination order is called constrained induced width cw*  Typically it may be far larger than the unconstrained induced width, ie cw* ≥ w* When interleaving MAX and SUM (using unconstrained orderings) the result is an Upper Bound on the MAP value  Can be used as a guiding heuristic function for search

AND node: Combination operator (product) OR node: MAX for hypothesis, SUM otherwise 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.88.54.89.52.352.27.623.104.162.0936.0972.162.0936 Result: MAP(D=1,E=0) MAX( 0.162*0.6, 0.0936*0.4 )= 0.0972

Pseudo tree must be consistent with the constrained elimination order Graph search via context-based caching Time and space complexity  Tree search: Space linear, time O(exp(cw*log n))  Graph search: Time and space O(exp(cw*))

Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks Inference in belief networks  Exact inference  Approximate inference

Mini-Bucket Elimination  Mini-clustering Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling

Search: Conditioning Complete Incomplete Gradient Descent Complete Incomplete Tree Clustering Variable Elimination Mini-Clustering(i) Mini-Bucket(i) Stochastic Local Search DFS search Inference: Elimination Time: exp(treewidth) Space:exp(treewidth) Time: exp(n) Space: linear AND/OR search Time: exp( treewidth*log n ) Space: linear Hybrids Space: exp(treewidth) Time: exp(treewidth) Time: exp(pathwidth) Space: exp(pathwidth) Belief Propagation Bucket Elimination

Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) max∏ Elimination operator λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) MPE B C D E A w* = 4 “induced width” (max clique size) width 4 3 1 1 0

Computation in a bucket is time and space exponential in the number of variables involved (i.e., width) Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables The idea is similar to i-consistency: bound the size of recorded dependencies (Dechter 2003)

Split a bucket into mini-buckets => bound complexity

Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C) P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) Upper Bound on MPE value λE(A)λE(A) λ B (A,D) λ D (A,E) 4 variables: split 3 variables: OK 2 variables: OK 1 variable: OK Mini-buckets max∏

Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) λE(A)λE(A) λ B (A,D) λ D (A,E) a’ = argmax P(A) ∙ λ E (A) e’ = 0 d’ = argmax λ C (a’,D,e’) ∙ ∙ λ C (a’,D) c’ = argmax P(C|a’) ∙ ∙ λ C (C,e’) b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) ∙ P(B|a’) Return (a’, b’, c’, d’, e’) A Lower Bound can also be computed as the probability of the sub-optimal assignment P(a’, b’, c’, d’, e’)

Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C) P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) Upper Bound on P(evidence) λE(A)λE(A) λ B (A,D) λ D (A,E) 4 variables: split 3 variables: OK 2 variables: OK 1 variable: OK Mini-buckets ∑∏

If we process all mini-buckets by summation then we get an unnecessarily large upper bound on the probability of evidence Tighter upper bound  Process first mini-bucket by summation and remaining ones by maximization We can also get a lower bound on P(evidence)  Process first mini-bucket by summation and remaining ones by minimization

Controlling parameter i (called i-bound)  Maximum number of distinct variables in a mini-bucket  Outputs both a lower and an upper bound Complexity: O(exp(i)) time and space As i-bound increases, both accuracy and time complexity increase  Clearly, if i = w*, then we have pure BE Possible use of mini-bucket approximations  As anytime algorithms (Dechter & Rish, 1997)  As heuristic functions for depth-first and best-first search (Kask & Dechter, 2001), (Marinescu & Dechter, 2005)

Static Mini-Buckets  Pre-compiled  Reduced overhead  Less accurate  Static variable ordering Dynamic Mini-Buckets  Computed dynamically  Higher overhead  High accuracy  Dynamic variable ordering

OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 h(D,0) = 4 3 350 0 9 tip nodes F 1 3 35 0 F h(F) = 5 A B C DE F A B CD E F f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’) h(n) ≥ v(n)

A f(A,B) B f(B,C) C f(B,F) F f(A,G) f(F,G) G f(B,E) f(C,E) E f(A,D) f(B,D) f(C,D) D h G (A,F) h F (A,B) h B (A) h E (B,C)h D (A,B,C) h C (A,B) AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h*(a, b, c) = h D (a, b, c) * h E (b, c) (Dechter99)

A f(A,B) B f(B,C) C f(B,F) F f(A,G) f(F,G) G f(B,E) f(C,E) E f(B,D) f(C,D) D h G (A,F) h F (A,B) h B (A) h E (B,C)h D (B,C) h C (B) h D (A) f(A,D) D mini-buckets AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h(a, b, c) = h D (a) * h D (b, c) * h E (b, c) ≥ h*(a, b, c) MBE(3)

A f(a,b) B f(b,C) C f(b,F) F f(a,G) f(F,G) G f(b,E) f(C,E) E f(a,D) f(b,D) f(C,D) D h G (F) h F () h B () h E (C)h D (C) h C () AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h(a, b, c) = h D (c) * h E (c) = h*(a, b, c) MBE(3)

s1196 ISCAS’89 circuit.

Mini-Bucket Elimination  Mini-clustering (tree decompositions) Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling  Particle filtering

Correctness and completeness:  Algorithm CTE is correct, i.e. it computes the exact posterior joint probability of all single variables (or subsets) and the evidence. Time complexity: O ( deg  (n+N)  d w*+1 ) Space complexity: O ( N  d sep ) wheredeg = the maximum degree of a node n = number of variables (= number of CPTs) N = number of nodes in the tree decomposition d = the maximum domain size of a variable w* = the induced width sep = the separator size

A B C p(a), p(b|a), p(c|a,b) B C D F p(d|b), p(f|c,d) h (1,2) (b,c) B E F p(e|b,f), h (2,3) (b,f) E F G p(g|e,f) 2 4 1 3 EF BC BF sep(2,3)={B,F} elim(2,3)={C,D} G E F C D B A

Motivation:  Time and space complexity of Cluster Tree Elimination depend on the induced width w* of the problem  When the induced width w* is big, CTE algorithm becomes infeasible The basic idea:  Try to reduce the size of the cluster (the exponent); partition each cluster into mini-clusters with less variables  Accuracy parameter i = maximum number of variables in a mini-cluster  The idea was explored for variable elimination (MBE)

Split a cluster into mini-clusters => bound complexity

A B C p(a), p(b|a), p(c|a,b) B E F p(e|b,f) E F G p(g|e,f) 2 4 1 3 EF BC BF Cluster Tree Elimination Mini-Clustering, i=3 G E F C D B A B C D F p(d|b), p(f|c,d) 2 B C D F p(d|b), h (1,2) (b,c), p(f|c,d) sep(2,3)= {B,F} elim(2,3) = {C,D} C D F B C D C D F p(f|c,d) p(d|b), h (1,2) (b,c) p(f|c,d)

EF BF BC ABC 2 4 1 3 BEF EFG BCDF

Correctness and completeness:  Algorithm MC(i) computes a bound (or an approximation) on the joint probability P(X i,e) of each variable and each of its values. Time & space complexity: O(exp(i))

Mini-Bucket Elimination  Mini-clustering Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling  Particle filtering

Belief propagation is exact for poly-trees (Pearl, 1988) IBP - applying BP iteratively to cyclic networks No guarantees for convergence Works well for many coding networks

IBP is applied to a loopy network iteratively  not an anytime algorithm  when it converges, it converges very fast MC applies bounded inference along a tree decomposition  MC is an anytime algorithm controlled by i-bound  MC converges in two passes up and down the tree IJGP combines:  the iterative feature of IBP  the anytime feature of MC

 Apply Cluster Tree Elimination to any join-graph  We commit to graphs that are minimal I-maps  Avoid cycles as long as I-mapness is not violated  Result: use minimal arc-labeled join-graphs

A D I B E J F G C H A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI Belief networkThe graph IBP works on (dual graph)

A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABABBCBC BEBE C C DEDECECE F H F FGFGGHGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABABBCBC C DEDECECE H F FGFGGHGH GI

A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGFGGHGH GIGI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGHGH GIGI

a) Minimal arc-labeled join graphb) Join-graph obtained by collapsing nodes of graph a) c) Minimal arc-labeled join graph A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC CDECECE F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC DECECE F FGH GI

ABCDE FGHIGHIJ CDEF CDE F GHI a) Minimal arc-labeled join graphb) Tree decomposition ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI

A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI ABCDE FGHIGHIJ CDEF CDE F GHI more accuracy less complexity

ABCDE FGI BCE GHIJ CDEF FGH BCBC CDE CECE F FGH GI ABCDE p(a), p(c), p(b|ac), p(d|abe),p(e|b,c) h(3,1)(bc) BCD CDEF BCBC CDE CECE 13 2 h (3,1) (bc) h (1,2) Minimal arc-labeled: sep(1,2)={D,E} elim(1,2)={A,B,C} Non-minimal arc-labeled: sep(1,2)={C,D,E} elim(1,2)={A,B}

We want arc-labeled decompositions such that:  the cluster size (internal width) is bounded by i (the accuracy parameter)  the width of the decomposition as a graph (external width) is as small as possible – closer to a tree Possible approaches to build decompositions:  partition-based algorithms - inspired by the mini-bucket decomposition  grouping-based algorithms

IJGP(i) applies BP to min arc-labeled join-graph, whose cluster size is bounded by i On join-trees IJGP finds exact beliefs! IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman and Weiss, 2001) Complexity of one iteration:  time: O(deg(n+N) d i+1 )  space: O(Nd  )

evidence=0 evidence=5

evidence=0evidence=5

IJGP borrows the iterative feature from IBP and the anytime virtues of bounded inference from MC Empirical evaluation showed the potential of IJGP, which improves with iteration and most of the time with i-bound, and scales up to large networks IJGP is almost always superior, often by a high margin, to IBP and MC Based on all our experiments, we think that IJGP provides a practical breakthrough to the task of belief updating #CSP: can use IJGP to generate solution counts estimates for depth-first Branch-and-Bound search

Mini-Bucket Elimination  Mini-clustering Iterative Belief Propagation  IJGP – Iterative Joint Graph Propagation Sampling  Forward sampling  Gibbs sampling (MCMC)  Importance sampling

Structural Approximations  Eliminate some dependencies Remove edges Mini-Bucket and Mini-Clustering approaches Local Search  Approach for optimization tasks: MPE, MAP Favorite MAX-CSP/WCSP/WSAT local search solver! Sampling  Generate random samples and compute values of interest from samples, not original network

Input: Bayesian network with set of nodes X Sample = a tuple with assigned values s=(X 1 =x 1,X 2 =x 2,…,X k =x k ) Tuple may include all variables (except evidence) or a subset Sampling schemas dictate how to generate samples (tuples) Ideally, samples are distributed according to P(X|E)

Given a set of variables X = {X 1, X 2, … X n } that represent joint probability distribution  (X) and some function g(X), we can compute expected value of g(X) :

Given independent, identically distributed samples (iid) S 1, S 2, …S T from  (X), it follows from Strong Law of Large Numbers: A sample S t is an instantiation:

Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Generate k=10 samples: 0,1,1,1,0,1,1,0,1,0 Approximate P’(X):

Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Sample X  P (X)  draw random number r  [0, 1]  If (r < 0.3) then set X=0  Else set X=1 Can generalize for any domain size

Same idea: generate a set of samples T Estimate posterior marginal P(X i |E) from samples Challenge: X is a vector and P(X) is a huge distribution represented by BN Need to know:  How to generate a new sample ?  How many samples T do we need ?  How to estimate P(E=e) and P(X i |e) ?

Forward Sampling Gibbs Sampling (MCMC)  Blocking  Rao-Blackwellised Likelihood Weighting Importance Sampling Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks

Forward Sampling  Case with No evidence E={}  Case with Evidence E=e

Input: Bayesian network X= {X 1,…,X N }, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: 1.For t = 1 to T 2. For i = 1 to N 3. X i  sample x i t from P(x i | pa i )

What does it mean to sample x i t from P(X i | pa i ) ? Assume D(X i )={0,1} Assume P(X i | pa i ) = (0.3, 0.7) Draw a random number r from [0,1] If r falls in [0,0.3], set X i = 0 If r falls in [0.3,1], set X i = 1 010.3 r

X1X1 X4X4 X2X2 X3X3

Task: given T samples {S 1,S 2,…,S n } estimate P(X i = x i ) : Basically, count the proportion of samples where X i = x i

Input: Bayesian network X= {X 1,…,X N }, N- #nodes E – evidence, T - # samples Output: T samples consistent with E 1.For t=1 to T 2. For i=1 to N 3. X i  sample x i t from P(x i | pa i ) 4. If X i in E and X i  x i, reject sample: 5. i = 1 and go to step 2

X1X1 X4X4 X2X2 X3X3

Let Y be a subset of evidence nodes s.t. Y=u

Theorem: Let  s (y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most  with probability at least 1-  it is enough to have: Derived from Chebychev’s Bound.

Advantages: P(x i | pa(x i )) is readily available Samples are independent ! Drawbacks: If evidence E is rare (P(e) is low), then we will reject most of the samples! Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves! If P(e) is small, T will become very big!

Forward Sampling  High Rejection Rate Fix evidence values  Gibbs sampling (MCMC)  Likelihood Weighting  Importance Sampling

Forward Sampling  High rejection rate  Samples are independent Fix evidence values  Gibbs sampling (MCMC)  Likelihood Weighting  Importance Sampling

Forward Sampling Gibbs Sampling (MCMC)  Blocking  Rao-Blackwellised Likelihood Weighting Importance Sampling

Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) Samples are dependent, form Markov Chain Sample from P’(X|e) which converges to P(X|e) Guaranteed to converge when all P > 0 Methods to improve convergence:  Blocking  Rao-Blackwellised

A sample t  [1,2,…], is an instantiation of all variables in the network: Sampling process  Fix values of observed variables e  Instantiate node values in sample x 0 at random  Generate samples x 1,x 2,…x T from P(X|e)  Compute posteriors from samples

Generate sample x t+1 from x t : In short, for i=1 to N: Process All variables In Some Order

Markov blanket :

Input: X, E Output: T samples {x t } Fix evidence E Generate samples from P(X | E) 1.For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. X i  sample x i t from P(X i | markov t \ X i )

Query: P(x i |e) = ? Method 1: count #of samples where X i =x i : Method 2: average probability (mixture estimator):

X = {X 1,X 2,…,X 9 } E = {X 9 } X1 X4 X8X5 X2 X3 X9 X7 X6

X 1 = x 1 0 X 6 = x 6 0 X 2 = x 2 0 X 7 = x 7 0 X 3 = x 3 0 X 8 = x 8 0 X 4 = x 4 0 X 5 = x 5 0 X1 X4 X8X5 X2 X3 X9 X7 X6

X 1  P (X 1 |X 0 2,…,X 0 8,X 9 ) E = {X 9 } P (X 1 =0 |X 0 2,X 0 3,X 9 } = αP(X 1 =0)P(X 0 2 |X 1 =0)P(X 3 0 |X 1 =0) P (X 1 =1 |X 0 2,X 0 3,X 9 } = αP(X 1 =1)P(X 0 2 |X 1 =1)P(X 3 0 |X 1 =1) X1 X4 X8X5 X2 X3 X9 X7 X6

X 2  P(X 2 |X 1 1,…,X 0 8,X 9 } E = {X 9 } Markov blanket for X 2 is: {X 2, X 1, X 4, X 5, X 3 } X1 X4 X8X5 X2 X3 X9 X7 X6

We want to sample from P(X | E) But … starting point is random Solution: throw away first K samples Known As “Burn-In” What is K ? Hard to tell. Use intuition. Alternatives: sample first sample values from approximate P(x|e)  For example, run IBP first

Converge to stationary distribution  * :  * =  * P where P is a transition kernel p ij = P(X i  X j ) Guaranteed to converge iff chain is :  irreducible  aperiodic  ergodic (  i,j p ij > 0)

Advantage :  guaranteed to converge to P(X|E), as long as P i > 0 Disadvantage :  convergence may be slow Problems:  Samples are dependent !  Statistical variance is too big in high-dimensional problems

Objectives: 1.Reduce dependence between samples (autocorrelation)  Skip samples  Randomize Variable Sampling Order 2.Reduce variance  Blocking Gibbs Sampling  Rao-Blackwellisation

Pick only every k-th sample (Gayer, 1992)  Can reduce dependence between samples!  Increases variance ! Waists samples !

Random Scan Gibbs Sampler  Pick each next variable X i for update at random with probability p i,  i p i = 1. In the simplest case, p i are distributed uniformly.  In some instances, reduces variance (MacEachern, Peruggia, 1999)

Sample several variables together, as a block Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (x t,y t,z t ), compute next sample: X t+1  P(y t,z t )=P(w t ) (y t+1,z t+1 )=W t+1  P(x t+1 ) + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!

Do not sample all variables! Sample a subset! Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (x t,y t ), compute next sample: x t+1  P(y t ) y t+1  P(x t+1 )

Bottom line: reducing number of variables in a sample reduce variance!

Select C  X (possibly cycle-cutset), |C| = m Fix evidence E Initialize nodes with random values: For i=1 to m: c i to C i = c 0 i For t=1 to n, generate samples: For i=1 to m: C i =c i t+1  P(c i |c 1 t+1,…,c i-1 t+1,c i+1 t,…,c m t,e)

Generate sample c t+1 from c t :

How to choose C ?  Special case: C is cycle-cutset, O(N)  General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed.  Pick C wisely so as to minimize w  notion of w- cutset

C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w Complexity of exact inference:  bounded by w ! Cycle-cutset is a special case!

Query:  c i  C, P(c i |e)=? same as Gibbs: Special case of w-cutset Query: P(x i |e) = ? computed while generating sample t compute after generating sample t (easy because C is a cut-set)

X1 X7 X5 X4 X2 X9 X8 X3 E=x 9 X6

X1 X7 X6X5 X4 X2 X9 X8 X3 Sample a new value for X 2 :

X1 X7 X6X5 X4 X2 X9 X8 X3 Sample a new value for X 5 :

X1 X7 X6X5 X4 X2 X9 X8 X3 Query P(x 2 |e) for sampling node X 2 : Sample 1 Sample 2 Sample 3

X1 X7 X6X5 X4 X2 X9 X8 X3 Query P(x 3 |e) for non-sampled node X 3 :

MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(X i )<=4, |E| = 35 Exact Time = 122 sec using Loop-Cutset Conditioning

MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(X i )=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination

“Clamping” evidence + Forward sampling + Weighting samples by evidence likelihood Works well for likely evidence!

eeeee Sample in topological order over X ! eeee x i  P(X i |pa i ) P(X i |pa i ) is a look-up in CPT!

Estimate Posterior Marginals: P(X i | e)

Converges to exact posterior marginals Generates samples fast Sampling distribution is close to prior (especially if E  Leaf Nodes) Increasing sampling variance  Convergence may be slow  Many samples with P(x (t) )=0 rejected

In general, it is hard to sample from target distribution P(X|E) Generate samples from sampling (proposal) distribution Q(X) Weigh each sample against P(X|E)

Given a distribution called the proposal distribution Q (such that P(Z=z,e)>0 => Q(Z=z)>0) w(Z=z) is called importance weight

Underlying principle, Approximate Average over a set of numbers by an average over a set of sampled numbers

Express the problem as computing the average over a set of real numbers Sample a subset of real numbers Approximate the true average by sample average.  True Average: Average of (0.11, 0.24, 0.55, 0.77, 0.88,0.99)=0.59  Sample Average over 2 samples: Average of (0.24, 0.77) = 0.505

Each Sample Z=z  Sample Z 1 =z 1 from Q(Z 1 )  Sample Z 2 =z 2 from Q(Z 2 |Z 1 =z1)  Sample Z 3 =z 3 from Q(Z 3 |Z1=z1) Generate N such samples

Q= Prior Distribution = CPTs of the Bayesian network

Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease.

Similar presentations

Presentation on theme: "Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease.

Similar presentations

Presentation on theme: "Radu Marinescu University College Cork. Uncertainty in medical diagnosis  Diseases produce symptoms  In diagnosis, observed symptoms => disease."— Presentation transcript:

Similar presentations

About project

Feedback