Download presentation
Presentation is loading. Please wait.
1
Radu Marinescu 4C @ University College Cork
2
Uncertainty in medical diagnosis Diseases produce symptoms In diagnosis, observed symptoms => disease ID Uncertainties Symptoms may not occur Symptoms may not be reported Diagnostic tests are not perfect – False positive, false negative How do we estimate confidence? P(disease | symptoms, tests) = ?
3
Uncertainty in medical decision-making Physicians, patients must decide on treatments Treatments may not be successful Treatments may have unpleasant side effects Choosing treatments Weigh risks of adverse outcomes People are BAD at reasoning intuitively about probabilities Provide systematic analysis
4
Probabilistic modeling with joint distributions Conditional independence and factorization Belief (or Bayesian) networks Example networks and software Inference in belief networks Exact inference Variable elimination, join-tree clustering, AND/OR search Approximate inference Mini-clustering, belief propagation, sampling
5
Judea Pearl. “Probabilistic reasoning in intelligent systems”, 1988 Stuart Russell & Peter Norvig. “Artificial Intelligence. A Modern Approach”, 2002 (Ch 13-17) Kevin Murphy. "A Brief Introduction to Graphical Models and Bayesian Networks" http://www.cs.ubc.ca/~murphyk/Bayes/bnintro.html Rina Dechter. "Bucket Elimination: A Unifying Framework for Probabilistic Inference" http://www.ics.uci.edu/~csp/R48a.ps Rina Dechter. "Mini-Buckets: A General Scheme for Approximating Inference" http://www.ics.uci.edu/~csp/r62a.pdf Rina Dechter & Robert Mateescu. "AND/OR Search Spaces for Graphical Models". http://www.ics.uci.edu/~csp/r126.pdf
6
A problem domain is modeled by a list of (discrete) random variables: X 1, X 2, …, X n Knowledge about the problem is represented by a joint probability distribution:P(X 1, X 2, …, X n )
7
Alarm (Pearl88) Story: In Los Angeles, burglary and earthquake are common. They both can trigger an alarm. In case of alarm, two neighbors John and Mary may call 911 Problem: estimate the probability of a burglary based on who has or has not called Variables: Burglary (B), Earthquake (E), Alarm (A), JohnCalls (J), MaryCalls (M) Knowledge required by the probabilistic approach in order to solve this problem: P(B, E, A, J, M)
8
Defines probabilities for all possible value assignments to the variables in the set
9
What is the probability of burglary given that Mary called, P(B=y | M=y)? Compute marginal probability: Compute answer (reasoning by conditioning):
10
Probability theory well-established and well understood In theory, can perform arbitrary inference among the variables given a joint probability. This is because the joint probability contains information of all aspects of the relationships among the variables Diagnostic inference: From effects to causes Example: P(B=y | M=y) Predictive inference: From causes to effects Example: P(M=y | B=y) Combining evidence: P(B=y | J=y, M=y, E=n) All inference sanctioned by probability theory and hence has clear semantics
11
In Alarm example: 32 numbers needed (parameters) Quite unnatural to assess P(B=y, E=y, A=y, J=y, M=y) Computing P(B=y | M=y) takes 29 additions In general, P(X 1, X 2, …, X n ) needs at least 2 n numbers to specify the joint probability distribution Knowledge acquisition difficult (complex, unnatural) Exponential storage and inference
12
Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks Example networks and software Inference in belief networks Exact inference Approximate inference Miscellaneous Mixed networks, influence diagrams, etc.
13
Overcome the problem of exponential size by exploiting conditional independencies The chain rule of probability: No gains yet. The number of parameters required by the factors is still O(2 n )
14
A random variable X is conditionally independent of a set of random variables Y given a set of random variables Z if P(X | Y, Z) = P(X | Z) Intuitively: Y tells us nothing more about X than we know by knowing Z As far as X is concerned, we can ignore Y if we know Z
15
About P(X i |X 1,…,X i-1 ): Domain knowledge usually allows one to identify a subset pa(X i ) {X 1, …, X i-1 } such that Given pa(X i ), X i is independent of all variables in {X 1,…,X i-1 } \ pa(X i ), i.e. P(X i | X 1, …, X i-1 ) = P(X i | pa(X i )) Then Joint distribution factorized! The number of parameters might have been substantially reduced
16
pa(B) = {}, pa(E) = {}, pa(A) = {B,E}, pa(J) = {A}, pa(M) = {A} Conditional probability tables (CPT)
17
Model size reduced from 32 to 2+2+4+4+8=20 Model construction easier Fewer parameters to assess Parameter more natural to assess e.g., P(B=y), P(J=y | A=y), P(A=y | B=y, E=y), etc. Inference easier. Will see this later.
18
Probabilistic modeling with joint distributions Conditional Independence and factorization Belief networks Example networks and software Inference in belief networks Exact inference Approximate inference
19
Graphically represent the conditional independency relationships: Construct a directed graph by drawing an arc from X j to X i iff X j pa(X i ) Also attach the CPT P(X i | pa(X i )) to node X i BE A JM P(B)P(E) P(A|B,E) P(J|A)P(M|A)
20
A belief network is: A directed acyclic graph (DAG), where: Each node represents a random variable And is associated with the conditional probability of the node given its parents Represents the joint probability distribution: A variable is conditionally independent of its non- descendants given its parents
21
3 basic independence structures Burglary Alarm JohnCalls 1: chain Burglary Alarm Earthquake 2: common descendants MaryCalls Alarm JohnCalls 3: common ancestors
22
Burglary Alarm JohnCalls 1. JohnCalls is independent of Burglary given Alarm
23
Burglary Alarm Earthquake 2. Burglary is independent of Earthquake not knowing Alarm. Burglary and Earthquake become dependent given Alarm!!
24
MaryCalls Alarm JohnCalls 3. MaryCalls is independent of JohnCalls given Alarm.
25
BN models many conditional independence relations relating distant variables and sets, which are defined in terms of the graphical criterion called d-separation d-separation = conditional independence Let X, Y and Z be three sets of nodes If X and Y are d-separated by Z, then X and Y are conditionally independent given Z: P(X|Y, Z) = P(X|Z) d-separation in the graph: A is d-separated from B given C if every undirected path between them is blocked Path blocking 3 cases that expand on three basic independence structures
26
With “linear” substructure With “wedge” substructure (common ancestors) With “vee” substructure (common descendants) XY Z in C XY X Y Z or any of its descendants not in C
27
1 23 4 5 XY Z X = {2} and Y = {3} are d-separated by Z = {1} path 2 1 3 is blocked by 1 Z path 2 4 3 is blocked because 4 and all its descendants are outside Z X = {2} and Y = {3} are not d-separated by Z = {1,5} path 2 1 3 is blocked by 1 Z path 2 4 3 is activated because 5 (which is a descendant of 4) is in Z learning the value of consequence 5 renders 5’s causes 2 and 3 dependant
28
Given a probability distribution P on a set of variables {X 1, …, X n }, a belief network B representing P is a minimal I-map (Pearl88) I-mapness: every d-separation condition displayed in B corresponds to a valid conditional independence relationship in P Minimal: none of the arrows in B can be deleted without destroying its I-mapness
29
BE A JM P(B,E,A,J,M) = Rewrite the full joint probability using the product rule: = P(J|B,E,A,M) P(B,E,A,M) = P(J|A)P(B,E,A,M) P(M|B,E,A) P(B,E,A) P(M|A) P(B,E,A) P(A|B,E) P(B,E) P(B) P(E) = P(J|A) P(M|A) P(A|B,E) P(B) P(E)
30
PCWP CO HRBP HREKG HRSAT ERRCAUTER HR HISTORY CATECHOL SAO2 EXPCO2 ARTCO2 VENTALV VENTLUNG VENITUBE DISCONNECT MINVOLSET VENTMACH KINKEDTUBE INTUBATIONPULMEMBOLUS PAPSHUNT ANAPHYLAXIS MINOVL PVSAT FIO2 PRESS INSUFFANESTHTPR LVFAILURE ERRBLOWOUTPUT STROEVOLUMELVEDVOLUME HYPOVOLEMIA CVP BP The “alarm” network: Monitoring Intensive-Care Patients 37 variables, 509 parameters (instead of 2 37 )
31
GeNIe (University of Pittsburgh) - free http://genie.sis.pitt.edu http://genie.sis.pitt.edu SamIam (UCLA) - free http://reasoning.cs.ucla.edu/SamIam/ http://reasoning.cs.ucla.edu/SamIam/ Hugin - commercial http://www.hugin.com http://www.hugin.com Netica - commercial http://www.norsys.com http://www.norsys.com UCI Lab – free but no GUI http://graphmod.ics.uci.edu/ http://graphmod.ics.uci.edu/
33
Belief networks are used in: Genetic linkage analysis Speech recognition Medical diagnosis Probabilistic error correcting coding Monitoring and diagnosis in distributed systems Troubleshooting (Microsoft) …
34
Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks Inference in belief networks Exact inference Approximate inference
35
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) VE+C hybrid AND/OR search (tree, graph)
36
Smoking Bronchitis Lung cancer X-ray Dyspnoea P(Lung cancer = yes | Smoking = no, Dyspnoea = yes) ?
37
Belief updating Maximum probable explanation (MPE) Maximum a posteriori hypothesis (MAP)
38
A BC ED P(A) P(C|A)P(B|A) P(E|B,C) P(D|A,B) P(A|E=0) ∑ E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) = α P(A,E=0) = P(A) ∑ E=0 ∑ D ∑ C P(C|A) ∑ B P(B|A) P(D|A,B) P(E|B,C) λ B (A,D,C,E)Variable Elimination
39
A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) A BC ED Moralize (“marry parents”) Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) Ordering: A, E, D, C, B P(C|A)
40
ELIMINATION: multiply (*) and sum (∑) bucket(B): { P(E|B,C), P(D|A,B), P(B|A) } λ B (A,C,D,E) = ∑ B P(B|A)*P(D|A,B)*P(E|B,C) OBSERVED BUCKET: bucket(B): { P(E|B,C), P(D|A,B), P(B|A), B=1 } λ B (A) = P(B=1|A) λ B (A,D) = P(D|A,B=1) λ B (E,C) = P(E|B=1,C)
43
Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) ∑∏ Elimination operator λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) P(A,E=0) B C D E A w* = 4 “induced width” (max clique size)
44
B C D E A A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) Induced width of the ordering w*(d) || max width of the nodes A BC ED
45
w*(d) – induced width of the moral graph along ordering d A BC ED “Moral” graph B C D E A w*(d 1 ) = 4 E D C B A w*(d 2 ) = 2
46
NP-complete A tree has induced width of ? Greedy algorithms: Min-width Min induced-width Max-cardinality Min-fill (thought as the best) Anytime min-width (via Branch-and-Bound)
47
Smoking Bronchitis Lung Cancer X-ray Dyspnoea
48
Probabilistic decoding A stream of bits is transmitted across a noisy channel and the problem is to recover the transmitted stream given the observed output and parity check bits x0x0 x1x1 x2x2 x3x3 x4x4 u0u0 u1u1 u2u2 u3u3 u4u4 y0uy0u y1uy1u y2uy2u y3uy3u y4uy4u y0xy0x y1xy1x y2xy2x y3xy3x y4xy4x Transmitted bits Parity check bits Received bits (observed) Received parity check bits (observed)
49
Medical diagnosis Given some observed symptoms, determine the most likely subset of diseases that may explain the symptoms Symptom2 Symptom3 Symptom4 Symptom5 Symptom1 Symptom6 Disease1 Disease2Disease4 Disease6 Disease5 Disease3 Disease7
50
Genetic linkage analysis Given the genotype information of a pedigree, infer the maximum likelihood haplotype configuration (maternal and paternal) of the unobserved individuals 2 1 A B a b A a B b 3 genotyped haplotype S 23m L 21f L 21m L 23m X 21 S 23f L 22f L 22m L 23f X 22 X 23 S 13m L 11f L 11m L 13m X 11 S 13f L 12f L 12m L 13f X 12 X 13 Locus 1 Locus 2 (Fishelson & Geiger, 2002)
51
A BC ED P(A) P(C|A)P(B|A) P(E|B,C) P(D|A,B) MPE = max A,E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) = max A P(A) max E=0 max D max C P(C|A) max B P(B|A) P(D|A,B) P(E|B,C) λ B (A,D,C,E)Variable Elimination
52
ABCf(A,B,C) TTT0.03 TTF0.07 TFT0.54 TFF0.36 FTT0.06 FTF0.14 FFT0.48 FFF0.32 ACf(A,C) TT0.54 TF0.36 FT0.48 FF0.32 max out B
53
Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) max∏ Elimination/combination operators λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) MPE value B C D E A w* = 4 “induced width” (max clique size) width 4 3 1 1 0
54
Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A)a’ = argmax P(A)∙λ E (A) e’ = 0 d’ = argmax λ C (a’,D,e’) c’ = argmax P(C|a’) ∙ ∙ λ C (a’,d’,C,e’) b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) ∙ P(B|a’) Return (a’, b’, c’, d’, e’)
55
w*(d) – induced width of the moral graph along ordering d A BC ED “Moral” graph B C D E A w*(d 1 ) = 4 E D C B A w*(d 2 ) = 2
56
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) VE+C hybrid AND/OR search (tree, graph)
57
Motivation BE computes P(evidence) or P(X|evidence) where X is the last variable in the ordering What if we need all marginal probabilities P(X i |evidence), where X i {X 1, X 2, …, X n } ? Run BE n times with X i being the last variable Inefficient! – induced width may vary significantly from one ordering to another SOLUTION: Bucket-Tree Elimination (BTE)
58
A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) Bucket E: Bucket D: Bucket C: Bucket B: Bucket A: P(E|B,C) P(D|A,B) P(B|A) P(A) P(C|A)λ E (B,C) λ D (A,B)λ C (A,B) λ B (A) P(E|B,C) P(D|A,B) P(C|A) P(B|A) P(A) E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) Variable elimination can be viewed as message passing (elimination) using a bucket tree Any node (bucket) can be the root Complexity: time and space exponential in the induced width P(C|A)
59
Bucket Tree A bucket tree has each bucket B i as a node and there is an arc from B i to B j if the function created at B i was placed in B j Graph-based definition Let G d be the induced graph along d. Each variable X and its earlier neighbors is a node B X. There is an arc from B X to B Y if Y is the closest parent to X.
60
A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) Belief network E D C B A Induced graph E,B,C A,B,D A,B,C B,A A E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) Bucket tree P(C|A)
61
u XnXn X2X2 X1X1 v h(u,v) … Compute the message: h(x 1,u) h(x n,u) elim(u,v) = vars(u) – vars(v)
62
E,B,C A,B,D A,B,C B,A A E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) π A (A) π C (B,C) π B (A,B) A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A)
63
E,B,C : P(E|B,C) A,B,D : P(D|A,B) A,B,C : P(C|A) B,A : P(B|A) A : P(A) E D C B A λ E (B,C) λ D (A,B) λ C (A,B) λ B (A) π A (A) π C (B,C) π B (A,B)
64
G,F F,B,CD,B,A A,B,CB,A A F B,CA,B A G,F F,B,CD,B,A A,B,C F B,C A,B G,F A,B,C,D,F F A BC FD P(A) P(B|A) P(F|B,C) P(D|A,B) Time-space trade off! G P(C|A) P(G|F)
65
A tree decomposition for a belief network ‹X,D,G,P› is a triple ‹T,χ,ψ›, where T=(V,E) is a tree, and χ and ψ are labeling functions, associating with each vertex v V two sets χ(v) V and ψ(v) P such that: For each function (CPT) p i P there is exactly one vertex such that p i ψ(v) and scope(p i ) χ(v) For each variable X i X, the set {v V | X i χ(v)} forms a connected sub-tree (running intersection property) A join-tree is a tree decomposition where all clusters are maximal E.g., a bucket-tree is a tree decomposition but not a join-tree
66
The width (aka treewidth) of a tree decomposition ‹T,χ,ψ› is max|χ(v)|, and its hyperwidth is max|ψ(v)| Given two adjacent vertices u and v of a tree decomposition, a separator of u and v is defined as sep(u,v) = χ(u) χ(v)
67
Good join trees using triangulation Create induced graph G’ along some ordering d Identify all maximal cliques in G’ Order cliques {C 1, C 2, …, C t } by rank of the highest vertex in each clique Form the join tree by connecting each C i to a predecessor C j (j < i) sharing the largest number of vertices with C i
68
E D C B A Induced graph A BC ED Moral graph ECB C3C3 DBA C2C2 CBA C1C1 P(A) P(B|A) P(C|A) P(E|B,C)P(D|A,B) BC P(E|B,C) P(D|A,B) P(A), P(B|A), P(C|A) AB Treewidth = 3 Separator size = 2 χ(C 3 ) ψ(C 3 ) separators
69
A B CD F E G ABC P(A), P(B|A), P(C|A,B) BCDF P(D|B), P(F|C,D) BEF P(E|E,F) EFG P(G|E,F) BC BF EF 1 2 3 4
70
A B CD F E G ABC BCDF BEF EFG BC BF EF 1 2 3 4 Time: O(exp(w+1)) Space: O(exp(sep))
71
Correctness and completeness Algorithm CTE is correct, i.e. it computes the exact joint probability of a single variable and the evidence Time complexity: O(deg x (n+N) x d w*+1 ) Space complexity: O(N x d sep ) » deg = max degree of a node in T » n = number of variables (=number of CPTs) » N = number of nodes in T » d = maximum domain size » w* = induced width » sep = separator size
72
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) Cycle cutset scheme VE+C hybrid AND/OR search (tree, graph)
73
0000 01010101 0101010101010101 0101 E C D B A 01 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=0|A=0,B=0) P(A=0)P(B=0|A=0)P(C=0|A=0)P(E=0|B=0,C=0)P(D=1|A=0,B=0) … P(A=0)P(B=1|A=0)P(C=1|A=0)P(E=0|B=1,C=1)P(D=1|A=0,B=1) ∑ = P(A=0, E=0)
74
0000 01010101 0101010101010101 0101 E C D B A 01 A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) P(A=0, E=0)P(A=1, E=0)
75
IDEA: condition until w* of the remaining graph gets small enough! 0000 0101 E C D B A 01 Search Elimination A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) w* = 1w* loop cutset w* = ww* = 0 searchw-cutsetelimination
76
Condition until we get a polytree (no loops) subset of conditioning variables = loop-cutset A BC ED BC ED A=0 BC ED A=1 P(B|D=0) = P(B,A=0|D=0) + P(B,A=1|D=0) Loop-cutset method is time exponential in loop-cutset size and linear space!
77
Identify a w-cutset, C w, of the network Finding smallest loop-cutset/w-cutset is NP-hard For each assignment of the cutset, solve by VE the conditioned subproblem Aggregate the solutions over all cutset assignments Time complexity: exp(|C w | + w) Space complexity: exp(w)
79
Eliminate
83
Condition
84
...
85
All algorithms generalize to any graphical model Through general operations of combination and marginalization General BE, BTE, CTE, VE+C Applicable to Markov networks, to constraint optimization, to counting number of solutions in SAT/CSP, etc.
86
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) VE+C hybrid Cycle cutset scheme AND/OR search (tree, graph)
87
Search: Conditioning Complete Incomplete Gradient Descent Complete Incomplete Tree Clustering Variable Elimination Mini-Clustering(i) Mini-Bucket(i) Stochastic Local Search DFS search Inference: Elimination Time: exp(treewidth) Space:exp(treewidth) Time: exp(n) Space: linear AND/OR search Time: exp( treewidth*log n ) Space: linear Hybrids Space: exp(treewidth) Time: exp(treewidth) Time: exp(pathwidth) Space: exp(pathwidth) Belief Propagation Bucket Elimination
88
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) Cycle cutset VE+C hybrid AND/OR search spaces AND/OR tree search AND/OR graph search
89
A D BC E F Ordering: A B E C D F A D BC E F 01010101 0101010101010101 01010101010101010101010101010101 0101 E C F D B A 01
90
A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 A D BC E F A D B CE F Moral graphDFS tree A D BC E F A D BC E F
91
A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 E 01010101 0 C 101010101010101 F 0101010101010101010101010101010101010101010101010101010101010101 D 01010101010101010101010101010101 0 B 101 A 01 AND/OR OR A D BC E F A D B CE F 1 1 1 0 1 0
92
92 A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 01010101 01010101 01010101 01010101 0101 0101 0101 0101 E 01010101 0 C 101010101010101 F 0101010101010101010101010101010101010101010101010101010101010101 D 01010101010101010101010101010101 0 B 101 A 01 AND/OR OR A D BC E F A D B CE F AND/OR size: exp(4), OR size exp(6)
93
The AND/OR search tree of R relative to a spanning-tree, T, has: Alternating levels of: OR nodes (variables) and AND nodes (values) Successor function: The successors of OR nodes X are all its consistent values along its path The successors of AND are all X child variables in T A solution is a consistent subtree Task: compute the value of the root node A D BC E F A D B CE F
94
(a) Graph 461 3275 (b) DFS tree depth=3 (c) Pseudo tree depth=2 (d) Chain depth=6 46 1 3 27 5 27 1 4 3 5 6 4 6 1 3 2 7 5 (Freuder85, Bayardo&Miranker95)
95
N = number of nodes, P = number of parents. MIN-FILL ordering. 100 instances.
96
Finding min depth DFS, or pseudo tree is NP- complete, but: Given a tree-decomposition whose treewidth is w*, there exists a pseudo tree T of G whose depth, satisfies: m <= w* log n (Bayardo & Miranker96, Bodlaender & Gilbert91)
97
FA C BD E E A C B D F (AF) (EF) (A) (AB) (AC) (BC) (AE) (BD) (DE) Bucket-tree based on dd: A B C E D F E A C B D F Induced graph E A C B DF Bucket-tree used as pseudo tree AND/OR search tree Bucket-tree ABE A ABC ABAB BDEBDEAEF bucket-A bucket-E bucket-B bucket-C bucket-Dbucket-F (AE) (BE)
98
Depth-first traversal of the induced graph constructed along some elimination ordering (e.g., min-fill) Sometimes can get slightly different trees than those obtained from the bucket-tree Recursive decomposition of the dual hypergraph while minimizing the separator size at each step Functions (CPTs) are vertices in the dual hypergraph, while variables are hyperedges Separator = set of hyperedges (i.e., variables)
99
Bayesian Networks Repository
100
Theorem: Any AND/OR search tree based on a pseudo tree is sound and complete (expresses all and only solutions) Theorem: Size of AND/OR search tree is O(n k m ) Size of OR search tree is O(k n ) Theorem: Size of AND/OR search tree can be bounded by O(exp(w* log n)) Related to: (Freuder85; Dechter90, Bayardo et. al. 96, Darwiche01, Bacchus et. al. 03) When the pseudo-tree is a chain we get an OR space
101
Random graphs with 20 nodes, 20 edges and 2 values per node
102
v(n) is the value of the tree T(n) for the task: Optimization (MPE): v(n) is the optimal solution in T(n) Belief updating: v(n), probability of evidence in T(n). Goal: compute the value of the root node recursively using DFS search of the AND/OR tree. Theorem: Complexity of AO DFS search is: Space: O(n) Time:O(n k m ) Time:O(exp(w* log n))
103
0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4 D: P(D|B,C) D=1 C: P(C|A) E: P(E|A,B) E=0 B: P(B|A) A: P(A) w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path
104
OR node 1 A 2k w(A,1) w(A,2) w(A,k) v(A,1) … AND node 0 X1X1 X2X2 XmXm … v(X 1 )v(X 2 )v(X m ) NOTE: the value of a terminal AND node is 1 the weight of an OR-AND arc for which no CPTs are fully instantiated is 1
105
AND node: Combination operator (product) OR node: Marginalization operator (summation) Value of node = updated belief for sub-problem below 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.88.54.89.52.352.27.623.104.3028.1559.24408.3028.1559 Result: P(D=1,E=0) 0.3028*0.6 + 0.1559*0.4 = 0.24408
106
k = domain size m = depth of pseudo-tree n = number of variables w*= treewidth
107
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) VE+C hybrid AND/OR search spaces AND/OR tree search AND/OR graph search
108
Any two nodes that root identical sub-trees or sub-graphs can be merged
110
A D BC E F GH J K A D B CE F G H J K A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 0101 0101 0101 0101 OR AND 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01
111
A OR 0 AND 1 B OR B 0 AND 1 0 1 E OR C EC EC EC DFDF DFDF DFDF DFDF AND 0101 0101 0101 0101 OR AND 0 G HH 0101 01 1 G HH 0101 01 0 J KK 0101 01 1 J KK 0101 01 A D BC E F GH J K A D B CE F G H J K
112
One way of recognizing nodes that can be merged context(X) = ancestors of X in the pseudo tree that are connected to X, or to descendants of X [ ] [A] [AB] [AE] [BC] [AB] A D B EC F pseudo tree A E C B F D A E C B F D
113
.7.8 0 A B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0101 1 EC 0101 A D BC E.7.8.9.5 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.9.8.9.5.7.5.8.9.7.5.4.5.7.2.88.54.89.52.352.27.623.104.3028.1559.24408.3028.1559 A D B CE [ ] [A] [AB] [BC] [AB] Context Cache table for D Result: P(D=1,E=0)
114
C 0 K 0 H 0 L 01 NN 0101 FFF 1 1 0101 F G 01 1 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 G 01 G 01 G 01 M 01 M 01 M 01 M 01 P 01 P 01 O 01 O 01 O 01 O 01 L 01 NN 0101 P 01 P 01 O 01 O 01 O 01 O 01 D 01 D 01 D 01 D 01 K 0 H 0 L 01 NN 0101 1 1 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 A 01 BB 0 1 0 1 EEEE 0101 JJJJ 0101 P 01 P 01 O 01 O 01 O 01 O 01 L 01 NN 0101 P 01 P 01 O 01 O 01 O 01 O 01 D 01 D 01 D 01 D 01 BA C E FG H J D K M L N O P C HK D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] [C] (C K H A B E J L N O D P M F G)
115
Theorem: The maximum context size for a pseudo tree is equal to the treewidth of the graph along the pseudo tree. C HK D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] [C] (C K H A B E J L N O D P M F G) BA C E FG H J D K M L N O P max context size = treewidth
116
G E K F L H C B A M J D E K L H C A M J ABC BDEF BDFG EFH FHK HJKLM treewidth = 3 = (max cluster size) - 1 ABC BDEFGEFHFHKJKLM pathwidth = 4 = (max cluster size) - 1 D G B F TREE CHAIN
117
AO(i): searches depth-first, cache i-context i = the max size of a cache table (i.e. number of variables in a context) i=0i=w* Space:O(n) Time:O(exp(w* log n)) Space:O(exp w*) Time:O(exp w*) Space:O(exp(i) ) Time:O(exp(m_i+i ) i
118
k = domain size n = number of variables w*= treewidth pw*= pathwidth w* ≤ pw* ≤ w* log n
119
Recursive Conditioning (RC) (Darwiche01) Can be viewed as an AND/OR graph search algorithm guided by tree Guiding tree structure is called “dtree” Value Elimination (VE) (Bacchus et al.03) Also an AND/OR graph search algorithm using an advanced caching scheme based on components rather than graph-based contexts Can use dynamic variable orderings
120
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) VE+C hybrid AND/OR search spaces AND/OR tree search AND/OR graph search
121
A C BK G L DF H M J E A C B K G L D F H M J E A C BK G L DF H M J E C B K G L D F H M J E 3-cutset A C BK G L DF H M J E C K G L D F H M J E 2-cutset A C BK G L DF H M J E L D F H M J E 1-cutset
122
A C B K G L D F H M J E A C B K G L D F H M J E A C B K G L D F H M J E pseudo tree1-cutset treemoral graph
123
AO(i): searches depth-first, cache i-context i = the max size of a cache table (i.e. number of variables in a context) i=0i=w* Space:O(n) Time:O(exp(w* log n)) Space:O(exp w*) Time:O(exp w*) Space:O(exp(i) ) Time:O(exp(m_i+i ) i
124
Definition: T_w is a w-cutset tree relative to backbone pseudo tree T, iff T_w roots T and when removed, yields treewidth w. Theorem: AO(i) time complexity for backbone T is time O(exp(i+m_i)) and space O(i), m_i is the depth of the T_i tree. Better than w-cutset: O(exp(i+c_i)) where c_i is the number of nodes in T_i
125
Variable elimination (inference) Bucket elimination Bucket-Tree elimination Cluster-Tree elimination Conditioning (search) VE+C hybrid AND/OR search for Most Probable Explanations
126
Solved by BE in time and space exponential in treewidth w* Solved by Conditioning in linear space and time exponential in the number of variables n It can be solved by AND/OR search: Tree search: space O(n), time O(exp(w* log n)) Graph search: time and space O(exp(w*))
127
0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4 D: P(D|B,C) D=1 C: P(C|A) E: P(E|A,B) E=0 B: P(B|A) A: P(A) w(X,x) = product of CPTs that contain X and their scope is fully instantiated along the path
128
OR node 1 A 2k w(A,1) w(A,2) w(A,k) v(A,1) … AND node 0 X1X1 X2X2 XmXm … v(X 1 )v(X 2 )v(X m ) NOTE: the value of a terminal AND node is 1 the weight of an OR-AND arc for which no CPTs are fully instantiated is 1
129
AND node: Combination operator (product) OR node: Marginalization operator (maximization) Value of node = MPE value for sub-problem below 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.72.40.81.45.288.20.567.09.12.081.072.12.081 Result: MPE(D=1,E=0) MAX( 0.12*0.6, 0.081*0.4 )= 0.072
130
n g(n) cost of the search path to n h(n) estimates the optimal cost below n UB(n) = g(n) * h(n) Upper Bound UB(n) OR Search Tree Prune if UB(n) ≤ LB Lower Bound LB (Lawler & Wood66)
131
0 D 0 (A=0, B=0, C=0, D=0) 0 A BC 0 0 A BC 00 D 1 (A=0, B=0, C=0, D=1) 0 A BC 01 D 0 (A=0, B=1, C=0, D=0) 0 A BC 01 D 1 (A=0, B=1, C=0, D=1) A BC D Pseudo tree Extension(T’) – solution trees that extend T’
132
OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 v(D,0) 3 350 0 9 tip nodes F 1 3 35 0 F v(F) A B C DE F A B CD E F f*(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * v(D,0) * v(F)
133
OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 h(D,0) = 4 3 350 0 9 tip nodes F 1 3 35 0 F h(F) = 5 A B C DE F A B CD E F f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’) h(n) ≥ v(n)
134
OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 1 0 D E E 0101 01 C 1 0 B 01 f(T’) ≤ LB LB (Marinescu and Dechter, 05)
135
Associate each node n with a heuristic upper bound h(n) on v(n) EXPAND (top-down) Evaluate f(T’) of the current partial solution sub- tree T’, and prune search if f(T’) ≤ LB Expand the tip node n by generating its successors PROPAGATE (bottom-up) Update value of the parent p of n OR nodes: maximization AND nodes: product
136
The principle of relaxed models Mini-Bucket Elimination for belief networks (Pearl86)
137
Min-fill pseudo tree. Time limit 1 hour. (Sang et al.05)
138
(Fishelson&Geiger02) Min-fill pseudo tree. Time limit 3 hours.
139
Associate each node n with a heuristic upper bound h(n) on v(n) EXPAND (top-down) Evaluate f(T’) of the current partial solution sub-tree T’, and prune search if f(T’) ≤ LB If not in cache, expand the tip node n by generating its successors PROPAGATE (bottom-up) Update value of the parent p of n OR nodes: maximization AND nodes: multiplication Cache value of n, based on context
140
Best-first search expands first the node with the best heuristic evaluation function among all nodes encountered so far It never expands nodes whose cost is beyond the optimal one, unlike depth-first search algorithms (Dechter & Pearl85) Superior among memory intensive algorithms employing the same heuristic function
141
Maintains the set of best partial solution trees EXPAND (top-down) Traces down marked connectors from root (best partial solution tree) Expands a tip node n by generating its successors n’ Associate each successor with heuristic estimate h(n’) Initialize v(n’) = h(n’) REVISE (bottom-up) Updates node values v(n) OR nodes: maximization AND nodes: multiplication Marks the most promising solution tree from the root Label the nodes as SOLVED: OR is SOLVED if marked child is SOLVED AND is SOLVED if all children are SOLVED Terminate when root node is SOLVED [specializes Nilsson’s AO* to graphical models (Nilsson80)] (Marinescu & Dechter, 07)
142
Min-fill pseudo tree. Time limit 1 hour.
143
Solved by BE in time and space exponential in constrained induced width w* Solved by AND/OR search: Tree search: space O(n), time O(exp(w* log n)) Graph search: time and space O(exp(w*))
144
A BC ED P(A) P(B|A) P(E|B,C) P(D|A,B) P(C|A) A BC ED Moralize (marry parents) Variables A and B are the hypothesis variables, variable E is evidence
145
Bucket E: Bucket D: Bucket C: Bucket B: Bucket A: P(E|B,C), E = 0 P(D|A,B) P(A) λ E (B, C) λ C (A,B)λ D (A, B) λB(A)λB(A) MAP value P(C|A) P(B|A) SUM buckets MAX buckets
146
Elimination order is important: SUM variables are eliminated first, followed by the MAX variables ordering: A, B, C, D, E is legal ordering: A, C, D, E, B is illegal Induced width corresponding to a legal elimination order is called constrained induced width cw* Typically it may be far larger than the unconstrained induced width, ie cw* ≥ w* When interleaving MAX and SUM (using unconstrained orderings) the result is an Upper Bound on the MAP value Can be used as a guiding heuristic function for search
147
AND node: Combination operator (product) OR node: MAX for hypothesis, SUM otherwise 0 A B 0 E OR AND OR AND OR AND C 0 OR AND D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 1 B 0 EC 0 D 01 1 D 01 01 1 EC 0 D 01 1 D 01 01 A D BC E A D B CE.7.8.9.5.7.8.9.5 Evidence: D=1 Evidence: E=0.4.5.7.2.8.2.8.1.9.1.9.4.6.1.9.6.4.8.9.8.9.7.5.7.5.8.9.8.9.7.5.7.5.4.5.7.2.88.54.89.52.352.27.623.104.162.0936.0972.162.0936 Result: MAP(D=1,E=0) MAX( 0.162*0.6, 0.0936*0.4 )= 0.0972
148
Pseudo tree must be consistent with the constrained elimination order Graph search via context-based caching Time and space complexity Tree search: Space linear, time O(exp(cw*log n)) Graph search: Time and space O(exp(cw*))
149
Probabilistic modeling with joint distributions Conditional independence and factorization Belief networks Inference in belief networks Exact inference Approximate inference
150
Mini-Bucket Elimination Mini-clustering Iterative Belief Propagation IJGP – Iterative Joint Graph Propagation Sampling Forward sampling Gibbs sampling (MCMC) Importance sampling
151
Search: Conditioning Complete Incomplete Gradient Descent Complete Incomplete Tree Clustering Variable Elimination Mini-Clustering(i) Mini-Bucket(i) Stochastic Local Search DFS search Inference: Elimination Time: exp(treewidth) Space:exp(treewidth) Time: exp(n) Space: linear AND/OR search Time: exp( treewidth*log n ) Space: linear Hybrids Space: exp(treewidth) Time: exp(treewidth) Time: exp(pathwidth) Space: exp(pathwidth) Belief Propagation Bucket Elimination
152
A BC ED P(A) P(C|A)P(B|A) P(E|B,C) P(D|A,B) MPE = ? max A,E=0,D,C,B P(A) P(B|A) P(C|A) P(D|A,B) P(E|B,C) = max A P(A) max E=0 max D max C P(C|A) max B P(B|A) P(D|A,B) P(E|B,C) λ B (A,D,C,E)Variable Elimination Given a belief network and some evidence
153
Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) max∏ Elimination operator λ B (A,D,C,E) λ C (A,D,E) λ D (A,E) λE(A)λE(A) MPE B C D E A w* = 4 “induced width” (max clique size) width 4 3 1 1 0
154
Computation in a bucket is time and space exponential in the number of variables involved (i.e., width) Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables The idea is similar to i-consistency: bound the size of recorded dependencies (Dechter 2003)
155
Split a bucket into mini-buckets => bound complexity
156
Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C) P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) Upper Bound on MPE value λE(A)λE(A) λ B (A,D) λ D (A,E) 4 variables: split 3 variables: OK 2 variables: OK 1 variable: OK Mini-buckets max∏
157
Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C), P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) λE(A)λE(A) λ B (A,D) λ D (A,E) a’ = argmax P(A) ∙ λ E (A) e’ = 0 d’ = argmax λ C (a’,D,e’) ∙ ∙ λ C (a’,D) c’ = argmax P(C|a’) ∙ ∙ λ C (C,e’) b’ = argmax P(e’|B,c’) ∙ ∙ P(d’|a’,B) ∙ P(B|a’) Return (a’, b’, c’, d’, e’) A Lower Bound can also be computed as the probability of the sub-optimal assignment P(a’, b’, c’, d’, e’)
158
Bucket B: Bucket C: Bucket D: Bucket E: Bucket A: P(E|B,C) P(D|A,B), P(B|A) P(C|A) E=0 P(A) λ B (C,E) λ C (A,D,E) Upper Bound on P(evidence) λE(A)λE(A) λ B (A,D) λ D (A,E) 4 variables: split 3 variables: OK 2 variables: OK 1 variable: OK Mini-buckets ∑∏
159
If we process all mini-buckets by summation then we get an unnecessarily large upper bound on the probability of evidence Tighter upper bound Process first mini-bucket by summation and remaining ones by maximization We can also get a lower bound on P(evidence) Process first mini-bucket by summation and remaining ones by minimization
160
Controlling parameter i (called i-bound) Maximum number of distinct variables in a mini-bucket Outputs both a lower and an upper bound Complexity: O(exp(i)) time and space As i-bound increases, both accuracy and time complexity increase Clearly, if i = w*, then we have pure BE Possible use of mini-bucket approximations As anytime algorithms (Dechter & Rish, 1997) As heuristic functions for depth-first and best-first search (Kask & Dechter, 2001), (Marinescu & Dechter, 2005)
161
Static Mini-Buckets Pre-compiled Reduced overhead Less accurate Static variable ordering Dynamic Mini-Buckets Computed dynamically Higher overhead High accuracy Dynamic variable ordering
162
OR AND OR AND OR AND A 0 B 0 D E E 0101 01 C 1 1 6485 45 45 24 9 9 2500 0 0 0 1 0 0 D 0 C 1 h(D,0) = 4 3 350 0 9 tip nodes F 1 3 35 0 F h(F) = 5 A B C DE F A B CD E F f(T’) = w(A,0) * w(B,1) * w(C,0) * w(D,0) * h(D,0) * h(F) ≥ f*(T’) h(n) ≥ v(n)
163
A f(A,B) B f(B,C) C f(B,F) F f(A,G) f(F,G) G f(B,E) f(C,E) E f(A,D) f(B,D) f(C,D) D h G (A,F) h F (A,B) h B (A) h E (B,C)h D (A,B,C) h C (A,B) AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h*(a, b, c) = h D (a, b, c) * h E (b, c) (Dechter99)
164
A f(A,B) B f(B,C) C f(B,F) F f(A,G) f(F,G) G f(B,E) f(C,E) E f(B,D) f(C,D) D h G (A,F) h F (A,B) h B (A) h E (B,C)h D (B,C) h C (B) h D (A) f(A,D) D mini-buckets AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h(a, b, c) = h D (a) * h D (b, c) * h E (b, c) ≥ h*(a, b, c) MBE(3)
165
A f(a,b) B f(b,C) C f(b,F) F f(a,G) f(F,G) G f(b,E) f(C,E) E f(a,D) f(b,D) f(C,D) D h G (F) h F () h B () h E (C)h D (C) h C () AB CD E F G A B CF G DE Ordering: (A, B, C, D, E, F, G) h(a, b, c) = h D (c) * h E (c) = h*(a, b, c) MBE(3)
166
s1196 ISCAS’89 circuit.
167
Mini-Bucket Elimination Mini-clustering (tree decompositions) Iterative Belief Propagation IJGP – Iterative Joint Graph Propagation Sampling Forward sampling Gibbs sampling (MCMC) Importance sampling Particle filtering
168
Correctness and completeness: Algorithm CTE is correct, i.e. it computes the exact posterior joint probability of all single variables (or subsets) and the evidence. Time complexity: O ( deg (n+N) d w*+1 ) Space complexity: O ( N d sep ) wheredeg = the maximum degree of a node n = number of variables (= number of CPTs) N = number of nodes in the tree decomposition d = the maximum domain size of a variable w* = the induced width sep = the separator size
169
A B C p(a), p(b|a), p(c|a,b) B C D F p(d|b), p(f|c,d) h (1,2) (b,c) B E F p(e|b,f), h (2,3) (b,f) E F G p(g|e,f) 2 4 1 3 EF BC BF sep(2,3)={B,F} elim(2,3)={C,D} G E F C D B A
170
Motivation: Time and space complexity of Cluster Tree Elimination depend on the induced width w* of the problem When the induced width w* is big, CTE algorithm becomes infeasible The basic idea: Try to reduce the size of the cluster (the exponent); partition each cluster into mini-clusters with less variables Accuracy parameter i = maximum number of variables in a mini-cluster The idea was explored for variable elimination (MBE)
171
Split a cluster into mini-clusters => bound complexity
172
A B C p(a), p(b|a), p(c|a,b) B E F p(e|b,f) E F G p(g|e,f) 2 4 1 3 EF BC BF Cluster Tree Elimination Mini-Clustering, i=3 G E F C D B A B C D F p(d|b), p(f|c,d) 2 B C D F p(d|b), h (1,2) (b,c), p(f|c,d) sep(2,3)= {B,F} elim(2,3) = {C,D} C D F B C D C D F p(f|c,d) p(d|b), h (1,2) (b,c) p(f|c,d)
173
EF BF BC ABC 2 4 1 3 BEF EFG BCDF
174
Correctness and completeness: Algorithm MC(i) computes a bound (or an approximation) on the joint probability P(X i,e) of each variable and each of its values. Time & space complexity: O(exp(i))
175
Mini-Bucket Elimination Mini-clustering Iterative Belief Propagation IJGP – Iterative Joint Graph Propagation Sampling Forward sampling Gibbs sampling (MCMC) Importance sampling Particle filtering
176
Belief propagation is exact for poly-trees (Pearl, 1988) IBP - applying BP iteratively to cyclic networks No guarantees for convergence Works well for many coding networks
177
A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI The graph IBP works on (dual graph) A D I B E J F G C H Belief network P(A) P(B|A,C) P(C) P(D|A,B,E)P(E|B,C) P(F|C,D,E) P(G|H,F) P(H) P(I|F,G)P(J|H,G,I)
178
IBP is applied to a loopy network iteratively not an anytime algorithm when it converges, it converges very fast MC applies bounded inference along a tree decomposition MC is an anytime algorithm controlled by i-bound MC converges in two passes up and down the tree IJGP combines: the iterative feature of IBP the anytime feature of MC
179
Apply Cluster Tree Elimination to any join-graph We commit to graphs that are minimal I-maps Avoid cycles as long as I-mapness is not violated Result: use minimal arc-labeled join-graphs
180
A D I B E J F G C H A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI Belief networkThe graph IBP works on (dual graph)
181
A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABABBCBC BEBE C C DEDECECE F H F FGFGGHGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABABBCBC C DEDECECE H F FGFGGHGH GI
182
A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGFGGHGH GIGI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGHGH GIGI
183
a) Minimal arc-labeled join graphb) Join-graph obtained by collapsing nodes of graph a) c) Minimal arc-labeled join graph A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC CDECECE F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC DECECE F FGH GI
184
ABCDE FGHIGHIJ CDEF CDE F GHI a) Minimal arc-labeled join graphb) Tree decomposition ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI
185
A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI ABCDE FGHIGHIJ CDEF CDE F GHI more accuracy less complexity
186
ABCDE FGI BCE GHIJ CDEF FGH BCBC CDE CECE F FGH GI ABCDE p(a), p(c), p(b|ac), p(d|abe),p(e|b,c) h(3,1)(bc) BCD CDEF BCBC CDE CECE 13 2 h (3,1) (bc) h (1,2) Minimal arc-labeled: sep(1,2)={D,E} elim(1,2)={A,B,C} Non-minimal arc-labeled: sep(1,2)={C,D,E} elim(1,2)={A,B}
187
We want arc-labeled decompositions such that: the cluster size (internal width) is bounded by i (the accuracy parameter) the width of the decomposition as a graph (external width) is as small as possible – closer to a tree Possible approaches to build decompositions: partition-based algorithms - inspired by the mini-bucket decomposition grouping-based algorithms
188
G E F C D B A a) schematic mini-bucket(i), i=3 b) minimal arc-labeled join-graph decomposition CDB CAB BA A CB P(D|B) P(C|A,B) P(A) BA P(B|A) FCD P(F|C,D) GFE EBF BF EF P(E|B,F) P(G|F,E) B CD BF A F G: (GFE) E: (EBF) (EF) F: (FCD) (BF) D: (DB) (CD) C: (CAB) (CB) B: (BA) (AB) (B) A: (A)
189
IJGP(i) applies BP to min arc-labeled join-graph, whose cluster size is bounded by i On join-trees IJGP finds exact beliefs! IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman and Weiss, 2001) Complexity of one iteration: time: O(deg(n+N) d i+1 ) space: O(Nd )
190
evidence=0 evidence=5
191
evidence=0evidence=5
193
IJGP borrows the iterative feature from IBP and the anytime virtues of bounded inference from MC Empirical evaluation showed the potential of IJGP, which improves with iteration and most of the time with i-bound, and scales up to large networks IJGP is almost always superior, often by a high margin, to IBP and MC Based on all our experiments, we think that IJGP provides a practical breakthrough to the task of belief updating #CSP: can use IJGP to generate solution counts estimates for depth-first Branch-and-Bound search
194
Mini-Bucket Elimination Mini-clustering Iterative Belief Propagation IJGP – Iterative Joint Graph Propagation Sampling Forward sampling Gibbs sampling (MCMC) Importance sampling
195
Structural Approximations Eliminate some dependencies Remove edges Mini-Bucket and Mini-Clustering approaches Local Search Approach for optimization tasks: MPE, MAP Favorite MAX-CSP/WCSP/WSAT local search solver! Sampling Generate random samples and compute values of interest from samples, not original network
196
Input: Bayesian network with set of nodes X Sample = a tuple with assigned values s=(X 1 =x 1,X 2 =x 2,…,X k =x k ) Tuple may include all variables (except evidence) or a subset Sampling schemas dictate how to generate samples (tuples) Ideally, samples are distributed according to P(X|E)
197
Given a set of variables X = {X 1, X 2, … X n } that represent joint probability distribution (X) and some function g(X), we can compute expected value of g(X) :
198
Given independent, identically distributed samples (iid) S 1, S 2, …S T from (X), it follows from Strong Law of Large Numbers: A sample S t is an instantiation:
199
Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Generate k=10 samples: 0,1,1,1,0,1,1,0,1,0 Approximate P’(X):
200
Given random variable X, D(X)={0, 1} Given P(X) = {0.3, 0.7} Sample X P (X) draw random number r [0, 1] If (r < 0.3) then set X=0 Else set X=1 Can generalize for any domain size
201
Same idea: generate a set of samples T Estimate posterior marginal P(X i |E) from samples Challenge: X is a vector and P(X) is a huge distribution represented by BN Need to know: How to generate a new sample ? How many samples T do we need ? How to estimate P(E=e) and P(X i |e) ?
202
Forward Sampling Gibbs Sampling (MCMC) Blocking Rao-Blackwellised Likelihood Weighting Importance Sampling Sequential Monte-Carlo (Particle Filtering) in Dynamic Bayesian Networks
203
Forward Sampling Case with No evidence E={} Case with Evidence E=e
204
Input: Bayesian network X= {X 1,…,X N }, N- #nodes, T - # samples Output: T samples Process nodes in topological order – first process the ancestors of a node, then the node itself: 1.For t = 1 to T 2. For i = 1 to N 3. X i sample x i t from P(x i | pa i )
205
What does it mean to sample x i t from P(X i | pa i ) ? Assume D(X i )={0,1} Assume P(X i | pa i ) = (0.3, 0.7) Draw a random number r from [0,1] If r falls in [0,0.3], set X i = 0 If r falls in [0.3,1], set X i = 1 010.3 r
206
X1X1 X4X4 X2X2 X3X3
207
Task: given T samples {S 1,S 2,…,S n } estimate P(X i = x i ) : Basically, count the proportion of samples where X i = x i
208
Input: Bayesian network X= {X 1,…,X N }, N- #nodes E – evidence, T - # samples Output: T samples consistent with E 1.For t=1 to T 2. For i=1 to N 3. X i sample x i t from P(x i | pa i ) 4. If X i in E and X i x i, reject sample: 5. i = 1 and go to step 2
209
X1X1 X4X4 X2X2 X3X3
210
Let Y be a subset of evidence nodes s.t. Y=u
211
Theorem: Let s (y) be the estimate of P(y) resulting from a randomly chosen sample set S with T samples. Then, to guarantee relative error at most with probability at least 1- it is enough to have: Derived from Chebychev’s Bound.
212
Advantages: P(x i | pa(x i )) is readily available Samples are independent ! Drawbacks: If evidence E is rare (P(e) is low), then we will reject most of the samples! Since P(y) in estimate of T is unknown, must estimate P(y) from samples themselves! If P(e) is small, T will become very big!
213
Forward Sampling High Rejection Rate Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling
214
Forward Sampling High rejection rate Samples are independent Fix evidence values Gibbs sampling (MCMC) Likelihood Weighting Importance Sampling
215
Forward Sampling Gibbs Sampling (MCMC) Blocking Rao-Blackwellised Likelihood Weighting Importance Sampling
216
Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) Samples are dependent, form Markov Chain Sample from P’(X|e) which converges to P(X|e) Guaranteed to converge when all P > 0 Methods to improve convergence: Blocking Rao-Blackwellised
217
A sample t [1,2,…], is an instantiation of all variables in the network: Sampling process Fix values of observed variables e Instantiate node values in sample x 0 at random Generate samples x 1,x 2,…x T from P(X|e) Compute posteriors from samples
218
Generate sample x t+1 from x t : In short, for i=1 to N: Process All variables In Some Order
219
Markov blanket :
220
Input: X, E Output: T samples {x t } Fix evidence E Generate samples from P(X | E) 1.For t = 1 to T (compute samples) 2. For i = 1 to N (loop through variables) 3. X i sample x i t from P(X i | markov t \ X i )
221
Query: P(x i |e) = ? Method 1: count #of samples where X i =x i : Method 2: average probability (mixture estimator):
222
X = {X 1,X 2,…,X 9 } E = {X 9 } X1 X4 X8X5 X2 X3 X9 X7 X6
223
X 1 = x 1 0 X 6 = x 6 0 X 2 = x 2 0 X 7 = x 7 0 X 3 = x 3 0 X 8 = x 8 0 X 4 = x 4 0 X 5 = x 5 0 X1 X4 X8X5 X2 X3 X9 X7 X6
224
X 1 P (X 1 |X 0 2,…,X 0 8,X 9 ) E = {X 9 } P (X 1 =0 |X 0 2,X 0 3,X 9 } = αP(X 1 =0)P(X 0 2 |X 1 =0)P(X 3 0 |X 1 =0) P (X 1 =1 |X 0 2,X 0 3,X 9 } = αP(X 1 =1)P(X 0 2 |X 1 =1)P(X 3 0 |X 1 =1) X1 X4 X8X5 X2 X3 X9 X7 X6
225
X 2 P(X 2 |X 1 1,…,X 0 8,X 9 } E = {X 9 } Markov blanket for X 2 is: {X 2, X 1, X 4, X 5, X 3 } X1 X4 X8X5 X2 X3 X9 X7 X6
227
We want to sample from P(X | E) But … starting point is random Solution: throw away first K samples Known As “Burn-In” What is K ? Hard to tell. Use intuition. Alternatives: sample first sample values from approximate P(x|e) For example, run IBP first
228
Converge to stationary distribution * : * = * P where P is a transition kernel p ij = P(X i X j ) Guaranteed to converge iff chain is : irreducible aperiodic ergodic ( i,j p ij > 0)
229
Advantage : guaranteed to converge to P(X|E), as long as P i > 0 Disadvantage : convergence may be slow Problems: Samples are dependent ! Statistical variance is too big in high-dimensional problems
230
Objectives: 1.Reduce dependence between samples (autocorrelation) Skip samples Randomize Variable Sampling Order 2.Reduce variance Blocking Gibbs Sampling Rao-Blackwellisation
231
Pick only every k-th sample (Gayer, 1992) Can reduce dependence between samples! Increases variance ! Waists samples !
232
Random Scan Gibbs Sampler Pick each next variable X i for update at random with probability p i, i p i = 1. In the simplest case, p i are distributed uniformly. In some instances, reduces variance (MacEachern, Peruggia, 1999)
233
Sample several variables together, as a block Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (x t,y t,z t ), compute next sample: X t+1 P(y t,z t )=P(w t ) (y t+1,z t+1 )=W t+1 P(x t+1 ) + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!
234
Do not sample all variables! Sample a subset! Example: Given three variables X,Y,Z, sample only X and Y, sum out Z. Given sample (x t,y t ), compute next sample: x t+1 P(y t ) y t+1 P(x t+1 )
235
Bottom line: reducing number of variables in a sample reduce variance!
236
Standard Gibbs: P(x|y,z),P(y|x,z),P(z|x,y)(1) Blocking: P(x|y,z), P(y,z|x)(2) Rao-Blackwellised: P(x|y), P(y|x)(3) Var3 < Var2 < Var1 ( Liu, Wong, Kong, 1994 ) XY Z
237
Select C X (possibly cycle-cutset), |C| = m Fix evidence E Initialize nodes with random values: For i=1 to m: c i to C i = c 0 i For t=1 to n, generate samples: For i=1 to m: C i =c i t+1 P(c i |c 1 t+1,…,c i-1 t+1,c i+1 t,…,c m t,e)
238
Generate sample c t+1 from c t :
239
How to choose C ? Special case: C is cycle-cutset, O(N) General case: apply Bucket Tree Elimination (BTE), O(exp(w)) where w is the induced width of the network when nodes in C are observed. Pick C wisely so as to minimize w notion of w- cutset
240
C=w-cutset of the network, a set of nodes such that when C and E are instantiated, the adjusted induced width of the network is w Complexity of exact inference: bounded by w ! Cycle-cutset is a special case!
241
Query: c i C, P(c i |e)=? same as Gibbs: Special case of w-cutset Query: P(x i |e) = ? computed while generating sample t compute after generating sample t (easy because C is a cut-set)
242
X1 X7 X5 X4 X2 X9 X8 X3 E=x 9 X6
243
X1 X7 X6X5 X4 X2 X9 X8 X3 Sample a new value for X 2 :
244
X1 X7 X6X5 X4 X2 X9 X8 X3 Sample a new value for X 5 :
245
X1 X7 X6X5 X4 X2 X9 X8 X3 Query P(x 2 |e) for sampling node X 2 : Sample 1 Sample 2 Sample 3
246
X1 X7 X6X5 X4 X2 X9 X8 X3 Query P(x 3 |e) for non-sampled node X 3 :
247
MSE vs. #samples (left) and time (right) Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(X i )<=4, |E| = 35 Exact Time = 122 sec using Loop-Cutset Conditioning
248
MSE vs. #samples (left) and time (right) Ergodic, |X| = 360, D(X i )=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination
249
Forward Sampling Gibbs Sampling (MCMC) Blocking Rao-Blackwellised Likelihood Weighting Importance Sampling
250
“Clamping” evidence + Forward sampling + Weighting samples by evidence likelihood Works well for likely evidence!
251
eeeee Sample in topological order over X ! eeee x i P(X i |pa i ) P(X i |pa i ) is a look-up in CPT!
253
Estimate Posterior Marginals: P(X i | e)
254
Converges to exact posterior marginals Generates samples fast Sampling distribution is close to prior (especially if E Leaf Nodes) Increasing sampling variance Convergence may be slow Many samples with P(x (t) )=0 rejected
255
Forward Sampling Gibbs Sampling (MCMC) Blocking Rao-Blackwellised Likelihood Weighting Importance Sampling
256
In general, it is hard to sample from target distribution P(X|E) Generate samples from sampling (proposal) distribution Q(X) Weigh each sample against P(X|E)
258
Given a distribution called the proposal distribution Q (such that P(Z=z,e)>0 => Q(Z=z)>0) w(Z=z) is called importance weight
259
Underlying principle, Approximate Average over a set of numbers by an average over a set of sampled numbers
260
Express the problem as computing the average over a set of real numbers Sample a subset of real numbers Approximate the true average by sample average. True Average: Average of (0.11, 0.24, 0.55, 0.77, 0.88,0.99)=0.59 Sample Average over 2 samples: Average of (0.24, 0.77) = 0.505
261
Express Q in product form: Q(Z)=Q(Z 1 )Q(Z 2 |Z 1 )….Q(Z n |Z 1,..Z n-1 ) Sample along the order Z 1,..Z n Example: Q(Z 1 )=(0.2,0.8) Q(Z 2 |Z 1 )=(0.2,0.8,0.1,0.9) Q(Z 3 |Z 1,Z 2 )=Q(Z 3 |Z 1 )=(0.5,0.5,0.3,0.7)
262
Each Sample Z=z Sample Z 1 =z 1 from Q(Z 1 ) Sample Z 2 =z 2 from Q(Z 2 |Z 1 =z1) Sample Z 3 =z 3 from Q(Z 3 |Z1=z1) Generate N such samples
263
Q= Prior Distribution = CPTs of the Bayesian network
264
lung Cancer Smoking X-ray Bronchitis Dyspnoea P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)
265
lung Cancer Smoking X-ray Bronchitis Dyspnoea P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) Q=Prior Q(S,C,D)=Q(S)*Q(C|S)*Q(D|C,B=0) =P(S)P(C|S)P(D|C,B=0) Sample S=s from P(S) Sample C=c from P(C|S=s) Sample D=d from P(D|C=c,B=0)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.