Approximation Techniques bounded inference 275b
SP22 Mini-buckets: “local inference” The idea is similar to i-consistency: bound the size of recorded dependencies Computation in a bucket is time and space exponential in the number of variables involved Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables
SP23 Mini-bucket approximation: MPE task Split a bucket into mini-buckets =>bound complexity
SP24 Approx-mpe(i) Input: i – max number of variables allowed in a mini-bucket Output: [lower bound (P of a sub-optimal solution), upper bound] Example: approx-mpe(3) versus elim-mpe
SP25 Properties of approx-mpe(i) Complexity: O(exp(2i)) time and O(exp(i)) time. Accuracy: determined by upper/lower (U/L) bound. As i increases, both accuracy and complexity increase. Possible use of mini-bucket approximations: As anytime algorithms (Dechter and Rish, 1997) As heuristics in best-first search (Kask and Dechter, 1999) Other tasks: similar mini-bucket approximations for: belief updating, MAP and MEU (Dechter and Rish, 1997)
SP26 Anytime Approximation
SP27 Bounded elimination for belief updating Idea mini-bucket is the same: So we can apply a sum in each mini-bucket, or better, one sum and the rest max, or min (for lower-bound) Approx-bel-max(i,m) generating upper and lower-bound on beliefs approximates elim-bel Approx-map(i,m): max buckets will be maximizes, sum buckets will be sum-max. Approximates elim-map.
SP28 Empirical Evaluation (Dechter and Rish, 1997; Rish thesis, 1999) Randomly generated networks Uniform random probabilities Random noisy-OR CPCS networks Probabilistic decoding Comparing approx-mpe and anytime-mpe versus elim-mpe
SP29 Random networks Uniform random: 60 nodes, 90 edges (200 instances) In 80% of cases, times speed-up while U/L<2 Noisy-OR – even better results Exact elim-mpe was infeasible; appprox-mpe took 0.1 to 80 sec.
SP210 CPCS networks – medical diagnosis (noisy-OR model) Test case: no evidence anytime-mpe( ), anytime-mpe( ), elim-mpe cpcs422cpcs360 Algorithm Time (sec)
SP211 The effect of evidence More likely evidence=>higher MPE => higher accuracy (why?) Likely evidence versus random (unlikely) evidence
SP212 Probabilistic decoding Error-correcting linear block code State-of-the-art: approximate algorithm – iterative belief propagation (IBP) (Pearl’s poly-tree algorithm applied to loopy networks)
SP213 Iterative Belief Proapagation Belief propagation is exact for poly-trees IBP - applying BP iteratively to cyclic networks No guarantees for convergence Works well for many coding networks
SP214 approx-mpe vs. IBP Bit error rate (BER) as a function of noise (sigma):
SP215 Mini-buckets: summary Mini-buckets – local inference approximation Idea: bound size of recorded functions Approx-mpe(i) - mini-bucket algorithm for MPE Better results for noisy-OR than for random problems Accuracy increases with decreasing noise in Accuracy increases for likely evidence Sparser graphs -> higher accuracy Coding networks: approx-mpe outperfroms IBP on low- induced width codes
SP216 Heuristic search Mini-buckets record upper-bound heuristics The evaluation function over Best-first: expand a node with maximal evaluation function Branch and Bound: prune if f >= upper bound Properties: an exact algorithm Better heuristics lead to more prunning
SP217 Heuristic Function Given a cost function P(a,b,c,d,e) = P(a) P(b|a) P(c|a) P(e|b,c) P(d|b,a) Define an evaluation function over a partial assignment as the probability of it’s best extension f*(a,e,d) = max b,c P(a,b,c,d,e) = = P(a) max b,c P)b|a) P(c|a) P(e|b,c) P(d|a,b) = g(a,e,d) H*(a,e,d) E E D A D B D D B
SP218 Heuristic Function H*(a,e,d) = max b,c P(b|a) P(c|a) P(e|b,c) P(d|a,b) = max c P(c|a) max b P(e|b,c) P(b|a) P(d|a,b) max c P(c|a) max b P(e|b,c) max b P(b|a) P(d|a,b) = H(a,e,d) f(a,e,d) = g(a,e,d) H(a,e,d) f*(a,e,d) The heuristic function H is compiled during the preprocessing stage of the Mini-Bucket algorithm.
SP219 max B P(e|b,c) P(d|a,b) P(b|a) max C P(c|a) h B (e,c) max D h B (d,a) max E h C (e,a) max A P(a) h E (a) h D (a) Heuristic Function The evaluation function f(x p ) can be computed using function recorded by the Mini-Bucket scheme and can be used to estimate the probability of the best extension of partial assignment x p ={x 1, …, x p }, f(x p )=g(xp) H(x p ) For example, H(a,e,d) = h B (d,a) h C (e,a) g(a,e,d) = P(a)
SP220 Properties Heuristic is monotone Heuristic is admissible Heuristic is computed in linear time IMPORTANT: Mini-buckets generate heuristics of varying strength using control parameter – bound I Higher bound -> more preprocessing -> stronger heuristics -> less search Allows controlled trade-off between preprocessing and search
SP221 Empirical Evaluation of mini-bucket heuristics
SP222 Cluster Tree Elimination - properties Correctness and completeness: Algorithm CTE is correct, i.e. it computes the exact joint probability of a single variable and the evidence. Time complexity: O ( deg (n+N) d w*+1 ) Space complexity: O ( N d sep ) wheredeg = the maximum degree of a node n = number of variables (= number of CPTs) N = number of nodes in the tree decomposition d = the maximum domain size of a variable w* = the induced width sep = the separator size
SP223 Mini-Clustering for belief updating Motivation: Time and space complexity of Cluster Tree Elimination depend on the induced width w* of the problem When the induced width w* is big, CTE algorithm becomes infeasible The basic idea: Try to reduce the size of the cluster (the exponent); partition each cluster into mini-clusters with less variables Accuracy parameter i = maximum number of variables in a mini- cluster The idea was explored for variable elimination (Mini-Bucket)
SP224 Idea of Mini-Clustering Split a cluster into mini-clusters => bound complexity
SP225 EF BF BC ABC BEF EFG BCDF Mini-Clustering - example
SP226 ABC BEF EFG EF BF BC BCDF ABC BEF EFG EF BF BC BCDF Cluster Tree Elimination vs. Mini-Clustering
SP227 Mini-Clustering Correctness and completeness: Algorithm MC(i) computes a bound (or an approximation) on the joint probability P(X i,e) of each variable and each of its values. Time & space complexity: O(n hw* d i ) where hw* = max u | {f | f (u) } |
SP228 Experimental results Algorithms: Exact IBP Gibbs sampling (GS) MC with normalization (approximate) Networks (all variables are binary): Coding networks CPCS 54, 360, 422 Grid networks (MxM) Random noisy-OR networks Random networks Measures: Normalized Hamming Distance (NHD) BER (Bit Error Rate) Absolute error Relative error Time
SP229 Random networks - Absolute error evidence=0evidence=10
SP230 Noisy-OR networks - Absolute error evidence=10evidence=20
SP231 Grid 15x evidence
SP232 CPCS422 - Absolute error evidence=0evidence=10
SP233 Coding networks - Bit Error Rate sigma=0.22sigma=.51
SP234 Mini-Clustering summary MC extends the partition based approximation from mini-buckets to general tree decompositions for the problem of belief updating Empirical evaluation demonstrates its effectiveness and superiority (for certain types of problems, with respect to the measures considered) relative to other existing algorithms
SP235 What is IJGP? IJGP is an approximate algorithm for belief updating in Bayesian networks IJGP is a version of join-tree clustering which is both anytime and iterative IJGP applies message passing along a join-graph, rather than a join-tree Empirical evaluation shows that IJGP is almost always superior to other approximate schemes (IBP, MC)
SP236 Iterative Belief Propagation - IBP Belief propagation is exact for poly-trees IBP - applying BP iteratively to cyclic networks No guarantees for convergence Works well for many coding networks One step: update BEL(U 1 ) U1U1 U2U2 U3U3 X2X2 X1X1
SP237 IJGP - Motivation IBP is applied to a loopy network iteratively not an anytime algorithm when it converges, it converges very fast MC applies bounded inference along a tree decomposition MC is an anytime algorithm controlled by i-bound MC converges in two passes up and down the tree IJGP combines: the iterative feature of IBP the anytime feature of MC
SP238 IJGP - The basic idea Apply Cluster Tree Elimination to any join-graph We commit to graphs that are minimal I-maps Avoid cycles as long as I-mapness is not violated Result: use minimal arc-labeled join-graphs
SP239 IJGP - Example A D I B E J F G C H A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI a) Belief networka) The graph IBP works on
SP240 Arc-minimal join-graph A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABABBCBC BEBE C C DEDECECE F H F FGFGGHGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABABBCBC C DEDECECE H F FGFGGHGH GI
SP241 Minimal arc-labeled join-graph A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGFGGHGH GIGI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGHGH GIGI
SP242 Join-graph decompositions a) Minimal arc-labeled join graphb) Join-graph obtained by collapsing nodes of graph a) c) Minimal arc-labeled join graph A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC CDECECE F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BCBC DECECE F FGH GI
SP243 Tree decomposition ABCDE FGHIGHIJ CDEF CDE F GHI a) Minimal arc-labeled join grapha) Tree decomposition ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI
SP244 Join-graphs A ABDE FGI ABC BCE GHIJ CDEF FGH C H A C AABBC BE C C DECE F H F FGGHH GI A ABDE FGI ABC BCE GHIJ CDEF FGH C H A ABBC C DECE H F FGH GI ABCDE FGI BCE GHIJ CDEF FGH BC DECE F FGH GI ABCDE FGHIGHIJ CDEF CDE F GHI more accuracy less complexity
SP245 Message propagation ABCDE FGI BCE GHIJ CDEF FGH BCBC CDE CECE F FGH GI ABCDE p(a), p(c), p(b|ac), p(d|abe),p(e|b,c) h(3,1)(bc) BCD CDEF BCBC CDE CECE 13 2 h (3,1) (bc) h (1,2) Minimal arc-labeled: sep(1,2)={D,E} elim(1,2)={A,B,C} Non-minimal arc-labeled: sep(1,2)={C,D,E} elim(1,2)={A,B}
SP246 Bounded decompositions We want arc-labeled decompositions such that: the cluster size (internal width) is bounded by i (the accuracy parameter) the width of the decomposition as a graph (external width) is as small as possible Possible approaches to build decompositions: partition-based algorithms - inspired by the mini-bucket decomposition grouping-based algorithms
SP247 Partition-based algorithms G E F C D B A a) schematic mini-bucket(i), i=3 b) arc-labeled join-graph decomposition CDB CAB BA A CB P(D|B) P(C|A,B) P(A) BA P(B|A) FCD P(F|C,D) GFE EBF BF EF P(E|B,F) P(G|F,E) B CD BF A F G: (GFE) E: (EBF) (EF) F: (FCD) (BF) D: (DB) (CD) C: (CAB) (CB) B: (BA) (AB) (B) A: (A)
SP248 IJGP properties IJGP(i) applies BP to min arc-labeled join-graph, whose cluster size is bounded by i On join-trees IJGP finds exact beliefs IJGP is a Generalized Belief Propagation algorithm (Yedidia, Freeman, Weiss 2001) Complexity of one iteration: time: O(deg(n+N) d i+1 ) space: O(Nd )
SP249 Empirical evaluation Algorithms : Exact IBP MC IJGP Measures: Absolute error Relative error Kulbach-Leibler (KL) distance Bit Error Rate Time Networks (all variables are binary): Random networks Grid networks (MxM) CPCS 54, 360, 422 Coding networks
SP250 Random networks - KL at convergence evidence=0 evidence=5
SP251 Random networks - KL vs. iterations evidence=0evidence=5
SP252 Random networks - Time
SP253 Coding networks - BER sigma=.22sigma=.32 sigma=.51sigma=.65
SP254 Coding networks - Time
SP255 IJGP summary IJGP borrows the iterative feature from IBP and the anytime virtues of bounded inference from MC Empirical evaluation showed the potential of IJGP, which improves with iteration and most of the time with i-bound, and scales up to large networks IJGP is almost always superior, often by a high margin, to IBP and MC Based on all our experiments, we think that IJGP provides a practical breakthrough to the task of belief updating
SP256 Random networks N=80, 100 instances, w*=15
SP257 Random networks N=80, 100 instances, w*=15
SP258 CPCS 54, CPCS360 CPCS360: 5 instances, w*=20 CPCS54: 100 instances, w*=15
SP259 Graph coloring problems X1X1 X2X2 X3X3 H1H1 H2H2 H3H3 X1X1 X2X2 X3X3 X4X4 X n-1 XnXn … H1H1 H2H2 H3H3 H4H4 … HkHk XiXi XjXj P(H k |X i X j ) ……
SP260 Graph coloring problems
SP261 Inference power of IBP - summary IBP’s inference of zero beliefs converges in a finite number of iterations and is sound; The results extend to generalized belief propagation algorithms, in particular to IJGP; We identified classes of networks for which IBP: can infer zeros, and therefore is likely to be good; can not infer zeros, although there are many of them (graph coloring), and therefore is bad Based on the analysis it is easy to synthesize belief networks that are hard for IBP. The success of IBP for coding networks can be explained by: Many extreme beliefs An easy-for-arc-consistency flat network
SP262 “Road map” CSPs: complete algorithms CSPs: approximations Belief nets: complete algorithms Belief nets: approximations Local inference: mini-buckets Stochastic simulations Variational techniques MDPs
SP263 Stochastic Simulation Forward sampling (logic sampling) Likelihood weighing Markov Chain Monte Carlo (MCMC): Gibbs sampling
SP264 Approximation via Sampling
SP265 Forward Sampling (logic sampling (Henrion, 1988))
SP266 Forward sampling (example) Drawback: high rejection rate!
SP267 Likelihood Weighing (Fung and Chang, 1990; Shachter and Peot, 1990) Works well for likely evidence! “Clamping” evidence+forward sampling+ weighing samples by evidence likelihood
SP268 Gibbs Sampling (Geman and Geman, 1984) Markov Chain Monte Carlo (MCMC): create a Markov chain of samples Advantage: guaranteed to converge to P(X) Disadvantage: convergence may be slow
SP269 Gibbs Sampling (cont’d) (Pearl, 1988) Markov blanket :