Bozhena Bidyuk Vibhav Gogate Rina Dechter

Bozhena Bidyuk Vibhav Gogate Rina Dechter
Sampling Techniques for Probabilistic and Deterministic Graphical models Bozhena Bidyuk Vibhav Gogate Rina Dechter Graphical models are the most popular knowledge representation schemes in AI and computer sciences for representing and reasoning with probabilistic and deterministic relationship. Exact reasoning is hard when the graph is dense (treewidth is high) so approximation schemes must be used at some point. Approximation with guarantees are hard also, but there are not as difficult when the information is just probabilistic. No strict constraints. However when some of the knowledge involves constraints approximatioin is hard as well. Iin particular sampling schemes becomes hard. In this talk we will address this issue of sampling in the presence of hard constraints. I will present SampleSearch which sample and search and analyze the implication of the ideas for importance sampling. We will identify the rejection problem, show how it can be hanbdled and when it is cost-effective. Empirical evaluation will demonstrate how cost-effective the scheme is. I will then move to the orthogonal approach of exploiting problem structure in sampling And show our strategy.

Overview Probabilistic Reasoning/Graphical models Importance Sampling
Markov Chain Monte Carlo: Gibbs Sampling Sampling in presence of Determinism Rao-Blackwellisation AND/OR importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Markov Chain Monte Carlo: Gibbs Sampling Sampling in presence of Determinism Cutset-based Variance Reduction AND/OR importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Probabilistic Reasoning; Graphical models
Bayesian network, constraint networks, mixed network Queries Exact algorithm using inference, search and hybrids Graph parameters: tree-width, cycle-cutset, w-cutset Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Bayesian Networks (Pearl, 1988)
P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) Smoking lung Cancer Bronchitis CPT: C B P(D|C,B) X-ray Dyspnoea Belief networks provide a formalism for reasoning under uncertainty. A belief network is defined by a directed Acyclic graph over nodes representing variable of interest. The arcs signify the existence of direct influences between Linked variables, and the strength of those influences are quantified by conditional probabilities. Primary queries over belief networks or markov networks is finding the probability of evidence, finding posterior probabilities, and finding the most likely scenario given the evidence (mpe). We can also talk about about map query which is like the mpe query but we focus on a subset of variable. Bayesian networks are popular for the medical domain: Imagine that we want to express the relationship between smoking, lung cancer, bronchitis, X-rays and Dyspnoea. Again, we can use a directed acyclic graph to capture direct causal influences and quantify them with probabilistic relationships. Once we have such a network specified (the representation question) we can use algorithm to answer the belief-updating query: And so if we learn that a person has D.. P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B Belief Updating: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ? Probability of evidence: P ( smoking=no, dyspnoea=yes ) = ?

Queries Probability of evidence (or partition function)
Posterior marginal (beliefs): Most Probable Explanation

Constraint Networks Map coloring Variables: countries (A B C etc.)
Values: colors (red green blue) Constraints: A B red green red yellow green red green yellow yellow green yellow red Constraint graph A B D C G F E C A B D E F G Task: find a solution Count solutions, find a good one

Propositional Satisfiability
 = {(¬C), (A v B v C), (¬A v B v E), (¬B v C v D)}.

Mixed Networks: Mixing Belief and Constraints
Belief or Bayesian Networks Constraint Networks A A D B C E F F B C B C D=0 D=1 1 .1 .9 .3 .7 E D B= R= The focus is on graphical models having determinism and mixed networks is a unifying framework for such graphical models. A graphical model captures knowledge using a collection of functions whose interaction can be captured by a graph representation. We have Bayesian networks to allow reasoning under conditions of uncertainty. A Bayesian network is a compact representation of a joint probability distribution. Given a directed acyclic graph over random variables, the joint distribution can be expressed as product of conditional probability tables (CPTS) defined over each variable. Each CPT is specified using the variable given its parents in the network. Constraint networks are deterministic graphical models which contains a set of variables, a set of values for each variable and constraints which restrict the assignments that can be made to the variables. A solution is an assignment to all variables such that all constraints are satisfied. Mixed networks are graphical models which augment a probabilistic graphical model having determinism with a constraint network. The constraint network is basically used to model zero probabilities as constraints. The constraints can also be specified externally. For example, the zero probabilities in the Bayesian network can be modeled as constraints as shown. The central aim is to allow exploiting the power of constraint processing techniques for efficient probabilistic inference. Constraints could be specified externally or may occur as zeros in the Belief network Same queries (e.g., weighted counts)

Bucket Elimination Algorithm elim-bel (Dechter 1996)
Elimination operator bucket B: P(a) P(c|a) P(b|a) P(d|b,a) P(e|b,c) bucket C: bucket D: bucket E: bucket A: e=0 B C D E A W*=4 Exp(w*) P(a|e=0)

Bucket Elimination Query: Elimination Order: d,e,b,c
Original Functions Messages Bucket Tree D E D,A,B E,B,C D: E: B: C: A: B BE is a unifying algorithmic framework for probabilistic inference – organizes computations using ‘buckets’ Say, want to compute Pr(a|e=0), given elimination order d,e,b,c Using BE, we process as follows: 1) Partition/Assign original functions/CPTs into ‘buckets’ using the specified elimination order; 2) Process from top to bottom, eliminating the variable in the bucket from subsequent computations Each bucket contains a set of functions, either the original functions/CPTs or functions generated by the algorithm BE is also a special case of tree elimination in which the tree-structure upon which messages are passed, the bucket tree, is determined by the variable elimination order used Nodes of tree are referred to as buckets BE processes along bucket tree from leaves to root – at each bucket performing two steps: 1) Combination (multiplication in case of BNs); and 2) Elimination (summation in case of Pr(e)/belief updating in BNs) B,A,C C C,A A A Time and space exp(w*)

Complexity of Elimination
The effect of the ordering: “Moral” graph A D E C B B C D E A E D C B A

Cutset-Conditioning A Cycle cutset = {A,B,C} B C C P J A L B E D F M O
H K G N C P J A L B E D F M O H K G N C P J A L B E D F M O H K G N C P J L B E D F M O H K G N C P J L B E D F M O H K G N A Cycle cutset = {A,B,C} B P J L E D F M O H K G N C P J L E D F M O H K G N C P J L E D F M O H K G N C

Search Over the Cutset Space: exp(w): w is a user-controled parameter
Time: exp(w+c(w)) A C B K G L D F H M J E Graph Coloring problem Inference may require too much memory Condition on some of the variables A=yellow A=green B=red B=blue B=green B=red B=blue B=yellow If we continue recursively all the way, we will have a search space that can be traversed in any manner. Or, we can stop and solve the problem by infere nce because it is not far from a tree, yielding methods known as cycle-cutset and w-cutset C K G L D F H M J E C K G L D F H M J E C K G L D F H M J E C K G L D F H M J E C K G L D F H M J E C K G L D F H M J E

Linkage Analysis ? ? A a B b A A 4 A | ? B | ? 6 A | a B | b
2 1 ? ? A a B b A A 3 4 A | ? B | ? 5 6 A | a B | b There are many applications to bio-informatics of Bayesian networks. One recent application of graphical models that is quite successful is linkage analysis. Given a family tree, phenotype of individuals in the tree at the studied trait (affected/unaffected/unknown), and partial, unordered genotype information at some marker loci Compute the most probable location of the disease gene. This is done by placing the disease gene on some marker and computing the probability of the data given the assumed location, which is represented by the distance, theta, from a known loci. 6 individuals Haplotype: {2, 3} Genotype: {6} Unknown

Linkage Analysis: 6 People, 3 Markers
L11m L11f L12m L12f X11 X12 S15m S13m L13m L13f L14m L14f X13 X14 S15m S15m L15m L15f L16m L16f S15m S16m X15 X16 L21m L21f L22m L22f X21 X22 S25m S23m L23m L23f L24m L24f X23 X24 S25m S25m L25m L25f L26m L26f Here is an example of using Bayesian networks for linkage analysis. It models the genetic inheritance in a family of 6 individuals relative to some genes of interest. The task is to find a desease gene on a chromosome This domain yields very hard probabilistic networks that contain both probabilistic information and deterministic relationship And it drives many of the methods that we currently develop. . S25m S26m X25 X26 L31m L31f L32m L32f X31 X32 S35m S33m L33m L33f L34m L34f X33 X34 S35m S35m L35m L35f L36m L36f S35m S36m X35 X36

Applications Determinism: More Ubiquitous than you may think! Transportation Planning (Liao et al. 2004, Gogate et al. 2005) Predicting and Inferring Car Travel Activity of individuals Genetic Linkage Analysis (Fischelson and Geiger, 2002) associate functionality of genes to their location on chromosomes. Functional/Software Verification (Bergeron, 2000) Generating random test programs to check validity of hardware First Order Probabilistic models (Domingos et al. 2006, Milch et al. 2005) Citation matching Applications of mixed network. Graphical models which have determinism occur quite frequently in many real world applications. Examples are transportation planning in which given GPS data about a person, the problem is to infer and predict Travel activity routines of individuals. Another application is in computational Biology, specifically genetic linkage analysis in which given a chromosome and some marker locations on the chromosome and the task is figure out if any DNA on the chromosome of the marker affects a particular disease. A third application is functional or software verification in which you are given a circuit or a piece of code and the task is to generate test programs which test whether the circuit or the software conforms to the specification. A final application is first order probabilistic models which have lot of determinism and are used for things like Citation matching or figuring out Author Author relationships.

Inference vs Conditioning-Search
Exp(w*) time/space E K F L H C B A M G J D ABC BDEF DGF EFH FHK HJ KLM 1 E C F D B A Search Exp(n) time O(n) space A D B C E F A=yellow A=green B=blue B=red B=green C K G L D F H M J E A B Search+inference: Space: exp(w) Time: exp(w+c(w))

Approximation Since inference, search and hybrids are too expensive when graph is dense; (high treewidth) then: Bounding inference: mini-bucket and mini-clustering Belief propagation Bounding search: Sampling Goal: an anytime scheme

Approximation Bounding search: Sampling
Since inference, search and hybrids are too expensive when graph is dense; (high treewidth) then: Bounding inference: mini-bucket and mini-clustering Belief propagation Bounding search: Sampling Goal: an anytime scheme

Outline Definitions and Background on Statistics
Theory of importance sampling Likelihood weighting State-of-the-art importance sampling techniques

A sample Given a set of variables X={X1,...,Xn}, a sample, denoted by St is an instantiation of all variables: Given a set of “n” variables X, a sample denoted by St is an assignment or an instantiation of values to all variables. Notice that the subscripts show the variable# while the superscripts show the sample #

How to draw a sample ? Univariate distribution
Example: Given random variable X having domain {0, 1} and a distribution P(X) = (0.3, 0.7). Task: Generate samples of X from P. How? draw random number r  [0, 1] If (r < 0.3) then set X=0 Else set X=1 Naturally, a fundamental question is how to draw a sample given a distribution. I’ll first describe how to sample from a uni-variate distribution. Let X be a binary random variable with domain {0,1} and P be a distribution over X. To sample a value for X, we first draw a random real number from the uniform distribution over [0,1]. If its value is less than 0.3, we set X to zero, otherwise we set it to 1.

How to draw a sample? Multi-variate distribution
Let X={X1,..,Xn} be a set of variables Express the distribution in product form Sample variables one by one from left to right, along the ordering dictated by the product form. Bayesian network literature: Logic sampling

Logic sampling (example)
Here is an illustration of logic sampling on a simple Bayesian network with four variables.

Expected value and Variance
Expected value: Given a probability distribution P(X) and a function g(X) defined over a set of variables X = {X1, X2, … Xn}, the expected value of g w.r.t. P is Variance: The variance of g w.r.t. P is: The expected value is defined as shown. Given a function g and a distribution P, the expected value is the sum of product the g-value and the p-value of all configurations of the random variable x. Note that we will call g(x) as the g-value and P(x) as the p-value.

Monte Carlo Estimate Estimator:
An estimator is a function of the samples. It produces an estimate of the unknown parameter of the sampling distribution.

Example: Monte Carlo estimate
Given: A distribution P(X) = (0.3, 0.7). g(X) = 40 if X equals 0 = 50 if X equals 1. Estimate EP[g(x)]=(40x0.3+50x0.7)=47. Generate k samples from P: 0,1,1,1,0,1,1,0,1,0

Importance sampling: Main idea
Transform the probabilistic inference problem into the problem of computing the expected value of a random variable w.r.t. to a distribution Q. Generate random samples from Q. Estimate the expected value from the generated samples.

Importance sampling for P(e)

Properties of IS estimate of P(e)
Convergence: by law of large numbers Unbiased. Variance:

Properties of IS estimate of P(e)
Mean Squared Error of the estimator The mean-squared error of g-bar is given by this expression. The important point to note is that the mean-squared error of the Monte Carlo estimate equals the variance! and thus it goes down as the number of samples is increased. This quantity enclosed in the brackets is zero because the expected value of the estimator equals the expected value of g(x)

Estimating P(Xi|e)

Properties of the IS estimator for P(Xi|e)
Convergence: By Weak law of large numbers Asymptotically unbiased Variance Harder to analyze Liu suggests a measure called “Effective sample size”

Effective Sample size Ideal estimator
Measures how much the estimator deviates from the ideal one.

Likelihood Weighting: Proposal Distribution

Likelihood Weighting: Sampling
Sample in topological order over X ! e e e e e e e e e Clamp evidence, Sample xi P(Xi|pai), P(Xi|pai) is a look-up in CPT!

Proposal selection One should try to select a proposal that is as close as possible to the posterior distribution.

Proposal Distributions used in Literature
AIS-BN (Adaptive proposal) Cheng and Druzdzel, 2000 Iterative Belief Propagation Changhe and Druzdzel, 2003 Iterative Join Graph Propagation (IJGP) and variable ordering Gogate and Dechter, 2005

Perfect sampling using Bucket Elimination
Algorithm: Run Bucket elimination on the problem along an ordering o=(XN,..,X1). Sample along the reverse ordering: (X1,..,XN) At each variable Xi, recover the probability P(Xi|x1,...,xi-1) by referring to the bucket.

Bucket elimination (BE) Algorithm elim-bel (Dechter 1996)
Elimination operator bucket B: P(a) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A: B C D E A P(e) SP2

Sampling from the output of BE (Dechter 2002)
bucket B: P(A) P(C|A) P(B|A) P(D|B,A) P(e|B,C) bucket C: bucket D: bucket E: bucket A: SP2

Mini-buckets: “local inference”
Computation in a bucket is time and space exponential in the number of variables involved Therefore, partition functions in a bucket into “mini-buckets” on smaller number of variables Can control the size of each “mini-bucket”, yielding polynomial complexity. SP2

Mini-Bucket Elimination
Space and Time constraints: Maximum scope size of the new function generated should be bounded by 2 Mini-buckets ΣB ΣB bucket B: P(e|B,C) P(B|A) P(D|B,A) P(C|A) hB(C,e) bucket C: BE generates a function having scope size 3. So it cannot be used. bucket D: hB(A,D) bucket E: hC(A,e) P(A) bucket A: hE(A) hD(A) Approximation of P(e) 49 SP2

Sampling from the output of MBE
bucket B: P(e|B,C) P(B|A) P(D|B,A) P(C|A) hB(C,e) bucket C: bucket D: hB(A,D) Sampling is same as in BE-sampling except that now we construct Q from a randomly selected “mini-bucket” bucket E: hC(A,e) bucket A: hE(A) hD(A) 50 SP2

IJGP-Sampling (Gogate and Dechter, 2005)
Iterative Join Graph Propagation (IJGP) A Generalized Belief Propagation scheme (Yedidia et al., 2002) IJGP yields better approximations of P(X|E) than MBE (Dechter, Kask and Mateescu, 2002) Output of IJGP is same as mini-bucket “clusters” Currently the best performing IS scheme!

Adaptive Importance Sampling

Adaptive Importance Sampling
General case Given k proposal distributions Take N samples out of each distribution Approximate P(e)

Estimating Q'(z)

Markov Chain x1 x2 x3 x4 A Markov chain is a discrete random process with the property that the next state depends only on the current state (Markov Property): If P(Xt|xt-1) does not depend on t (time homogeneous) and state space is finite, then it is often expressed as a transition function (aka transition matrix)

Example: Drunkard’s Walk
a random walk on the number line where, at each step, the position may change by +1 or −1 with equal probability 1 2 1 2 3 transition matrix P(X)

Example: Weather Model
rain rain rain sun rain transition matrix P(X)

Multi-Variable System
state is an assignment of values to all the variables x1t x1t+1 x2t x2t+1 x3t x3t+1

Bayesian Network System
Bayesian Network is a representation of the joint probability distribution over 2 or more variables X1t x1t+1 X1 X2t x2t+1 X2 X3 X3t x3t+1

Stationary Distribution Existence
If the Markov chain is time-homogeneous, then the vector (X) is a stationary distribution (aka invariant or equilibrium distribution, aka “fixed point”), if its entries sum up to 1 and satisfy: Finite state space Markov chain has a unique stationary distribution if and only if: The chain is irreducible All of its states are positive recurrent

Irreducible A state x is irreducible if under the transition rule one has nonzero probability of moving from x to any other state and then coming back in a finite number of steps If one state is irreducible, then all the states must be irreducible (Liu, Ch. 12, pp. 249, Def )

Recurrent A state x is recurrent if the chain returns to x with probability 1 Let M(x ) be the expected number of steps to return to state x State x is positive recurrent if M(x ) is finite The recurrent states in a finite state chain are positive recurrent .

Stationary Distribution Convergence
Consider infinite Markov chain: If the chain is both irreducible and aperiodic, then: Initial state is not important in the limit “The most useful feature of a “good” Markov chain is its fast forgetfulness of its past…” (Liu, Ch. 12.1)

Aperiodic Define d(i) = g.c.d.{n > 0 | it is possible to go from i to i in n steps}. Here, g.c.d. means the greatest common divisor of the integers in the set. If d(i)=1 for i, then chain is aperiodic Positive recurrent, aperiodic states are ergodic

Markov Chain Monte Carlo
How do we estimate P(X), e.g., P(X|e) ? Generate samples that form Markov Chain with stationary distribution =P(X|e) Estimate  from samples (observed states): visited states x0,…,xn can be viewed as “samples” from distribution 

MCMC Summary Convergence is guaranteed in the limit
Initial state is not important, but… typically, we throw away first K samples - “burn-in” Samples are dependent, not i.i.d. Convergence (mixing rate) may be slow The stronger correlation between states, the slower convergence!

Gibbs Sampling (Geman&Geman,1984)
Gibbs sampler is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables Sample new variable value one variable at a time from the variable’s conditional distribution: Samples form a Markov chain with stationary distribution P(X|e)

Gibbs Sampling: Illustration
The process of Gibbs sampling can be understood as a random walk in the space of all instantiations of X=x (remember drunkard’s walk): In one step we can reach instantiations that differ from current one by value assignment to at most one variable (assume randomized choice of variables Xi).

Ordered Gibbs Sampler Generate sample xt+1 from xt :
In short, for i=1 to N: Process All Variables In Some Order

Transition Probabilities in BN
Given Markov blanket (parents, children, and their parents), Xi is independent of all other nodes Xi Markov blanket: Computation is linear in the size of Markov blanket!

Ordered Gibbs Sampling Algorithm (Pearl,1988)
Input: X, E=e Output: T samples {xt } Fix evidence E=e, initialize x0 at random For t = 1 to T (compute samples) For i = 1 to N (loop through variables) xit+1  P(Xi | markovit) End For

Gibbs Sampling Example - BN
X1 = x10 X6 = x60 X2 = x20 X7 = x70 X3 = x30 X8 = x80 X4 = x40 X5 = x50 X1 X3 X6 X2 X5 X8 X9 X4 X7

Gibbs Sampling Example - BN

Answering Queries P(xi |e) = ?
Method 1: count # of samples where Xi = xi (histogram estimator): Dirac delta f-n Method 2: average probability (mixture estimator): Mixture estimator converges faster (consider estimates for the unobserved values of Xi; prove via Rao-Blackwell theorem)

Rao-Blackwell Theorem
Rao-Blackwell Theorem: Let random variable set X be composed of two groups of variables, R and L. Then, for the joint distribution (R,L) and function g, the following result applies for a function of interest g, e.g., the mean or covariance (Casella&Robert,1996, Liu et. al. 1995). theorem makes a weak promise, but works well in practice! improvement depends the choice of R and L

Importance vs. Gibbs Gibbs: Importance: wt

Gibbs Sampling: Convergence
Sample from `P(X|e)P(X|e) Converges iff chain is irreducible and ergodic Intuition - must be able to explore all states: if Xi and Xj are strongly correlated, Xi=0 Xj=0, then, we cannot explore states with Xi=1 and Xj=1 All conditions are satisfied when all probabilities are positive Convergence rate can be characterized by the second eigen-value of transition matrix

Gibbs: Speeding Convergence
Reduce dependence between samples (autocorrelation) Skip samples Randomize Variable Sampling Order Employ blocking (grouping) Multiple chains Reduce variance (cover in the next section)

Blocking Gibbs Sampler
Sample several variables together, as a block Example: Given three variables X,Y,Z, with domains of size 2, group Y and Z together to form a variable W={Y,Z} with domain size 4. Then, given sample (xt,yt,zt), compute next sample: + Can improve convergence greatly when two variables are strongly correlated! - Domain of the block variable grows exponentially with the #variables in a block!

Gibbs: Multiple Chains
Generate M chains of size K Each chain produces independent estimate Pm: Estimate P(xi|e) as average of Pm (xi|e) : Treat Pm as independent random variables.

Gibbs Sampling Summary
Markov Chain Monte Carlo method (Gelfand and Smith, 1990, Smith and Roberts, 1993, Tierney, 1994) Samples are dependent, form Markov Chain Sample from which converges to Guaranteed to converge when all P > 0 Methods to improve convergence: Blocking Rao-Blackwellised

Outline Rejection problem Backtrack-free distribution SampleSearch
Constructing it in practice SampleSearch Construct the backtrack-free distribution on the fly. Approximate estimators Experiments

Rejection problem Importance sampling requirement
P(z,e) > 0  Q(z)>0 When P(z,e)=0 but Q(z) > 0, the weight of the sample is zero and it is rejected. The probability of generating a rejected sample should be very small. Otherwise the estimate will be zero.

Rejection Problem Constraints: A≠B A≠C If z violates either A≠B or A≠C then P(z,e)=0 A=0 B=0 C=0 B=1 A=1 C=1 Root 0.8 0.2 0.4 0.6 Q: Positive All Blue leaves correspond to solutions i.e. g(x) >0 All Red leaves correspond to non-solutions i.e. g(x)=0

Backtrack-free distribution: A rejection-free distribution
QF(branch)=0 if no solutions under it QF(branch) αQ(branch) otherwise Constraints: A≠B A≠C A=0 B=0 C=0 B=1 A=1 C=1 Root 0.8 0.2 1 All Blue leaves correspond to solutions i.e. g(x) >0 All Red leaves correspond to non-solutions i.e. g(x)=0

Generating samples from QF
Constraints: A≠B A≠C QF(branch)=0 if no solutions under it QF(branch) αQ(branch) otherwise Root 0.8 0.2 Invoke an oracle at each branch. Oracle returns True if there is a solution under a branch False, otherwise A=0 00.4 0.6  1 ?? Solution ?? Solution B=1 00.2 0.81 ?? Solution C=1

Generating samples from QF
Gogate et al., UAI 2005, Gogate and Dechter, UAI 2005 Generating samples from QF Constraints: A≠B A≠C Oracles Adaptive consistency as pre-processing step Constant time table look-up Exponential in the treewidth of the constraint portion. A complete CSP solver Need to run it at each assignment. Root 0.2 0.8 A=0 1 B=1 1 C=1

Algorithm SampleSearch
Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C A=0 B=0 C=0 B=1 A=1 C=1 Root 0.8 0.2 0.4 0.6

Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 1 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 1 B=1 B=0 B=1 0.8 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 Resume Sampling

Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 1 B=1 B=0 B=1 0.8 0.2 0.8 1 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 Until P(sample,e)>0 Constraint Violated

Gogate and Dechter, AISTATS 2007, AAAI 2007
Generate more Samples Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Gogate and Dechter, AISTATS 2007, AAAI 2007
Generate more Samples Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 1 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Traces of SampleSearch
Gogate and Dechter, AISTATS 2007, AAAI 2007 Traces of SampleSearch Constraints: A≠B A≠C A=0 B=0 B=1 C=1 Root A=0 B=1 C=1 Root C=0 A=0 B=0 B=1 C=1 Root C=0 Root A=0 B=1 C=1

SampleSearch: Sampling Distribution
Gogate and Dechter, AISTATS 2007, AAAI 2007 SampleSearch: Sampling Distribution Problem: Due to Search, the samples are no longer i.i.d. from Q. Thm: SampleSearch generates i.i.d. samples from the backtrack-free distribution

The Sampling distribution QF of SampleSearch
Gogate and Dechter, AISTATS 2007, AAAI 2007 The Sampling distribution QF of SampleSearch Constraints: A≠B A≠C Root What is probability of generating A=0? QF(A=0)=0.8 Why? SampleSearch is systematic 0.8 What is probability of generating (A=0,B=1)? QF(B=1|A=0)=1 Why? SampleSearch is systematic A=0 1 B=0 B=1 What is probability of generating (A=0,B=0)? Simple: QF(B=0|A=0)=0 All samples generated by SampleSearch are solutions 1 C=0 C=1 C=0 C=1 Backtrack-free distribution

Asymptotic approximations of QF
Gogate and Dechter, AISTATS 2007, AAAI 2007 Asymptotic approximations of QF Root IF Hole THEN UF=Q (i.e. there is a solution at the other branch) LF=0 (i.e. no solution at the other branch) 0.8 Hole ? Don’t know A=0 1 No solutions here B=0 B=1 1 No solutions here C=0 C=1 C=0 C=1

Approximations: Convergence in the limit
Gogate and Dechter, AISTATS 2007, AAAI 2007 Approximations: Convergence in the limit Store all possible traces A=0 B=1 C=1 Root 0.8 1 ? A=0 B=0 B=1 C=1 Root 0.8 ? 1 A=0 B=1 C=1 Root C=0 0.8 ? 0.6 1

Approximations: Convergence in the limit
Gogate and Dechter, AISTATS 2007, AAAI 2007 Approximations: Convergence in the limit From the combined sample tree, update U and L. IF Hole THEN UFN=Q and LFN=0 Root 0.8 ? A=0 1 B=1 1 C=1

Upper and Lower Approximations
Asymptotically unbiased. Upper and lower bound on the unbiased sample mean Linear time and space overhead Bias versus variance tradeoff Bias = difference between the upper and lower approximation.

Improving Naive SampleSeach
Gogate and Dechter, AISTATS 2007, AAAI 2007 Improving Naive SampleSeach Better Search Strategy Can use any state-of-the-art CSP/SAT solver e.g. minisat (Een and Sorrenson 2006) All theorems and result hold Better Importance Function Use output of generalized belief propagation to compute the initial importance function Q (Gogate and Dechter, 2005)

Experiments Tasks Benchmarks Algorithms Weighted Counting Marginals
Satisfiability problems (counting solutions) Linkage networks Relational instances (First order probabilistic networks) Grid networks Logistics planning instances Algorithms SampleSearch/UB, SampleSearch/LB SampleCount (Gomes et al. 2007, SAT) ApproxCount (Wei and Selman, 2007, SAT) RELSAT (Bayardo and Peshoueshk, 2000, SAT) Edge Deletion Belief Propagation (Choi and Darwiche, 2006) Iterative Join Graph Propagation (Dechter et al., 2002) Variable Elimination and Conditioning (VEC) EPIS (Changhe and Druzdzel, 2006)

Results: Solution Counts Langford instances
Time Bound: 10 hrs

Results: Probability of Evidence Linkage instances (UAI 2006 evaluation)
Time Bound: 3 hrs

Time Bound: 3 hrs

Results on Marginals Evaluation Criteria
Always bounded between 0 and 1 Lower Bounds the KL distance When probabilities close to zero are present KL distance may tend to infinity.

Results: Posterior Marginals Linkage instances (UAI 2006 evaluation)
Time Bound: 3 hrs Distance measure: Hellinger distance

Summary: SampleSearch
Manages rejection problem while sampling Systematic backtracking search Sampling Distribution of SampleSearch is the backtrack-free distribution QF Expensive to compute Approximation of QF based on storing all traces that yields an asymptotically unbiased estimator Linear time and space overhead Bound the sample mean from above and below Empirically, when a substantial number of zero probabilities are present, SampleSearch based schemes dominate their pure sampling counter-parts and Generalized Belief Propagation.

Sampling: Performance
Gibbs sampling Reduce dependence between samples Importance sampling Reduce variance Achieve both by sampling a subset of variables and integrating out the rest (reduce dimensionality), aka Rao-Blackwellisation Exploit graph structure to manage the extra cost

Smaller Subset State-Space
Smaller state-space is easier to cover

Smoother Distribution

Speeding Up Convergence
Mean Squared Error of the estimator: In case of unbiased estimator, BIAS=0 Reduce variance  speed up convergence !

Rao-Blackwellisation
Liu, Ch.2.3

Rao-Blackwellisation
“Carry out analytical computation as much as possible” - Liu X=RL Importance Sampling: Gibbs Sampling: autocovariances are lower (less correlation between samples) if Xi and Xj are strongly correlated, Xi=0  Xj=0, only include one fo them into a sampling set Liu, Ch.2.5.5

Blocking Gibbs Sampler vs. Collapsed
X Y Z Standard Gibbs: (1) Blocking: (2) Collapsed: (3) Faster Convergence

Collapsed Gibbs Sampling Generating Samples
Generate sample ct+1 from ct : In short, for i=1 to K:

Collapsed Gibbs Sampler
Input: C X, E=e Output: T samples {ct } Fix evidence E=e, initialize c0 at random For t = 1 to T (compute samples) For i = 1 to N (loop through variables) cit+1  P(Ci | ct\ci) End For

Calculation Time Computing P(ci| ct\ci,e) is more expensive (requires inference) Trading #samples for smaller variance: generate more samples with higher covariance generate fewer samples with lower covariance Must control the time spent computing sampling probabilities in order to be time-effective!

Exploiting Graph Properties
Recall… computation time is exponential in the adjusted induced width of a graph w-cutset is a subset of variable s.t. when they are observed, induced width of the graph is w when sampled variables form a w-cutset , inference is exp(w) (e.g., using Bucket Tree Elimination) cycle-cutset is a special case of w-cutset Sampling w-cutset  w-cutset sampling!

What If C=Cycle-Cutset ?
P(x2,x5,x9) – can compute using Bucket Elimination X1 X2 X3 X1 X3 X4 X5 X6 X4 X6 X9 X9 X7 X8 X7 X8 P(x2,x5,x9) – computation complexity is O(N)

Computing Transition Probabilities
Compute joint probabilities: X1 X2 X3 X4 X5 X6 Normalize: X9 X7 X8

Cutset Sampling-Answering Queries
Query: ci C, P(ci |e)=? same as Gibbs: computed while generating sample t using bucket tree elimination Query: xi X\C, P(xi |e)=? compute after generating sample t using bucket tree elimination

Cutset Sampling vs. Cutset Conditioning

Cutset Sampling Example
Estimating P(x2|e) for sampling node X2 : Sample 1 X1 X2 X3 Sample 2 X4 X5 X6 Sample 3 X9 X7 X8

Cutset Sampling Example
Estimating P(x3 |e) for non-sampled node X3 : X1 X2 X3 X4 X5 X6 X9 X7 X8

CPCS54 Test Results MSE vs. #samples (left) and time (right)
Ergodic, |X|=54, D(Xi)=2, |C|=15, |E|=3 Exact Time = 30 sec using Cutset Conditioning

CPCS179 Test Results MSE vs. #samples (left) and time (right)
Non-Ergodic (1 deterministic CPT entry) |X| = 179, |C| = 8, 2<= D(Xi)<=4, |E| = 35 Exact Time = 122 sec using Cutset Conditioning

CPCS360b Test Results MSE vs. #samples (left) and time (right)
Ergodic, |X| = 360, D(Xi)=2, |C| = 21, |E| = 36 Exact Time > 60 min using Cutset Conditioning Exact Values obtained via Bucket Elimination

Random Networks MSE vs. #samples (left) and time (right)
|X| = 100, D(Xi) =2,|C| = 13, |E| = 15-20 Exact Time = 30 sec using Cutset Conditioning

Coding Networks Cutset Transforms Non-Ergodic Chain to Ergodic
x1 x2 x3 x4 u1 u2 u3 u4 p1 p2 p3 p4 y1 y2 y3 y4 MSE vs. time (right) Non-Ergodic, |X| = 100, D(Xi)=2, |C| = 13-16, |E| = 50 Sample Ergodic Subspace U={U1, U2,…Uk} Exact Time = 50 sec using Cutset Conditioning

Non-Ergodic Hailfinder
MSE vs. #samples (left) and time (right) Non-Ergodic, |X| = 56, |C| = 5, 2 <=D(Xi) <=11, |E| = 0 Exact Time = 2 sec using Loop-Cutset Conditioning

CPCS360b - MSE MSE vs. Time Ergodic, |X| = 360, |C| = 26, D(Xi)=2
Exact Time = 50 min using BTE

Cutset Importance Sampling
(Gogate & Dechter, 2005) and (Bidyuk & Dechter, 2006) Apply Importance Sampling over cutset C where P(ct,e) is computed using Bucket Elimination

Likelihood Cutset Weighting (LCS)
Z=Topological Order{C,E} Generating sample t+1: computed while generating sample t using bucket tree elimination can be memoized for some number of instances K (based on memory available KL[P(C|e), Q(C)] ≤ KL[P(X|e), Q(X)]

Pathfinder 1

Pathfinder 2

Summary i.i.d. samples Unbiased estimator Generates samples fast
Importance Sampling Gibbs Sampling i.i.d. samples Unbiased estimator Generates samples fast Samples from Q Reject samples with zero-weight Improves on cutset Dependent samples Biased estimator Generates samples slower Samples from `P(X|e) Does not converge in presence of constraints Improves on cutset

CPCS360b LW – likelihood weighting
LCS – likelihood weighting on a cutset

CPCS422b LW – likelihood weighting

Coding Networks LW – likelihood weighting

Motivation Expected value of the number on the face of a die:
What is the expected value of the product of the numbers on the face of “k” dice? As we saw the main idea in sampling is to express the problem as an expectation and then compute a Monte Carlo estimate. The main idea in AND/OR sampling is to derive a better Monte Carlo estimate that takes into account conditional independence. I’ll demonstrate the main idea using this dice example.

Monte Carlo estimate Perform the following experiment N times.
Toss all the k dice. Record the product of the numbers on the top face of each die. Report the average over the N runs.

How the sample average converges?
10 dice. Exact Answer is (3.5)10

But This is Really Dumb? The dice are independent.
A better Monte Carlo estimate Perform the experiment N times For each dice record the average Take a product of the averages Conventional estimate: Averages of products.

How the sample Average Converges

Moral of the story Make use of (conditional) independence to get better results Used for exact inference extensively Bucket Elimination (Dechter, 1996) Junction tree (Lauritzen and Speigelhalter, 1988) Value Elimination (Bacchus et al. 2004) Recursive Conditioning (Darwiche, 2001) BTD (Jegou et al., 2002) AND/OR search (Dechter and Mateescu, 2007) How to use it for sampling? AND/OR Importance sampling

Background: AND/OR search space
1 2 B C OR AND AND/OR Tree B {0,1,2} A B C Pseudo-tree ≠ A {0,1,2} ≠ C {0,1,2} Problem A 1 2 B C OR Tree A B C Chain Pseudo-tree The AND/OR search space is a generic inference scheme that can be used to exactly solve various combinatorial problems in graphical models. Given a graphical model, the AND/OR search space is driven by a pseudo tree which captures the problem decomposition. For example, given Z, X and Y are independent of each other. Given the pseudo-tree, the associated AND/OR search tree, denoted ST , has alternating levels of OR nodes and AND nodes. The OR nodes are labeled with variables AND nodes are labeled with values and correspond to value assignments in the domains of the variables. The root of the AND/OR search tree is an OR node, labeled with the root of the pseudo-tree T. Semantically, the OR states represent alternative solutions, whereas the AND states represent problem decomposition into independent sub-problems, all of which need be solved. When the pseudo-tree is a chain, the AND/OR search tree coincides with the regular OR search tree.

AND/OR sampling: Example
F D C B P(A) P(B|A) P(C|A) P(D|B) P(F|C)

AND/OR Importance Sampling (General Idea)
Gogate and Dechter, UAI 2008, CP2008 AND/OR Importance Sampling (General Idea) A B C Pseudo-tree Decompose Expectation The main idea in AND/OR sampling is to use the psuedo-tree to decompose the expectation. First, I’ll phrase the original summation problem as an expectation given a proposal distribution Q.

B C Pseudo-tree Decompose Expectation By applying simple symbolic manipulatons, The terms in brackets is nothing but conditional expectation. First bracket gives conditional expectation of X given Z. Second bracket gives conditional expectation of Y given Z Don’t get too alarmed by the word conditional expectation. It is just an average over all events in which Z is instantiated to the specific value.

B C Pseudo-tree Compute all expectations separately How? Record all samples For each sample that has A=a Estimate the conditional expectations separately using the generated samples Combine the results So how to use this decomposition computationally. The idea is to compute the expectations separately. How to do that First, we record all the samples Next, for each Z=j, we estimate the conditional expectation at X and at Y separately using the generated samples Combine the results.

AND/OR Importance Sampling
B C Pseudo-tree A B 2 1 C Sample # A B C 1 2 3 4 The operations can be intuitively expressed on an AND/OR search tree. Let us assume that we have generated four samples. We simply arrange the samples on an AND/OR tree. Note that I have labeled the arcs using some weird number pairs. Where do they come from. The first quantity is simply the ratio of f over Q. The second quantity keeps track of how many times the assignment is sampled. Now coming back to the computation part. At each OR node, we estimate conditional expectation of the variable at that node given the value of the variable ordered above it. Fancy word conditional expectation means average of the children. For example, the conditional expectation of X given Z=0 is simply the average the quantities below it.

Gogate and Dechter, UAI 2008, CP2008
AND/OR Importance Sampling A B 2 1 C  Avg Sample # Z X Y 1 2 3 4 All AND nodes: Separate Components. Take Product Operator: Product All OR nodes: Conditional Expectations given the assignment above it Operator: Weighted Average

Algorithm AND/OR Importance Sampling
Gogate and Dechter, UAI 2008, CP2008 Algorithm AND/OR Importance Sampling Construct a pseudo-tree. Construct a proposal distribution along the pseudo-tree Generate samples x1, ,xN from Q along O. Build a AND/OR sample tree for the samples x1, ,xN along the ordering O. FOR all leaf nodes i of AND-OR tree do IF AND-node v(i)= 1 ELSE v(i)=0 FOR every node n from leaves to the root do IF AND-node v(n)=product of children IF OR-node v(n) = Average of children Return v(root-node)

# samples in AND/OR vs Conventional
Gogate and Dechter, UAI 2008, CP2008 # samples in AND/OR vs Conventional Sample # A B C 1 2 3 4 A 1 A B B C 1 2 1 1 2 1 8 Samples in AND/OR space versus 4 samples in importance sampling Example: A=0, B=2, C=0 is not generated but still considered in the AND/OR space AND/OR sampling has lower error because it computes an average by considering more samples. For example, Z=0, X=2, Y=0 is not generated but still considered in AND/OR space (see the red branches in the AND/OR tree) Let us count the number of samples. Remember that OR node represents alternatives while AND node stands for decomposition or independent sub-problems each of which must be solved. So we need to sum over all alternatives or OR nodes and take a product over all AND nodes.

Why AND/OR Importance Sampling
Gogate and Dechter, UAI 2008, CP2008 Why AND/OR Importance Sampling AND/OR estimates have smaller variance. Variance Reduction Easy to Prove for case of complete independence (Goodman, 1960) Complicated to prove for general conditional independence case (See Vibhav Gogate’s thesis)! Note the squared term.

AND/OR Graph sampling ≠ ≠ ≠ C 1 B D 2 A D {0,1,2} C {0,1,2} C B D A B
1 B D 2 A D {0,1,2} ≠ C {0,1,2} C B D A ≠ B {0,1,2} ≠ A {0,1,2} C 1 B D B D AND/OR sample tree 8 samples 1 2 1 2 2 2 A A A AND/OR sample graph 12 samples 2 1

Combining AND/OR sampling and w-cutset sampling
Reduce the variance of weights Rao-Blackwellised w-cutset sampling (Bidyuk and Dechter, 2007) Increase the number of samples; kind of AND/OR Tree and Graph sampling (Gogate and Dechter, 2008) Combine the two Because Z-hat is an unbiased estimate, its mean-squared error, namely its quality depends upon its variance. Smaller the variance, the better is the quality! Looking at the expression for variance, we can see that we can reduce the variance by either reducing the variance of the weights or by increasing the number of samples N. The numerator depends upon how good Q is. The denominator depends upon how fast the samples are generated. The two known graph-based approaches reduce variance are Rao-Blackwellised importance sampling and AND/OR importance sampling and the paper is about combining the two.

Algorithm AND/OR w-cutset sampling
Given an integer constant w Partition the set of variables into K and R, such that the treewidth of R is bounded by w. AND/OR sampling on K Construct a pseudo-tree of K and compute Q(K) consistent with K Generate samples from Q(K) and store them on an AND/OR tree Rao-Blackwellisation (Exact inference) at each leaf For each leaf node of the tree compute Z(R|g) where g is the assignment from the leaf to the root. Value computation: Recursively from the leaves to the root At each AND node compute product of values at children At each OR node compute a weighted average over the values at children Return the value of the root node As the name suggests, AND/OR w-cutset sampling combines AND/OR sampling and w-cutset sampling. The main idea is to perform AND/OR sampling over the variables in the set “K”. Then, at each leaf node of the AND/OR sample tree, you perform exact inference, namely Rao-Blackwellisation. Finally, you perform value computation as in AND/OR sampling to retrieve an estimate of the partition function.

AND/OR w-cutset sampling: Step 1: Partition the set of variables
Practical constraint: Can only perform exact inference if the treewidth is bounded by 1. E G D F Graphical model So here are the steps

AND/OR w-cutset sampling: Step 2: AND/OR sampling over {A,B,C}
Pseudo-tree C A B E G D F A start pseudo-tree over {A,B,C} that takes into account conditional independence properties Chain start pseudo-tree over {A,B,C} Sample A,B and C without taking into account conditional independence Graphical model

Pseudo-tree C A B E G D F Graphical model

1 Pseudo-tree C A B Samples: (C=0,A=0,B=1), (C=0,A=1,B=1), (C=1,A=0,B=0), (C=1,A=1,B=0)

AND/OR w-cutset sampling: Step 3: Exact inference at each leaf
B 1

AND/OR w-cutset sampling: Step 4: Value computation
Value of C: Estimate of the partition function C 1 A B A B 1 1 1

Properties and Improvements
Basic underlying scheme for sampling remains the same The only thing that changes is what you estimate from the samples Can be combined with any state-of-the-art importance sampling technique Graph vs Tree sampling Take full advantage of the conditional independence properties uncovered from the primal graph Two points I want to mention are: Number 1: The underlying scheme for sampling remains the same. The only thing that changes is what you estimate from the samples. Thus, we can combine the estimators with any state of the art sampling scheme. Number 2: We can perform graph sampling on the variables in the set “K” instead of tree sampling reducing variance even further.

AND/OR w-cutset sampling Advantages and Disadvantages
Variance Reduction Relatively fewer calls to the Rao-Blackwellisation step due to efficient caching (Lazy Rao-Blackwellisation) Dynamic Rao-Blackwellisation when context-specific or logical dependencies are present Particularly suitable for Markov logic networks (Richardson and Domingos, 2006). Disadvantages Increases time and space complexity and therefore fewer samples may be generated.

Take away Figure: Variance Hierarchy and Complexity
IS O(nN) O(1) O(nNexp(w)) O(exp(w)) O(nN) O(h) w-cutset IS AND/OR Tree IS O(nNt) O(nN) O(nNexp(w)) O(h+nexp(w)) AND/OR Graph IS AND/OR w-cutset Tree IS The main contribution of the paper is this Figure, which we call the variance hierarchy. As we go down the hierarchy, the variance reduces and thus the quality of estimates improves. At the top is the conventional importance sampling scheme. Then we have the two graph-based IS schemes. The variance of w-cutset IS is incomparable to AND/OR tree and graph sampling. The two new schemes introduced in this paper have the smallest variance! But there is no free lunch! As we move down the variance hierarchy, the complexity of the schemes typically increases. O(nNtexp(w)) O(nN+nexp(w)) AND/OR w-cutset Graph IS

Experiments Benchmarks Algorithms Linkage analysis Graph coloring
OR tree sampling AND/OR tree sampling AND/OR graph sampling w-cutset versions of the three schemes above

Time Bound: 1hr

Results: Solution counting Graph coloring instance
Time Bound: 1hr

Summary: AND/OR Importance sampling
AND/OR sampling: A general scheme to exploit conditional independence in sampling Theoretical guarantees: lower sampling error than conventional sampling Variance reduction orthogonal to Rao-Blackwellised sampling. Better empirical performance than conventional sampling.

Bozhena Bidyuk Vibhav Gogate Rina Dechter

Similar presentations

Presentation on theme: "Bozhena Bidyuk Vibhav Gogate Rina Dechter"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Bozhena Bidyuk Vibhav Gogate Rina Dechter

Similar presentations

Presentation on theme: "Bozhena Bidyuk Vibhav Gogate Rina Dechter"— Presentation transcript:

Similar presentations

About project

Feedback