SampleSearch: Importance Sampling in the presence of Determinism

SampleSearch: Importance Sampling in the presence of Determinism
Rina Dechter Bren school of Information and Computer Sciences University of California, Irvine Graphical models are the most popular knowledge representation schemes in AI and computer sciences for representing and reasoning with probabilistic and deterministic relationship. Exact reasoning is hard when the graph is dense (treewidth is high) so approximation schemes must be used at some point. Approximation with guarantees are hard also, but there are not as difficult when the information is just probabilistic. No strict constraints. However when some of the knowledge involves constraints approximatioin is hard as well. Iin particular sampling schemes becomes hard. In this talk we will address this issue of sampling in the presence of hard constraints. I will present SampleSearch which sample and search and analyze the implication of the ideas for importance sampling. We will identify the rejection problem, show how it can be hanbdled and when it is cost-effective. Empirical evaluation will demonstrate how cost-effective the scheme is. I will then move to the orthogonal approach of exploiting problem structure in sampling And show our strategy. Joint work with Vibhav Gogate

Overview Introduction: Mixed graphical models
SampleSearch: Sampling with Searching Exploiting structure in samplinig: AND/OR Importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Bayesian Networks: Representation (Pearl, 1988)
P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) Smoking lung Cancer Bronchitis X-ray Dyspnoea Bayesian networks are a compact representation of a joint distribution. Here we assume only discrete distributions. Given a directed acyclic graph over a set of random variables, the joint distribution over all the random variables can be expressed as a product of various conditional distributions or CPTs defined as P(X) given its parents in the graph. Two queries are of interest. Belief Updating P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B) Belief Updating: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

Mixed Networks: Mixing Belief and Constraints
Belief or Bayesian Networks Constraint Networks A A D B C E F F B C B C D=0 D=1 1 .1 .9 .3 .7 E D B= R= The focus is on graphical models having determinism and mixed networks is a unifying framework for such graphical models. A graphical model captures knowledge using a collection of functions whose interaction can be captured by a graph representation. We have Bayesian networks to allow reasoning under conditions of uncertainty. A Bayesian network is a compact representation of a joint probability distribution. Given a directed acyclic graph over random variables, the joint distribution can be expressed as product of conditional probability tables (CPTS) defined over each variable. Each CPT is specified using the variable given its parents in the network. Constraint networks are deterministic graphical models which contains a set of variables, a set of values for each variable and constraints which restrict the assignments that can be made to the variables. A solution is an assignment to all variables such that all constraints are satisfied. Mixed networks are graphical models which augment a probabilistic graphical model having determinism with a constraint network. The constraint network is basically used to model zero probabilities as constraints. The constraints can also be specified externally. For example, the zero probabilities in the Bayesian network can be modeled as constraints as shown. The central aim is to allow exploiting the power of constraint processing techniques for efficient probabilistic inference. Constraints could be specified externally or may occur as zeros in the Belief network

Mixed networks: Distribution and Queries
The distribution represented by a mixed network T=(B,R): Queries: Weighted Counting (Equivalent to P(e), partition function, solution counting) Marginal distribution: The distribution defined by a mixed network is the distribution represented by a Bayesian network projected onto the solutions of the constraint portion. The distribution can be normalized using a constant M. Evidence is a special type of unary constraint. Common queries over the mixed network are computing the normalization constant M and the marginal distribution over a variable. M is same as P(e), partition function and solution counting. Two queries: Marginals and Counting The process of answering these queries is called as inference and “skip to the next slide”

Applications Determinism: More Ubiquitous than you may think! Transportation Planning (Liao et al. 2004, Gogate et al. 2005) Predicting and Inferring Car Travel Activity of individuals Genetic Linkage Analysis (Fischelson and Geiger, 2002) associate functionality of genes to their location on chromosomes. Functional/Software Verification (Bergeron, 2000) Generating random test programs to check validity of hardware First Order Probabilistic models (Domingos et al. 2006, Milch et al. 2005) Citation matching Applications of mixed network. Graphical models which have determinism occur quite frequently in many real world applications. Examples are transportation planning in which given GPS data about a person, the problem is to infer and predict Travel activity routines of individuals. Another application is in computational Biology, specifically genetic linkage analysis in which given a chromosome and some marker locations on the chromosome and the task is figure out if any DNA on the chromosome of the marker affects a particular disease. A third application is functional or software verification in which you are given a circuit or a piece of code and the task is to generate test programs which test whether the circuit or the software conforms to the specification. A final application is first order probabilistic models which have lot of determinism and are used for things like Citation matching or figuring out Author Author relationships.

Functions on Orange nodes are deterministic
L11m L11f L12m L12f S13m X11 X12 S13f L13m L13f X13 Model for locus 1 L21m L21f L22m L22f S23m X21 X22 S23f Functions on Orange nodes are deterministic L23m L23f X23 Model for locus 2

Bayesian Network for Recombination
L11m L11f L12m L12f Locus 1 S13m X11 X12 S13f y1 y2 L13m L13f Deterministic relationships X13 y3 Probabilistic relationships L21m A Bayesian network for 3 people and two loci is given here. Theta is the parameter controling the distance between the two markers. The linkage instances are interesting as graphical models because they involve both probabilistic and deterministic relationships, L21f L22m L22f Locus 2 S23m X21 X22 S23f L23m L23f P(e|Θ) ? X23 9 UNL, April 2009 9

Linkage analysis: 6 people, 3 markers
L11m L11f L12m L12f X11 X12 S15m S13m L13m L13f L14m L14f X13 X14 S15m S15m L15m L15f L16m L16f S15m S16m X15 X16 L21m L21f L22m L22f X21 X22 S25m S23m L23m L23f L24m L24f X23 X24 S25m S25m L25m L25f L26m L26f Here is an example of using Bayesian networks for linkage analysis. It models the genetic inheritance in a family of 6 individuals relative to some genes of interest. The task is to find a desease gene on a chromosome This domain yields very hard probabilistic networks that contain both probabilistic information and deterministic relationship And it drives many of the methods that we currently develop. . S25m S26m X25 X26 L31m L31f L32m L32f X31 X32 S35m S33m L33m L33f L34m L34f X33 X34 S35m S35m L35m L35f L36m L36f S35m S36m UNL, April 2009 X35 X36

Approximate Inference
Approximations are hard with determinism Randomized Polynomial ε-approximation possible when no zeros are present (Karp 1993, Cheng 2001) ε-approximation NP-hard in the presence of zeros Gibbs sampling is problematic when MCMC is not ergodic. Current remedies Replace zeros with very small values (Laplace correction: Naive Bayes, NLP) bad performance when zeros or determinism is real! The goal is to perform effective sampling based approximate inference in presence of determinism. As compared with exact inference which in fact becomes easier when determinism is present in the graphical model, approximate inference rather counter-intuitively becomes harder! In particular, there exists a randomized polynomial time algorithm that bounds the marginal probabilities with relative error epsilon when there is no determinism. However, when determinism is present the problem of bounding marginal probabilities with relative error epsilon suddenly becomes NP-hard Other approaches in Machine learning community and NLP for tackling zeros use something called the Laplace correction. Namely replace all zeros by very small values and continue. This works quite well when determinism is not real in machine learning for instance. Example: missing data. However, when determinism is real, laplace correction can yield very bad performance.

SampleSearch: Sampling with Searching Rejection problem Recovery, and analysis Empirical evaluation Exploiting structure in samplinig: AND/OR Importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Importance Sampling: Overview
Given a proposal or importance distribution Q(z) such that f(z)>0 implies Q(z)>0, rewrite Given i.i.d. samples z1,..,zN from Q(z), Importance sampling is a general technique which can be used to approximate any sum of the form M where f is a function over a set of variables. The idea is to multiply and divide by a distribution Q(z) and transform the sum into an expectation w.r.t. the distribution Q. Q is often called the proposal distribution or the importance distribution and that is why this scheme is called as importance sampling. Q could be any distribution as long as it satisfies the condition shown. However, for effective performance Q should be as close as possible to the posterior distribution of the mixed network. Then one generates samples z1 to zn from Q and estimates the sum by an average over the samples.

How to generate i.i.d. samples from Q(X)
We have n variables

Generating i.i.d. samples from Q
Root 0.8 0.2 A=1 A=0 0.8 We assume that Q is discrete and can be expressed in a product form along an ordering of variables. Also, we assume that Q can be specified in polynomial space. Namely each component of Q involves only polynomial number of variables. Once, Q is expressed in product form, we can express it using a probability tree as shown. The probability tree here is over three variables with a variable at each level. Each branch in the tree is labeled with a probability of reaching that branch given the parent. To sample from this probability tree, one just selects each branch with the appropriate probability until a leaf node is reached. The probability of reaching the leaf node is the product of the probabilities on the path from the root to that leaf. For example, the probability of sampling the leaf C=1 is 0.8 times 0.6 times 0.8. 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Rejection Problem Toss a biased coin P(H)=0.8, P(T)=0.2 Say we get H
Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Rejection Problem Root Toss a biased coin P(H)=0.4, P(T)=0.6
Say, We get a Head 0.8 A=0 0.4 0.6 B=0 B=1 0.8 0.2 0.2 0.8 C=0 C=1 C=0 C=1

Rejection Problem Root Toss a biased coin P(H)=0.4, P(T)=0.6
Say, We get a Head 0.8 A=0 0.4 B=0 0.8 0.2 C=0 C=1 A large number of assignments generated will be rejected, thrown away

Rejection Problem Constraints: A≠B A≠C
Root 0.8 0.2 0.4 0.6 Unfortunately, in presence of constraints or determinism, the proposal distribution may suffer from the rejection problem in that it may generate samples which are inconsistent i.e. they are not solutions to the constraint portion. In the tree, all blue leaves are solutions while all red leaves are non-solutions and as we can see there is a positive probability of generating a non-solution. To understand why this is a problem, let us look at the expression shown above. f is the true function while Q is the proposal distribution. If a sample xi is not a solution its f-value is zero and the ratio will be zero. We call such samples whose weight is zero as rejected samples. If we only generate rejected sample our estimate M-hat will be zero. All Blue leaves correspond to solutions i.e. f(x) >0 All Red leaves correspond to non-solutions i.e. f(x)=0

Revising Q to backtrack-free distribution:
QF(branch)=0 if no solutions under it QF(branch)=1 if no solutions under the other branch QF(branch) =Q(branch) otherwise Constraints: A≠B A≠C A=0 B=0 C=0 B=1 A=1 C=1 Root 0.8 0.2 0.40 0.61 1 0.21 0.80 To reemedy this situation we can try to modify Q so that it will be easy to sample from and it will have the same set of solutions (same non-zero tuples) as f, the true function It is easy to see that if we modify Q so that it respects all constraints in the graphical model, we would never generate inconsistent samples or non-solutions. In the constraint satisfaction jargon, this process of generating a solution without backtracking is called as making the constraint network backtrack-free. So we call our modification of the proposal which always generates samples which are solutions as the backtrack-free distribution. It is defined as shown. At each branch, we check to see if there are solutions under that branch, if there are then we keep the branch as it is. All Blue leaves correspond to solutions i.e. f(x) >0 All Red leaves correspond to non-solutions i.e. f(x)=0

Generating samples from QF
QF(branch)=0 if no solutions under it QF(branch)=1 if no solutions under the other branch QF(branch) =Q(branch) otherwise Constraints: A≠B A≠C Root 0.8 0.2 A=0 00.4 0.6  1 ?? Solution Invoke an oracle at each branch. Oracle returns True if there is a solution under a branch False, otherwise ?? Solution B=1 00.2 0.81 ?? Solution C=1

Generating samples from QF
Gogate et al., UAI 2005, Gogate and Dechter, UAI 2005 Generating samples from QF Constraints: A≠B A≠C Oracles: In practice Adaptive consistency as pre-processing step A complete CSP solver Too costly O(exp(treewidth)) Invoked O(nd) for each sample Root 0.2 0.8 A=0 1 B=1 1 C=1

Gogate et al., UAI 2005, Gogate and Dechter, UAI 2005
Approximations of QF Use i-consistency instead of adaptive consistency O(ni) time and space complexity identify some zeros so that they are never sampled Cons: Too weak when constraint portion is hard. No processing Percentage of Zeros identified Polynomial consistency enforcement Rejection Rate NP-hard exponential time and space

Algorithm SampleSearch
Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C A=0 B=0 C=0 B=1 A=1 C=1 Root 0.8 0.2 0.4 0.6

Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 1 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 1 B=1 B=0 B=1 0.8 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 Resume Sampling

Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 1 B=1 B=0 B=1 0.8 0.2 0.8 1 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 Until f(sample)>0 Constraint Violated

Gogate and Dechter, AISTATS 2007, AAAI 2007
Generate more Samples Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Gogate and Dechter, AISTATS 2007, AAAI 2007
Generate more Samples Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 1 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Traces of SampleSearch
Gogate and Dechter, AISTATS 2007, AAAI 2007 Traces of SampleSearch Constraints: A≠B A≠C A=0 B=0 B=1 C=1 Root A=0 B=1 C=1 Root C=0 A=0 B=0 B=1 C=1 Root C=0 Root A=0 B=1 C=1

SampleSearch: Sampling Distribution
Gogate and Dechter, AISTATS 2007, AAAI 2007 SampleSearch: Sampling Distribution Problem: Due to Search, the samples are no longer i.i.d. from Q Theorem: SampleSearch generates i.i.d. samples from the backtrack-free distribution Now that we know how to always generate consistent samples, we are not done yet because our task is to estimate the probability of evidence or the normalization constant M? If we use Q to estimate M, then it would be incorrect because SampleSearch does not generate samples from Q due to its backtracking search component. Ironically, we can prove that SampleSearch generates samples from the backtrack-free distribution and as we already saw it is quite expensive to compute.

The Sampling distribution QF of SampleSearch
Gogate and Dechter, AISTATS 2007, AAAI 2007 The Sampling distribution QF of SampleSearch Constraints: A≠B A≠C Root What is probability of generating A=0? QF(A=0)=0.8 Why? SampleSearch is systematic 0.8 What is probability of generating (A=0,B=1)? QF(B=1|A=0)=1 Why? SampleSearch is systematic A=0 1 Skip this slide. B=0 B=1 What is probability of generating (A=0,B=0)? Simple: QF(B=0|A=0)=0 All samples generated by SampleSearch are solutions 1 C=0 C=1 C=0 C=1 Backtrack-free distribution

Asymptotic approximations of QF
Gogate and Dechter, AISTATS 2007, AAAI 2007 Asymptotic approximations of QF Root IF Hole THEN UF=Q (i.e. there is a solution at the other branch) LF=0 (i.e. no solution at the other branch) 0.8 Hole ? Don’t know A=0 1 No solutions here B=0 B=1 QF(branch)=0 if no solutions under it QF(branch) =Q(branch) otherwise Let us assume that SampleSearch traversed the trace shown in the Figure to sample the Blue solution. To know the exact Backtrack-free distribution at each branch point, we have to figure out whether there is a solution on the other branch point. If there is no solution on the other branch point according to the definition of the backtrack-free distribution, we set the probability of the branch to zero. Otherwise, we keep it the same. In its trace, SampleSearch has figured out that some branches are solutions or non solutions but it has not figured out the consistency of some other branches which are the holes or question marks in the figure. A simple approximation suggests itself. We can either set the hole to Q or to 0 corresponding to the definition of the backtrack-free distribution. 1 No solutions here C=0 C=1 C=0 C=1

Approximations: Convergence in the limit
Gogate and Dechter, AISTATS 2007, AAAI 2007 Approximations: Convergence in the limit Store all possible traces A=0 B=1 C=1 Root 0.8 1 ? A=0 B=0 B=1 C=1 Root 0.8 ? 1 A=0 B=1 C=1 Root C=0 0.8 ? 0.6 1

Approximations: Convergence in the limit
Gogate and Dechter, AISTATS 2007, AAAI 2007 Approximations: Convergence in the limit From the combined sample tree, update U and L. IF Hole THEN UFN=Q and LFN=0 Root 0.8 ? A=0 1 B=1 1 C=1

Improving Naive SampleSeach: The IJGP-wc-SS algorithm
Gogate and Dechter, AISTATS 2007, AAAI 2007 Improving Naive SampleSeach: The IJGP-wc-SS algorithm Better Search Strategy Can use any state-of-the-art CSP/SAT solver e.g. minisat (Een and Sorrenson 2006) Better Proposal distribution Use output of IJGP- a generalized belief propagation to compute the initial importance function w-cutset importance sampling (Bidyuk and Dechter, 2007) Reduce variance by sampling from a subspace

Experiments Tasks Benchmarks Algorithms Weighted Counting Marginals
Satisfiability problems (counting solutions) Linkage networks Relational instances (First order probabilistic networks) Grid networks Logistics planning instances Algorithms IJGP-wc-SS/LB and IJGP-wc-SS/UB IJGP-wc-IS (Vanilla algorithm that does not perform search) SampleCount (Gomes et al. 2007, SAT) ApproxCount (Wei and Selman, 2007, SAT) EPIS (Changhe and Druzdzel, 2006) RELSAT (Bayardo and Peshoueshk, 2000, SAT) Edge Deletion Belief Propagation (Choi and Darwiche, 2006) Iterative Join Graph Propagation (Dechter et al., 2002) Variable Elimination and Conditioning (VEC)

Results: Probability of Evidence Linkage instances (UAI 2006 evaluation)
Time Bound: 3 hrs M: number of samples generated in 10 hrs Z: Probability of Evidence

Results: Probability of Evidence Relational instances (UAI 2008 evaluation)
Time Bound: 10 hrs M: number of samples generated in 10 hrs Z: Probability of Evidence

Results: Solution Counts Latin Square instances (size 8 to 16)
Here, I’ll explain the methodology that each algorithm was run for the same amount of time (10hrs). Also explain what n, k, c and w stand for. If exact results are not known, we report the lower bound. Higher the lower bound, better the particular scheme. Also, ApproxCount does not output a lower bound. Time Bound: 10 hrs M: number of samples generated in 10 hrs Z: Solution Counts

Results: Solution Counts Langford instances
Time Bound: 10 hrs M: number of samples generated in 10 hrs Z: Solution Counts

Results on Marginals Evaluation Criteria
Always bounded between 0 and 1 Lower Bounds the KL distance When probabilities close to zero are present KL distance may tend to infinity.

Results: Posterior Marginals Linkage instances (UAI 2006 evaluation)
Time Bound: 3 hrs Table shows the Hellinger distance (∆) and Number of samples: M

Summary: SampleSearch
Manages rejection problem while sampling Sampling Distribution is the backtrack-free distribution QF Approximation of QF by storing all traces yielding an asymptotically unbiased estimator Linear time and space overhead Bound the weighted counts from above and below Empirically, when a substantial number of zero probabilities are present, SampleSearch dominate.

AND/OR Importance sampling
Using structure within importance sampling Our final contribution is AND/OR Importance sampling which utilizes structure within importance sampling. AND/OR Importance sampling

SampleSearch: Sampling with Searching Exploiting structure in sampling: AND/OR Importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Motivation Given Q(z), Importance sampling totally disregards the structure of f(z) while approximating it. Consider the two sums. Both are sums over four variables {1,2,3,4} but both have different structure. The second sum is sparser than the first sum. Given a proposal distribution Q(z), Importance sampling totally disregards the structure of f(z) while approximating M. Intuitively, we can expect that the second sum which has a much sparser graph structure is easier to approximate than the first sum and we will build on this intuition to generalize importance sampling.

OR Search Tree Constraint Satisfaction – Counting Solutions A B C D E
RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A E C B F D C D F E B A 1 1 1 Defined by a variable ordering, Fixed but may be dynamic 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AND/OR Search Tree Constraint Satisfaction – Counting Solutions A=a
B C RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A E C B F D A A=a B A E C B F D DFS tree C A F B=b C E C=c E=e D B E D F D=d F=f

AND/OR Search Tree Constraint Satisfaction – Counting Solutions
B C RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A D B E C F A E C B F D pseudo tree OR A AND 1 OR B B Defined by a variable ordering, Fixed but may be dynamic AND 1 1 OR C E C E C E C E AND 1 1 1 1 1 1 1 1 OR D D F F D D F F D D F F D D F F AND 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AND/OR Tree OR – casing upon variable values
F OR – casing upon variable values AND – decomposition into independent subproblems pseudo tree OR A AND 1 OR B B Defined by a variable ordering, Fixed but may be dynamic AND 1 1 OR C E C E C E C E AND 1 1 1 1 1 1 1 1 OR D D F D F F D D F D F AND 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Complexity of AND/OR Tree Search
Space O(n) Time O(n km) O(n kw* log n) [Freuder & Quinn85], [Collin, Dechter & Katz91], [Bayardo & Miranker95], [Darwiche01] O(kn) k = domain size m = depth of pseudo-tree n = number of variables w*= treewidth

Background: AND/OR search space
Z 1 2 X Y OR AND AND/OR Tree X {0,1,2} Z X Y Pseudo-tree ≠ Z {0,1,2} ≠ Y {0,1,2} Problem X 1 2 Z Y OR Tree X Z Y Chain Pseudo-tree To take advantage of the structure of the graphical model within importance sampling, we will use the notion of AND/OR search spaces for graphical models. Given a graphical model, the AND/OR search space is driven by a pseudo tree which captures the problem decomposition. For example, given Z, X and Y are independent of each other. Given the pseudo-tree, the associated AND/OR search tree, has alternating levels of OR nodes and AND nodes. The OR nodes are circles and are labeled with variables while the AND nodes are squares and labeled with values and correspond to value assignments in the domains of the variables. The root of the AND/OR search tree is an OR node, labeled with the root of the pseudo-tree T. Semantically, the OR states represent alternative solutions, whereas the AND states represent problem decomposition into independent sub-problems, all of which need be solved. When the pseudo-tree is a chain, it does not take advantage of decomposition and the AND/OR search tree coincides with the regular OR search tree.

Example Bayesian network
P(Z) Z=0 Z=1 0.8 0.2 f(Z) Y B Z X A P(X|Z) X=0 X=1 X=2 Z=0 0.3 0.4 Z=1 0.2 0.7 0.1 P(Y|Z) Y=0 Y=1 Y=2 Z=0 0.5 0.1 0.4 Z=1 0.2 0.6 f(XZ) f(YZ) P(X) X=0 0.1 X=1 0.2 X=2 0.6 P(Y) Y=0 0.2 Y=1 0.7 Y=2 0.1 We will use the given Bayesian network to explain the main idea in AND/OR importance sampling. Here we are interested in computing probability of evidence or weighted counting. The evidence here is A=0, B=0. When we instantiate the variable A to 0, the CPT on A is a function over X. Similarly, when we instantiate B, the CPT on B is a function over Y. For simplicity and to save space on the slides, we will combine the CPTs mentioning X into a single function f(XZ) and the CPTs involving Y into a single function f(YZ). In this case the probability of evidence P(a,b) can be computed by summing out X,Y,Z from the product of the three functions. Evidence A=0, B=0

Recap: Conventional Importance Sampling
Gogate and Dechter, UAI 2008, CP2008 Recap: Conventional Importance Sampling Y B Z X A AND/OR idea! Decompose this expectation Let us quickly recap how conventional importance sampling converts the summation into an expectation. Given a proposal distribution Q, we multiply and divide by Q yielding the given expectation. Now the main idea in AND/OR importance sampling is to decompose the given expectation based on the structure of the Bayesian network.

AND/OR Importance Sampling (General Idea)
Y B Z X A AND/OR Importance Sampling (General Idea) Z X Y Pseudo-tree Decompose Expectation Consider the expression for M. By applying simple symbolic manipulations, we can rewrite the expression as shown. The terms in brackets are nothing but conditional expectations. First bracket gives conditional expectation of X given Z. Second bracket gives conditional expectation of Y given Z. The conditional expectations can also be read by following the pseudo-tree structure.

AND/OR Importance Sampling (General Idea)
Y B Z X A AND/OR Importance Sampling (General Idea) Z X Y Pseudo-tree Estimate all conditional expectations separately How? Record all samples For each sample that has Z=j Estimate the conditional expectations X|Z and Y|Z using samples corresponding to X|Z=j and Y|Z=j respectively. Combine the results So how to use this decomposition for sampling. The idea is to compute the expectations separately. How to do that Well, First, we record all the samples. Next, for each Z=j, we estimate the conditional expectation at X and at Y separately using the generated samples. Combine the results.

AND/OR Importance Sampling
Gogate and Dechter, UAI 2008, CP2008 Y B Z X A Sample # Z X Y 1 2 3 4 Z <1.6,2> <0.4,2> 1 X X Y Z X Y Pseudo-tree Y <0.12,1> <0.2,1> <0.16,1> <0.36,1> <0.08,1> <0.14,1> <0.28,1> <0.84,1> 1 2 1 1 2 1 The operations can be intuitively expressed on an AND/OR search tree. Let us assume that we have generated four samples. We simply arrange the samples on an AND/OR tree. Note that I have labeled the arcs using some weird number pairs. Where do they come from. The first quantity is simply the ratio of f over Q. The second quantity keeps track of how many times the assignment is sampled. Now coming back to the computation part. At each OR node, we estimate conditional expectation of the variable at that node given the value of the variable ordered above it. Fancy word conditional expectation means average of the children. For example, the conditional expectation of X given Z=0 is simply the average the quantities below it. At each AND node, we combine the results, namely take the product based on the expression shown above.

Example Bayesian network
P(Z) Z=0 Z=1 0.8 0.2 Y B Z X A P(X|Z) X=0 X=1 X=2 Z=0 0.3 0.4 Z=1 0.2 0.7 0.1 P(Y|Z) Y=0 Y=1 Y=2 Z=0 0.5 0.1 0.4 Z=1 0.2 0.6 P(X) X=0 0.1 X=1 0.2 X=2 0.6 P(Y) Y=0 0.2 Y=1 0.7 Y=2 0.1 We will use the given Bayesian network to explain the main idea in AND/OR importance sampling. Here we are interested in computing probability of evidence or weighted counting. The evidence here is A=0, B=0. When we instantiate the variable A to 0, the CPT on A is a function over X. Similarly, when we instantiate B, the CPT on B is a function over Y. For simplicity and to save space on the slides, we will combine the CPTs mentioning X into a single function f(XZ) and the CPTs involving Y into a single function f(YZ). In this case the probability of evidence P(a,b) can be computed by summing out X,Y,Z from the product of the three functions. Evidence A=0, B=0

Gogate and Dechter, UAI 2008, CP2008
AND/OR Importance Sampling Sample # Z X Y 1 2 3 4 Z <1.6,2> <0.4,2> 1 X X Y Y <0.12,1> <0.2,1> <0.16,1> <0.36,1> <0.08,1> <0.14,1> <0.28,1> <0.84,1> 1 2 1 1 2 1 Putting these ideas together, we can compute a new estimate for M as follows. We arrange the samples on an AND/OR search tree. At each AND node, we compute a product and at each OR node we compute a weighted average. The value of the root node now expresses an estimate of the weighted counts. All AND nodes: Separate Components. Take Product Operator: Product All OR nodes: Conditional Expectations given the assignment above it Operator: Average

Algorithm AND/OR Importance Sampling
Gogate and Dechter, UAI 2008, CP2008 Algorithm AND/OR Importance Sampling Generate samples x1, ,xN from Q along O. Build a AND/OR sample tree for the samples x1, ,xN along the ordering O. FOR all leaf nodes i of AND-OR tree do IF AND-node v(i)= 1 ELSE v(i)=0 FOR every node n from leaves to the root do IF AND-node v(n)=product of children IF OR-node v(n) = Average of children Return v(root-node)

# samples in AND/OR vs Conventional
Gogate and Dechter, UAI 2008, CP2008 # samples in AND/OR vs Conventional Sample # Z X Y 1 2 3 4 Z <1,1> <1,1> 1 X Y X Y <1,1> <1,1> <1,1> <1,1> <1,1> <1,1> <1,1> <1,1> 1 2 1 1 2 1 8 Samples in AND/OR space versus 4 samples in importance sampling Example: Z=0, X=2, Y=0 is not generated but still considered in AND/OR space AND/OR sampling has lower error because it computes an average by considering more samples. For example, Z=0, X=2, Y=0 is not generated but still considered in AND/OR space (see the red branches in the AND/OR tree) Let us count the number of samples. Remember that OR node represents alternatives while AND node stands for decomposition or independent sub-problems each of which must be solved. So we need to sum over all alternatives or OR nodes and take a product over all AND nodes.

Properties of AND/OR Importance Sampling
Gogate and Dechter, UAI 2008, CP2008 Properties of AND/OR Importance Sampling Unbiased estimate of weighted counts. AND/OR estimate has lower Variance than conventional importance sampling estimate. Variance Reduction Easy to Prove for case of complete independence (Goodman, 1960) Complicated to prove for general conditional independence case (Gogate thesis, papers!)

AND/OR w-cutset (Rao-Blackwellised) sampling
Gogate and Dechter, Under review, CP 2009 AND/OR w-cutset (Rao-Blackwellised) sampling Rao-Blackwellisation (Rao, 1963) Partition X into K and R, such that we can compute P(R|k) efficiently. Sample from K and sum out R Estimate: w-cutset sampling (Bidyuk and Dechter, 2003): Select K such that the treewidth of R after removing K is bounded by “w”. Weighted Counts conditioned on K=ki We can even combine AND/OR sampling with Rao-Blackwellised w-cutset sampling yielding further variance reduction. The idea in Rao-Blackwellisation is to combine exact inference with sampling which reduces variance based on the Rao-Blackwell theorem. Here the set of variables is partitioned into K and R such that it is easy to do exact inference on R given K and the Rao-Blackwell estimate is shown below. w-cutset sampling is an elegant implementation of Rao-Blackwellisation in which the set K is selected in such a way that the treewidth of R given K is bounded by w. Thus the exact inference step can be carried out in polynomial time yielding a very efficient practical scheme.

AND/OR w-cutset sampling
Gogate and Dechter, Under review, CP 2009 AND/OR w-cutset sampling Perform AND/OR tree or graph sampling on K Exact inference on R Orthogonal approaches: Theorem: Combining AND/OR sampling and w-cutset sampling yields further variance reduction. A B C D E G F A B C D E G F A A B C B The main idea in AND/OR w-cutset sampling is to perform AND/OR sampling on the w-cutset variables K and exact inference on R. The main point here is that AND/OR sampling is orthogonal to w-cutset Rao-Blackwellised sampling and the two could be combined to yield further variance reduction. Start pseudo tree on the cutset variables C OR pseudo tree on the cutset variables Graphical model Full pseudo tree

From Search Trees to Search Graphs
Any two nodes that root identical subtrees (subgraphs) can be merged

Merging Based on Context
One way of recognizing nodes that can be merged: context (X) = ancestors of X in pseudo tree that are connected to X, or to descendants of X A D B E C F pseudo tree [ ] [A] A E C B F D A E C B F D [AB] [AB] We may have two nodes that are identical, then we can merge. Highlight: rooting the same tree. Maybe an example of and/or tree, show animation that we merge. Minimal is when we remove all redundancy [BC] [AE]

AND/OR Search Graph Constraint Satisfaction – Counting Solutions
Context A B C RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A D B E C F [ ] [A] A E C B F D [AB] [AB] [BC] [AE] OR pseudo tree A AND 1 OR B B AND 1 1 Defined by a variable ordering, Fixed but may be dynamic OR C E C E C E C E AND 1 1 1 1 1 1 1 1 OR D D D D F F F F B C Value 1 AND 1 1 1 1 1 1 1 1 context minimal graph Cache table for D

How Big Is the Context? Theorem: The maximum context size for a pseudo tree is equal to the treewidth of the graph along the pseudo tree. C H K D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] (C K H A B E J L N O D P M F G) max context size = treewidth

Complexity of AND/OR Graph Search
Space O(n kw*) O(n kpw*) Time k = domain size n = number of variables w*= treewidth pw*= pathwidth w* ≤ pw* ≤ w* log n

AND/OR Graphs ≠ C B D A {0,1,2} 2 B D 1 C A AND/OR tree 2 B D 1 C A C
1 C A AND/OR tree 2 B D 1 C A C B D A AND/OR graph The figure shows a full AND/OR tree corresponding to the pseudo tree. We can see that node A does not depend on C and therefore the subtree corresponding to C=0 and C=1 is identical for A. Therefore, the two could be merged yielding an AND/OR graph which is even more compact and uses more conditional indenpendencies that an AND/OR tree.

AND/OR graph sampling ≠ C B D A {0,1,2} OR Sample Tree C B D A
B 1 D A 2 C B D A AND/OR Sample Tree C 1 B D 2 A AND/OR Sample Graph C 1 D 2 We can implement the same idea and convert an AND/OR sample tree into an AND/OR sample graph by merging all identical subtrees. This will yield a new structure on which we can perform similar computations, namely average at OR and product at AND to yield a new estimate. This new estimate will have lower variance. B 1 A B A 2 2 A 1 2

Variance Hierarchy and Complexity
O(nN) O(1) Variance Hierarchy and Complexity IS O(cN+(n-c)Nexp(w)) O(1) O(nN) O(h) w-cutset IS AND/OR Tree IS O(nNw*) O(nN) O(cN+(n-c)Nexp(w)) O(h+(n-c)exp(w)) AND/OR Graph IS AND/OR w-cutset Tree IS O(cNw*+(n-c)Nexp(w)) O(cN+(n-c)exp(w)) AND/OR w-cutset Graph IS

Experiments Benchmarks Algorithms Linkage analysis Graph coloring
Grids Algorithms OR tree sampling AND/OR tree sampling AND/OR graph sampling w-cutset versions of the three schemes above

Results: Probability of Evidence Linkage instances (UAI 2006 evaluation)
Time Bound: 1hr

Here we see that the scheme that employs the most decomposition AND/OR w-cutset graph sampling yields the best performance.

Summary: AND/OR Importance sampling
AND/OR sampling: A general scheme to exploit conditional independence in sampling Theoretical guarantees: lower sampling error than conventional sampling Variance reduction orthogonal to Rao-Blackwellised sampling. Better empirical performance than conventional sampling.

Conclusion Effective sampling in presence of determinism to address the rejection and non-convergence problems SampleSearch manages rejection while sampling SampleSearch-SIR and SampleSearch-MH Convergent MCMC sampling schemes Lower Bounding schemes Extending Markov inequality to multiple samples AND/OR Importance sampling Post-processing, use conditional independence for reducing the variance of estimates

Mixed Networks (Mateescu and Dechter, 2004)
Constraint Network Belief Network A A F F B C B C Moral mixed graph E D A E D B C D 1 F B C D=0 D=1 .2 .8 1 .1 .9 .3 .7 .5 B C We often have both probabilistic information and constraints. Our approach is to allow the two representation to co-exist, explicitlly. It has virtues for user interface and for computation. Complex cnf queries: P((A or B) and (~CVD)) E D 84 UNL, April 2009 84

Transportation Planning: Graphical model
dt wt dt-1 wt-1 D: Time-of-day (discrete) W: Day of week (discrete) G: collection of locations where the person spends significant amount of time. (discrete) F: Counter Route: A hidden variable that just predicts what path the person takes (discrete) Location: A pair (e,d) e is the edge on which the person is and d is the distance of the person from one of the end-points of the edge (continuous) Velocity: Continuous GPS reading: (lat,lon,spd,utc). gt-1 gt Ft-1 Ft rt-1 rt vt-1 vt Here is an example dynamic model which models car travel activity routines of individuals lt-1 lt yt-1 yt

SampleSearch: Importance Sampling in the presence of Determinism

Similar presentations

Presentation on theme: "SampleSearch: Importance Sampling in the presence of Determinism"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SampleSearch: Importance Sampling in the presence of Determinism

Similar presentations

Presentation on theme: "SampleSearch: Importance Sampling in the presence of Determinism"— Presentation transcript:

Similar presentations

About project

Feedback