SampleSearch: Importance Sampling in the presence of Determinism

Slides:

Advertisements

Similar presentations

Bozhena Bidyuk Vibhav Gogate Rina Dechter

Advertisements

Verification/constraints workshop, 2006 From AND/OR Search to AND/OR BDDs Rina Dechter Information and Computer Science, UC-Irvine, and Radcliffe Institue.

Join-graph based cost-shifting Alexander Ihler, Natalia Flerova, Rina Dechter and Lars Otten University of California Irvine Introduction Mini-Bucket Elimination.

Exact Inference in Bayes Nets

Junction Trees And Belief Propagation. Junction Trees: Motivation What if we want to compute all marginals, not just one? Doing variable elimination for.

Dynamic Bayesian Networks (DBNs)

MPE, MAP AND APPROXIMATIONS Lecture 10: Statistical Methods in AI/ML Vibhav Gogate The University of Texas at Dallas Readings: AD Chapter 10.

Undirected Probabilistic Graphical Models (Markov Nets) (Slides from Sam Roweis)

Belief Propagation by Jakob Metzler. Outline Motivation Pearl’s BP Algorithm Turbo Codes Generalized Belief Propagation Free Energies.

Overview of Inference Algorithms for Bayesian Networks Wei Sun, PhD Assistant Research Professor SEOR Dept. & C4I Center George Mason University, 2009.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Introduction Combining two frameworks

Bayesian network inference

A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graph.

CPSC 322, Lecture 12Slide 1 CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12 (Textbook Chpt ) January, 29, 2010.

Global Approximate Inference Eran Segal Weizmann Institute.

. Hidden Markov Models For Genetic Linkage Analysis Lecture #4 Prepared by Dan Geiger.

. Bayesian Networks Lecture 9 Edited from Nir Friedman’s slides by Dan Geiger from Nir Friedman’s slides.

Belief Propagation, Junction Trees, and Factor Graphs

SampleSearch: A scheme that searches for Consistent Samples Vibhav Gogate and Rina Dechter University of California, Irvine USA.

G. Cowan Lectures on Statistical Data Analysis 1 Statistical Data Analysis: Lecture 8 1Probability, Bayes’ theorem, random variables, pdfs 2Functions of.

Learning Bayesian Networks

. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Announcements Homework 8 is out Final Contest (Optional)

Computer vision: models, learning and inference Chapter 10 Graphical Models.

A Differential Approach to Inference in Bayesian Networks - Adnan Darwiche Jiangbo Dang and Yimin Huang CSCE582 Bayesian Networks and Decision Graphs.

On the Power of Belief Propagation: A Constraint Propagation Perspective Rina Dechter Bozhena Bidyuk Robert Mateescu Emma Rollon.

Ryan Kinworthy 2/26/20031 Chapter 7- Local Search part 2 Ryan Kinworthy CSCE Advanced Constraint Processing.

AND/OR Search for Mixed Networks #CSP Robert Mateescu ICS280 Spring Current Topics in Graphical Models Professor Rina Dechter.

. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.

Solution Counting Methods for Combinatorial Problems Ashish Sabharwal [ Cornell University] Based on joint work with: Carla Gomes, Willem-Jan van Hoeve,

Approximate Inference 2: Monte Carlo Markov Chain

If we measured a distribution P, what is the tree- dependent distribution P t that best approximates P? Search Space: All possible trees Goal: From all.

Importance Sampling ICS 276 Fall 2007 Rina Dechter.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Charu Aggarwal + * Department of Computer Science, University of Texas at Dallas + IBM T. J. Watson.

Ahsanul Haque *, Swarup Chandra *, Latifur Khan * and Michael Baron + * Department of Computer Science, University of Texas at Dallas + Department of Mathematical.

Undirected Models: Markov Networks David Page, Fall 2009 CS 731: Advanced Methods in Artificial Intelligence, with Biomedical Applications.

Trevor Brown – University of Toronto B-slack trees: Space efficient B-trees.

CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.

Inference Complexity As Learning Bias Daniel Lowd Dept. of Computer and Information Science University of Oregon Joint work with Pedro Domingos.

U NIFORM S OLUTION S AMPLING U SING A C ONSTRAINT S OLVER A S AN O RACLE Stefano Ermon Cornell University August 16, 2012 Joint work with Carla P. Gomes.

Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.

Learning With Bayesian Networks Markus Kalisch ETH Zürich.

Probabilistic Networks Chapter 14 of Dechter’s CP textbook Speaker: Daniel Geschwender April 1, 2013 April 1&3, 2013DanielG--Probabilistic Networks1.

Two Approximate Algorithms for Belief Updating Mini-Clustering - MC Robert Mateescu, Rina Dechter, Kalev Kask. "Tree Approximation for Belief Updating",

1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.

CPSC 422, Lecture 11Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 11 Oct, 2, 2015.

Belief Propagation and its Generalizations Shane Oldenburger.

CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.

Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:

Arc Consistency CPSC 322 – CSP 3 Textbook § 4.5 February 2, 2011.

Inference Algorithms for Bayes Networks

Advances in Bayesian Learning Learning and Inference in Bayesian Networks Irina Rish IBM T.J.Watson Research Center

Foundations of Constraint Processing, Spring 2009 Structure-Based Methods: An Introduction 1 Foundations of Constraint Processing CSCE421/821, Spring 2009.

Today Graphical Models Representing conditional dependence graphically

On the Optimality of the Simple Bayesian Classifier under Zero-One Loss Pedro Domingos, Michael Pazzani Presented by Lu Ren Oct. 1, 2007.

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Consistency Methods for Temporal Reasoning

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

Irina Rish IBM T.J.Watson Research Center

Exact Inference Continued

CSPs: Search and Arc Consistency Computer Science cpsc322, Lecture 12

People Forecasting Where people are going?

Markov Networks.

Arthur Choi and Adnan Darwiche UCLA

From AND/OR Search to AND/OR BDDs

CS 188: Artificial Intelligence

Expectation-Maximization & Belief Propagation

Markov Networks.

Iterative Join Graph Propagation

Presentation transcript:

SampleSearch: Importance Sampling in the presence of Determinism Rina Dechter Bren school of Information and Computer Sciences University of California, Irvine Graphical models are the most popular knowledge representation schemes in AI and computer sciences for representing and reasoning with probabilistic and deterministic relationship. Exact reasoning is hard when the graph is dense (treewidth is high) so approximation schemes must be used at some point. Approximation with guarantees are hard also, but there are not as difficult when the information is just probabilistic. No strict constraints. However when some of the knowledge involves constraints approximatioin is hard as well. Iin particular sampling schemes becomes hard. In this talk we will address this issue of sampling in the presence of hard constraints. I will present SampleSearch which sample and search and analyze the implication of the ideas for importance sampling. We will identify the rejection problem, show how it can be hanbdled and when it is cost-effective. Empirical evaluation will demonstrate how cost-effective the scheme is. I will then move to the orthogonal approach of exploiting problem structure in sampling And show our strategy. Joint work with Vibhav Gogate

Overview Introduction: Mixed graphical models SampleSearch: Sampling with Searching Exploiting structure in samplinig: AND/OR Importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Overview Introduction: Mixed graphical models SampleSearch: Sampling with Searching Exploiting structure in samplinig: AND/OR Importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Bayesian Networks: Representation (Pearl, 1988) P(D|C,B) P(B|S) P(S) P(X|C,S) P(C|S) Smoking lung Cancer Bronchitis X-ray Dyspnoea Bayesian networks are a compact representation of a joint distribution. Here we assume only discrete distributions. Given a directed acyclic graph over a set of random variables, the joint distribution over all the random variables can be expressed as a product of various conditional distributions or CPTs defined as P(X) given its parents in the graph. Two queries are of interest. Belief Updating P(S, C, B, X, D) = P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B) Belief Updating: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

Mixed Networks: Mixing Belief and Constraints Belief or Bayesian Networks Constraint Networks A A D B C E F F B C B C D=0 D=1 1 .1 .9 .3 .7 E D B= R= The focus is on graphical models having determinism and mixed networks is a unifying framework for such graphical models. A graphical model captures knowledge using a collection of functions whose interaction can be captured by a graph representation. We have Bayesian networks to allow reasoning under conditions of uncertainty. A Bayesian network is a compact representation of a joint probability distribution. Given a directed acyclic graph over random variables, the joint distribution can be expressed as product of conditional probability tables (CPTS) defined over each variable. Each CPT is specified using the variable given its parents in the network. Constraint networks are deterministic graphical models which contains a set of variables, a set of values for each variable and constraints which restrict the assignments that can be made to the variables. A solution is an assignment to all variables such that all constraints are satisfied. Mixed networks are graphical models which augment a probabilistic graphical model having determinism with a constraint network. The constraint network is basically used to model zero probabilities as constraints. The constraints can also be specified externally. For example, the zero probabilities in the Bayesian network can be modeled as constraints as shown. The central aim is to allow exploiting the power of constraint processing techniques for efficient probabilistic inference. Constraints could be specified externally or may occur as zeros in the Belief network

Mixed networks: Distribution and Queries The distribution represented by a mixed network T=(B,R): Queries: Weighted Counting (Equivalent to P(e), partition function, solution counting) Marginal distribution: The distribution defined by a mixed network is the distribution represented by a Bayesian network projected onto the solutions of the constraint portion. The distribution can be normalized using a constant M. Evidence is a special type of unary constraint. Common queries over the mixed network are computing the normalization constant M and the marginal distribution over a variable. M is same as P(e), partition function and solution counting. Two queries: Marginals and Counting The process of answering these queries is called as inference and “skip to the next slide”

Applications Determinism: More Ubiquitous than you may think! Transportation Planning (Liao et al. 2004, Gogate et al. 2005) Predicting and Inferring Car Travel Activity of individuals Genetic Linkage Analysis (Fischelson and Geiger, 2002) associate functionality of genes to their location on chromosomes. Functional/Software Verification (Bergeron, 2000) Generating random test programs to check validity of hardware First Order Probabilistic models (Domingos et al. 2006, Milch et al. 2005) Citation matching Applications of mixed network. Graphical models which have determinism occur quite frequently in many real world applications. Examples are transportation planning in which given GPS data about a person, the problem is to infer and predict Travel activity routines of individuals. Another application is in computational Biology, specifically genetic linkage analysis in which given a chromosome and some marker locations on the chromosome and the task is figure out if any DNA on the chromosome of the marker affects a particular disease. A third application is functional or software verification in which you are given a circuit or a piece of code and the task is to generate test programs which test whether the circuit or the software conforms to the specification. A final application is first order probabilistic models which have lot of determinism and are used for things like Citation matching or figuring out Author Author relationships.

Functions on Orange nodes are deterministic L11m L11f L12m L12f S13m X11 X12 S13f L13m L13f X13 Model for locus 1 L21m L21f L22m L22f S23m X21 X22 S23f Functions on Orange nodes are deterministic L23m L23f X23 Model for locus 2

Bayesian Network for Recombination L11m L11f L12m L12f Locus 1 S13m X11 X12 S13f y1 y2 L13m L13f Deterministic relationships X13 y3 Probabilistic relationships L21m A Bayesian network for 3 people and two loci is given here. Theta is the parameter controling the distance between the two markers. The linkage instances are interesting as graphical models because they involve both probabilistic and deterministic relationships, L21f L22m L22f Locus 2 S23m X21 X22 S23f L23m L23f P(e|Θ) ? X23 9 UNL, April 2009 9

Linkage analysis: 6 people, 3 markers L11m L11f L12m L12f X11 X12 S15m S13m L13m L13f L14m L14f X13 X14 S15m S15m L15m L15f L16m L16f S15m S16m X15 X16 L21m L21f L22m L22f X21 X22 S25m S23m L23m L23f L24m L24f X23 X24 S25m S25m L25m L25f L26m L26f Here is an example of using Bayesian networks for linkage analysis. It models the genetic inheritance in a family of 6 individuals relative to some genes of interest. The task is to find a desease gene on a chromosome This domain yields very hard probabilistic networks that contain both probabilistic information and deterministic relationship And it drives many of the methods that we currently develop. . S25m S26m X25 X26 L31m L31f L32m L32f X31 X32 S35m S33m L33m L33f L34m L34f X33 X34 S35m S35m L35m L35f L36m L36f S35m S36m UNL, April 2009 X35 X36

Approximate Inference Approximations are hard with determinism Randomized Polynomial ε-approximation possible when no zeros are present (Karp 1993, Cheng 2001) ε-approximation NP-hard in the presence of zeros Gibbs sampling is problematic when MCMC is not ergodic. Current remedies Replace zeros with very small values (Laplace correction: Naive Bayes, NLP) bad performance when zeros or determinism is real! The goal is to perform effective sampling based approximate inference in presence of determinism. As compared with exact inference which in fact becomes easier when determinism is present in the graphical model, approximate inference rather counter-intuitively becomes harder! In particular, there exists a randomized polynomial time algorithm that bounds the marginal probabilities with relative error epsilon when there is no determinism. However, when determinism is present the problem of bounding marginal probabilities with relative error epsilon suddenly becomes NP-hard Other approaches in Machine learning community and NLP for tackling zeros use something called the Laplace correction. Namely replace all zeros by very small values and continue. This works quite well when determinism is not real in machine learning for instance. Example: missing data. However, when determinism is real, laplace correction can yield very bad performance.

Overview Introduction: Mixed graphical models SampleSearch: Sampling with Searching Rejection problem Recovery, and analysis Empirical evaluation Exploiting structure in samplinig: AND/OR Importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Importance Sampling: Overview Given a proposal or importance distribution Q(z) such that f(z)>0 implies Q(z)>0, rewrite Given i.i.d. samples z1,..,zN from Q(z), Importance sampling is a general technique which can be used to approximate any sum of the form M where f is a function over a set of variables. The idea is to multiply and divide by a distribution Q(z) and transform the sum into an expectation w.r.t. the distribution Q. Q is often called the proposal distribution or the importance distribution and that is why this scheme is called as importance sampling. Q could be any distribution as long as it satisfies the condition shown. However, for effective performance Q should be as close as possible to the posterior distribution of the mixed network. Then one generates samples z1 to zn from Q and estimates the sum by an average over the samples.

How to generate i.i.d. samples from Q(X) We have n variables

Generating i.i.d. samples from Q Root 0.8 0.2 A=1 A=0 0.8 We assume that Q is discrete and can be expressed in a product form along an ordering of variables. Also, we assume that Q can be specified in polynomial space. Namely each component of Q involves only polynomial number of variables. Once, Q is expressed in product form, we can express it using a probability tree as shown. The probability tree here is over three variables with a variable at each level. Each branch in the tree is labeled with a probability of reaching that branch given the parent. To sample from this probability tree, one just selects each branch with the appropriate probability until a leaf node is reached. The probability of reaching the leaf node is the product of the probabilities on the path from the root to that leaf. For example, the probability of sampling the leaf C=1 is 0.8 times 0.6 times 0.8. 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Rejection Problem Toss a biased coin P(H)=0.8, P(T)=0.2 Say we get H Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Rejection Problem Root Toss a biased coin P(H)=0.4, P(T)=0.6 Say, We get a Head 0.8 A=0 0.4 0.6 B=0 B=1 0.8 0.2 0.2 0.8 C=0 C=1 C=0 C=1

Rejection Problem Root Toss a biased coin P(H)=0.4, P(T)=0.6 Say, We get a Head 0.8 A=0 0.4 0.6 B=0 B=1 0.8 0.2 0.2 0.8 C=0 C=1 C=0 C=1

Rejection Problem Root Toss a biased coin P(H)=0.4, P(T)=0.6 Say, We get a Head 0.8 A=0 0.4 B=0 0.8 0.2 C=0 C=1 A large number of assignments generated will be rejected, thrown away

Rejection Problem Constraints: A≠B A≠C Root 0.8 0.2 0.4 0.6 Unfortunately, in presence of constraints or determinism, the proposal distribution may suffer from the rejection problem in that it may generate samples which are inconsistent i.e. they are not solutions to the constraint portion. In the tree, all blue leaves are solutions while all red leaves are non-solutions and as we can see there is a positive probability of generating a non-solution. To understand why this is a problem, let us look at the expression shown above. f is the true function while Q is the proposal distribution. If a sample xi is not a solution its f-value is zero and the ratio will be zero. We call such samples whose weight is zero as rejected samples. If we only generate rejected sample our estimate M-hat will be zero. All Blue leaves correspond to solutions i.e. f(x) >0 All Red leaves correspond to non-solutions i.e. f(x)=0

Revising Q to backtrack-free distribution: QF(branch)=0 if no solutions under it QF(branch)=1 if no solutions under the other branch QF(branch) =Q(branch) otherwise Constraints: A≠B A≠C A=0 B=0 C=0 B=1 A=1 C=1 Root 0.8 0.2 0.40 0.61 1 0.21 0.80 To reemedy this situation we can try to modify Q so that it will be easy to sample from and it will have the same set of solutions (same non-zero tuples) as f, the true function It is easy to see that if we modify Q so that it respects all constraints in the graphical model, we would never generate inconsistent samples or non-solutions. In the constraint satisfaction jargon, this process of generating a solution without backtracking is called as making the constraint network backtrack-free. So we call our modification of the proposal which always generates samples which are solutions as the backtrack-free distribution. It is defined as shown. At each branch, we check to see if there are solutions under that branch, if there are then we keep the branch as it is. All Blue leaves correspond to solutions i.e. f(x) >0 All Red leaves correspond to non-solutions i.e. f(x)=0

Generating samples from QF QF(branch)=0 if no solutions under it QF(branch)=1 if no solutions under the other branch QF(branch) =Q(branch) otherwise Constraints: A≠B A≠C Root 0.8 0.2 A=0 00.4 0.6  1 ?? Solution Invoke an oracle at each branch. Oracle returns True if there is a solution under a branch False, otherwise ?? Solution B=1 00.2 0.81 ?? Solution C=1

Generating samples from QF Gogate et al., UAI 2005, Gogate and Dechter, UAI 2005 Generating samples from QF Constraints: A≠B A≠C Oracles: In practice Adaptive consistency as pre-processing step A complete CSP solver Too costly O(exp(treewidth)) Invoked O(nd) for each sample Root 0.2 0.8 A=0 1 B=1 1 C=1

Gogate et al., UAI 2005, Gogate and Dechter, UAI 2005 Approximations of QF Use i-consistency instead of adaptive consistency O(ni) time and space complexity identify some zeros so that they are never sampled Cons: Too weak when constraint portion is hard. No processing Percentage of Zeros identified Polynomial consistency enforcement Rejection Rate NP-hard exponential time and space

Algorithm SampleSearch Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C A=0 B=0 C=0 B=1 A=1 C=1 Root 0.8 0.2 0.4 0.6

Algorithm SampleSearch Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 1 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Algorithm SampleSearch Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 1 B=1 B=0 B=1 0.8 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 Resume Sampling

Algorithm SampleSearch Gogate and Dechter, AISTATS 2007, AAAI 2007 Algorithm SampleSearch Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 1 B=1 B=0 B=1 0.8 0.2 0.8 1 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 Until f(sample)>0 Constraint Violated

Gogate and Dechter, AISTATS 2007, AAAI 2007 Generate more Samples Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Gogate and Dechter, AISTATS 2007, AAAI 2007 Generate more Samples Constraints: A≠B A≠C Root 0.8 0.2 A=1 A=0 0.8 0.2 0.4 0.6 B=0 B=1 B=0 B=1 0.8 1 0.8 0.2 0.2 0.8 0.2 0.8 0.2 C=0 C=1 C=0 C=1 C=0 C=1 C=0 C=1

Traces of SampleSearch Gogate and Dechter, AISTATS 2007, AAAI 2007 Traces of SampleSearch Constraints: A≠B A≠C A=0 B=0 B=1 C=1 Root A=0 B=1 C=1 Root C=0 A=0 B=0 B=1 C=1 Root C=0 Root A=0 B=1 C=1

SampleSearch: Sampling Distribution Gogate and Dechter, AISTATS 2007, AAAI 2007 SampleSearch: Sampling Distribution Problem: Due to Search, the samples are no longer i.i.d. from Q Theorem: SampleSearch generates i.i.d. samples from the backtrack-free distribution Now that we know how to always generate consistent samples, we are not done yet because our task is to estimate the probability of evidence or the normalization constant M? If we use Q to estimate M, then it would be incorrect because SampleSearch does not generate samples from Q due to its backtracking search component. Ironically, we can prove that SampleSearch generates samples from the backtrack-free distribution and as we already saw it is quite expensive to compute.

The Sampling distribution QF of SampleSearch Gogate and Dechter, AISTATS 2007, AAAI 2007 The Sampling distribution QF of SampleSearch Constraints: A≠B A≠C Root What is probability of generating A=0? QF(A=0)=0.8 Why? SampleSearch is systematic 0.8 What is probability of generating (A=0,B=1)? QF(B=1|A=0)=1 Why? SampleSearch is systematic A=0 1 Skip this slide. B=0 B=1 What is probability of generating (A=0,B=0)? Simple: QF(B=0|A=0)=0 All samples generated by SampleSearch are solutions 1 C=0 C=1 C=0 C=1 Backtrack-free distribution

Asymptotic approximations of QF Gogate and Dechter, AISTATS 2007, AAAI 2007 Asymptotic approximations of QF Root IF Hole THEN UF=Q (i.e. there is a solution at the other branch) LF=0 (i.e. no solution at the other branch) 0.8 Hole ? Don’t know A=0 1 No solutions here B=0 B=1 QF(branch)=0 if no solutions under it QF(branch) =Q(branch) otherwise Let us assume that SampleSearch traversed the trace shown in the Figure to sample the Blue solution. To know the exact Backtrack-free distribution at each branch point, we have to figure out whether there is a solution on the other branch point. If there is no solution on the other branch point according to the definition of the backtrack-free distribution, we set the probability of the branch to zero. Otherwise, we keep it the same. In its trace, SampleSearch has figured out that some branches are solutions or non solutions but it has not figured out the consistency of some other branches which are the holes or question marks in the figure. A simple approximation suggests itself. We can either set the hole to Q or to 0 corresponding to the definition of the backtrack-free distribution. 1 No solutions here C=0 C=1 C=0 C=1

Approximations: Convergence in the limit Gogate and Dechter, AISTATS 2007, AAAI 2007 Approximations: Convergence in the limit Store all possible traces A=0 B=1 C=1 Root 0.8 1 ? A=0 B=0 B=1 C=1 Root 0.8 ? 1 A=0 B=1 C=1 Root C=0 0.8 ? 0.6 1

Approximations: Convergence in the limit Gogate and Dechter, AISTATS 2007, AAAI 2007 Approximations: Convergence in the limit From the combined sample tree, update U and L. IF Hole THEN UFN=Q and LFN=0 Root 0.8 ? A=0 1 B=1 1 C=1

Improving Naive SampleSeach: The IJGP-wc-SS algorithm Gogate and Dechter, AISTATS 2007, AAAI 2007 Improving Naive SampleSeach: The IJGP-wc-SS algorithm Better Search Strategy Can use any state-of-the-art CSP/SAT solver e.g. minisat (Een and Sorrenson 2006) Better Proposal distribution Use output of IJGP- a generalized belief propagation to compute the initial importance function w-cutset importance sampling (Bidyuk and Dechter, 2007) Reduce variance by sampling from a subspace

Experiments Tasks Benchmarks Algorithms Weighted Counting Marginals Satisfiability problems (counting solutions) Linkage networks Relational instances (First order probabilistic networks) Grid networks Logistics planning instances Algorithms IJGP-wc-SS/LB and IJGP-wc-SS/UB IJGP-wc-IS (Vanilla algorithm that does not perform search) SampleCount (Gomes et al. 2007, SAT) ApproxCount (Wei and Selman, 2007, SAT) EPIS (Changhe and Druzdzel, 2006) RELSAT (Bayardo and Peshoueshk, 2000, SAT) Edge Deletion Belief Propagation (Choi and Darwiche, 2006) Iterative Join Graph Propagation (Dechter et al., 2002) Variable Elimination and Conditioning (VEC)

Results: Probability of Evidence Linkage instances (UAI 2006 evaluation) Time Bound: 3 hrs M: number of samples generated in 10 hrs Z: Probability of Evidence

Results: Probability of Evidence Relational instances (UAI 2008 evaluation) Time Bound: 10 hrs M: number of samples generated in 10 hrs Z: Probability of Evidence

Results: Solution Counts Latin Square instances (size 8 to 16) Here, I’ll explain the methodology that each algorithm was run for the same amount of time (10hrs). Also explain what n, k, c and w stand for. If exact results are not known, we report the lower bound. Higher the lower bound, better the particular scheme. Also, ApproxCount does not output a lower bound. Time Bound: 10 hrs M: number of samples generated in 10 hrs Z: Solution Counts

Results: Solution Counts Langford instances Time Bound: 10 hrs M: number of samples generated in 10 hrs Z: Solution Counts

Results on Marginals Evaluation Criteria Always bounded between 0 and 1 Lower Bounds the KL distance When probabilities close to zero are present KL distance may tend to infinity.

Results: Posterior Marginals Linkage instances (UAI 2006 evaluation) Time Bound: 3 hrs Table shows the Hellinger distance (∆) and Number of samples: M

Summary: SampleSearch Manages rejection problem while sampling Sampling Distribution is the backtrack-free distribution QF Approximation of QF by storing all traces yielding an asymptotically unbiased estimator Linear time and space overhead Bound the weighted counts from above and below Empirically, when a substantial number of zero probabilities are present, SampleSearch dominate.

AND/OR Importance sampling Using structure within importance sampling Our final contribution is AND/OR Importance sampling which utilizes structure within importance sampling. AND/OR Importance sampling

Overview Introduction: Mixed graphical models SampleSearch: Sampling with Searching Exploiting structure in sampling: AND/OR Importance sampling Here is a brief outline of my talk. I’ll start by describing some preliminaries and motivating applications. Then I’ll describe the main problem which we investigated in this thesis which is that of poor performance of sampling based algorithms in presence of determinism. Then I’ll describe the four main contributions of this thesis.

Motivation Given Q(z), Importance sampling totally disregards the structure of f(z) while approximating it. Consider the two sums. Both are sums over four variables {1,2,3,4} but both have different structure. The second sum is sparser than the first sum. Given a proposal distribution Q(z), Importance sampling totally disregards the structure of f(z) while approximating M. Intuitively, we can expect that the second sum which has a much sparser graph structure is easier to approximate than the first sum and we will build on this intuition to generalize importance sampling.

OR Search Tree Constraint Satisfaction – Counting Solutions A B C D E RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A E C B F D C D F E B A 1 1 1 Defined by a variable ordering, Fixed but may be dynamic 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AND/OR Search Tree Constraint Satisfaction – Counting Solutions A=a B C RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A E C B F D A A=a B A E C B F D DFS tree C A F B=b C E C=c E=e D B E D F D=d F=f

AND/OR Search Tree Constraint Satisfaction – Counting Solutions B C RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A D B E C F A E C B F D pseudo tree OR A AND 1 OR B B Defined by a variable ordering, Fixed but may be dynamic AND 1 1 OR C E C E C E C E AND 1 1 1 1 1 1 1 1 OR D D F F D D F F D D F F D D F F AND 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

AND/OR Tree OR – casing upon variable values F OR – casing upon variable values AND – decomposition into independent subproblems pseudo tree OR A AND 1 OR B B Defined by a variable ordering, Fixed but may be dynamic AND 1 1 OR C E C E C E C E AND 1 1 1 1 1 1 1 1 OR D D F D F F D D F D F AND 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

Complexity of AND/OR Tree Search Space O(n) Time O(n km) O(n kw* log n) [Freuder & Quinn85], [Collin, Dechter & Katz91], [Bayardo & Miranker95], [Darwiche01] O(kn) k = domain size m = depth of pseudo-tree n = number of variables w*= treewidth

Background: AND/OR search space Z 1 2 X Y OR AND AND/OR Tree X {0,1,2} Z X Y Pseudo-tree ≠ Z {0,1,2} ≠ Y {0,1,2} Problem X 1 2 Z Y OR Tree X Z Y Chain Pseudo-tree To take advantage of the structure of the graphical model within importance sampling, we will use the notion of AND/OR search spaces for graphical models. Given a graphical model, the AND/OR search space is driven by a pseudo tree which captures the problem decomposition. For example, given Z, X and Y are independent of each other. Given the pseudo-tree, the associated AND/OR search tree, has alternating levels of OR nodes and AND nodes. The OR nodes are circles and are labeled with variables while the AND nodes are squares and labeled with values and correspond to value assignments in the domains of the variables. The root of the AND/OR search tree is an OR node, labeled with the root of the pseudo-tree T. Semantically, the OR states represent alternative solutions, whereas the AND states represent problem decomposition into independent sub-problems, all of which need be solved. When the pseudo-tree is a chain, it does not take advantage of decomposition and the AND/OR search tree coincides with the regular OR search tree.

Example Bayesian network P(Z) Z=0 Z=1 0.8 0.2 f(Z) Y B Z X A P(X|Z) X=0 X=1 X=2 Z=0 0.3 0.4 Z=1 0.2 0.7 0.1 P(Y|Z) Y=0 Y=1 Y=2 Z=0 0.5 0.1 0.4 Z=1 0.2 0.6 f(XZ) f(YZ) P(X) X=0 0.1 X=1 0.2 X=2 0.6 P(Y) Y=0 0.2 Y=1 0.7 Y=2 0.1 We will use the given Bayesian network to explain the main idea in AND/OR importance sampling. Here we are interested in computing probability of evidence or weighted counting. The evidence here is A=0, B=0. When we instantiate the variable A to 0, the CPT on A is a function over X. Similarly, when we instantiate B, the CPT on B is a function over Y. For simplicity and to save space on the slides, we will combine the CPTs mentioning X into a single function f(XZ) and the CPTs involving Y into a single function f(YZ). In this case the probability of evidence P(a,b) can be computed by summing out X,Y,Z from the product of the three functions. Evidence A=0, B=0

Recap: Conventional Importance Sampling Gogate and Dechter, UAI 2008, CP2008 Recap: Conventional Importance Sampling Y B Z X A AND/OR idea! Decompose this expectation Let us quickly recap how conventional importance sampling converts the summation into an expectation. Given a proposal distribution Q, we multiply and divide by Q yielding the given expectation. Now the main idea in AND/OR importance sampling is to decompose the given expectation based on the structure of the Bayesian network.

AND/OR Importance Sampling (General Idea) Y B Z X A AND/OR Importance Sampling (General Idea) Z X Y Pseudo-tree Decompose Expectation Consider the expression for M. By applying simple symbolic manipulations, we can rewrite the expression as shown. The terms in brackets are nothing but conditional expectations. First bracket gives conditional expectation of X given Z. Second bracket gives conditional expectation of Y given Z. The conditional expectations can also be read by following the pseudo-tree structure.

AND/OR Importance Sampling (General Idea) Y B Z X A AND/OR Importance Sampling (General Idea) Z X Y Pseudo-tree Estimate all conditional expectations separately How? Record all samples For each sample that has Z=j Estimate the conditional expectations X|Z and Y|Z using samples corresponding to X|Z=j and Y|Z=j respectively. Combine the results So how to use this decomposition for sampling. The idea is to compute the expectations separately. How to do that Well, First, we record all the samples. Next, for each Z=j, we estimate the conditional expectation at X and at Y separately using the generated samples. Combine the results.

AND/OR Importance Sampling Gogate and Dechter, UAI 2008, CP2008 Y B Z X A Sample # Z X Y 1 2 3 4 Z <1.6,2> <0.4,2> 1 X X Y Z X Y Pseudo-tree Y <0.12,1> <0.2,1> <0.16,1> <0.36,1> <0.08,1> <0.14,1> <0.28,1> <0.84,1> 1 2 1 1 2 1 The operations can be intuitively expressed on an AND/OR search tree. Let us assume that we have generated four samples. We simply arrange the samples on an AND/OR tree. Note that I have labeled the arcs using some weird number pairs. Where do they come from. The first quantity is simply the ratio of f over Q. The second quantity keeps track of how many times the assignment is sampled. Now coming back to the computation part. At each OR node, we estimate conditional expectation of the variable at that node given the value of the variable ordered above it. Fancy word conditional expectation means average of the children. For example, the conditional expectation of X given Z=0 is simply the average the quantities below it. At each AND node, we combine the results, namely take the product based on the expression shown above.

Example Bayesian network P(Z) Z=0 Z=1 0.8 0.2 Y B Z X A P(X|Z) X=0 X=1 X=2 Z=0 0.3 0.4 Z=1 0.2 0.7 0.1 P(Y|Z) Y=0 Y=1 Y=2 Z=0 0.5 0.1 0.4 Z=1 0.2 0.6 P(X) X=0 0.1 X=1 0.2 X=2 0.6 P(Y) Y=0 0.2 Y=1 0.7 Y=2 0.1 We will use the given Bayesian network to explain the main idea in AND/OR importance sampling. Here we are interested in computing probability of evidence or weighted counting. The evidence here is A=0, B=0. When we instantiate the variable A to 0, the CPT on A is a function over X. Similarly, when we instantiate B, the CPT on B is a function over Y. For simplicity and to save space on the slides, we will combine the CPTs mentioning X into a single function f(XZ) and the CPTs involving Y into a single function f(YZ). In this case the probability of evidence P(a,b) can be computed by summing out X,Y,Z from the product of the three functions. Evidence A=0, B=0

Gogate and Dechter, UAI 2008, CP2008 AND/OR Importance Sampling Sample # Z X Y 1 2 3 4 Z <1.6,2> <0.4,2> 1 X X Y Y <0.12,1> <0.2,1> <0.16,1> <0.36,1> <0.08,1> <0.14,1> <0.28,1> <0.84,1> 1 2 1 1 2 1 Putting these ideas together, we can compute a new estimate for M as follows. We arrange the samples on an AND/OR search tree. At each AND node, we compute a product and at each OR node we compute a weighted average. The value of the root node now expresses an estimate of the weighted counts. All AND nodes: Separate Components. Take Product Operator: Product All OR nodes: Conditional Expectations given the assignment above it Operator: Average

Algorithm AND/OR Importance Sampling Gogate and Dechter, UAI 2008, CP2008 Algorithm AND/OR Importance Sampling Generate samples x1, . . . ,xN from Q along O. Build a AND/OR sample tree for the samples x1, . . . ,xN along the ordering O. FOR all leaf nodes i of AND-OR tree do IF AND-node v(i)= 1 ELSE v(i)=0 FOR every node n from leaves to the root do IF AND-node v(n)=product of children IF OR-node v(n) = Average of children Return v(root-node)

# samples in AND/OR vs Conventional Gogate and Dechter, UAI 2008, CP2008 # samples in AND/OR vs Conventional Sample # Z X Y 1 2 3 4 Z <1,1> <1,1> 1 X Y X Y <1,1> <1,1> <1,1> <1,1> <1,1> <1,1> <1,1> <1,1> 1 2 1 1 2 1 8 Samples in AND/OR space versus 4 samples in importance sampling Example: Z=0, X=2, Y=0 is not generated but still considered in AND/OR space AND/OR sampling has lower error because it computes an average by considering more samples. For example, Z=0, X=2, Y=0 is not generated but still considered in AND/OR space (see the red branches in the AND/OR tree) Let us count the number of samples. Remember that OR node represents alternatives while AND node stands for decomposition or independent sub-problems each of which must be solved. So we need to sum over all alternatives or OR nodes and take a product over all AND nodes.

Properties of AND/OR Importance Sampling Gogate and Dechter, UAI 2008, CP2008 Properties of AND/OR Importance Sampling Unbiased estimate of weighted counts. AND/OR estimate has lower Variance than conventional importance sampling estimate. Variance Reduction Easy to Prove for case of complete independence (Goodman, 1960) Complicated to prove for general conditional independence case (Gogate thesis, papers!)

AND/OR w-cutset (Rao-Blackwellised) sampling Gogate and Dechter, Under review, CP 2009 AND/OR w-cutset (Rao-Blackwellised) sampling Rao-Blackwellisation (Rao, 1963) Partition X into K and R, such that we can compute P(R|k) efficiently. Sample from K and sum out R Estimate: w-cutset sampling (Bidyuk and Dechter, 2003): Select K such that the treewidth of R after removing K is bounded by “w”. Weighted Counts conditioned on K=ki We can even combine AND/OR sampling with Rao-Blackwellised w-cutset sampling yielding further variance reduction. The idea in Rao-Blackwellisation is to combine exact inference with sampling which reduces variance based on the Rao-Blackwell theorem. Here the set of variables is partitioned into K and R such that it is easy to do exact inference on R given K and the Rao-Blackwell estimate is shown below. w-cutset sampling is an elegant implementation of Rao-Blackwellisation in which the set K is selected in such a way that the treewidth of R given K is bounded by w. Thus the exact inference step can be carried out in polynomial time yielding a very efficient practical scheme.

AND/OR w-cutset sampling Gogate and Dechter, Under review, CP 2009 AND/OR w-cutset sampling Perform AND/OR tree or graph sampling on K Exact inference on R Orthogonal approaches: Theorem: Combining AND/OR sampling and w-cutset sampling yields further variance reduction. A B C D E G F A B C D E G F A A B C B The main idea in AND/OR w-cutset sampling is to perform AND/OR sampling on the w-cutset variables K and exact inference on R. The main point here is that AND/OR sampling is orthogonal to w-cutset Rao-Blackwellised sampling and the two could be combined to yield further variance reduction. Start pseudo tree on the cutset variables C OR pseudo tree on the cutset variables Graphical model Full pseudo tree

From Search Trees to Search Graphs Any two nodes that root identical subtrees (subgraphs) can be merged

From Search Trees to Search Graphs Any two nodes that root identical subtrees (subgraphs) can be merged

Merging Based on Context One way of recognizing nodes that can be merged: context (X) = ancestors of X in pseudo tree that are connected to X, or to descendants of X A D B E C F pseudo tree [ ] [A] A E C B F D A E C B F D [AB] [AB] We may have two nodes that are identical, then we can merge. Highlight: rooting the same tree. Maybe an example of and/or tree, show animation that we merge. Minimal is when we remove all redundancy [BC] [AE]

AND/OR Search Graph Constraint Satisfaction – Counting Solutions Context A B C RABC 1 B C D RBCD 1 A B E RABE 1 A E F RAEF 1 A D B E C F [ ] [A] A E C B F D [AB] [AB] [BC] [AE] OR pseudo tree A AND 1 OR B B AND 1 1 Defined by a variable ordering, Fixed but may be dynamic OR C E C E C E C E AND 1 1 1 1 1 1 1 1 OR D D D D F F F F B C Value 1 AND 1 1 1 1 1 1 1 1 context minimal graph Cache table for D

How Big Is the Context? Theorem: The maximum context size for a pseudo tree is equal to the treewidth of the graph along the pseudo tree. C H K D M F G A B E J O L N P [AB] [AF] [CHAE] [CEJ] [CD] [CHAB] [CHA] [CH] [C] [ ] [CKO] [CKLN] [CKL] [CK] (C K H A B E J L N O D P M F G) max context size = treewidth

Complexity of AND/OR Graph Search Space O(n kw*) O(n kpw*) Time k = domain size n = number of variables w*= treewidth pw*= pathwidth w* ≤ pw* ≤ w* log n

AND/OR Graphs ≠ C B D A {0,1,2} 2 B D 1 C A AND/OR tree 2 B D 1 C A C 1 C A AND/OR tree 2 B D 1 C A C B D A AND/OR graph The figure shows a full AND/OR tree corresponding to the pseudo tree. We can see that node A does not depend on C and therefore the subtree corresponding to C=0 and C=1 is identical for A. Therefore, the two could be merged yielding an AND/OR graph which is even more compact and uses more conditional indenpendencies that an AND/OR tree.

AND/OR graph sampling ≠ C B D A {0,1,2} OR Sample Tree C B D A B 1 D A 2 C B D A AND/OR Sample Tree C 1 B D 2 A AND/OR Sample Graph C 1 D 2 We can implement the same idea and convert an AND/OR sample tree into an AND/OR sample graph by merging all identical subtrees. This will yield a new structure on which we can perform similar computations, namely average at OR and product at AND to yield a new estimate. This new estimate will have lower variance. B 1 A B A 2 2 A 1 2

Variance Hierarchy and Complexity O(nN) O(1) Variance Hierarchy and Complexity IS O(cN+(n-c)Nexp(w)) O(1) O(nN) O(h) w-cutset IS AND/OR Tree IS O(nNw*) O(nN) O(cN+(n-c)Nexp(w)) O(h+(n-c)exp(w)) AND/OR Graph IS AND/OR w-cutset Tree IS O(cNw*+(n-c)Nexp(w)) O(cN+(n-c)exp(w)) AND/OR w-cutset Graph IS

Experiments Benchmarks Algorithms Linkage analysis Graph coloring Grids Algorithms OR tree sampling AND/OR tree sampling AND/OR graph sampling w-cutset versions of the three schemes above

Results: Probability of Evidence Linkage instances (UAI 2006 evaluation) Time Bound: 1hr

Here we see that the scheme that employs the most decomposition AND/OR w-cutset graph sampling yields the best performance.

Summary: AND/OR Importance sampling AND/OR sampling: A general scheme to exploit conditional independence in sampling Theoretical guarantees: lower sampling error than conventional sampling Variance reduction orthogonal to Rao-Blackwellised sampling. Better empirical performance than conventional sampling.

Conclusion Effective sampling in presence of determinism to address the rejection and non-convergence problems SampleSearch manages rejection while sampling SampleSearch-SIR and SampleSearch-MH Convergent MCMC sampling schemes Lower Bounding schemes Extending Markov inequality to multiple samples AND/OR Importance sampling Post-processing, use conditional independence for reducing the variance of estimates

Mixed Networks (Mateescu and Dechter, 2004) Constraint Network Belief Network A A F F B C B C Moral mixed graph E D A E D B C D 1 F B C D=0 D=1 .2 .8 1 .1 .9 .3 .7 .5 B C We often have both probabilistic information and constraints. Our approach is to allow the two representation to co-exist, explicitlly. It has virtues for user interface and for computation. Complex cnf queries: P((A or B) and (~CVD)) E D 84 UNL, April 2009 84

Transportation Planning: Graphical model dt wt dt-1 wt-1 D: Time-of-day (discrete) W: Day of week (discrete) G: collection of locations where the person spends significant amount of time. (discrete) F: Counter Route: A hidden variable that just predicts what path the person takes (discrete) Location: A pair (e,d) e is the edge on which the person is and d is the distance of the person from one of the end-points of the edge (continuous) Velocity: Continuous GPS reading: (lat,lon,spd,utc). gt-1 gt Ft-1 Ft rt-1 rt vt-1 vt Here is an example dynamic model which models car travel activity routines of individuals lt-1 lt yt-1 yt