. Approximate Inference Slides by Nir Friedman. When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded.

Slides:



Advertisements
Similar presentations
Naïve Bayes. Bayesian Reasoning Bayesian reasoning provides a probabilistic approach to inference. It is based on the assumption that the quantities of.
Advertisements

Exact Inference in Bayes Nets
Lirong Xia Approximate inference: Particle filter Tue, April 1, 2014.
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Bayesian network inference
Bayesian Networks. Motivation The conditional independence assumption made by naïve Bayes classifiers may seem to rigid, especially for classification.
. PGM: Tirgul 8 Markov Chains. Stochastic Sampling  In previous class, we examined methods that use independent samples to estimate P(X = x |e ) Problem:
PGM 2003/04 Tirgul 3-4 The Bayesian Network Representation.
. Learning Bayesian networks Slides by Nir Friedman.
Bayes Nets Rong Jin. Hidden Markov Model  Inferring from observations (o i ) to hidden variables (q i )  This is a general framework for representing.
Bayesian Belief Networks
Bayesian Networks. Graphical Models Bayesian networks Conditional random fields etc.
. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.
Announcements Homework 8 is out Final Contest (Optional)
© Daniel S. Weld 1 Statistical Learning CSE 573 Lecture 16 slides which overlap fix several errors.
. DAGs, I-Maps, Factorization, d-Separation, Minimal I-Maps, Bayesian Networks Slides by Nir Friedman.
1 Bayesian Networks Chapter ; 14.4 CS 63 Adapted from slides by Tim Finin and Marie desJardins. Some material borrowed from Lise Getoor.
CS Bayesian Learning1 Bayesian Learning. CS Bayesian Learning2 States, causes, hypotheses. Observations, effect, data. We need to reconcile.
. PGM 2002/3 – Tirgul6 Approximate Inference: Sampling.
Read R&N Ch Next lecture: Read R&N
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Bayesian Statistics and Belief Networks. Overview Book: Ch 13,14 Refresher on Probability Bayesian classifiers Belief Networks / Bayesian Networks.
Introduction to Bayesian Networks
Made by: Maor Levy, Temple University  Inference in Bayes Nets ◦ What is the probability of getting a strong letter? ◦ We want to compute the.
Instructor: Eyal Amir Grad TAs: Wen Pu, Yonatan Bisk Undergrad TAs: Sam Johnson, Nikhil Johri CS 440 / ECE 448 Introduction to Artificial Intelligence.
Bayes’ Nets: Sampling [These slides were created by Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley. All CS188 materials are available.
CS B 553: A LGORITHMS FOR O PTIMIZATION AND L EARNING Monte Carlo Methods for Probabilistic Inference.
1 CMSC 671 Fall 2001 Class #21 – Tuesday, November 13.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
CHAPTER 5 Probability Theory (continued) Introduction to Bayesian Networks.
An Introduction to Markov Chain Monte Carlo Teg Grenager July 1, 2004.
CS 188: Artificial Intelligence Bayes Nets: Approximate Inference Instructor: Stuart Russell--- University of California, Berkeley.
Exact Inference in Bayes Nets. Notation U: set of nodes in a graph X i : random variable associated with node i π i : parents of node i Joint probability:
Inference Algorithms for Bayes Networks
1 CMSC 671 Fall 2001 Class #20 – Thursday, November 8.
UIUC CS 497: Section EA Lecture #7 Reasoning in Artificial Intelligence Professor: Eyal Amir Spring Semester 2004 (Based on slides by Gal Elidan (Hebrew.
CSC Lecture 23: Sigmoid Belief Nets and the wake-sleep algorithm Geoffrey Hinton.
CS498-EA Reasoning in AI Lecture #19 Professor: Eyal Amir Fall Semester 2011.
CS 2750: Machine Learning Review
CS 2750: Machine Learning Directed Graphical Models
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
CS b553: Algorithms for Optimization and Learning
Read R&N Ch Next lecture: Read R&N
Approximate Inference
Learning Bayesian Network Models from Data
CS 4/527: Artificial Intelligence
Exact Inference Continued
CAP 5636 – Advanced Artificial Intelligence
Read R&N Ch Next lecture: Read R&N
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Instructors: Fei Fang (This Lecture) and Dave Touretzky
Bayesian Statistics and Belief Networks
CS 188: Artificial Intelligence
Class #19 – Tuesday, November 3
Directed Graphical Probabilistic Models: the sequel
Exact Inference Continued
CS 188: Artificial Intelligence Fall 2008
Class #16 – Tuesday, October 26
Inference III: Approximate Inference
Approximate Inference by Sampling
Read R&N Ch Next lecture: Read R&N
Approximate Inference: Particle-Based Methods
Read R&N Ch Next lecture: Read R&N
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 12
Learning Bayesian networks
Presentation transcript:

. Approximate Inference Slides by Nir Friedman

When can we hope to approximate? Two situations: u Highly stochastic distributions “Far” evidence is discarded u “Peaked” distributions improbable values are ignored

Stochasticity & Approximations u Consider a chain: u P(X i+1 = t | X i = t) = 1-  P(X i+1 = f | X i = f) = 1-   Computing the probability of X n+1 given X 1, we get X1X1 X2X2 X3X3 X n+1 Even # of flips: Odd # of flips:

Plot of P(X n = t | X 1 = t) n = 5 n = 10 n = 20 

Stochastic Processes u This behavior of a chain (a Markov Process) is called Mixing. u In general Bayes nets there is a similar behavior.  If probabilities are far from 0 & 1, then effect of “far” evidence vanishes (and so can be discarded in approximations).

“Peaked” distributions u If the distribution is “peaked”, then most of the mass is on few instances u If we can focus on these instances, we can ignore the rest Instances

Global conditioning A L C I D J B M E K Fixing value of A & B Fixing values in the beginning of the summation can decrease tables formed by variable elimination. This way space is traded with time. Special case: choose to fix a set of nodes that “break all loops”. This method is called cutset-conditioning. L C I J M E K D a b ba

Bounded conditioning A B Fixing value of A & B By examining only the probable assignment of A & B, we perform several simple computations instead of a complex one.

Bounded conditioning u Choose A and B so that P(Y,e |a,b) can be computed easily. E.g., a cycle cutset. u Search for highly probable assignments to A,B. l Option 1--- select a,b with high P(a,b). l Option 2--- select a,b with high P(a,b | e). u We need to search for such high mass values and that can be hard.

Bounded Conditioning Advantages: u Combines exact inference within approximation u Continuous: more time can be used to examine more cases u Bounds: unexamined mass used to compute error-bars Possible problems:  P(a,b) is prior mass not the posterior.  If posterior is significantly different P(a,b| e), Computation can be wasted on irrelevant assignments

Network Simplifications u In these approaches, we try to replace the original network with a simpler one l the resulting network allows fast exact methods

Network Simplifications Typical simplifications: l Remove parts of the network l Remove edges l Reduce the number of values (value abstraction) l Replace a sub-network with a simpler one (model abstraction) u These simplifications are often w.r.t. to the particular evidence and query

Stochastic Simulation  Suppose our goal is the compute the likelihood of evidence P(e) where e is an assignment to some variables in {X 1,…,X n }.  Assume that we can sample instances according to the distribution P (x 1,…,x n ).  What is then the probability that a random sample satisfies e? Answer: simply P(e) which is what we wish to compute.  Each sample simulates the tossing of a biased coin with probability P(e) of “ Heads ”.

Stochastic Sampling  Intuition: given a sufficient number of samples x[1],…,x[N], we can estimate  Law of large number implies that as N grows, our estimate will converge to p with high probability Zeros or ones u How many samples do we need to get a reliable estimation? We will not discuss this issue here.

Sampling a Bayesian Network  If P (X 1,…,X n ) is represented by a Bayesian network, can we efficiently sample from it? u Idea: sample according to structure of the network l Write distribution using the chain rule, and then sample each variable given its parents

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e b Earthquake Radio Burglary Alarm Call 0.03

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eb Earthquake Radio Burglary Alarm Call 0.001

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eab 0.4 Earthquake Radio Burglary Alarm Call

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb Earthquake Radio Burglary Alarm Call 0.8

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb 0.3 Earthquake Radio Burglary Alarm Call

Samples: B E A C R Logic sampling P(b) 0.03 P(e) P(a) b e P(c) a a P(r) e e eacb r Earthquake Radio Burglary Alarm Call

Logic Sampling  Let X 1, …, X n be order of variables consistent with arc direction  for i = 1, …, n do sample x i from P(X i | pa i ) (Note: since Pa i  {X 1,…,X i-1 }, we already assigned values to them)  return x 1, …,x n

Logic Sampling u Sampling a complete instance is linear in number of variables l Regardless of structure of the network  However, if P(e) is small, we need many samples to get a decent estimate

Can we sample from P( X i |e) ?  If evidence e is in roots of the Bayes network, easily u If evidence is in leaves of the network, we have a problem: l Our sampling method proceeds according to the order of nodes in the network. Z R B A=a X

Likelihood Weighting  Can we ensure that all of our sample satisfy e? u One simple (but wrong) solution: When we need to sample a variable Y that is assigned value by e, use its specified value.  For example: we know Y = 1 Sample X from P(X) Then take Y = 1  Is this a sample from P( X,Y |Y = 1) ? NO. X Y

Likelihood Weighting  Problem: these samples of X are from P(X) u Solution: Penalize samples in which P(Y=1|X) is small u We now sample as follows: Let x i be a sample from P(x) Let w i = P(Y = 1|X = x i ) X Y

Likelihood Weighting  Let X 1, …, X n be order of variables consistent with arc direction u w = 1  for i = 1, …, n do if X i = x i has been observed  w  w  P(X i = x i | pa i ) l else  sample x i from P(X i | pa i )  return x 1, …,x n, and w

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a P(r) r r b Earthquake Radio Burglary Alarm Call 0.03 Weight = r a = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r eb Earthquake Radio Burglary Alarm Call Weight = r = a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r eb 0.4 Earthquake Radio Burglary Alarm Call Weight = r = a 0.6 a

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r ecb Earthquake Radio Burglary Alarm Call 0.05 Weight = r = a a 0.6

Samples: B E A C R Likelihood Weighting P(b) 0.03 P(e) P(a) b e P(c) a a P(r) r r ecb r 0.3 Earthquake Radio Burglary Alarm Call Weight = r = a a 0.6 *0.3

Likelihood Weighting u Why does this make sense?  When N is large, we expect to sample NP(X = x) samples with x[i] = x u Thus,

Summary Approximate inference is needed for large pedigrees. We have seen a few methods today. Some could fit genetic linkage analysis and some do not. There are many other approximation algorithms: Variational methods, MCMC, and others. In next semester’s project of Bioinformatics (236524), we will offer projects that seek to implement some approximation methods and embed them in the superlink software.