EM for BNs Kalman Filters Gaussian BNs

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Advertisements

Stanford CS223B Computer Vision, Winter 2007 Lecture 12 Tracking Motion Professors Sebastian Thrun and Jana Košecká CAs: Vaibhav Vaish and David Stavens.
Machine Learning CUNY Graduate Center Lecture 7b: Sampling.
Maximum Likelihood (ML), Expectation Maximization (EM)
CS 188: Artificial Intelligence Fall 2006 Lecture 17: Bayes Nets III 10/26/2006 Dan Klein – UC Berkeley.
A Unifying Review of Linear Gaussian Models
1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin.
Crash Course on Machine Learning
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.
Machine Learning CUNY Graduate Center Lecture 4: Logistic Regression.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
CS Statistical Machine learning Lecture 24
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
State Estimation and Kalman Filtering
Lecture 2: Statistical learning primer for biologists
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
Tracking with dynamics
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
CS498-EA Reasoning in AI Lecture #23 Instructor: Eyal Amir Fall Semester 2011.
MLPR - Questions. Can you go through integration, differentiation etc. Why do we need priors? Difference between prior and posterior. What does Bayesian.
Probability Theory and Parameter Estimation I
Expectation-Maximization (EM)
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
ECE 5424: Introduction to Machine Learning
ECE 5424: Introduction to Machine Learning
CS 4/527: Artificial Intelligence
CAP 5636 – Advanced Artificial Intelligence
Markov Networks.
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
CS498-EA Reasoning in AI Lecture #20
Inference Inference: calculating some useful quantity from a joint probability distribution Examples: Posterior probability: Most likely explanation: B.
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
CS 188: Artificial Intelligence
CS 188: Artificial Intelligence Fall 2008
Kalman Filters Switching Kalman Filter
Parameter Learning 2 Structure Learning 1: The good
Readings: K&F: 5.1, 5.2, 5.3, 5.4, 5.5, 5.6, 5.7 Markov networks, Factor graphs, and an unified view Start approximate inference If we are lucky… Graphical.
Approximate Inference by Sampling
Parametric Methods Berlin Chen, 2005 References:
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
Machine Learning: Lecture 6
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Markov Networks.
Variable Elimination Graphical Models – Carlos Guestrin
Mean Field and Variational Methods Loopy Belief Propagation
Kalman Filters Gaussian MNs
Generalized Belief Propagation
BN Semantics 2 – The revenge of d-separation
Presentation transcript:

EM for BNs Kalman Filters Gaussian BNs Readings: K&F: 16.2 Jordan Chapter 20 K&F: 4.5, 12.2, 12.3 EM for BNs Kalman Filters Gaussian BNs Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 17th, 2006

Thus far, fully supervised learning We have assumed fully supervised learning: Many real problems have missing data: 10-708 – Carlos Guestrin 2006

The general learning problem with missing data Marginal likelihood – x is observed, z is missing: 10-708 – Carlos Guestrin 2006

E-step x is observed, z is missing Compute probability of missing data given current choice of  Q(z|xj) for each xj e.g., probability computed during classification step corresponds to “classification step” in K-means 10-708 – Carlos Guestrin 2006

Jensen’s inequality Theorem: log z P(z) f(z) ¸ z P(z) log f(z) 10-708 – Carlos Guestrin 2006

Applying Jensen’s inequality Use: log z P(z) f(z) ¸ z P(z) log f(z) 10-708 – Carlos Guestrin 2006

The M-step maximizes lower bound on weighted data Lower bound from Jensen’s: Corresponds to weighted dataset: <x1,z=1> with weight Q(t+1)(z=1|x1) <x1,z=2> with weight Q(t+1)(z=2|x1) <x1,z=3> with weight Q(t+1)(z=3|x1) <x2,z=1> with weight Q(t+1)(z=1|x2) <x2,z=2> with weight Q(t+1)(z=2|x2) <x2,z=3> with weight Q(t+1)(z=3|x2) … 10-708 – Carlos Guestrin 2006

The M-step Maximization step: Use expected counts instead of counts: If learning requires Count(x,z) Use EQ(t+1)[Count(x,z)] 10-708 – Carlos Guestrin 2006

Convergence of EM Define potential function F(,Q): EM corresponds to coordinate ascent on F Thus, maximizes lower bound on marginal log likelihood As seen in Machine Learning class last semester 10-708 – Carlos Guestrin 2006

Announcements Lectures the rest of the semester: Special time: Monday Nov 20 - 5:30-7pm, Wean 4615A: Gaussian GMs & Kalman Filters Special time: Monday Nov 27 - 5:30-7pm, Wean 4615A: Dynamic BNs Wed. 11/30, regular class time: Causality (Richard Scheines) Friday 12/1, regular class time: Finish Dynamic BNs & Overview of Advanced Topics Deadlines & Presentations: Project Poster Presentations: Dec. 1st 3-6pm (NSH Atrium) Project write up: Dec. 8th by 2pm by email 8 pages – limit will be strictly enforced Final: Out Dec. 1st, Due Dec. 15th by 2pm (strict deadline) 10-708 – Carlos Guestrin 2006

Data likelihood for BNs Flu Allergy Sinus Headache Nose Given structure, log likelihood of fully observed data: 10-708 – Carlos Guestrin 2006

Marginal likelihood What if S is hidden? Flu Allergy Sinus Headache Nose What if S is hidden? 10-708 – Carlos Guestrin 2006

Log likelihood for BNs with hidden data Flu Allergy Sinus Headache Nose Marginal likelihood – O is observed, H is hidden 10-708 – Carlos Guestrin 2006

E-step for BNs E-step computes probability of hidden vars h given o Flu Allergy Sinus Headache Nose E-step computes probability of hidden vars h given o Corresponds to inference in BN 10-708 – Carlos Guestrin 2006

The M-step for BNs Maximization step: Flu Allergy Sinus Headache Nose Maximization step: Use expected counts instead of counts: If learning requires Count(h,o) Use EQ(t+1)[Count(h,o)] 10-708 – Carlos Guestrin 2006

M-step for each CPT M-step decomposes per CPT Standard MLE: Flu Allergy Sinus Headache Nose M-step decomposes per CPT Standard MLE: M-step uses expected counts: 10-708 – Carlos Guestrin 2006

Computing expected counts Flu Allergy Sinus Headache Nose M-step requires expected counts: For a set of vars A, must compute ExCount(A=a) Some of A in example j will be observed denote by AO = aO(j) Some of A will be hidden denote by AH Use inference (E-step computes expected counts): ExCount(t+1)(AO = aO(j), AH = aH) Ã P(AH = aH | AO = aO(j),(t)) 10-708 – Carlos Guestrin 2006

Data need not be hidden in the same way Flu Allergy Sinus Headache Nose When data is fully observed A data point is When data is partially observed But unobserved variables can be different for different data points e.g., Same framework, just change definition of expected counts ExCount(t+1)(AO = aO(j), AH = aH) Ã P(AH = aH | AO = aO(j),(t)) 10-708 – Carlos Guestrin 2006

Learning structure with missing data [K&F 16.6] Known BN structure: Use expected counts, learning algorithm doesn’t change Unknown BN structure: Can use expected counts and score model as when we talked about structure learning But, very slow... e.g., greedy algorithm would need to redo inference for every edge we test… (Much Faster) Structure-EM: Iterate: compute expected counts do a some structure search (e.g., many greedy steps) repeat Theorem: Converges to local optima of marginal log-likelihood details in the book Flu Allergy Sinus Headache Nose 10-708 – Carlos Guestrin 2006

What you need to know about learning with missing data EM for Bayes Nets E-step: inference computes expected counts Only need expected counts over Xi and Paxi M-step: expected counts used to estimate parameters Which variables are hidden can change per datapoint Also, use labeled and unlabeled data ! some data points are complete, some include hidden variables Structure-EM: iterate between computing expected counts & many structure search steps 10-708 – Carlos Guestrin 2006

Adventures of our BN hero Compact representation for probability distributions Fast inference Fast learning Approximate inference But… Who are the most popular kids? 1. Naïve Bayes 2 and 3. Hidden Markov models (HMMs) Kalman Filters

The Kalman Filter An HMM with Gaussian distributions Has been around for at least 50 years Possibly the most used graphical model ever It’s what does your cruise control tracks missiles controls robots … And it’s so simple… Possibly explaining why it’s so used Many interesting models build on it… An example of a Gaussian BN (more on this later)

Example of KF – SLAT Simultaneous Localization and Tracking [Funiak, Guestrin, Paskin, Sukthankar ’06] Place some cameras around an environment, don’t know where they are Could measure all locations, but requires lots of grad. student (Stano) time Intuition: A person walks around If camera 1 sees person, then camera 2 sees person, learn about relative positions of cameras

Example of KF – SLAT Simultaneous Localization and Tracking [Funiak, Guestrin, Paskin, Sukthankar ’06]

Multivariate Gaussian Mean vector: Covariance matrix:

Conditioning a Gaussian Joint Gaussian: p(X,Y) ~ N(;) Conditional linear Gaussian: p(Y|X) ~ N(mY|X; s2)

Gaussian is a “Linear Model” Conditional linear Gaussian: p(Y|X) ~ N(0+X; s2)

Conditioning a Gaussian Joint Gaussian: p(X,Y) ~ N(;) Conditional linear Gaussian: p(Y|X) ~ N(mY|X; SYY|X)

Conditional Linear Gaussian (CLG) – general case p(Y|X) ~ N(0+BX; SYY|X)

Understanding a linear Gaussian – the 2d case Variance increases over time (motion noise adds up) Object doesn’t necessarily move in a straight line

Tracking with a Gaussian 1 p(X0) ~ N(0,0) p(Xi+1|Xi) ~ N( Xi + ; Xi+1|Xi)

Tracking with Gaussians 2 – Making observations We have p(Xi) Detector observes Oi=oi Want to compute p(Xi|Oi=oi) Use Bayes rule: Require a CLG observation model p(Oi|Xi) ~ N(W Xi + v; Oi|Xi)

Operations in Kalman filter X1 X2 X3 X4 X5 O1 = O2 = O3 = O4 = O5 = Compute Start with At each time step t: Condition on observation Prediction (Multiply transition model) Roll-up (marginalize previous time step) I’ll describe one implementation of KF, there are others Information filter

Exponential family representation of Gaussian: Canonical Form

Canonical form Standard form and canonical forms are related: Conditioning is easy in canonical form Marginalization easy in standard form

Conditioning in canonical form First multiply: Then, condition on value B = y

Operations in Kalman filter X1 X2 X3 X4 X5 O1 = O2 = O3 = O4 = O5 = Compute Start with At each time step t: Condition on observation Prediction (Multiply transition model) Roll-up (marginalize previous time step)

Prediction & roll-up in canonical form First multiply: Then, marginalize Xt:

What if observations are not CLG? Often observations are not CLG CLG if Oi =  Xi + o +  Consider a motion detector Oi = 1 if person is likely to be in the region Posterior is not Gaussian

Linearization: incorporating non-linear evidence p(Oi|Xi) not CLG, but… Find a Gaussian approximation of p(Xi,Oi)= p(Xi) p(Oi|Xi) Instantiate evidence Oi=oi and obtain a Gaussian for p(Xi|Oi=oi) Why do we hope this would be any good? Locally, Gaussian may be OK

Linearization as integration Gaussian approximation of p(Xi,Oi)= p(Xi) p(Oi|Xi) Need to compute moments E[Oi] E[Oi2] E[Oi Xi] Note: Integral is product of a Gaussian with an arbitrary function

Linearization as numerical integration Product of a Gaussian with arbitrary function Effective numerical integration with Gaussian quadrature method Approximate integral as weighted sum over integration points Gaussian quadrature defines location of points and weights Exact if arbitrary function is polynomial of bounded degree Number of integration points exponential in number of dimensions d Exact monomials requires exponentially fewer points For 2d+1 points, this method is equivalent to effective Unscented Kalman filter Generalizes to many more points

Operations in non-linear Kalman filter X1 X2 X3 X4 X5 O1 = O2 = O3 = O4 = O5 = Compute Start with At each time step t: Condition on observation (use numerical integration) Prediction (Multiply transition model, use numerical integration) Roll-up (marginalize previous time step)

What you need to know about Kalman Filters Probably most used BN Assumes Gaussian distributions Equivalent to linear system Simple matrix operations for computations Non-linear Kalman filter Usually, observation or motion model not CLG Use numerical integration to find Gaussian approximation