1 EM for BNs Graphical Models – 10708 Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, 18.3 10-708 –  Carlos Guestrin.

Slides:



Advertisements
Similar presentations
Parameter and Structure Learning Dhruv Batra, Recitation 10/02/2008.
Advertisements

Parameter Learning in MN. Outline CRF Learning CRF for 2-d image segmentation IPF parameter sharing revisited.
EM Algorithm Jur van den Berg.
Expectation Maximization
Lirong Xia Hidden Markov Models Tue, March 28, 2014.
Discriminative, Unsupervised, Convex Learning Dale Schuurmans Department of Computing Science University of Alberta MITACS Workshop, August 26, 2005.
Segmentation and Fitting Using Probabilistic Methods
EE-148 Expectation Maximization Markus Weber 5/11/99.
Overview Full Bayesian Learning MAP learning
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Phylogenetic Trees Presenter: Michael Tung
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
ECE 5984: Introduction to Machine Learning
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
A Unifying Review of Linear Gaussian Models
Gaussian Mixture Models and Expectation Maximization.
Machine Learning CUNY Graduate Center Lecture 21: Graphical Models.
Alignment and classification of time series gene expression in clinical studies Tien-ho Lin, Naftali Kaminski and Ziv Bar-Joseph.
1 Naïve Bayes Models for Probability Estimation Daniel Lowd University of Washington (Joint work with Pedro Domingos)
1 Logistic Regression Adapted from: Tom Mitchell’s Machine Learning Book Evan Wei Xiang and Qiang Yang.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
1 Bayesian Param. Learning Bayesian Structure Learning Graphical Models – Carlos Guestrin Carnegie Mellon University October 6 th, 2008 Readings:
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 11 th, 2006 Readings: K&F: 8.1, 8.2, 8.3,
CSE 446 Logistic Regression Winter 2012 Dan Weld Some slides from Carlos Guestrin, Luke Zettlemoyer.
1 EM for BNs Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 17 th, 2006 Readings: K&F: 16.2 Jordan.
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Processing Sequential Sensor Data The “John Krumm perspective” Thomas Plötz November 29 th, 2011.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2008 Readings: K&F: 3.1, 3.2, –  Carlos.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –Classification: Logistic Regression –NB & LR connections Readings: Barber.
Lecture 2: Statistical learning primer for biologists
1 Parameter Learning 2 Structure Learning 1: The good Graphical Models – Carlos Guestrin Carnegie Mellon University September 27 th, 2006 Readings:
CSE 517 Natural Language Processing Winter 2015
1 BN Semantics 2 – The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 20 th, 2006 Readings: K&F:
1 Param. Learning (MLE) Structure Learning The Good Graphical Models – Carlos Guestrin Carnegie Mellon University October 1 st, 2008 Readings: K&F:
1 Learning P-maps Param. Learning Graphical Models – Carlos Guestrin Carnegie Mellon University September 24 th, 2008 Readings: K&F: 3.3, 3.4, 16.1,
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 BN Semantics 2 – Representation Theorem The revenge of d-separation Graphical Models – Carlos Guestrin Carnegie Mellon University September 17.
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
1 Structure Learning (The Good), The Bad, The Ugly Inference Graphical Models – Carlos Guestrin Carnegie Mellon University October 13 th, 2008 Readings:
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 5984: Introduction to Machine Learning Dhruv Batra Virginia Tech Topics: –(Finish) Expectation Maximization –Principal Component Analysis (PCA) Readings:
 Dynamic Bayesian Networks Beyond Graphical Models – Carlos Guestrin Carnegie Mellon University December 1 st, 2006 Readings: K&F: 18.1, 18.2,
1 Variable Elimination Graphical Models – Carlos Guestrin Carnegie Mellon University October 15 th, 2008 Readings: K&F: 8.1, 8.2, 8.3,
1 BN Semantics 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 15 th, 2006 Readings: K&F: 3.1, 3.2, 3.3.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
1 BN Semantics 3 – Now it’s personal! Parameter Learning 1 Graphical Models – Carlos Guestrin Carnegie Mellon University September 22 nd, 2006 Readings:
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
The Linear Dynamic System and its application in moction capture
Propagating Uncertainty In POMDP Value Iteration with Gaussian Process
Readings: K&F: 15.1, 15.2, 15.3, 15.4, 15.5 K&F: 7 (overview of inference) K&F: 8.1, 8.2 (Variable Elimination) Structure Learning in BNs 3: (the good,
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Kalman Filters Switching Kalman Filter
Parameter Learning 2 Structure Learning 1: The good
EM for BNs Kalman Filters Gaussian BNs
Unifying Variational and GBP Learning Parameters of MNs EM for BNs
BN Semantics 3 – Now it’s personal! Parameter Learning 1
Readings: K&F: 11.3, 11.5 Yedidia et al. paper from the class website
Variable Elimination Graphical Models – Carlos Guestrin
Kalman Filters Gaussian MNs
Generalized Belief Propagation
Presentation transcript:

1 EM for BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: 18.1, 18.2, –  Carlos Guestrin

2 Thus far, fully supervised learning We have assumed fully supervised learning: Many real problems have missing data:

–  Carlos Guestrin The general learning problem with missing data Marginal likelihood – x is observed, z is missing:

–  Carlos Guestrin E-step x is observed, z is missing Compute probability of missing data given current choice of   Q(z|x (j) ) for each x (j) e.g., probability computed during classification step corresponds to “classification step” in K-means

–  Carlos Guestrin Jensen’s inequality Theorem: log  z P(z) f(z) ≥  z P(z) log f(z)

–  Carlos Guestrin Applying Jensen’s inequality Use: log  z P(z) f(z) ≥  z P(z) log f(z)

–  Carlos Guestrin The M-step maximizes lower bound on weighted data Lower bound from Jensen’s: Corresponds to weighted dataset:  with weight Q (t+1) (z=1|x (1) )  with weight Q (t+1) (z=2|x (1) )  with weight Q (t+1) (z=3|x (1) )  with weight Q (t+1) (z=1|x (2) )  with weight Q (t+1) (z=2|x (2) )  with weight Q (t+1) (z=3|x (2) ) ……

–  Carlos Guestrin The M-step Maximization step: Use expected counts instead of counts:  If learning requires Count(x,z)  Use E Q(t+1) [Count(x,z)]

–  Carlos Guestrin Convergence of EM Define potential function F( ,Q): EM corresponds to coordinate ascent on F  Thus, maximizes lower bound on marginal log likelihood  As seen in Machine Learning class last semester

–  Carlos Guestrin Data likelihood for BNs Given structure, log likelihood of fully observed data: Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Marginal likelihood What if S is hidden? Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Log likelihood for BNs with hidden data Marginal likelihood – O is observed, H is hidden Flu Allergy Sinus Headache Nose

–  Carlos Guestrin E-step for BNs E-step computes probability of hidden vars h given o Corresponds to inference in BN Flu Allergy Sinus Headache Nose

–  Carlos Guestrin The M-step for BNs Maximization step: Use expected counts instead of counts:  If learning requires Count(h,o)  Use E Q(t+1) [Count(h,o)] Flu Allergy Sinus Headache Nose

–  Carlos Guestrin M-step for each CPT M-step decomposes per CPT  Standard MLE:  M-step uses expected counts: Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Computing expected counts M-step requires expected counts:  Observe O=o  For a set of vars A, must compute ExCount(A=a)  Some of A in example j will be observed denote by A O = a O (j)  Some of A will be hidden denote by A H Use inference (E-step computes expected counts):  ExCount (t+1) (A O = a O, A H = a H ) Flu Allergy Sinus Headache Nose

–  Carlos Guestrin Data need not be hidden in the same way When data is fully observed  A data point is When data is partially observed  A data point is But unobserved variables can be different for different data points  e.g., Same framework, just change definition of expected counts  Observed vars in point j,  Consider set of vars A  ExCount (t+1) (A = a) Flu Allergy Sinus Headache Nose

Poster printing Poster session:  Friday Dec 1st, 3-6pm in the NSH Atrium.  There will be a popular vote for best poster. Invite your friends!  please be ready to set up your poster at 2:45pm sharp. We will provide posterboards, easels and pins.  The posterboards are 30x40 inches  We don't have a specific poster format for you to use. You can either bring a big poster or a print a set of regular sized pages and pin them together. Unfortunately, we don't have a budget to pay for printing. If you are an SCS student, SCS has a poster printer you can use which prints on a 36" wide roll of paper.  If you are a student outside SCS, you will need to check with your department to see if there are printing facilities for big posters (I don't know what is offered outside SCS), or print a set of regular sized pages. We are looking forward to a great poster session! –  Carlos Guestrin

–  Carlos Guestrin Learning structure with missing data [K&F 18.4] Known BN structure: Use expected counts, learning algorithm doesn’t change Unknown BN structure:  Can use expected counts and score model as when we talked about structure learning  But, very slow... e.g., greedy algorithm would need to redo inference for every edge we test… (Much Faster) Structure-EM: Iterate:  compute expected counts  do a some structure search (e.g., many greedy steps)  repeat Theorem: Converges to local optima of marginal log- likelihood  details in the book Flu Allergy Sinus Headache Nose

–  Carlos Guestrin What you need to know about learning BNs with missing data EM for Bayes Nets E-step: inference computes expected counts  Only need expected counts over X i and Pa xi M-step: expected counts used to estimate parameters Which variables are hidden can change per datapoint  Also, use labeled and unlabeled data ! some data points are complete, some include hidden variables Structure-EM:  iterate between computing expected counts & many structure search steps

MNs & CRFs with missing data MNs with missing data  Models P(X), part of X hidden  Use EM to optimize  Same ideas as BN CRFs with missing data  Models P(Y|X)  What’s hidden? Part of Y: All of Y: Part of X: –  Carlos Guestrin

22 Kalman Filters Gaussian BNs Graphical Models – Carlos Guestrin Carnegie Mellon University November 24 th, 2008 Readings: K&F: 6.1, 6.2, 6.3, 14.1, 14.2

23 Adventures of our BN hero Compact representation for probability distributions Fast inference Fast learning Approximate inference But… Who are the most popular kids? 1. Naïve Bayes 2 and 3. Hidden Markov models (HMMs) Kalman Filters

24 The Kalman Filter An HMM with Gaussian distributions Has been around for at least 60 years Possibly the most used graphical model ever It’s what  does your cruise control  tracks missiles  controls robots  … And it’s so simple…  Possibly explaining why it’s so used Many interesting models build on it…  An example of a Gaussian BN (more on this later)

25 Example of KF – SLAT Simultaneous Localization and Tracking [Funiak, Guestrin, Paskin, Sukthankar ’06] Place some cameras around an environment, don’t know where they are Could measure all locations, but requires lots of grad. student (Stano) time Intuition:  A person walks around  If camera 1 sees person, then camera 2 sees person, learn about relative positions of cameras

26 Example of KF – SLAT Simultaneous Localization and Tracking [Funiak, Guestrin, Paskin, Sukthankar ’06]

27 Multivariate Gaussian Mean vector: Covariance matrix:

28 Conditioning a Gaussian Joint Gaussian:  p(X,Y) ~ N(  ;  ) Conditional linear Gaussian:  p(Y|X) ~ N(  Y|X ;  2 Y|X )

29 Gaussian is a “Linear Model” Conditional linear Gaussian:  p(Y|X) ~ N(  0 +  X;  2 )

30 Conditioning a Gaussian Joint Gaussian:  p(X,Y) ~ N(  ;  ) Conditional linear Gaussian:  p(Y|X) ~ N(  Y|X ;  YY|X )

31 Conditional Linear Gaussian (CLG) – general case Conditional linear Gaussian:  p(Y|X) ~ N(  0 +  X;  YY|X )

32 Understanding a linear Gaussian – the 2d case Variance increases over time (motion noise adds up) Object doesn’t necessarily move in a straight line

33 Tracking with a Gaussian 1 p(X 0 ) ~ N(  0,  0 ) p(X i+1 |X i ) ~ N(  X i +  ;  Xi+1|Xi )

34 Tracking with Gaussians 2 – Making observations We have p(X i ) Detector observes O i =o i Want to compute p(X i |O i =o i ) Use Bayes rule: Require a CLG observation model  p(O i |X i ) ~ N(W X i + v;  Oi|Xi )