Expectation-Maximization

Slides:



Advertisements
Similar presentations
Image Modeling & Segmentation
Advertisements

Hidden Markov Models (HMM) Rabiner’s Paper
Unsupervised Learning
Expectation Maximization
Maximum Likelihood And Expectation Maximization Lecture Notes for CMPUT 466/551 Nilanjan Ray.
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
K-means clustering Hongning Wang
1 Expectation Maximization Algorithm José M. Bioucas-Dias Instituto Superior Técnico 2005.
Visual Recognition Tutorial
Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Parametric Inference.
Expectation Maximization Algorithm
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Expectation-Maximization
Visual Recognition Tutorial
What is it? When would you use it? Why does it work? How do you implement it? Where does it stand in relation to other methods? EM algorithm reading group.
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
EM Algorithm Likelihood, Mixture Models and Clustering.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Incomplete Graphical Models Nan Hu. Outline Motivation K-means clustering Coordinate Descending algorithm Density estimation EM on unconditional mixture.
Biointelligence Laboratory, Seoul National University
EM and expected complete log-likelihood Mixture of Experts
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Lecture 19: More EM Machine Learning April 15, 2010.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
Lecture 6 Spring 2010 Dr. Jianjun Hu CSCE883 Machine Learning.
CS Statistical Machine learning Lecture 24
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Prototype Classification Methods Fu Chang Institute of Information Science Academia Sinica ext. 1819
Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 8, Friday June 8 th, 2007 (introduction.
Lecture 2: Statistical learning primer for biologists
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
CSE 517 Natural Language Processing Winter 2015
Bayesian Speech Synthesis Framework Integrating Training and Synthesis Processes Kei Hashimoto, Yoshihiko Nankaku, and Keiichi Tokuda Nagoya Institute.
RADFORD M. NEAL GEOFFREY E. HINTON 발표: 황규백
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Comp. Genomics Recitation 6 14/11/06 ML and EM.
LINEAR CLASSIFIERS The Problem: Consider a two class task with ω1, ω2.
Lecture 18 Expectation Maximization
Hidden Markov Models.
Classification of unlabeled data:
Extended Baum-Welch algorithm
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Latent Variables, Mixture Models and EM
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Probabilistic Models with Latent Variables
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Expectation Maximization
Gaussian Mixture Models And their training with the EM algorithm
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Lecture 11 Generalizations of EM.
Biointelligence Laboratory, Seoul National University
Introduction to HMM (cont)
Hidden Markov Models By Manish Shrivastava.
EM Algorithm 主講人:虞台文.
Clustering (2) & EM algorithm
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Learning Bayesian networks
Presentation transcript:

Expectation-Maximization Markoviana Reading Group Fatih Gelgi, ASU, 2005 11/14/2018

Outline What is EM? Intuitive Explanation Algorithm Generalized EM Example: Gaussian Mixture Algorithm Generalized EM Discussion Applications HMM – Baum-Welch K-means 11/14/2018 Fatih Gelgi, ASU’05

What is EM? Two main applications: Data has missing values, due to problems with or limitations of the observation process. Optimizing the likelihood function is extremely hard, but the likelihood function can be simplified by assuming the existence of and values for additional missing or hidden parameters. 11/14/2018 Fatih Gelgi, ASU’05

Key Idea… The observed data U is generated by some distribution and is called the incomplete data. Assume that a complete data set exists Z = (U,J), where J is the missing or hidden data. Maximize the posterior probability of the parameters  given the data U, marginalizing over J: 11/14/2018 Fatih Gelgi, ASU’05

Intuitive Explanation of EM Alternate between estimating the unknowns  and the hidden variables J. In each iteration, instead of finding the best J  J, compute a distribution over the space J. EM is a lower-bound maximization process (Minka,98). E-step: construct a local lower-bound to the posterior distribution. M-step: optimize the bound. 11/14/2018 Fatih Gelgi, ASU’05

Intuitive Explanation of EM Lower-bound approximation method ** Sometimes provides faster convergence than gradient descent and Newton’s method 11/14/2018 Fatih Gelgi, ASU’05

Example: Mixture Components 11/14/2018 Fatih Gelgi, ASU’05

Example (cont’d): True Likelihood of Parameters 11/14/2018 Fatih Gelgi, ASU’05

Example (cont’d): Iterations of EM 11/14/2018 Fatih Gelgi, ASU’05

Lower-bound Maximization Posterior probability  Logarithm of the joint distribution Idea: start with a guess t, compute an easily computed lower-bound B(; t) to the function log P(|U) and maximize the bound instead. difficult!!! 11/14/2018 Fatih Gelgi, ASU’05

Lower-bound Maximization (cont.) Construct a tractable lower-bound B(; t) that contains a sum of logarithms. ft(J) is an arbitrary prob. dist. By Jensen’s inequality, 11/14/2018 Fatih Gelgi, ASU’05

Optimal Bound B(; t) touches the objective function log P(U,) at t. Maximize B(t; t) with respect to ft(J): Introduce a Lagrange multiplier  to enforce the constraint 11/14/2018 Fatih Gelgi, ASU’05

Optimal Bound (cont.) Derivative with respect to ft(J): Maximizes at: 11/14/2018 Fatih Gelgi, ASU’05

Maximizing the Bound Re-write B(;t) with respect to the expectations: where Finally, 11/14/2018 Fatih Gelgi, ASU’05

EM Algorithm EM converges to a local maximum of log P(U,)  maximum of log P(|U). 11/14/2018 Fatih Gelgi, ASU’05

A Relation to the Log-Posterior An alternative way to compute expected log-posterior: which is the same as maximization with respect to , 11/14/2018 Fatih Gelgi, ASU’05

Generalized EM Assume and B function are differentiable in .The EM likelihood converges to a point where GEM: Instead of setting t+1 = argmax B(;t) Just find t+1 such that B(;t+1) > B(;t) GEM also is guaranteed to converge 11/14/2018 Fatih Gelgi, ASU’05

HMM – Baum-Welch Revisited Estimate the parameters (a, b, ) st. number of correct individual states to be maximum. gt(i) is the probability of being in state Si at time t xt(i,j) is the probability of being in state Si at time t, and Sj at time t+1 11/14/2018 Fatih Gelgi, ASU’05

Baum-Welch: E-step 11/14/2018 Fatih Gelgi, ASU’05

Baum-Welch: M-step 11/14/2018 Fatih Gelgi, ASU’05

K-Means Problem: Given data X and the number of clusters K, find clusters. Clustering based on centroids, A point belongs to the cluster with closest centroid. Hidden variables centroids of the clusters! 11/14/2018 Fatih Gelgi, ASU’05

K-Means (cont.) Starting with an initial 0, centroids, E-step: Split the data into K clusters according to distances to the centroids (Calculate the distribution ft(J)). M-step: Update the centroids (Calculate t+1). 11/14/2018 Fatih Gelgi, ASU’05

K Means Example (K=2) Pick seeds Reassign clusters Compute centroids Converged! 11/14/2018 Fatih Gelgi, ASU’05

Discussion Is EM a Primal-Dual algorithm? 11/14/2018 Fatih Gelgi, ASU’05

Reference: A.P.Dempster et al “Maximum-likelihood from incomplete data Journal of the Royal Statistical Society. Series B (Methodological), Vol. 39, No. 1. (1977), pp. 1-38. F. Dellaert, “The Expectation Maximization Algorithm”, Tech. Rep. GIT-GVU-02-20, 2002. T. Minka, “Expectation-Maximization as lower bound maximization”, 1998 Y. Chang, M. Kölsch. Presentation: Expectation Maximization, UCSB, 2002. K. Andersson, Presentation: Model Optimization using the EM algorithm, COSC 7373, 2001 11/14/2018 Fatih Gelgi, ASU’05

Thanks! 11/14/2018 Fatih Gelgi, ASU’05