CSE 517 Natural Language Processing Winter 2015

Slides:



Advertisements
Similar presentations
Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.
Advertisements

Unsupervised Learning
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
Learning HMM parameters
Expectation Maximization
Supervised Learning Recap
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
K-means clustering Hongning Wang
Hidden Variables, the EM Algorithm, and Mixtures of Gaussians Computer Vision CS 543 / ECE 549 University of Illinois Derek Hoiem 03/15/12.
Mixture Language Models and EM Algorithm
Visual Recognition Tutorial
… Hidden Markov Models Markov assumption: Transition model:
. Learning Hidden Markov Models Tutorial #7 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Background Readings: Chapter 3.3 in the book, Biological Sequence Analysis, Durbin et al., 2001.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
. Learning Bayesian networks Slides by Nir Friedman.
Lecture 5: Learning models using EM
Most slides from Expectation Maximization (EM) Northwestern University EECS 395/495 Special Topics in Machine Learning.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
Expectation Maximization Algorithm
Expectation Maximization for GMM Comp344 Tutorial Kai Zhang.
Maximum Likelihood (ML), Expectation Maximization (EM)
Expectation-Maximization
Learning Bayesian Networks
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
. Learning Bayesian networks Most Slides by Nir Friedman Some by Dan Geiger.
. Learning – EM in The ABO locus Tutorial #9 © Ilan Gronau. Based on original slides of Ydo Wexler & Dan Geiger.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al.,  Shlomo.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
Gaussian Mixture Models and Expectation Maximization.
Biointelligence Laboratory, Seoul National University
Machine Learning Saarland University, SS 2007 Holger Bast Max-Planck-Institut für Informatik Saarbrücken, Germany Lecture 9, Friday June 15 th, 2007 (EM.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Machine Learning Lecture 23: Statistical Estimation with Sampling Iain Murray’s MLSS lecture on videolectures.net:
CPSC 502, Lecture 15Slide 1 Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 16 Nov, 3, 2011 Slide credit: C. Conati, S.
Lecture 19: More EM Machine Learning April 15, 2010.
Chapter 7 Point Estimation
CHAPTER 7: Clustering Eick: K-Means and EM (modified Alpaydin transparencies and new transparencies added) Last updated: February 25, 2014.
Unsupervised Learning: Clustering Some material adapted from slides by Andrew Moore, CMU. Visit for
MACHINE LEARNING 8. Clustering. Motivation Based on E ALPAYDIN 2004 Introduction to Machine Learning © The MIT Press (V1.1) 2  Classification problem:
Lecture 17 Gaussian Mixture Models and Expectation Maximization
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
Today's Specials ● Detailed look at Lagrange Multipliers ● Forward-Backward and Viterbi algorithms for HMMs ● Intro to EM as a concept [ Motivation, Insights]
Machine Learning 5. Parametric Methods.
1 Machine Learning Lecture 9: Clustering Moshe Koppel Slides adapted from Raymond J. Mooney.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
CSE 446: Expectation Maximization (EM) Winter 2012 Daniel Weld Slides adapted from Carlos Guestrin, Dan Klein & Luke Zettlemoyer.
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Conditional Expectation
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Hidden Markov Models BMI/CS 576
Classification of unlabeled data:
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Expectation-Maximization
دانشگاه صنعتی امیرکبیر Instructor : Saeed Shiry
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Lecture 5 Unsupervised Learning in fully Observed Directed and Undirected Graphical Models.
EM for Inference in MV Data
Learning From Observed Data
EM for Inference in MV Data
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

CSE 517 Natural Language Processing Winter 2015 Expectation Maximization Yejin Choi University of Washington

(Recap) Expectation Maximization for HMM Initialize transition and emission parameters Random, uniform, or more informed initialization Iterate until convergence E-Step: Compute expected counts M-Step: Compute new transition and emission parameters (using the expected counts computed above) How does this relate to the general form of EM?

Expectation Maximization (General Form) Input: model and unlabeled data Initialize parameters Until convergence E-step (expectation) compute the posteriors (while fixing the model parameters) M-step (maximization) compute parameters that maximize the expected log likelihood Result: learn that maximizes: computed from E-step

Expectation Maximization E-step (expectation) compute the posteriors (while fixing the model parameters) we don’t actually need to compute the full posteriors, instead, we only need to compute “sufficient statistics” that matter for M-step, which boil down to “expected counts” of things computationally expensive when y is structured multivariate M-step (maximization) compute parameters that maximizes the expected log likelihood For models that are a product of multinomials (e.g., naiveBayes, HMM, PCFG), closed forms exist  “maximum likelihood estimates (MLE)”

Some Questions about EM EM always converges? EM converges even with approx E-step? -- hard EM / soft EM EM converges to a global or local optimum? (or saddle point?) EM improve “likelihood”. How? -- while what M-step maximizes is “expected likelihood” Maximum Likelihood Estimates (MLEs) for M-step? When to use EM (or not)?

EM improves

Maximum Likelihood Estimates Supervised Learning for Language Models: HMM: PCFG:

Maximum Likelihood Estimates Models: Language Models: HMM: PCFG: What’s common?  product of multinomials*! *multinomials is a conflated term. “categorical distribution” is more correct

MLEs maximize Likelihood Supervised Learning for Language Models: 2. HMM: 3. PCFG: Happens to be intuitive, we can also prove that MLE with actual counts maximize log likelihood MLE with expected counts maximize expected log likelihood

MLE for multinomial distributions Let’s first consider a simpler case. We want to learn parameters that maximize the (log) likelihood of the training data: Since it’s multinomial, it must be that := count of used in the likelihood of training data For example, for Unigram LM, and := count (apple) in the training corpus

MLE for multinomial distributions Learning parameters for such that equivalent to learning parameters for lambda is called Lagrangian multiplier You can add additional lambda terms: one for each equality constraint encode constraint

MLE for multinomial distributions Learning parameters for such that equivalent to learning parameters for Find optimal parameters by setting partial derivatives = 0 We have MLE! -- can be generalized to a product of multinomials, e.g., HMM, PCFG. For each prob distribution that needs to sum to 1, create a different lambda term.

When to use EM (or not) The ultimate goal of (unsupervised) learning is to find the parameters θ that maximizes the likelihood over the training data: For some models, it is difficult to find the parameters that maximize the log likelihood directly. For such models, it is sometimes very easy to find the parameters that maximizes the expected log likelihood. (Use EM!) For example, there are closed form solutions (MLE) for models that are in the form of product of multinomials (i.e., categorical distributions). If optimizing for expected log likelihood is not any easier than optimizing for log likelihood --- no need to use EM.

Other EM Variants Generalized EM (GEM) Stochastic EM Hard EM When exact M-step is difficult: finds θ that improves, but not necessarily maximizes. Converges to a local optimum. Stochastic EM When exact E-step is difficult: Monte Carlo sampling. Will asymptotically converge to a local optimum Hard EM When exact E-step is difficult: find the best prediction of the hidden variable ‘y’ and put all the prob mass ( = 1) to that best prediction. K-means is Hard EM. Will converge if improving the expected log likelihood of M-step.