Expectation Maximization Dekang Lin Department of Computing Science University of Alberta.

Slides:



Advertisements
Similar presentations
Pair-HMMs and CRFs Chuong B. Do CS262, Winter 2009 Lecture #8.
Advertisements

Three Basic Problems 1.Compute the probability of a text (observation) language modeling – evaluate alternative texts and models P m (W 1,N ) 2.Compute.
CS626: NLP, Speech and the Web
CPSC 422, Lecture 16Slide 1 Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 16 Feb, 11, 2015.
Mixture Models and the EM Algorithm
HMM II: Parameter Estimation. Reminder: Hidden Markov Model Markov Chain transition probabilities: p(S i+1 = t|S i = s) = a st Emission probabilities:
The EM algorithm LING 572 Fei Xia Week 10: 03/09/2010.
Part of Speech Tagging The DT students NN went VB to P class NN Plays VB NN well ADV NN with P others NN DT Fruit NN flies NN VB NN VB like VB P VB a DT.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Tagging with Hidden Markov Models. Viterbi Algorithm. Forward-backward algorithm Reading: Chap 6, Jurafsky & Martin Instructor: Paul Tarau, based on Rada.
Visual Recognition Tutorial
Overview Full Bayesian Learning MAP learning
HMM for CpG Islands Parameter Estimation For HMM Maximum Likelihood and the Information Inequality Lecture #7 Background Readings: Chapter 3.3 in the.
. EM algorithm and applications Lecture #9 Background Readings: Chapters 11.2, 11.6 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
. Parameter Estimation and Relative Entropy Lecture #8 Background Readings: Chapters 3.3, 11.2 in the text book, Biological Sequence Analysis, Durbin et.
The EM algorithm (Part 1) LING 572 Fei Xia 02/23/06.
Lecture 5: Learning models using EM
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
A gentle introduction to Gaussian distribution. Review Random variable Coin flip experiment X = 0X = 1 X: Random variable.
The EM algorithm LING 572 Fei Xia 03/01/07. What is EM? EM stands for “expectation maximization”. A parameter estimation method: it falls into the general.
. Maximum Likelihood (ML) Parameter Estimation with applications to reconstructing phylogenetic trees Comput. Genomics, lecture 6b Presentation taken from.
Maximum Likelihood (ML), Expectation Maximization (EM)
Visual Recognition Tutorial
Expectation-Maximization (EM) Chapter 3 (Duda et al.) – Section 3.9
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
EM algorithm LING 572 Fei Xia 03/02/06. Outline The EM algorithm EM for PM models Three special cases –Inside-outside algorithm –Forward-backward algorithm.
TopicTrend By: Jovian Lin Discover Emerging and Novel Research Topics.
.. . Maximum Likelihood (ML) Parameter Estimation with applications to inferring phylogenetic trees Comput. Genomics, lecture 6a Presentation taken from.
. Parameter Estimation For HMM Lecture #7 Background Readings: Chapter 3.3 in the text book, Biological Sequence Analysis, Durbin et al., 2001.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
Recitation on EM slides taken from:
Part-of-Speech Tagging Foundation of Statistical NLP CHAPTER 10.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
5. Maximum Likelihood –II Prof. Yuille. Stat 231. Fall 2004.
Week 41 How to find estimators? There are two main methods for finding estimators: 1) Method of moments. 2) The method of Maximum likelihood. Sometimes.
CSE 517 Natural Language Processing Winter 2015
Maximum Likelihood Estimation
Statistical Estimation Vasileios Hatzivassiloglou University of Texas at Dallas.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS : Speech, NLP and the Web/Topics in AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture-15: Probabilistic parsing; PCFG (contd.)
NLP. Parsing ( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (,,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (,,) ) (VP (MD will) (VP (VB join) (NP (DT.
NLP. Introduction to NLP #include int main() { int n, reverse = 0; printf("Enter a number to reverse\n"); scanf("%d",&n); while (n != 0) { reverse =
Hidden Markov Model Parameter Estimation BMI/CS 576 Colin Dewey Fall 2015.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Part-of-Speech Tagging CSCI-GA.2590 – Lecture 4 Ralph Grishman NYU.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Prototype-Driven Grammar Induction Aria Haghighi and Dan Klein Computer Science Division University of California Berkeley.
Conditional Expectation
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
CS460/626 : Natural Language Processing/Speech, NLP and the Web (Lecture 25– Probabilistic Parsing) Pushpak Bhattacharyya CSE Dept., IIT Bombay 14 th March,
Dan Roth University of Illinois, Urbana-Champaign 7 Sequential Models Tutorial on Machine Learning in Natural.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
Comp. Genomics Recitation 6 14/11/06 ML and EM.
Statistical Models for Automatic Speech Recognition
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
Hidden Markov Models - Training
Latent Variables, Mixture Models and EM
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Introduction to EM algorithm
Statistical Models for Automatic Speech Recognition
EM Algorithm 主講人:虞台文.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Expectation Maximization Dekang Lin Department of Computing Science University of Alberta

Objectives Expectation Maximization (EM) is perhaps most often used and mostly half understood algorithm for unsupervised learning. It is very intuitive. Many people rely on their intuition to apply the algorithm in different problem domains. I will present a proof of the EM Theorem that explains why the algorithm works. Hopefully this will help applying EM when intuition is not obvious.

Model Building with Partial Observations Our goal is to build a probabilistic model A model is defined by a set of parameters θ The model parameters can be estimated from a set of training examples: x 1, x 2, …, x n x i ’s are identically and independently distributed (iid) Unfortunately, we only get to observe part of each training example: x i =(t i, y i ) and we can only observe y i. How do we build the model?

Example: POS Tagging Complete data: A sentence (a sequence of words) and a corresponding sequence of POS tags. Observed data: the sentence Unobserved data: the sequence of tags Model: an HMM with transition/emission probability tables.

Training with Tagged Corpus Pierre NNP Vinken NNP,, 61 CD years NNS old JJ,, will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD.. Mr. NNP Vinken NNP is VBZ chairman NN of IN Elsevier NNP N.V. NNP,, the DT Dutch NNP publishing VBG group NN.. Rudolph NNP Agnew NNP,, 55 CD years NNS old JJ and CC former JJ chairman NN of IN Consolidated NNP Gold NNP Fields NNP PLC NNP,, was VBD named VBN a DT nonexecutive JJ director NN of IN this DT British JJ industrial JJ conglomerate NN.. Pierre NNP Vinken NNP,, 61 CD years NNS old JJ,, will MD join VB the DT board NN as IN a DT nonexecutive JJ director NN Nov. NNP 29 CD.. Mr. NNP Vinken NNP is VBZ chairman NN of IN Elsevier NNP N.V. NNP,, the DT Dutch NNP publishing VBG group NN.. Rudolph NNP Agnew NNP,, 55 CD years NNS old JJ and CC former JJ chairman NN of IN Consolidated NNP Gold NNP Fields NNP PLC NNP,, was VBD named VBN a DT nonexecutive JJ director NN of IN this DT British JJ industrial JJ conglomerate NN.. c( JJ )=7 c( JJ, NN )=4, P( NN | JJ )=4/7

Example: Parsing Complete Data: a sentence and its parse tree. Observed data: the sentence Unobserved data: the nonterminal categories and their relationships that form the parse tree. Model: PCFG or anything that allows one to compute the probability of parse trees.

Example: Semantic Labeling Complete Data: (context, cluster, word) Observed Data: (context, word) Unobserved Data: cluster Model: P(context, cluster, word) = P(context)P(cluster|context)P(word|cluster)

What is the best Model? There are many possibly models Many possible ways to set the model parameters. We obviously want the “best” model. Which model is the best? The model that assigns the highest probability to the observation is the best. Maximize Π i P θ (y i ), or equivalently Σ i log P θ (y i )  What about maximizing the probability of the hidden data? This is know as the maximum likelihood estimation (MLE)

MLE Example A coin with P(H)=p, P(T)=q. We observed m H’s and n T’s. What are p and q according to MLE? Maximize Σ i log P θ (y i )= log p m q n Under the constraint: p+q=1 Lagrange Method: Define g(p,q)=m log p + n log q+λ(p+q-1) Solve the equations

Example Suppose we have two coins. Coin 1 is fair. Coin 2 has probability p generating H. They each have ½ probability to be chosen and tossed. The complete data is (1, H), (1, T), (2, T), (1, H), (2, T) We only know the result of the toss, but don’t know when coin was chosen. The observed data is H, T, T, H, T. Problem: Suppose the observations include m H’s and n T’s. How to estimate p to maximize Σ i log P θ (y i )?

Need for Iterative Algorithm Unfortunately, we often cannot find the best θ by solving equations. Example: Three coins, 0, 1, and 2, with probabilities p 0, p 1, and p 2 generating H. Experiment: Toss coin 0  If H, toss coin 1 three times  If T, toss coin 2 three times Observations: ,,,, What is MLE for p 0, p 1, and p 2 ?

Overview of EM Create an initial model, θ 0. Arbitrarily, randomly, or with a small set of training examples. Use the model θ’ to obtain another model θ such that Σ i log P θ (y i ) > Σ i log P θ’ (y i ) Repeat the above step until reaching a local maximum. Guaranteed to find a better model after each iteration.

Maximizing Likelihood How do we find a better model θ given a model θ’? Can we use Lagrange method to maximize Σ i logP θ (y i )? If this can be done, there is no need to iterate!

EM Theorem The following EM Theorem holds This theorem is similar to (but is not identical to, nor does it follow) the EM Theorem in [Jelinek 1997, p.148] (the proof is almost identical). EM Theorem: Σ t is summation over all possible values of unobserved data

What does EM Theorem Mean? If we can find a θ that maximizes the same θ will also satisfy the condition which is needed in the EM algorithm. We can maximize the former by taking its partial derivatives w.r.t. parameters in θ.

EM Theorem: why? Why optimizing is easier than optimizing P θ (t, y i ) involves the complete data and is usually a product of a set of parameters. P θ (y i ) usually involves summation over all hidden variables.

EM Theorem: Proof =1 ≤0 (Jensen’s Inequality)

The proof used the inequality More generally, if p and q are probability distributions Even more generally, if f is a convex function, E[f(x)] ≥ f(E[x]) Jensen’s Inequality

What is ? The expected value of log P θ (t,y i ) according to the model θ’. The EM Theorem states that we can get a better model by maximizing the sum (over all instances) of the expectation.

A Generic Set Up for EM Assume P θ (t, y) is a product of a set of parameters. Assume θ consists of M groups of parameters. The parameters in each group sum up to 1. Let u jk be a parameter. Σ m u jm =1 Let T jk be a subset of hidden data such that if t is in T jk, the computation of P θ (t, y i ) involves u jk. Let n(t,y i ) be the number of times u jk is used in P θ (t,y i ), i.e., P θ (t,y i )=u jk n(t,y i ) v(t,y), where v(t,y) is the product of all other parameters.

pseudo count of instances involving u jk

Summary EM Theorem Intuition Proof Generic Set-up