LECTURE 15: REESTIMATION, EM AND MIXTURES

Slides:

Advertisements

Similar presentations

ECE 8443 – Pattern Recognition Objectives: Elements of a Discrete Model Evaluation Decoding Dynamic Programming Resources: D.H.S.: Chapter 3 (Part 3) F.J.:

Advertisements

Supervised Learning Recap

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Hidden Markov Models Theory By Johan Walters (SR 2003)

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Lecture 17: Supervised Learning Recap Machine Learning April 6, 2010.

HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.

Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Gaussian Mixture Model and the EM algorithm in Speech Recognition

ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:

7-Speech Recognition Speech Recognition Concepts

1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.

International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.

LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.

ECE 8443 – Pattern Recognition LECTURE 10: HETEROSCEDASTIC LINEAR DISCRIMINANT ANALYSIS AND INDEPENDENT COMPONENT ANALYSIS Objectives: Generalization of.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.

HMM - Part 2 The EM algorithm Continuous density HMM.

1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.

CS Statistical Machine learning Lecture 24

1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.

ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.

1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 12: Advanced Discriminant Analysis Objectives:

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.

ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.

Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.

Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:

Other Models for Time Series. The Hidden Markov Model (HMM)

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Mixture Densities Maximum Likelihood Estimates.

A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.

Hidden Markov Models BMI/CS 576

Learning, Uncertainty, and Information: Learning Parameters

LECTURE 11: Advanced Discriminant Analysis

EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture

Extended Baum-Welch algorithm

Statistical Models for Automatic Speech Recognition

LECTURE 10: EXPECTATION MAXIMIZATION (EM)

LECTURE 15: HMMS – EVALUATION AND DECODING

Hidden Markov Models - Training

Computational NeuroEngineering Lab

Latent Variables, Mixture Models and EM

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

Hidden Markov Models Part 2: Algorithms

Bayesian Models in Machine Learning

Probabilistic Models with Latent Variables

Statistical Models for Automatic Speech Recognition

4.0 More about Hidden Markov Models

LECTURE 14: HMMS – EVALUATION AND DECODING

CONTEXT DEPENDENT CLASSIFICATION

HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs

LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.

Qiang Huo(*) and Chorkin Chan(**)

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.

Presentation transcript:

LECTURE 15: REESTIMATION, EM AND MIXTURES Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources: D.H.S.: Chapter 3 (Part 3) F.J.: Statistical Methods D.L.D.: Mixture Modeling R.S.: Semicontinuous HMMs C.C.: Ten Years of HMMs ISIP: HMM Overview ISIP: Software

Evaluation and Decoding (Review)

Problem No. 3: Learning Let us derive the learning algorithm heuristically, and then formally derive it using the EM Theorem. Define a backward probability analogous to αi(t): This represents the probability that we are in state ωi(t) and will generate the remainder of the target sequence from [t+1,T]. Define the probability of a transition from ωi(t-1) to ωi(t) given the model generated the entire training sequence, vT: The expected number of transitions from ωi(t-1) to ωi(t) at any time is simply the sum of this over T: .

Problem No. 3: Learning (Cont.) The expected number of any transition from ωi(t) is: Therefore, a new estimate for the transition probability is: We can similarly estimate the output probability density function: The resulting reestimation algorithm is known as the Baum-Welch, or Forward Backward algorithm: Convergence is very quick, typically within three iterations for complex systems. Seeding the parameters with good initial guesses is very important. The forward backward principle is used in many algorithms.

Continuous Probability Density Functions 1 2 3 4 5 6 … The discrete HMM incorporates a discrete probability density function, captured in the matrix B, to describe the probability of outputting a symbol. Signal measurements, or feature vectors, are continuous-valued N-dimensional vectors. In order to use our discrete HMM technology, we must vector quantize (VQ) this data – reduce the continuous-valued vectors to discrete values chosen from a set of M codebook vectors. Initially most HMMs were based on VQ front-ends. However, for the past 30 years, the continuous density model has become widely accepted. We can use a Gaussian mixture model to represent arbitrary distributions: Note that this amounts to assigning a mean and covariance matrix for each mixture component. If we use a 39-dimensional feature vector, and 128 mixtures components per state, and thousands of states, we will experience a significant increase in complexity 

Continuous Probability Density Functions (Cont.) It is common to assume that the features are decorrelated and that the covariance matrix can be approximated as a diagonal matrix to reduce complexity. We refer to this as variance-weighting. Also, note that for a single mixture, the log of the output probability at each state becomes a Mahalanobis distance, which creates a nice connection between HMMs and PCA. In fact, HMMs can be viewed in some ways as a generalization of PCA. Write the joint probability of the data and the states given the model as: This can be considered the sum of the product of all densities with all possible state sequences, ω, and all possible mixture components, K: and

Application of EM We can write the auxiliary function, Q, as: The log term has the expected decomposition: The auxiliary function has three components: The first term resembles the term we had for discrete distributions. The remaining two terms are new:

Reestimation Equations Maximization of requires differentiation with respect to the parameters of the Gaussian: . This results in the following: where the intermediate variable, , is computed as: The mixture coefficients can be reestimated using a similar equation:

Acoustic Models P(A/W) Case Study: Speech Recognition Based on a noisy communication channel model in which the intended message is corrupted by a sequence of noisy models Bayesian approach is most common: Objective: minimize word error rate by maximizing P(W|A) P(A|W): Acoustic Model P(W): Language Model P(A): Evidence (ignored) Acoustic models use hidden Markov models with Gaussian mixtures. P(W) is estimated using probabilistic N-gram models. Parameters can be trained using generative (ML) or discriminative (e.g., MMIE, MCE, or MPE) approaches. Acoustic Front-end Acoustic Models P(A/W) Language Model P(W) Search Input Speech Recognized Utterance

Case Study: Acoustic Modeling in Speech Recognition

Case Study: Fingerspelling Recognition

Case Study: Fingerspelling Recognition

Summary Hidden Markov Model: a directed graph in which nodes, or states, represent output (emission) distributions and arcs represent transitions between these nodes. Emission distributions can be discrete or continuous. Continuous Density Hidden Markov Models (CDHMM): the most popular form of HMMs for many signal processing problems. Gaussian mixture distributions are commonly used (weighted sum of Gaussian pdfs). Subproblems: Practical implementations of HMMs require computationally efficient solutions to: Evaluation: computing the probability that we arrive in a state a time t and produce a particular observation at that state. Decoding: finding the path through the network with the highest likelihood (e.g., best path). Learning: estimating the parameters of the model from data. The EM algorithm plays an important role in learning because it proves the model will converge if trained iteratively. There are many generalizations of HMMs including Bayesian networks, deep belief networks and finite state automata. HMMs are very effective at modeling data that has a sequential structure.