Gaussian Mixture Model and the EM algorithm in Speech Recognition

Slides:



Advertisements
Similar presentations
Angelo Dalli Department of Intelligent Computing Systems
Advertisements

Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models in NLP
Sequential Modeling with the Hidden Markov Model Lecture 9 Spoken Language Processing Prof. Andrew Rosenberg.
Visual Recognition Tutorial
Speech Recognition Training Continuous Density HMMs Lecture Based on:
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
First introduced in 1977 Lots of mathematical derivation Problem : given a set of data (data is incomplete or having missing values). Goal : assume the.
Hidden Markov Models I Biology 162 Computational Genetics Todd Vision 14 Sep 2004.
Lecture 5: Learning models using EM
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Expectation Maximization Algorithm
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
CS 224S / LINGUIST 285 Spoken Language Processing Dan Jurafsky Stanford University Spring 2014 Lecture 5: Acoustic Modeling with Gaussians.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Combined Lecture CS621: Artificial Intelligence (lecture 25) CS626/449: Speech-NLP-Web/Topics-in- AI (lecture 26) Pushpak Bhattacharyya Computer Science.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:
CS344 : Introduction to Artificial Intelligence Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 21- Forward Probabilities and Robotic Action Sequences.
EM and expected complete log-likelihood Mixture of Experts
7-Speech Recognition Speech Recognition Concepts
Fundamentals of Hidden Markov Model Mehmet Yunus Dönmez.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
N– variate Gaussian. Some important characteristics: 1)The pdf of n jointly Gaussian R.V.’s is completely described by means, variances and covariances.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
Speech and Language Processing Chapter 9 of SLP Automatic Speech Recognition (II)
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
Hidden Markovian Model. Some Definitions Finite automation is defined by a set of states, and a set of transitions between states that are taken based.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Flat clustering approaches
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Other Models for Time Series. The Hidden Markov Model (HMM)
Visual Recognition Tutorial1 Markov models Hidden Markov models Forward/Backward algorithm Viterbi algorithm Baum-Welch estimation algorithm Hidden.
Gaussian Mixture Model classification of Multi-Color Fluorescence In Situ Hybridization (M-FISH) Images Amin Fazel 2006 Department of Computer Science.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Statistical Models for Automatic Speech Recognition
Hidden Markov Models Part 2: Algorithms
Probabilistic Models with Latent Variables
Statistical Models for Automatic Speech Recognition
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
Speech Processing Speech Recognition
LECTURE 15: REESTIMATION, EM AND MIXTURES
Introduction to HMM (cont)
Hidden Markov Models By Manish Shrivastava.
EM Algorithm 主講人:虞台文.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Gaussian Mixture Model and the EM algorithm in Speech Recognition Puskás János-Pál

Speech Recognition Develop a method for computers to “understand” speech using mathematical methods

The Hidden Markov Model

First-order observable Markov Model a set of states Q = q1, q2…qN; the state at time t is qt Current state only depends on previous state Transition probability matrix A Special initial probability vector  Constraints:

Problem: how to apply HMM model to continuous observations? We have assumed that the output alphabet V has a finite number of symbols But spectral feature vectors are real-valued! How to deal with real-valued features? Decoding: Given ot, how to compute P(ot|q) Learning: How to modify EM to deal with real-valued features

HMM in Speech Recognition

Gaussian Distribution For a D-dimensional input vector o, the Gaussian distribution with mean μ and positive definite covariance matrix Σ can be expressed as The distribution is completely described by the D parameters representing μ and the D(D+1)/2 parameters representing the symmetric covariance matrix Σ

Is it enough ? Single Gaussian may do a bad job of modeling distribution in any dimension: Solution: Mixtures of Gaussians Figure from Chen, Picheney et al slides

Gaussian Mixture Models (GMM)

GMM Estimation We will assume that the data as being generated by a set of N distinct sources, but that we only observe the input observation ot without knowing from which source it comes. Summary: each state has a likelihood function parameterized by: M Mixture weights M Mean Vectors of dimensionality D Either M Covariance Matrices of DxD Or more likely M Diagonal Covariance Matrices of DxD which is equivalent to M Variance Vectors of dimensionality D

Gaussians for Acoustic Modeling A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o

The EM Algorithm The EM algorithm is an iterative algorithm that has two steps. In the Expectation step, it tries to ”guess” the values of the zt’s. In the Maximization step, it updates the parameters of our models based on our guesses. The random variables zt indicates which of the N Gaussians each ot had come from. Note that the zt’s are latent random variable, meaning they are hidden/unobserved. This is what make our estimation problem difficult.

The EM Algorithm in Speech Recognition The Posteriori Probability (zt) : (“fuzzy membership” of ot to ith gaussian) Mixture weight update: Mean vector update: Covariance matrix update:

Baum-Welch for Mixture Models Let’s define the probability of being in state j at time t with the kth mixture component accounting for ot: Now,

The Forward and Backward algorithms Forward (α) algorithm Backward (β) algorithm

How to train mixtures? Choose M (often 16; or can tune M optimally) Then can do various splitting or clustering algorithms One simple method for “splitting”: Compute global mean  and global variance Split into two Gaussians, with means  (sometimes  is 0.2 Run Forward-Backward to retrain Go to 2 until we have 16 mixtures Or choose starting clusters with the K-means algorithm

The Covariance Matrix Represents correlations in a Gaussian. Symmetric matrix. Positive definite. D(D+1)/2 parameters when x has D dimensions.

But: assume diagonal covariance I.e., assume that the features in the feature vector are uncorrelated This isn’t true for FFT features, but is true for MFCC features. Computation and storage much cheaper if diagonal covariance. I.e. only diagonal entries are non-zero Diagonal contains the variance of each dimension ii2 So this means we consider the variance of each acoustic feature (dimension) separately

Diagonal Covariance Matrix Simplified model: Assumes orthogonal principal axes. D parameters. Assumes independence between components of x.

Cost of Gaussians in High Dimensions

How does the system work

References Lawrence R. Rabiner - A Tutorial on Hidden Markov Models and Selected Aplivations in Speech Recognition Magyar nyelvi beszédtechnológiai alapismeretek Multimédiás szoftver CD - Nikol Kkt. 2002. Dan Jurafsky – “CS Speech Recognition and Synthesis” Lecture 8 -10 Stanford University, 2005