Statistical Models for Automatic Speech Recognition

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

CS479/679 Pattern Recognition Dr. George Bebis
Supervised Learning Recap
Hidden Markov Models Reading: Russell and Norvig, Chapter 15, Sections
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Visual Recognition Tutorial
Lecture 5: Learning models using EM
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Isolated-Word Speech Recognition Using Hidden Markov Models
1 Pattern Recognition: Statistical and Neural Lonnie C. Ludeman Lecture 13 Oct 14, 2005 Nanjing University of Science & Technology.
ECE 8443 – Pattern Recognition LECTURE 06: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Bias in ML Estimates Bayesian Estimation Example Resources:
7-Speech Recognition Speech Recognition Concepts
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
HMM - Part 2 The EM algorithm Continuous density HMM.
CS Statistical Machine learning Lecture 24
1 CONTEXT DEPENDENT CLASSIFICATION  Remember: Bayes rule  Here: The class to which a feature vector belongs depends on:  Its own value  The values.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
Computer Vision Lecture 6. Probabilistic Methods in Segmentation.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Elements of a Discrete Model Evaluation.
Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Definition of the Hidden Markov Model A Seminar Speech Recognition presentation A Seminar Speech Recognition presentation October 24 th 2002 Pieter Bas.
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Applied statistics Usman Roshan.
Chapter 3: Maximum-Likelihood Parameter Estimation
LECTURE 11: Advanced Discriminant Analysis
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Parameter Estimation 主講人:虞台文.
Classification of unlabeled data:
LECTURE 10: EXPECTATION MAXIMIZATION (EM)
CSC 594 Topics in AI – Natural Language Processing
Computational NeuroEngineering Lab
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Latent Variables, Mixture Models and EM
CSC 594 Topics in AI – Natural Language Processing
Hidden Markov Models Part 2: Algorithms
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Unsupervised Learning II: Soft Clustering with Gaussian Mixture Models
CONTEXT DEPENDENT CLASSIFICATION
EE513 Audio Signals and Systems
Speech Processing Speech Recognition
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 15: REESTIMATION, EM AND MIXTURES
LECTURE 07: BAYESIAN ESTIMATION
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
EM Algorithm and its Applications
Mathematical Foundations of BME Reza Shadmehr
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
A Gentle Tutorial of the EM Algorithm and its Application to Parameter Estimation for Gaussian Mixture and Hidden Markov Models Jeff A. Bilmes International.
Presentation transcript:

Statistical Models for Automatic Speech Recognition Lukáš Burget

Basic rules of probability theory Sum rule: Product rule: Bayes rule: 𝑃 𝑥 = 𝑦 𝑃(𝑥,𝑦) 𝑃 𝑥,𝑦 =𝑃 𝑥|𝑦 𝑃 y =𝑃 𝑦|𝑥 𝑃(𝑥) 𝑃 𝑥|𝑦 = 𝑃 𝑦|𝑥 𝑃 𝑥 𝑃(𝑦)

Continuous random variables P(x) –probability p(x) –probability density function Sum rule: 𝑃 𝑥∈(𝑎,𝑏) = 𝑎 𝑏 𝑝(𝑥) d𝑥 p(x)  x 𝑝 𝑥 = 𝑝(𝑥,𝑦) 𝑑𝑦

Speech recognition problem Feature extraction: preprocessing speech signal to satisfy needs of the following recognition process (dimensionality reduction, preserving only the “important” information, decorrelation). Popular features are MFCC: modification based on psycho-acoustic findings applied to short-time spectra. For convenience, we will use one-dimensional features in most of our examples (e.g. short time energy).

Classifying speech frame unvoiced voiced  p(x)  x

Classifying speech frame unvoiced voiced  p(x)  x Mathematically, we ask the following question: But the value we read from probability distribution is p(x|class). According to Bayes Rule, the above can be revritten as:

Multi-class classification The class being correct with the highest probability is given by: silence unvoiced voiced  p(x)  x But we do not know the true distribution, …

Estimation of parameters … we only see some training examples. unvoiced voiced silence  p(x)  x

Estimation of parameters … we only see some training examples. Let’s decide for some parametric model (e.g. Gaussian distribution) and estimate its parameters from the data. Here, we are using the frequentist approach: Estimate and rely on distributions, which tells us how frequently we have seen similar feature x for individual classes. unvoiced voiced silence  p(x)  x

Maximum Likelihood Estimation In the next part, we will use ML estimation of model parameters: This allow as to individually estimate parameters, Θ, of each class given the data for that class. Therefore, for the convenience, we can omit the class identities in the following equations. The models we are going to examine are: Single Gaussian Gaussian Mixture Model (GMM) Hidden Markov Model We want to solve three fundamental problems: Evaluation of the model (computing likelihood of features given the model) Training the model (finding ML estimates of parameters) Finding most likely values of hidden parameters

Gaussian distribution (univariate) 𝑝 𝑥 =𝒩 𝑥;𝜇, 𝜎 2 = 1 2𝜋 𝜎 2 𝑒 − 𝑥−𝜇 2 2 𝜎 2 ML estimates of parameters 𝜇= 1 𝑁 𝑖 𝑥 𝑖 𝜎 2 = 1 𝑁 𝑖 (𝑥 𝑖 −𝜇) 2

Why Gaussian distribution? Naturally occurring Central limit theorem: Summing values of many independently generated random variables gives Gaussian distributed observations Examples: Summing outcome of N dices Galton’s board https://www.youtube.com/watch?v=03tx4v0i7MA

Gaussian distribution (multivariate) 𝑝 𝑥 1 , …, 𝑥 𝐷 = 𝒩 𝐱;𝝁,𝚺 = 1 2𝜋 𝐷 |𝚺| 𝑒 − 1 2 𝐱−𝝁 𝑇 𝚺 −1 𝐱−𝝁 ML odhad of parametrů: 𝝁= 1 𝑇 𝑖 𝐱 𝑖 𝚺= 1 𝑇 𝑖 𝐱 i −𝝁 𝐱 i −𝝁 𝑇

Gaussian Mixture Model (GMM) 𝑝 𝐱|Θ = 𝑐 𝒩 𝐱; 𝝁 𝑧 , 𝚺 𝑧 𝜋 𝑧 where Θ={ 𝜋 𝑧 , 𝝁 𝑧 , 𝚺 𝑧 } 𝑧 𝜋 𝑧 =1 We can see the sum above just as a function defining the shape of the probability density function or …

Gaussian Mixture Model 𝑝 𝐱 = 𝑧 𝑝 𝐱 𝑧 𝑃(𝑧) = 𝑧 𝒩 𝐱; 𝝁 𝑧 , 𝚺 𝑧 𝜋 𝑧 or we can see it generative probabilistic model described by Bayesian network with Categorical latent random variable 𝑧 identifying Gaussian distribution generating the observation 𝐱 Observations assumed to be generated as follows: randomly select Gaussian component according probabilities 𝑃(𝑧) generated observation 𝐱 form the selected Gaussian distribution To evaluate 𝑃 𝐱 , we have to marginalize out 𝑧 No close form solution for training 𝑧 𝑝 𝐱,z =𝑝 𝐱 z 𝑃(𝑧) x

Training GMM –Viterbi training Intuitive and Approximate iterative algorithm for training GMM parameters. Using current model parameters, let Gaussians to classify data as the Gaussians were different classes (Even though the both data and all components corresponds to one class modeled by the GMM) Re-estimate parameters of Gaussian using the data associated with to them in the previous step. Repeat the previous two steps until the algorithm converge.

Training GMM – EM algorithm Expectation Maximization is very general tool applicable do different generative models with latent (hidden) variables. Here, we only see the result of its application to the problem of re-estimating GMM parameters. It guarantees to increase likelihood of training data in every iteration, however it does not guarantees to find the global optimum. The algorithm is very similar to Viterbi training presented above. However, instead of hard alignments of frames to Gaussian components, the posterior probabilities 𝑃 𝑐| 𝐱 𝑖 (calculated given the old model) are used as soft weights. Parameters 𝝁 𝑐 , 𝚺 𝑐 are then calculated using a weighted average. 𝛾 𝑧𝑖 = 𝒩 𝑥 𝑖 | 𝜇 𝑧 𝑖 (𝑜𝑙𝑑) , 𝜎 2 𝑧 𝑖 (𝑜𝑙𝑑) 𝜋 𝑧 𝑖 (𝑜𝑙𝑑) 𝑘 𝒩 x 𝑖 | 𝜇 𝑘 (𝑜𝑙𝑑) , 𝜎 2 𝑘 (𝑜𝑙𝑑) 𝜋 𝑘 (𝑜𝑙𝑑) = 𝑝 x 𝑖 | 𝑧 𝑖 𝑃( 𝑧 𝑖 ) 𝑘 𝑝 x 𝑖 |𝑘 𝑃(𝑘) =𝑃 𝑧 𝑖 | x 𝑖 𝜇 𝑘 𝑛𝑒𝑤 = 1 𝑖 𝛾 𝑘𝑖 𝑖 𝛾 𝑘𝑖 x 𝑖 𝜋 𝑘 𝑛𝑒𝑤 = 𝑖 𝛾 𝑘𝑖 𝑘 𝑖 𝛾 𝑘𝑖 𝜎 2 z 𝑛𝑒𝑤 = 1 𝑖 𝛾 𝑘𝑖 𝑖 𝛾 𝑘𝑖 x i − 𝜇 𝑘 2

GMM to be learned

EM algorithm

EM algorithm

EM algorithm

EM algorithm

EM algorithm

EM algorithm

EM algorithm

EM algorithm

EM algorithm

EM algorithm

Classifying stationary sequence unvoiced voiced silence Frame independency assumption

Modeling more general sequences: Hidden Markov Models b1(x) b2(x) b3(x) Generative model: For each frame, model moves from one state to another according to a transition probability aij and generates feature vector from probability distribution bj(.) associated with the state that was entered. To evaluate such model, we do not see which path through the states was taken. Let’s start with evaluating HMM for a particular state sequence.

a11 a22 a33 a12 a23 a34 b1(x) b2(x) b3(x) P(X,S|Θ) = b1(x1) b1(x1) a11 a11 b1(x2) b1(x2) a12 a12 b2(x3) b2(x3) a23 a23 b3(x4) b3(x4) a33 a33 b3(x5) b3(x5)

Evaluating HMM for a particular state sequence P(X,S|Θ) = b1(x1) b1(x1) a11 a11 b1(x2) b1(x2) a12 a12 b2(x3) b2(x3) a23 a23 b3(x4) b3(x4) a33 a33 b3(x5) b3(x5)

Evaluating HMM for a particular state sequence The joint likelihood of observes sequence X and state sequence S can be decomposed as follows: I where is prior probability of hidden variable – state sequence S. For GMM, the corresponding term was: Pc is likelihood of observed sequence X, given the state sequence S. For GMM, the corresponding term was:

.

.

.

Evaluating HMM (for any state sequence) Since we do not know the underlying state sequence, we must marginalize – compute and sum likelihoods over all the possible paths

Finding the best (Viterbi) paths ^

Training HMMs – Viterbi training Similar to the approximate training we have already seen for GMMs For each training utterance find Viterbi path through HMM, which associate feature frames with states. Re-estimate state distribution using associated feature frames. Repeat steps 1. and 2. until the algorithm converges.

Training HMMs using EM s t

Isolated word recognition YES NO

Connected word recognition YES sil sil NO See what words were traversed by the Viterbi path

Phoneme based models y eh s y eh s

Using Language model - unigram P(one) w ah n one sil P(two) t uw sil two P(three) th r iy three

Using Language model - bigram one P(W2|W1) w ah n sil one two t uw sil two three th r iy sil three

Other basic ASR topics not covered by this presentation Context dependent models Training phoneme based models Feature extraction Delta parameters De-correlation of features Full-covariance vs. diagonal cov. modeling Adaptation to speaker or acoustic condition Language Modeling LM smoothing (back-off) Discriminative training (MMI or MPE) and so on