Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

Slides:



Advertisements
Similar presentations
ECE 8443 – Pattern Recognition LECTURE 05: MAXIMUM LIKELIHOOD ESTIMATION Objectives: Discrete Features Maximum Likelihood Resources: D.H.S: Chapter 3 (Part.
Advertisements

2 – In previous chapters: – We could design an optimal classifier if we knew the prior probabilities P(wi) and the class- conditional probabilities P(x|wi)
Learning HMM parameters
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Hidden Markov Models Theory By Johan Walters (SR 2003)
Hidden Markov Models in NLP
Visual Recognition Tutorial
Speech Recognition Training Continuous Density HMMs Lecture Based on:
Parameter Estimation: Maximum Likelihood Estimation Chapter 3 (Duda et al.) – Sections CS479/679 Pattern Recognition Dr. George Bebis.
HMM-BASED PATTERN DETECTION. Outline  Markov Process  Hidden Markov Models Elements Basic Problems Evaluation Optimization Training Implementation 2-D.
Slide 1 EE3J2 Data Mining EE3J2 Data Mining Lecture 10 Statistical Modelling Martin Russell.
Pattern Classification, Chapter 3 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P.
Hidden Markov Models K 1 … 2. Outline Hidden Markov Models – Formalism The Three Basic Problems of HMMs Solutions Applications of HMMs for Automatic Speech.
Conditional Random Fields
Visual Recognition Tutorial
1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.
Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.
Chapter 3 (part 1): Maximum-Likelihood & Bayesian Parameter Estimation  Introduction  Maximum-Likelihood Estimation  Example of a Specific Case  The.
Binary Variables (1) Coin flipping: heads=1, tails=0 Bernoulli Distribution.
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
7-Speech Recognition Speech Recognition Concepts
Estimating parameters in a statistical model Likelihood and Maximum likelihood estimation Bayesian point estimates Maximum a posteriori point.
1 HMM - Part 2 Review of the last lecture The EM algorithm Continuous density HMM.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Yung-Kyun Noh and Joo-kyung Kim Biointelligence.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Definitions Random Signal Analysis (Review) Discrete Random Signals Random.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: ML and Simple Regression Bias of the ML Estimate Variance of the ML Estimate.
: Chapter 3: Maximum-Likelihood and Baysian Parameter Estimation 1 Montri Karnjanadecha ac.th/~montri.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Chapter 3: Maximum-Likelihood Parameter Estimation l Introduction l Maximum-Likelihood Estimation l Multivariate Case: unknown , known  l Univariate.
HMM - Part 2 The EM algorithm Continuous density HMM.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
CS Statistical Machine learning Lecture 24
1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,... Si Sj.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: MLLR For Two Gaussians Mean and Variance Adaptation MATLB Example Resources:
1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.
1 Chapter 8: Model Inference and Averaging Presented by Hui Fang.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov
1 Hidden Markov Model Observation : O1,O2,... States in time : q1, q2,... All states : s1, s2,..., sN Si Sj.
Maximum likelihood estimators Example: Random data X i drawn from a Poisson distribution with unknown  We want to determine  For any assumed value of.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition Objectives: Reestimation Equations Continuous Distributions Gaussian Mixture Models EM Derivation of Reestimation Resources:
Ch 2. Probability Distributions (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized by Joo-kyung Kim Biointelligence Laboratory,
Hidden Markov Models. A Hidden Markov Model consists of 1.A sequence of states {X t |t  T } = {X 1, X 2,..., X T }, and 2.A sequence of observations.
Other Models for Time Series. The Hidden Markov Model (HMM)
Machine Learning Expectation Maximization and Gaussian Mixtures CSE 473 Chapter 20.3.
Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
. The EM algorithm Lecture #11 Acknowledgement: Some slides of this lecture are due to Nir Friedman.
CS479/679 Pattern Recognition Dr. George Bebis
LECTURE 06: MAXIMUM LIKELIHOOD ESTIMATION
Ch3: Model Building through Regression
Statistical Models for Automatic Speech Recognition
CSC 594 Topics in AI – Natural Language Processing
Statistical Models for Automatic Speech Recognition
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
LECTURE 15: REESTIMATION, EM AND MIXTURES
Parametric Methods Berlin Chen, 2005 References:
Learning From Observed Data
Presentation transcript:

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Outline  Parameter estimation  Maximum Likelihood (ML) parameter estimation  ML for Gaussian PDFs  ML for HMMs – the Baum-Welch algorithm  HMM adaptation: –MAP estimation –MLLR

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete variables  Suppose that Y is a random variable which can take any value in a discrete set X={x 1,x 2,…,x M }  Suppose that y 1,y 2,…,y N are samples of the random variable Y  If c m is the number of times that the y n = x m then an estimate of the probability that y n takes the value x m is given by:

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete Probability Mass Function Symbol Total Num.Occurrences

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables  In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space  Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities…  …but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables  An alternative is to use a parametric model  In a parametric model, probabilities are defined by a small set of parameters  Simplest example is a normal, or Gaussian model  A Gaussian probability density function (PDF) is defined by two parameters –its mean , and –variance 

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF  ‘Standard’ 1-dimensional Guassian PDF: – mean  =0 –variance  =1

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF a b P(a  x  b)

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF  For a 1-dimensional Gaussian PDF p with mean  and variance  : Constant to ensure area under curve is 1 Defines ‘bell’ shape

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction More examples  =0.1  =1.0  =10.0  =5.0

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data  Suppose y = y 1,…,y n,…,y T is a sequence of T data values  Given a Gaussian PDF p with mean  and variance , define:  How do we choose  and  to maximise this probability?

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Poor fitGood fit

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Maximum Likelihood Estimation  Define the best fitting Gaussian to be the one such that p(y| ,  ) is maximised.  Terminology: –p(y| ,  ) as a function of y is the probability (density) of y –p(y| ,  ) as a function of ,  is the likelihood of ,   Maximising p(y| ,  ) with respect to ,  is called Maximum Likelihood (ML) estimation of , 

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML estimation of ,   Intuitively: –The maximum likelihood estimate of  should be the average value of y 1,…,y T, (the sample mean) –The maximum likelihood estimate of  should be the variance of y 1,…,y T. (the sample variance)  This turns out to be true: p(y| ,  ) is maximised by setting:

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Proof First note that maximising p(y) is the same as maximising log(p(y)) At a maximum: Also So,

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs  Now consider –An N state HMM M, each of whose states is associated with a Gaussian PDF –A training sequence y 1,…,y T  For simplicity assume that each y t is 1-dimensional

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs  If we knew that: –y 1,…,y e(1) correspond to state 1 –y e(1)+1,…,y e(2) correspond to state 2 –: –y e(n-1)+1,…,y e(n) correspond to state n –:  Then we could set the mean of state n to the average value of y e(n-1)+1,…,y e(n)

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML Training for HMMs y 1,…,y e(1), y e(1)+1,…,y e(2), y e(2)+1,…,y T Unfortunately we don’t know that y e(n-1)+1,…,y e(n) correspond to state n…

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution 1.Define an initial HMM – M 0 2.Use the Viterbi algorithm to compute the optimal state sequence between M 0 and y 1,…,y T y 1 y 2 y 3 … … y t … y T

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued)  Use optimal state sequence to segment y y 1 y e(1) y e(1)+1 … … y e(2) … y T  Reestimate parameters to get a new model M 1

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued)  Now repeat whole process using M 1 instead of M 0, to get a new model M 2  Then repeat again using M 2 to get a new model M 3  …. p(y | M 0 )  p(y | M 1 )  p(y | M 2 )  ….  p(y | M n ) ….

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Local optimization p(y|M) M M 0 M 1 …M n Local optimum Global optimum

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch optimization  The algorithm just described is often called Viterbi training or Viterbi reestimation  It is often used to train large sets of HMMs  An alternative method is called Baum-Welch reestimation

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch Reestimation y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT P(s i | y t ) =  t (i)

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Backward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward-Backward’ Algorithm yTyT y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation  A modern large-vocabulary continuous speech recognition system has many thousands of parameters  Many hours of speech data used to train the system (e.g hours!)  Speech data comes from many speakers  Hence recogniser is ‘speaker independent’  But performance for an individual would be better if the system were speaker dependent

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation  For a single speaker, only a small amount of training data is available  Viterbi reestimation or Baum-Welch reestimation will not work

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Parameters vs training data’ Performance Number of parameters Larger training set Smaller training set

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation  Two common approaches to adaptation (with small amounts of training data) –Bayesian adaptation (also known as MAP adaptation (MAP = Maximum a Priori)) –Transform-based adaptation (also known as MLLR (MLLR = Maximum Likelihood Linear Regression))

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Bayesian adaptation  Uses well-trained, ‘speaker-independent’ HMM as a prior for the estimate of the parameters of the speaker dependent HMM  E.G: Speaker independent state PDF Sample mean MAP estimate of mean

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation  Each acoustic vector is typically 40 dimensional  So a linear transform of the acoustic data needs 40 x 40 = 1600 parameters  This is much less than the 10s or 100s or thousands of parameters needed to train the whole system  Estimate a linear transform to transform speaker- independent into speaker-dependent parameters

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Speaker- independent parameters Speaker-dependent data points ‘best fit’ transform Adapted parameters

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Summary  Maximum Likelihood (ML) estimation  Viterbi HMM parameter estimation  Baum-Welch HMM parameter estimation  Forward and backward probabilities  Adaptation –Bayesian adaptation –Transform-based adaptation