Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Outline Parameter estimation Maximum Likelihood (ML) parameter estimation ML for Gaussian PDFs ML for HMMs – the Baum-Welch algorithm HMM adaptation: –MAP estimation –MLLR
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete variables Suppose that Y is a random variable which can take any value in a discrete set X={x 1,x 2,…,x M } Suppose that y 1,y 2,…,y N are samples of the random variable Y If c m is the number of times that the y n = x m then an estimate of the probability that y n takes the value x m is given by:
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete Probability Mass Function Symbol Total Num.Occurrences
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities… …but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables An alternative is to use a parametric model In a parametric model, probabilities are defined by a small set of parameters Simplest example is a normal, or Gaussian model A Gaussian probability density function (PDF) is defined by two parameters –its mean , and –variance
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF ‘Standard’ 1-dimensional Guassian PDF: – mean =0 –variance =1
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF a b P(a x b)
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF For a 1-dimensional Gaussian PDF p with mean and variance : Constant to ensure area under curve is 1 Defines ‘bell’ shape
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction More examples =0.1 =1.0 =10.0 =5.0
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Suppose y = y 1,…,y n,…,y T is a sequence of T data values Given a Gaussian PDF p with mean and variance , define: How do we choose and to maximise this probability?
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Poor fitGood fit
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Maximum Likelihood Estimation Define the best fitting Gaussian to be the one such that p(y| , ) is maximised. Terminology: –p(y| , ) as a function of y is the probability (density) of y –p(y| , ) as a function of , is the likelihood of , Maximising p(y| , ) with respect to , is called Maximum Likelihood (ML) estimation of ,
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML estimation of , Intuitively: –The maximum likelihood estimate of should be the average value of y 1,…,y T, (the sample mean) –The maximum likelihood estimate of should be the variance of y 1,…,y T. (the sample variance) This turns out to be true: p(y| , ) is maximised by setting:
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Proof First note that maximising p(y) is the same as maximising log(p(y)) At a maximum: Also So,
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs Now consider –An N state HMM M, each of whose states is associated with a Gaussian PDF –A training sequence y 1,…,y T For simplicity assume that each y t is 1-dimensional
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs If we knew that: –y 1,…,y e(1) correspond to state 1 –y e(1)+1,…,y e(2) correspond to state 2 –: –y e(n-1)+1,…,y e(n) correspond to state n –: Then we could set the mean of state n to the average value of y e(n-1)+1,…,y e(n)
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML Training for HMMs y 1,…,y e(1), y e(1)+1,…,y e(2), y e(2)+1,…,y T Unfortunately we don’t know that y e(n-1)+1,…,y e(n) correspond to state n…
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution 1.Define an initial HMM – M 0 2.Use the Viterbi algorithm to compute the optimal state sequence between M 0 and y 1,…,y T y 1 y 2 y 3 … … y t … y T
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued) Use optimal state sequence to segment y y 1 y e(1) y e(1)+1 … … y e(2) … y T Reestimate parameters to get a new model M 1
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued) Now repeat whole process using M 1 instead of M 0, to get a new model M 2 Then repeat again using M 2 to get a new model M 3 …. p(y | M 0 ) p(y | M 1 ) p(y | M 2 ) …. p(y | M n ) ….
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Local optimization p(y|M) M M 0 M 1 …M n Local optimum Global optimum
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch optimization The algorithm just described is often called Viterbi training or Viterbi reestimation It is often used to train large sets of HMMs An alternative method is called Baum-Welch reestimation
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch Reestimation y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT P(s i | y t ) = t (i)
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Backward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward-Backward’ Algorithm yTyT y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation A modern large-vocabulary continuous speech recognition system has many thousands of parameters Many hours of speech data used to train the system (e.g hours!) Speech data comes from many speakers Hence recogniser is ‘speaker independent’ But performance for an individual would be better if the system were speaker dependent
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation For a single speaker, only a small amount of training data is available Viterbi reestimation or Baum-Welch reestimation will not work
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Parameters vs training data’ Performance Number of parameters Larger training set Smaller training set
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation Two common approaches to adaptation (with small amounts of training data) –Bayesian adaptation (also known as MAP adaptation (MAP = Maximum a Priori)) –Transform-based adaptation (also known as MLLR (MLLR = Maximum Likelihood Linear Regression))
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Bayesian adaptation Uses well-trained, ‘speaker-independent’ HMM as a prior for the estimate of the parameters of the speaker dependent HMM E.G: Speaker independent state PDF Sample mean MAP estimate of mean
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Each acoustic vector is typically 40 dimensional So a linear transform of the acoustic data needs 40 x 40 = 1600 parameters This is much less than the 10s or 100s or thousands of parameters needed to train the whole system Estimate a linear transform to transform speaker- independent into speaker-dependent parameters
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Speaker- independent parameters Speaker-dependent data points ‘best fit’ transform Adapted parameters
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Summary Maximum Likelihood (ML) estimation Viterbi HMM parameter estimation Baum-Welch HMM parameter estimation Forward and backward probabilities Adaptation –Bayesian adaptation –Transform-based adaptation