Download presentation
Presentation is loading. Please wait.
1
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005
2
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Outline Parameter estimation Maximum Likelihood (ML) parameter estimation ML for Gaussian PDFs ML for HMMs – the Baum-Welch algorithm HMM adaptation: –MAP estimation –MLLR
3
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete variables Suppose that Y is a random variable which can take any value in a discrete set X={x 1,x 2,…,x M } Suppose that y 1,y 2,…,y N are samples of the random variable Y If c m is the number of times that the y n = x m then an estimate of the probability that y n takes the value x m is given by:
4
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete Probability Mass Function Symbol 1 2 3 4 5 6 7 8 9 Total Num.Occurrences 120 231 90 87 63 57 156 203 91 1098
5
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities… …but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods
6
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables An alternative is to use a parametric model In a parametric model, probabilities are defined by a small set of parameters Simplest example is a normal, or Gaussian model A Gaussian probability density function (PDF) is defined by two parameters –its mean , and –variance
7
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF ‘Standard’ 1-dimensional Guassian PDF: – mean =0 –variance =1
8
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF a b P(a x b)
9
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF For a 1-dimensional Gaussian PDF p with mean and variance : Constant to ensure area under curve is 1 Defines ‘bell’ shape
10
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction More examples =0.1 =1.0 =10.0 =5.0
11
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Suppose y = y 1,…,y n,…,y T is a sequence of T data values Given a Gaussian PDF p with mean and variance , define: How do we choose and to maximise this probability?
12
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Poor fitGood fit
13
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Maximum Likelihood Estimation Define the best fitting Gaussian to be the one such that p(y| , ) is maximised. Terminology: –p(y| , ) as a function of y is the probability (density) of y –p(y| , ) as a function of , is the likelihood of , Maximising p(y| , ) with respect to , is called Maximum Likelihood (ML) estimation of ,
14
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML estimation of , Intuitively: –The maximum likelihood estimate of should be the average value of y 1,…,y T, (the sample mean) –The maximum likelihood estimate of should be the variance of y 1,…,y T. (the sample variance) This turns out to be true: p(y| , ) is maximised by setting:
15
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Proof First note that maximising p(y) is the same as maximising log(p(y)) At a maximum: Also So,
16
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs Now consider –An N state HMM M, each of whose states is associated with a Gaussian PDF –A training sequence y 1,…,y T For simplicity assume that each y t is 1-dimensional
17
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs If we knew that: –y 1,…,y e(1) correspond to state 1 –y e(1)+1,…,y e(2) correspond to state 2 –: –y e(n-1)+1,…,y e(n) correspond to state n –: Then we could set the mean of state n to the average value of y e(n-1)+1,…,y e(n)
18
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML Training for HMMs y 1,…,y e(1), y e(1)+1,…,y e(2), y e(2)+1,…,y T Unfortunately we don’t know that y e(n-1)+1,…,y e(n) correspond to state n…
19
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution 1.Define an initial HMM – M 0 2.Use the Viterbi algorithm to compute the optimal state sequence between M 0 and y 1,…,y T y 1 y 2 y 3 … … y t … y T
20
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued) Use optimal state sequence to segment y y 1 y e(1) y e(1)+1 … … y e(2) … y T Reestimate parameters to get a new model M 1
21
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued) Now repeat whole process using M 1 instead of M 0, to get a new model M 2 Then repeat again using M 2 to get a new model M 3 …. p(y | M 0 ) p(y | M 1 ) p(y | M 2 ) …. p(y | M n ) ….
22
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Local optimization p(y|M) M M 0 M 1 …M n Local optimum Global optimum
23
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch optimization The algorithm just described is often called Viterbi training or Viterbi reestimation It is often used to train large sets of HMMs An alternative method is called Baum-Welch reestimation
24
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch Reestimation y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT P(s i | y t ) = t (i)
25
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT
26
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Backward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT
27
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward-Backward’ Algorithm yTyT y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt
28
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation A modern large-vocabulary continuous speech recognition system has many thousands of parameters Many hours of speech data used to train the system (e.g. 200+ hours!) Speech data comes from many speakers Hence recogniser is ‘speaker independent’ But performance for an individual would be better if the system were speaker dependent
29
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation For a single speaker, only a small amount of training data is available Viterbi reestimation or Baum-Welch reestimation will not work
30
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Parameters vs training data’ Performance Number of parameters Larger training set Smaller training set
31
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation Two common approaches to adaptation (with small amounts of training data) –Bayesian adaptation (also known as MAP adaptation (MAP = Maximum a Priori)) –Transform-based adaptation (also known as MLLR (MLLR = Maximum Likelihood Linear Regression))
32
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Bayesian adaptation Uses well-trained, ‘speaker-independent’ HMM as a prior for the estimate of the parameters of the speaker dependent HMM E.G: Speaker independent state PDF Sample mean MAP estimate of mean
33
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Each acoustic vector is typically 40 dimensional So a linear transform of the acoustic data needs 40 x 40 = 1600 parameters This is much less than the 10s or 100s or thousands of parameters needed to train the whole system Estimate a linear transform to transform speaker- independent into speaker-dependent parameters
34
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Speaker- independent parameters Speaker-dependent data points ‘best fit’ transform Adapted parameters
35
Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Summary Maximum Likelihood (ML) estimation Viterbi HMM parameter estimation Baum-Welch HMM parameter estimation Forward and backward probabilities Adaptation –Bayesian adaptation –Transform-based adaptation
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.