Presentation is loading. Please wait.

Presentation is loading. Please wait.

Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005.

Similar presentations


Presentation on theme: "Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005."— Presentation transcript:

1 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005

2 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Outline  Parameter estimation  Maximum Likelihood (ML) parameter estimation  ML for Gaussian PDFs  ML for HMMs – the Baum-Welch algorithm  HMM adaptation: –MAP estimation –MLLR

3 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete variables  Suppose that Y is a random variable which can take any value in a discrete set X={x 1,x 2,…,x M }  Suppose that y 1,y 2,…,y N are samples of the random variable Y  If c m is the number of times that the y n = x m then an estimate of the probability that y n takes the value x m is given by:

4 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Discrete Probability Mass Function Symbol 1 2 3 4 5 6 7 8 9 Total Num.Occurrences 120 231 90 87 63 57 156 203 91 1098

5 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables  In most practical applications the data are not restricted to a finite set of values – they can take any value in N-dimensional space  Simply counting the number of occurrences of each value is no longer a viable way of estimating probabilities…  …but there are generalisations of this approach which are applicable to continuous variables – these are referred to as non-parametric methods

6 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Continuous Random Variables  An alternative is to use a parametric model  In a parametric model, probabilities are defined by a small set of parameters  Simplest example is a normal, or Gaussian model  A Gaussian probability density function (PDF) is defined by two parameters –its mean , and –variance 

7 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF  ‘Standard’ 1-dimensional Guassian PDF: – mean  =0 –variance  =1

8 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF a b P(a  x  b)

9 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Gaussian PDF  For a 1-dimensional Gaussian PDF p with mean  and variance  : Constant to ensure area under curve is 1 Defines ‘bell’ shape

10 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction More examples  =0.1  =1.0  =10.0  =5.0

11 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data  Suppose y = y 1,…,y n,…,y T is a sequence of T data values  Given a Gaussian PDF p with mean  and variance , define:  How do we choose  and  to maximise this probability?

12 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Fitting a Gaussian PDF to Data Poor fitGood fit

13 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Maximum Likelihood Estimation  Define the best fitting Gaussian to be the one such that p(y| ,  ) is maximised.  Terminology: –p(y| ,  ) as a function of y is the probability (density) of y –p(y| ,  ) as a function of ,  is the likelihood of ,   Maximising p(y| ,  ) with respect to ,  is called Maximum Likelihood (ML) estimation of , 

14 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML estimation of ,   Intuitively: –The maximum likelihood estimate of  should be the average value of y 1,…,y T, (the sample mean) –The maximum likelihood estimate of  should be the variance of y 1,…,y T. (the sample variance)  This turns out to be true: p(y| ,  ) is maximised by setting:

15 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Proof First note that maximising p(y) is the same as maximising log(p(y)) At a maximum: Also So,

16 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs  Now consider –An N state HMM M, each of whose states is associated with a Gaussian PDF –A training sequence y 1,…,y T  For simplicity assume that each y t is 1-dimensional

17 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML training for HMMs  If we knew that: –y 1,…,y e(1) correspond to state 1 –y e(1)+1,…,y e(2) correspond to state 2 –: –y e(n-1)+1,…,y e(n) correspond to state n –:  Then we could set the mean of state n to the average value of y e(n-1)+1,…,y e(n)

18 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ML Training for HMMs y 1,…,y e(1), y e(1)+1,…,y e(2), y e(2)+1,…,y T Unfortunately we don’t know that y e(n-1)+1,…,y e(n) correspond to state n…

19 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution 1.Define an initial HMM – M 0 2.Use the Viterbi algorithm to compute the optimal state sequence between M 0 and y 1,…,y T y 1 y 2 y 3 … … y t … y T

20 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued)  Use optimal state sequence to segment y y 1 y e(1) y e(1)+1 … … y e(2) … y T  Reestimate parameters to get a new model M 1

21 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Solution (continued)  Now repeat whole process using M 1 instead of M 0, to get a new model M 2  Then repeat again using M 2 to get a new model M 3  …. p(y | M 0 )  p(y | M 1 )  p(y | M 2 )  ….  p(y | M n ) ….

22 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Local optimization p(y|M) M M 0 M 1 …M n Local optimum Global optimum

23 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch optimization  The algorithm just described is often called Viterbi training or Viterbi reestimation  It is often used to train large sets of HMMs  An alternative method is called Baum-Welch reestimation

24 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Baum-Welch Reestimation y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT P(s i | y t ) =  t (i)

25 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT

26 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Backward’ Probabilities y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt yTyT

27 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Forward-Backward’ Algorithm yTyT y1y1 y3y3 y4y4 y5y5 y2y2 y t+1 ytyt

28 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation  A modern large-vocabulary continuous speech recognition system has many thousands of parameters  Many hours of speech data used to train the system (e.g. 200+ hours!)  Speech data comes from many speakers  Hence recogniser is ‘speaker independent’  But performance for an individual would be better if the system were speaker dependent

29 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation  For a single speaker, only a small amount of training data is available  Viterbi reestimation or Baum-Welch reestimation will not work

30 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction ‘Parameters vs training data’ Performance Number of parameters Larger training set Smaller training set

31 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Adaptation  Two common approaches to adaptation (with small amounts of training data) –Bayesian adaptation (also known as MAP adaptation (MAP = Maximum a Priori)) –Transform-based adaptation (also known as MLLR (MLLR = Maximum Likelihood Linear Regression))

32 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Bayesian adaptation  Uses well-trained, ‘speaker-independent’ HMM as a prior for the estimate of the parameters of the speaker dependent HMM  E.G: Speaker independent state PDF Sample mean MAP estimate of mean

33 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation  Each acoustic vector is typically 40 dimensional  So a linear transform of the acoustic data needs 40 x 40 = 1600 parameters  This is much less than the 10s or 100s or thousands of parameters needed to train the whole system  Estimate a linear transform to transform speaker- independent into speaker-dependent parameters

34 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Transform-based adaptation Speaker- independent parameters Speaker-dependent data points ‘best fit’ transform Adapted parameters

35 Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Summary  Maximum Likelihood (ML) estimation  Viterbi HMM parameter estimation  Baum-Welch HMM parameter estimation  Forward and backward probabilities  Adaptation –Bayesian adaptation –Transform-based adaptation


Download ppt "Speech Technology Lab Ƅ ɜ: m ɪ ŋ ǝ m EEM4R Spoken Language Processing - Introduction Training HMMs Version 4: February 2005."

Similar presentations


Ads by Google