Statistical Models for Automatic Speech Recognition

Statistical Models for Automatic Speech Recognition
Lukáš Burget

Feature extraction Preprocessing speech signal to satisfy needs of the following recognition process (dimensionality reduction, preserving only the “important” information, decorelation). Popular features are MFCC: modification based on psycho-acoustic findings applied to short-time spectra. For convenience, we will use one-dimensional features in most of our examples (e.g. short time energy).

Classifying single speech frame
unvoiced voiced

Classifying single speech frame
unvoiced voiced Mathematically, we ask the following question: P ( v o i c e d j x ) > u n But the value we read from probability distribution is p(x|class). According to Bayes Rule, the above can be revritten as: p ( x j v o i c e d ) P > u n

Multi-class classification
The class being correct with the highest probability is given by: a r g m x ! P ( j ) = p silence unvoiced voiced But we do not know the true distribution, …

Estimation of parameters
… we only see some training examples. unvoiced voiced silence

Estimation of parameters
… we only see some training examples. Let’s decide for some parametric model (e.g. Gaussian distribution) and estimate its parameters from the data. unvoiced voiced silence

Maximum Likelihood Estimation
In the next part, we will use ML estimation of model parameters: ^ c l a s M L = r g m x Y 8 i 2 p ( j ) This allow as to individually estimate parameters, Θ, of each class given the data for that class. Therefore, for the convenience, we can omit the class identities in the following equations. The models we are going to examine are: Single Gaussian Gaussian Mixture Model (GMM) Hidden Markov Model We want to solve three fundamental problems: Evaluation of the model (computing likelihood of features given the model) Training the model (finding ML estimates of parameters) Finding most likely values of hidden parameters

Gaussian distribution (1 dimension)
Evaluation: N ( x ; 2 ) = 1 p e ML estimates of parameters (Training): = 1 T P t x ( ) 2 = 1 T P t ( x ) No hidden variables.

Gaussian distribution (2 dimensions)
x ; ) = 1 p 2 P j e T

Gaussian Mixture Model
Evaluation: p ( x j ) = P c N ; where P c = 1

Gaussian Mixture Model
Evaluation: p ( x j ) = P c N ; We can see the sum above just as a function defining the shape of the probability density function, or we can see it as a more complicated generative probabilistic model, from which features are generated as follows: One of Gaussian components is first randomly selected according prior probabilities Pc Feature vector is generated form the selected Gaussian distribution For the evaluation, however, we do not know which component generated the input vector (Identity of the component is hidden variable). Therefore, we marginalize – sum over all the components respecting their prior probabilities. Why we want to complicate our lives with this concept: It allows at to apply EM algorithm for GMM training We will need this concept for HMMs

Training GMM –Viterbi training
Intuitive and Approximate iterative algorithm for training GMM parameters. Using current model parameters, let Gaussians to classify data as the Gaussians were different classes (Even though the both data and all components corresponds to one class modeled by the GMM) Re-estimate parameters of Gaussian using the data associated with to them in the previous step. Repeat the previous two steps until the algorithm converge.

Training GMM – EM algorithm
Expectation Maximization is very general tool applicable in many cases were we deal with unobserved (hidden) data. Here, we only see the result of its application to the problem of re-estimating parameters of GMM. It guarantees to increase likelihood of training data in every iteration, however it does not guarantees to find the global optimum. The algorithm is very similar to Viterbi training presented above. Only instead of hard decisions, it uses “soft” posterior probabilities of Gaussians (given the old model) as a weights and weight average is used to compute new mean and variance estimates. ^ ( n e w ) c = P T t 1 x ^ 2 c ( n e w ) = P T t 1 x c ( t ) = P N x ; ^ o l d 2

Classifying stationary sequence
unvoiced voiced silence Frame independency assumption P ( X j c l a s ) = Y 8 x i 2 p

Modeling more general sequences: Hidden Markov Models
b1(x) b2(x) b3(x) Generative model: For each frame, model moves from one state to another according to a transition probability aij and generates feature vector from probability distribution bj(.) associated with the state that was entered. To evaluate such model, we do not se which path through the states was taken. Let’s start with evaluating HMM for a particular state sequence.

a11 a22 a33 a12 a23 a34 b1(x) b2(x) b3(x) P(X,S|Θ) = b1(x1) b1(x1) a11 a11 b1(x2) b1(x2) a12 a12 b2(x3) b2(x3) a23 a23 b3(x4) b3(x4) a33 a33 b3(x5) b3(x5)

Evaluating HMM for a particular state sequence
P(X,S|Θ) = b1(x1) b1(x1) a11 a11 b1(x2) b1(x2) a12 a12 b2(x3) b2(x3) a23 a23 b3(x4) b3(x4) a33 a33 b3(x5) b3(x5)

Evaluating HMM for a particular state sequence
The joint likelihood of observes sequence X and state sequence S P ( X ; S j ) = b 1 x a 2 3 4 5 can be decomposed as follows: P ( X ; S j ) = where P ( S j ) = a 1 2 3 4 is prior probability of hidden variable – state sequence S. For GMM, the corresponding term was: Pc P ( X j S ; ) = b 1 x 2 3 4 5 is likelihood of observed sequence X, given the state sequence S. For GMM, the corresponding term was: N ( x ; c 2 )

Evaluating HMM (for any state sequence)
Since we do not know the underlying state sequence, we must marginalize – compute and sum likelihoods over all the possible paths P ( X j ) = S ;

Finding the best (Viterbi) paths
X j ) = m a x S ;

Training HMMs – Viterbi training
Similar to the approximate training we have already seen for GMMs For each training utterance find Viterbi path through GMM, which associate feature frames with states. Re-estimate state distribution using associated feature frames. Repeat steps 1. and 2. until the algorithm converges.

Training HMMs using EM ° ( t ) = ^ ¹ = ^ ¾ = s ® ¯ P X j £ ® ( t ) ¯ (
w ) s = P T t 1 x s ( t ) = P X j o l d ^ 2 s ( n e w ) = P T t 1 x s ( t )

Isolated word recognition
YES NO p ( X j Y E S ) P > N O

Connected word recognition
YES sil sil NO

Phoneme based models y eh s y eh s

Using Language model - unigram
P(one) w ah n one sil P(two) t uw sil two P(three) th r iy three

Using Language model - bigram
one P(W2|W1) w ah n sil one two t uw sil two three th r iy sil three

Other basic ASR topics not covered by this presentation
Context dependent models Training phoneme based models Feature extraction Delta parameters De-correlation of features Full-covariance vs. diagonal cov. modeling Adaptation to speaker or acoustic condition Language Modeling LM smoothing (back-off) Discriminative training (MMI or MPE) and so on

Statistical Models for Automatic Speech Recognition

Similar presentations

Presentation on theme: "Statistical Models for Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Statistical Models for Automatic Speech Recognition

Similar presentations

Presentation on theme: "Statistical Models for Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback