Presentation is loading. Please wait.

Presentation is loading. Please wait.

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Similar presentations


Presentation on theme: "Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007."— Presentation transcript:

1 Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007

2 Outline Introduction Hidden Markov Models Discriminative Training Criteria –Maximum Mutual Information –Minimum Classification Error –Minimum Bayes’ Risk –Techniques to improve generalization Large Margin HMMs Maximum Entropy Markov Models Conditional Random Field Dynamic Kernels Conditional Augmented Models Conclusions

3 Automatic Speech Recognition The task of the speech recognition is to determine the identity of an given observation sequence by assigning the recognized word sequence to it The decision is to find the identity with maximum a posterior (MAP) probability –The so-called Bayes decision (or minimum-error-rate) rule A certain parametric representation of these distributions is needed HMMs are widely adopted for acoustic modeling Acoustic Model Language Model is assumed to be given have to be estimated MultinomialGaussian

4 Acoustic Modeling (1/2) In the development of an ASR system, acoustic modeling is always an indispensable and crucial ingredient The purpose of acoustic modeling is to provide a method to calculate the likelihood of a speech utterance occurring given a word sequence, In principle, the word sequence can be decomposed into a sequence of phone-like units (acoustic models) –Each of which is normally represented by a HMM, and can be estimated from a corpus of training utterances –Traditionally, the maximum likelihood (ML) training can be employed for this estimation

5 Acoustic Modeling (2/2) Besides the ML training, the acoustic model can be alternative trained with discriminative training criteria –MCE training 、 MMI training 、 MPE training…etc –In MCE training, an approximation to the error rate on the training data is optimized –The MMI and MPE algorithms were developed in an attempt to correctly discriminate the recognition hypotheses for the best recognition results However.. –The underlying acoustic model is still generative, with the associated constraints on the state and transition probability distributions –Classification is based on Bayes’ decision rule

6 Introduction Initially these discriminative criteria were applied to small vocabulary speech recognition tasks A number of techniques were then developed to enable their use for LVCSR tasks –I-smoothing –Language model weakening –The use of lattices to compactly represent the denominator score But the performance on LVCSR tasks is still not satisfactory for many speech-enabled applications –This has led to interest in discriminative (or direct) models for speech recognition where the posterior of the word-sequence given the observation,,is directly modeled

7 Hidden Markov Models HMMs are the standard acoustic model used in speech recognition The likelihood function is The standard training of HMM is based on Maximum Likelihood training –This optimization is normally performed using Expectation Maximization

8 Discriminative Training Criteria The discriminative training criteria are more closely linked to minimizing the error rate, rather than maximizing the likelihood of generating the training data Three main forms of discriminative training have been examined –Maximum Mutual Information (MMI) –Minimum Classification Error (MCE) –Minimum Bayes’ Risk (MBR) Minimum Phone Error (MPE)

9 Discriminative Training Criteria Maximum Mutual Information: –To maximizing the mutual information between the observed sequences and models Minimum Classification Error: –Based on a smooth function of the difference between the log- likelihood of the correct sequence and all other competing word sequences

10 Discriminative Training Criteria Minimum Bayes’ Risk: –Rather than trying to model the correct distribution, the expected loss during inference is minimized –A number of loss function: 1/0 function –equivalent to a sentence-level loss function Word –the loss function directly related to minimizing the expected Word Error Rate (WER) Phone

11 Large Margin HMMs The simplest form of large margin training criterion can be expressed as maximizing [Li et al. 2005] –This aims to maximize the minimum distance between the log- posterior of the correct label and all the incorrect labels Some properties related to both the MMI and MCE criterion –A log-posterior cost function is used as in the MMI criterion –The denominator term used with this approach does not include an element from the correct label in a similar fashion to the MCE criterion

12 Large Margin HMMs A couple of variants of large margin training –Soft margin training [Jinyu Li et al. 2006] –Large margin GMM [F. Sha and L.K. Saul 2007] The size of the margin is specified in terms of a loss function between the two sets of sequences where

13 Direct Models Direct modeling attempts to model the posterior probability directly There are many potential advantages as well as challenges for direct modeling –The direct model can potentially make decoding simpler –The direct model allows for the potential combination of multiple sources of data in a unified fashion Asynchronous and overlapping features can be incorporated formally It will be possible to take advantage of supra-segmental features like prosodic features, acoustic phonetic features, speaker style, rate of speech, channel differences –However, joint estimation would require a large amount of parallel speech and text data (a challenge for data collection)

14 Direct Models The relationship between observations and states is reversed –Separate transition and observation probabilities are replaced with one function –Directly modeling makes direct computation of possible The model can also be conditioned flexibly on a variety of contextual features –Any computable property of the observation sequence can be used as a feature –The number of features at each time frame need not be the same Assumption:

15 Maximum Entropy Markov Models Recently, McCallum et al. (ICML 2000) modeled sequential processes using a direct model similar to the HMM in graphical structure and used exponential models for transition- observation probabilities –Called Maximum Entropy Markov Model (MEMM) Maximum Entropy modeling is used to model the conditional distributions –ME modeling is based on the principle of avoiding unnecessary assumptions –The principle states that the modeled probability distribution should be consistent with the given collection of facts about itself and otherwise be as uniform as possible

16 Maximum Entropy Markov Models The mathematical interpretation of this principle results in a constrained optimization problem –Maximize the entropy of a conditional distribution, subject to given constraints –Constraints represent the known facts about the model from statistics of the training data Definition 1: Definition 2:

17 Maximum Entropy Markov Models These definitions allow us to introduce the constraints of the model The expected value of with respect to the model is Using Lagrange multipliers for constrained optimization, the desired probability distribution is given by the maximum of the function

18 Maximum Entropy Markov Models Finally, the solution of objective function is given by the exponential model

19 Reference [SAP06][Jeff Kuo and Yuqing Gao] “Maximum Entropy Direct Models for Speech Recognition”


Download ppt "Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007."

Similar presentations


Ads by Google