Presentation is loading. Please wait.

Presentation is loading. Please wait.

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Similar presentations


Presentation on theme: "Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur."— Presentation transcript:

1 Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation Presented by Yueng-Tien, Lo

2 outline A General View of Discriminative Training Criteria – Extended Unifying Approach – Smoothed Error Minimizing Training Criteria

3 A General View of Discriminative Training Criteria A unifying approach for a class of discriminative training criteria was presented that allows for optimizing several objective functions – among them the Maximum Mutual Information criterion and the Minimum Classification Error criterion – within a single framework. The approach is extended such that it also captures other criteria more recently proposed

4 A General View of Discriminative Training Criteria A class of discriminative training criteria can then be defined by: is a gain function that allows for rating sentence hypotheses W based on the spoken word string W r G reflects an error metric such as the Levenshtein distance or the sentence error.

5 A General View of Discriminative Training Criteria The fraction inside the brackets is called the local discriminative criterion

6 Extended Unifying Approach The choice of alternative word sequences contained in the set M r together with the smoothing function f, the weighting exponent, and the gain function G determine the particular criterion in use.

7 Extended Unifying Approach Maximum Likelihood (ML) Criterion Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion Diversity Index Jeffreys’ Criterion Chernoff Affinity

8 Maximum Likelihood (ML) Criterion Although not a discriminative criterion in a strict sense, the ML objective function is contained as a special case in the extended unifying approach. The ML estimator is consistent which means that for any increasing and representative set of training samples the estimation of the parameters converges toward the true model parameters

9 Maximum Likelihood (ML) Criterion However, for automatic speech recognition, the model assumptions are typically not correct, and therefore the ML estimator will return the true parameters of a wrong model in the limiting case of an infinite amount of training data In contrast to discriminative training criteria, which concentrate on enhancing class separability by taking class extraneous data into account, the ML estimator optimizes each class region individually

10 Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion yields the MMI criterion which is defined as the sum over the logarithms of the class posterior probabilities of the spoken word sequences W r for each training utterance r given the corresponding acoustic observations X r :

11 Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation

12 Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion An approximation to the MMI criterion is given by the Corrective Training (CT) criterion Here, the sum over all competing word sequences in the denominator is replaced with the best recognized sentence hypothesis

13 Diversity Index The Diversity Index of degree measures the divergence of a probability distribution from the uniform distribution. While a diversity index closer to the maximum at 0 means a larger divergence from the uniform distribution, smaller values indicate that all classes tend to be nearly equally likely.

14 Diversity Index The Diversity Index applies the weighting function which results in the following expression for the discriminative training criterion:

15 Diversity Index Two well-known diversity indices, the Shannon Entropy (which is equivalent to the MMI criterion for constant class priors) and the Gini Index, are special cases of the Diversity Index The Shannon Entropy results from the continuous limit as approaches 0 while the Gini Index follows from setting = 1:

16 Diversity Index The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation

17 Jeffreys’ Criterion The Jeffreys’ criterion, which is also known as Jeffreys’ divergence, is closely related to the Kullback- Leibler distance [Kullback & Leibler 51] and was first proposed in [Jeffreys 46]: The smoothing function is not lower- bounded which means that an increase in the objective function will be large if the parameters are estimated such that no training utterance has a vanishing small posterior probability

18 Chernoff Affinity The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity. It employs the smoothing function with parameter which leads to the following training criterion:

19 Chernoff Affinity The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity It employs the smoothing function with parameter which leads to the following training criterion:

20 Smoothed Error Minimizing Training Criteria Error minimizing training criteria such as the MCE, the MWE, and the MPE criterion aim at minimizing the expectation of an error related loss function on the training data Let L denote any such loss function. Then the objective is to determine a parameter set that minimizes the total costs due to classification errors:

21 The optimization problem The objective function includes an “argmin” operation which prevents the computation of a gradient The objective function has many local optima: an optimization algorithm must handle this. The loss function L itself is typically a non-continuous step function and therefore not differentiable.

22 Smoothed Error Minimizing Training Criteria A remedy to make this class of error minimizing training criteria amenable to gradient based optimization methods is to replace Eq. (4.17) with the following expression: Discriminative criteria like the MCE, the MWE, and the MPE criterion differ only with respect to the choice of the loss function L.

23 Smoothed Error Minimizing Training Criteria While the MCE criterion typically applies a smoothed sentence error loss function, both the MWE and the MPE criterion are based on approximations of the word or phoneme error rate

24 Smoothed Error Minimizing Training Criteria Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion Minimum Squared Error (MSE) Criterion

25 Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion The MCE criterion aims at minimizing the expectation of a smoothed sentence error on training data. According to Bayes’ decision rule, the probability of making a classification error in utterance r is given by:

26 Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion Smoothing the local error probability with a sigmoid function and carrying out the sum over all training utterances yields the MCE criterion: Similar to the CT criterion, the Falsifying Training (FT) derives from the MCE criterion in the limiting case of infinite :

27 Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion The objective of the Minimum Word Error (MWE) criterion as well as its closely related Minimum Phone Error (MPE) criterion is to minimize the expectation of an approximation to the word or phoneme accuracy on training data. After an efficient lattice-based training scheme was found and successfully implemented in [Povey & Woodland 02], the criterion has received increasing interest.

28 Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion Both criteria compute the average transcription accuracy over all sentence hypotheses considered for discrimination: Here, is defined as the posterior probability of the sentence hypothesis W scaled by a factor in the log-space:

29 A class of discriminative training criteria covered by the extended unifying approach.


Download ppt "Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur."

Similar presentations


Ads by Google