Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:

Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference: D. Povey. “Discriminative Training for Large Vocabulary Speech Recognition,” Ph.D Dissertation, Peterhouse, University of Cambridge, July 2004

2 Statistical Speech Recognition –In this presentation, language model is assumed to be given in advance while acoustic model is needed to be estimated HMMs (hidden Markov models) are widely adopted for acoustic modeling Recognized Sentence Speech Acoustic Match Feature Extraction Linguistic Decoding

3 Expected Risk Let be a finite set of various possible word sequences for a given observation utterance –Assume that the true word sequence is also in Let be the action of classifying a given observation sequence to a word sequence – Let be the loss incurred when we take such an action (and the true word sequence is just ) Therefore, the (expected) risk for a specific action [Duda et al. 2000]

4 Decoding: Minimum Expected Risk (1/2) In speech recognition, we can take the action with the minimum (expected) risk If zero-one loss function is adopted (string-level error) – Then

5 Decoding: Minimum Expected Risk (2/2) –Thus, Select the word sequence with maximum posterior probability (MAP decoding) The string editing or Levenshtein distance also can be accounted for the loss function –Take individual word errors into consideration –E.g., Minimum Bayes Risk (MBR) search/decoding [V. Goel et al. 2004], Word Error Minimization [Mangu et al. 2000]

6 Training: Minimum Overall Expected Risk (1/2) In training, we should minimize the overall (expected) loss of the actions of the training utterances – is the true word sequence of –The integral extends over the whole observation sequence space However, when a limited number of training observation sequences are available, the overall risk can be approximated by

7 Training: Minimum Overall Expected Risk (2/2) Assume to be uniform –The overall risk can be further expressed as If zero-one loss function is adopted –Then

8 Training: Minimum Error Rate Minimum Error Rate (MER) estimation –MER is equivalent to MAP

9 Training: Maximum Likelihood (1/2) The objective function of Maximum Likelihood (ML) estimation can be obtained if Jensen Inequality is further applied –Maximize the overall log-posterior of all training utterances minimize the upper bound uniform Independent of [Schlüter 2000]

10 Training: Maximum Likelihood (2/2) On the other hand, the discriminative training approaches attempt to optimize the correctness of the model set by formulating an objective function that in some way penalizes the model parameters that are liable to confuse correct and incorrect answers MLE can be considered as a derivation from overall log-posterior

11 Training: Maximum Mutual Information (1/3) The objective function can be defined as the sum of the pointwise mutual information of all training utterances and their associated true word sequences The maximum mutual information (MMI) estimation tries to find a new parameter set ( ) that maximizes the above objective function [Bahl et al. 1986]

12 Training: Maximum Mutual Information (2/3) An alternative derivation based on the overall expected risk criterion –Which is equivalent to the maximization of overall log- posterior of all training utterances Independent of include

13 Training: Maximum Mutual Information (3/3) When we maximize the MMIE objection function –Not only the probability of true word sequence (numerator, like the MLE objective function) can be increased, but also can the probabilities of other possible word sequences (denominator) be decreased –Thus, MMIE attempts to make the correct hypothesis more probable, while at the same time it also attempts to make incorrect hypotheses less probable MMIE also can be considered as a derivation from overall log-posterior

14 Training: Minimum Classification Error (1/2) The misclassification measure is defined as Minimization of the overall misclassification measure is similar to MMIE when language model is assumed uniformly distributed [Chou 2000]

15 Training: Minimum Classification Error (2/2) Embed a sigmoid (loss) function to smooth the misclassification measure Let and, then we have Minimization of the overall loss directly minimizes (classification) error rate, so MCE can be regarded as a derivation from MER

16 Training: Minimum Phone Error The objective function of Minimum Phone Error (MPE) is directly derived from the overall expected risk criterion –Replace the loss function with the so-called accuracy function MPE tries to maximize the expected (phone or word) accuracy of all possible word sequences (generated by the recognizer) regarding the training utterances [Povey 2004]

17 Objective Function Optimization Objective function has the “latent variable” problem, such that it can not be directly optimized  Iterative optimization Gradient-based approaches E.g., MCE Expectation Maximum (EM) –strong-sense auxiliary function E.g., MLE –Weak-sense auxiliary function E.g., MMIE, MPE

18 Three Steps for EM Step 1.Draw a lower bound –Use the Jensen’s inequality Step 2.Find the best lower bound  auxiliary function –Let the lower bound touch the objective function at current guess Step 3.Maximize the auxiliary function –Obtain the new guess –Go to Step 2 until converge [Minka 1998]

19 objective function current guess Step 1.Draw a lower bound (1/3)

20 Step 1.Draw a lower bound (2/3) lower bound function objective function

21 Step 1.Draw a lower bound (3/3) Apply Jensen’s Inequality The lower bound function of

22 Step 2.Find the best lower bound (1/4) objective function lower bound function

23 Step 2.Find the best lower bound (2/4) –Let the lower bound touch the objective function at current guess –Find the best at

24 Step 2.Find the best lower bound (3/4) After derivation w.r.t Set it to zero

25 Step 2.Find the best lower bound (4/4) Q function

26 Step 3.Maximize the auxiliary function (1/3) auxiliary function

27 Step 3.Maximize the auxiliary function (2/3) objective function

28 Step 3.Maximize the auxiliary function (3/3) objective function

29 Step 2.Find the best lower bound objective function auxiliary function

30 Step 3.Maximize the auxiliary function objective function

31 Strong-sense Auxiliary Function If is said to be a strong-sense auxiliary function for around,iff [Povey et al. 2003]

32 Weak-sense Auxiliary Function (1/5) If is said to be a weak-sense auxiliary function for around,iff

33 Weak-sense Auxiliary Function (2/5) objective function auxiliary function

34 Weak-sense Auxiliary Function (3/5) objective function auxiliary function

35 Weak-sense Auxiliary Function (4/5) objective function

36 Weak-sense Auxiliary Function (5/5) If is said to be a smooth function around,iff Speed up convergence Provide more stable estimate

37 Smooth Function (1/2) objective function smooth function

38 Smooth Function (2/2) objective function is also a weak-sense auxiliary function

39 MPE: Discrimination The MPE objective function is less sensitive to portions of the training data that are poorly transcribed A (word) lattice structure can be used here to approximate the set of all possible word sequences of each training utterance –Training statistics can be efficiently computed via such structure

40 MPE: Auxiliary Function (1/2) The weak-sense auxiliary function for MPE model updating can be defined as – is a scalar value (a constant) calculated for each phone arc q, and can be either positive or negative (because of the accuracy function) –The auxiliary function also can be decomposed as arcs with positive contributions (so-called numerator) arcs with negative contributions (so-called denominator) still have the “latent variable” problem

41 MPE: Auxiliary Function (2/2) The auxiliary function can be modified by considering the normal auxiliary function for –The smoothing term is not added yet here The key quantity (statistics value) required in MPE training is, which can be termed as

42 MPE: Statistics Accumulation (1/2) The objective function can be expressed as (for a specific phone arc ) The differential can be expressed as

43 MPE: Statistics Accumulation (2/2) The average accuracy of sentences passing through the arc q The likelihood of the arc q The average accuracy of all the sentences in the word graph

44 MPE: Accuracy Function (1/4) and can be calculated in an approximation way using the word graph and the Forward-Backward algorithm Note that the exact accuracy function is express as the sum of phone-level accuracy over all phones, e.g. –However, such accuracy is obtained by full alignment between the true and all possible word sequences, which is computational expensive

45 MPE: Accuracy Function (2/4) An approximated phone accuracy is defined – : the ration of the portion of that is overlapped by 1. Assume the true word sequence has no pronunciation variation 2. Phone accuracy can be obtained by simple local search 3. Context-independent phones can be used for accuracy calculation

46 MPE: Accuracy Function (3/4) Forward-Backward algorithm for statistics calculation –Use “phone graph” as the vehicle for 開始時間為 0 的音素 q end for t=1 to T-1 for 開始時間為 t 的音素 q for 結束時間為 t-1 且可連至 q 的音素 r end for 結束時間為 t-1 且可連至 q 的音素 r end end end Forward

47 MPE: Accuracy Function (4/4) for 結束時間為 T-1 的音素 q for t=T-2 to 0 for 結束時間為 t 的音素 q for 開始時間為 t+1 且可連至 q 的音素 r end for 開始時間為 t+1 且可連至 q 的音素 r end Backward for 每一音素 q end

48 MPE: Smoothing Function The smoothing function can be defined as –The old model parameters( ) are used here as the hyper- parameters –It has a maximum value at

49 MPE: Final Auxiliary Function (1/2) weak-sense auxiliary function strong-sense auxiliary function smoothing function involved weak-sense auxiliary function

50 MPE: Final Auxiliary Function (2/2) Weak-sense auxiliary function Strong-sense auxiliary Weak-sense Add smooth function

51 MPE: Model Update (1/2) Based on the final auxiliary function, we have the following update formulas diagonal covariance matrix correlation matrix

52 MPE: Model Update (2/2) Two sets of statistics (numerator, denominator) are accumulated respectively

53 MPE: Setting Constants (1/2) The mean and variance update formulas rely on the proper setting of the smoothing constant ( ) –If is too large, the step size is small and convergence is slow –If is too small, the algorithm may become unstable – also needs to make all variance positive A B C

54 MPE: Setting Constants (2/2) Previous work [Povey 2004] used a value of that was twice the minimum positive value needed to insure all variance updates were positive

55 MPE: I-Smoothing I-smoothing increases the weight of the numerator counts depending on the amounts of data available for each Guassian This is done by multiplying the numerator terms ( ) in the update formulas by – can be set empirically (e.g., ) emphasize positive contributions (arcs with higher accuracy)

56 Preliminary Experimental Results (1/3) CER(%)ML_itr10 MPE_itr10 ML_itr150 MPE_itr10 MFCC(CMS)29.4826.5228.7224.44 MFCC(CN)26.6025.0526.9624.54 HLDA+MLLT(CN)23.6420.9223.7820.79 Experiment conducted on the MATBN (TV broadcast news) corpus (field reporters) –Training: 34,672 utterances (25.5hrs) –Testing: 292 utterances (1.45hrs, outside-testing) –Metric: Chinese character error rate (CER) 12.83% relative improvement 11.51% relative improvement

57 Preliminary Experimental Results (2/3) Another experiment conducted on the MATBN (TV broadcast news) corpus (field reporters) –Discriminative Feature: HLDA+MLLT(CN) –Discriminative training: MPE –Total 34,672 utterances (24.5hrs) - (10-fold cross validation) A relative improvement of 26.2% finally achieved (HLDA-MLLT(CN)+ MPE)

58 Preliminary Experimental Results (3/3) Corpus segmentation conducted on the TIMIT corpus –50 phone categories –Training: 4546 utterances (3.8 hrs) –Testing: 1646 utterances (1.4 hrs) Minimum Error Length (MEL) w w eh ih eh w w ih sil w eh w ih w dh ih silw w w w ih dh ey Lattice generated by force alignment

59 Preliminary Experimental Results (3/3) Corpus segmentation conducted on the TIMIT corpus –50 phone categories –Training: 4546 utterances (3.8 hrs) –Testing: 1646 utterances (1.4 hrs) Minimum Error Length (MEL) w w eh ih eh w w ih sil w eh w ih w dh ih silw w w w ih dh ey Lattice generated by force alignment

60 w w eh ih eh w w ih sil w eh w ih w dh ih silw w w w ih dh ey Lattice generated by force alignment

61 Conclusions & Future work MPE/MWE (or MMI) based discriminative training approaches have shown effectiveness in Chinese continuous speech recognition –Joint training of feature transformation, acoustic models and language models –Unsupervised training –More in-deep investigation and analysis are needed –Exploration of variant accuracy/error functions

62 References D. Povey, P.C. Woodland, M.J.F. Gales, “Discriminative MAP for Acoustic Model Adaptation,” in Proc. ICASSP 2003 R. Schluter, W. Macherey, B. Muller, H. Ney, “Comparison of discriminative training criteria and optimization methods for speech recognition,” Speech Communication 34, 2001 K. Vertanen, “An Overview of Discriminative Training for Speech Recognition” V. Goel, S. Kumar, W. Byrne, “Segmental minimum Bayes-risk decoding for automatic speech recognition,” IEEE Transactions on Speech and Audio Processing, May 2004 J.W. Kuo, “An Initial Study on Minimum Phone Error Discriminative Learning of Acoustic Models for Mandarin Large Vocabulary Continuous Speech Recognition” Master Thesis, NTNU, 2005 J.W. Kuo, B. Chen, "Minimum Word Error Based Discriminative Training of Language Models," Eurospeech 2005

63 References Lidia Mangu, Eric Brill, Andreas Stolcke, “Finding Consensus in Speech Recognition: Word Error Minimization and Other Applications of Confusion Networks,” Computer Speech and Language 14, 2000 R. O. Duda, P. E. Hart and D. G. Stork, Pattern Classification, Second Edition. New York: John & Wiley, 2000 R. Schlüter, “Investigations on Discriminative Training Criteria,” Ph.D Dissertation, RWTH Aachen - University of Technology, September 2000 L. R. Bahl, P. F. Brown, P. V. de Souza and R. L. Mercer, “Maximum Mutual Information Estimation of Hidden Markov Model Parameters for Speech Recognition,” in Proc. ICASSP’86 Wu Chou, “Discriminant-Function-Based Minimum Recognition Error Rate Pattern-Recognition Approach to Speech Recognition,” Proceedings of the IEEE, Vol. 88, No. 8, 2000 T. Minka. Expectation-Maximization as lower bound maximization. http://research.microsoft.com/~minka/papers/em.html, 1998 http://research.microsoft.com/~minka/papers/em.html

Thank You !

65 Appendix: MMI v.s. MPE CER relative reduction: MMI  8% MPE  12% 27.88 24.51 25.62

Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:

Similar presentations

Presentation on theme: "Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:

Similar presentations

Presentation on theme: "Discriminative Training Approaches for Continuous Speech Recognition Jen-Wei Kuo, Shih-Hung Liu, Berlin Chen, Hsin-min Wang Speaker : 郭人瑋 Main reference:"— Presentation transcript:

Similar presentations

About project

Feedback