Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.

Slides:



Advertisements
Similar presentations
Discriminative Training in Speech Processing Filipp Korkmazsky LORIA.
Advertisements

Pattern Recognition and Machine Learning
CHAPTER 8 More About Estimation. 8.1 Bayesian Estimation In this chapter we introduce the concepts related to estimation and begin this by considering.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Extended Baum-Welch algorithm Present by shih-hung Liu
2004/11/161 A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition LAWRENCE R. RABINER, FELLOW, IEEE Presented by: Chi-Chun.
Visual Recognition Tutorial
1 Learning Entity Specific Models Stefan Niculescu Carnegie Mellon University November, 2003.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Evaluating Hypotheses
Sample-Separation-Margin Based Minimum Classification Error Training of Pattern Classifiers with Quadratic Discriminant Functions Yongqiang Wang 1,2, Qiang.
Visual Recognition Tutorial
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
EE513 Audio Signals and Systems Statistical Pattern Classification Kevin D. Donohue Electrical and Computer Engineering University of Kentucky.
Soft Margin Estimation for Speech Recognition Main Reference: Jinyu Li, " SOFT MARGIN ESTIMATION FOR AUTOMATIC SPEECH RECOGNITION," PhD thesis, Georgia.
Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.
Principles of Pattern Recognition
Introduction to variable selection I Qi Yu. 2 Problems due to poor variable selection: Input dimension is too large; the curse of dimensionality problem.
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
Discriminative Training Approaches for Continuous Speech Recognition Berlin Chen, Jen-Wei Kuo, Shih-Hung Liu Speech Lab Graduate Institute of Computer.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Ch 4. Linear Models for Classification (1/2) Pattern Recognition and Machine Learning, C. M. Bishop, Summarized and revised by Hee-Woong Lim.
1 E. Fatemizadeh Statistical Pattern Recognition.
Computational Intelligence: Methods and Applications Lecture 23 Logistic discrimination and support vectors Włodzisław Duch Dept. of Informatics, UMK Google:
Clustering and Testing in High- Dimensional Data M. Radavičius, G. Jakimauskas, J. Sušinskas (Institute of Mathematics and Informatics, Vilnius, Lithuania)
Non-Bayes classifiers. Linear discriminants, neural networks.
BCS547 Neural Decoding. Population Code Tuning CurvesPattern of activity (r) Direction (deg) Activity
Chapter 7 Point Estimation of Parameters. Learning Objectives Explain the general concepts of estimating Explain important properties of point estimators.
BCS547 Neural Decoding.
I-SMOOTH FOR IMPROVED MINIMUM CLASSIFICATION ERROR TRAINING Haozheng Li, Cosmin Munteanu Pei-ning Chen Department of Computer Science & Information Engineering.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Ch 5b: Discriminative Training (temporal model) Ilkka Aho.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
1  The Problem: Consider a two class task with ω 1, ω 2   LINEAR CLASSIFIERS.
Estimators and estimates: An estimator is a mathematical formula. An estimate is a number obtained by applying this formula to a set of sample data. 1.
ECE 8443 – Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional Likelihood Mutual Information Estimation (CMLE) Maximum MI Estimation.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
Lecture 3: MLE, Bayes Learning, and Maximum Entropy
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan
CHAPTER 4 ESTIMATES OF MEAN AND ERRORS. 4.1 METHOD OF LEAST SQUARES I n Chapter 2 we defined the mean  of the parent distribution and noted that the.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Objectives: Loss Functions Risk Min. Error Rate Class. Resources: DHS – Chap. 2 (Part 1) DHS – Chap. 2 (Part 2) RGO - Intro to PR MCE for Speech MCE for.
LECTURE 03: DECISION SURFACES
Statistical Models for Automatic Speech Recognition
Chapter 3: Maximum-Likelihood and Bayesian Parameter Estimation (part 2)
Chapter 9 Hypothesis Testing.
LECTURE 05: THRESHOLD DECODING
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Statistical Models for Automatic Speech Recognition
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Generally Discriminant Analysis
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
LECTURE 23: INFORMATION THEORY REVIEW
Parametric Methods Berlin Chen, 2005 References:
Biointelligence Laboratory, Seoul National University
LECTURE 05: THRESHOLD DECODING
Data Exploration and Pattern Recognition © R. El-Yaniv
Presentation transcript:

Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur Mathematik, Informatik und Naturwissenschaften der RWTH Aachen University zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften genehmigte Dissertation Presented by Yueng-Tien, Lo

outline A General View of Discriminative Training Criteria – Extended Unifying Approach – Smoothed Error Minimizing Training Criteria

A General View of Discriminative Training Criteria A unifying approach for a class of discriminative training criteria was presented that allows for optimizing several objective functions – among them the Maximum Mutual Information criterion and the Minimum Classification Error criterion – within a single framework. The approach is extended such that it also captures other criteria more recently proposed

A General View of Discriminative Training Criteria A class of discriminative training criteria can then be defined by: is a gain function that allows for rating sentence hypotheses W based on the spoken word string W r G reflects an error metric such as the Levenshtein distance or the sentence error.

A General View of Discriminative Training Criteria The fraction inside the brackets is called the local discriminative criterion

Extended Unifying Approach The choice of alternative word sequences contained in the set M r together with the smoothing function f, the weighting exponent, and the gain function G determine the particular criterion in use.

Extended Unifying Approach Maximum Likelihood (ML) Criterion Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion Diversity Index Jeffreys’ Criterion Chernoff Affinity

Maximum Likelihood (ML) Criterion Although not a discriminative criterion in a strict sense, the ML objective function is contained as a special case in the extended unifying approach. The ML estimator is consistent which means that for any increasing and representative set of training samples the estimation of the parameters converges toward the true model parameters

Maximum Likelihood (ML) Criterion However, for automatic speech recognition, the model assumptions are typically not correct, and therefore the ML estimator will return the true parameters of a wrong model in the limiting case of an infinite amount of training data In contrast to discriminative training criteria, which concentrate on enhancing class separability by taking class extraneous data into account, the ML estimator optimizes each class region individually

Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion yields the MMI criterion which is defined as the sum over the logarithms of the class posterior probabilities of the spoken word sequences W r for each training utterance r given the corresponding acoustic observations X r :

Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation

Maximum Mutual Information (MMI) and Corrective Training (CT) Criterion An approximation to the MMI criterion is given by the Corrective Training (CT) criterion Here, the sum over all competing word sequences in the denominator is replaced with the best recognized sentence hypothesis

Diversity Index The Diversity Index of degree measures the divergence of a probability distribution from the uniform distribution. While a diversity index closer to the maximum at 0 means a larger divergence from the uniform distribution, smaller values indicate that all classes tend to be nearly equally likely.

Diversity Index The Diversity Index applies the weighting function which results in the following expression for the discriminative training criterion:

Diversity Index Two well-known diversity indices, the Shannon Entropy (which is equivalent to the MMI criterion for constant class priors) and the Gini Index, are special cases of the Diversity Index The Shannon Entropy results from the continuous limit as approaches 0 while the Gini Index follows from setting = 1:

Diversity Index The MMI criterion is equivalent to the Shannon Entropy which, in case of given class priors, is also known as Equivocation

Jeffreys’ Criterion The Jeffreys’ criterion, which is also known as Jeffreys’ divergence, is closely related to the Kullback- Leibler distance [Kullback & Leibler 51] and was first proposed in [Jeffreys 46]: The smoothing function is not lower- bounded which means that an increase in the objective function will be large if the parameters are estimated such that no training utterance has a vanishing small posterior probability

Chernoff Affinity The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity. It employs the smoothing function with parameter which leads to the following training criterion:

Chernoff Affinity The Chernoff Affinity was suggested in as a generalization of Battacharyya’s measure of affinity It employs the smoothing function with parameter which leads to the following training criterion:

Smoothed Error Minimizing Training Criteria Error minimizing training criteria such as the MCE, the MWE, and the MPE criterion aim at minimizing the expectation of an error related loss function on the training data Let L denote any such loss function. Then the objective is to determine a parameter set that minimizes the total costs due to classification errors:

The optimization problem The objective function includes an “argmin” operation which prevents the computation of a gradient The objective function has many local optima: an optimization algorithm must handle this. The loss function L itself is typically a non-continuous step function and therefore not differentiable.

Smoothed Error Minimizing Training Criteria A remedy to make this class of error minimizing training criteria amenable to gradient based optimization methods is to replace Eq. (4.17) with the following expression: Discriminative criteria like the MCE, the MWE, and the MPE criterion differ only with respect to the choice of the loss function L.

Smoothed Error Minimizing Training Criteria While the MCE criterion typically applies a smoothed sentence error loss function, both the MWE and the MPE criterion are based on approximations of the word or phoneme error rate

Smoothed Error Minimizing Training Criteria Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion Minimum Squared Error (MSE) Criterion

Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion The MCE criterion aims at minimizing the expectation of a smoothed sentence error on training data. According to Bayes’ decision rule, the probability of making a classification error in utterance r is given by:

Minimum Classification Error (MCE) and Falsifying Training (FT) Criterion Smoothing the local error probability with a sigmoid function and carrying out the sum over all training utterances yields the MCE criterion: Similar to the CT criterion, the Falsifying Training (FT) derives from the MCE criterion in the limiting case of infinite :

Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion The objective of the Minimum Word Error (MWE) criterion as well as its closely related Minimum Phone Error (MPE) criterion is to minimize the expectation of an approximation to the word or phoneme accuracy on training data. After an efficient lattice-based training scheme was found and successfully implemented in [Povey & Woodland 02], the criterion has received increasing interest.

Minimum Word Error (MWE) and Minimum Phone Error (MPE) Criterion Both criteria compute the average transcription accuracy over all sentence hypotheses considered for discrimination: Here, is defined as the posterior probability of the sentence hypothesis W scaled by a factor in the log-space:

A class of discriminative training criteria covered by the extended unifying approach.