Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Slides:



Advertisements
Similar presentations
Principles of Density Estimation
Advertisements

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Pattern Classification, Chapter 2 (Part 2) 0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
Visual Recognition Tutorial
Bayesian Decision Theory Chapter 2 (Duda et al.) – Sections
0 Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
1 An Introduction to Statistical Machine Translation Dept. of CSIE, NCKU Yao-Sheng Chang Date:
Prénom Nom Document Analysis: Non Parametric Methods for Pattern Recognition Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Machine Learning CMPT 726 Simon Fraser University
Visual Recognition Tutorial
Pattern Recognition Topic 2: Bayes Rule Expectant mother:
Isolated-Word Speech Recognition Using Hidden Markov Models
Presented by: Fang-Hui, Chu Automatic Speech Recognition Based on Weighted Minimum Classification Error Training Method Qiang Fu, Biing-Hwang Juang School.
Principles of Pattern Recognition
Speech Recognition Pattern Classification. 22 September 2015Veton Këpuska2 Pattern Classification  Introduction  Parametric classifiers  Semi-parametric.
Statistical Decision Theory
Prof. Dr. S. K. Bhattacharjee Department of Statistics University of Rajshahi.
Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 03: GAUSSIAN CLASSIFIERS Objectives: Whitening.
ECE 8443 – Pattern Recognition Objectives: Error Bounds Complexity Theory PAC Learning PAC Bound Margin Classifiers Resources: D.M.: Simplified PAC-Bayes.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Deterministic vs. Random Maximum A Posteriori Maximum Likelihood Minimum.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
CS 782 – Machine Learning Lecture 4 Linear Models for Classification  Probabilistic generative models  Probabilistic discriminative models.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Statistical Decision Theory Bayes’ theorem: For discrete events For probability density functions.
CS Statistical Machine learning Lecture 24
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
ECE 8443 – Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem Proof EM Example – Missing Data Intro to Hidden Markov Models.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
1 Introduction to Statistics − Day 4 Glen Cowan Lecture 1 Probability Random variables, probability densities, etc. Lecture 2 Brief catalogue of probability.
ECE 8443 – Pattern Recognition LECTURE 04: PERFORMANCE BOUNDS Objectives: Typical Examples Performance Bounds ROC Curves Resources: D.H.S.: Chapter 2 (Part.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition LECTURE 04: GAUSSIAN CLASSIFIERS Objectives: Whitening.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
Objectives: Chernoff Bound Bhattacharyya Bound ROC Curves Discrete Features Resources: V.V. – Chernoff Bound J.G. – Bhattacharyya T.T. – ROC Curves NIST.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Jensen’s Inequality (Special Case) EM Theorem.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
A Method to Approximate the Bayesian Posterior Distribution in Singular Learning Machines Kenji Nagata, Sumio Watanabe Tokyo Institute of Technology.
A Study on Speaker Adaptation of Continuous Density HMM Parameters By Chin-Hui Lee, Chih-Heng Lin, and Biing-Hwang Juang Presented by: 陳亮宇 1990 ICASSP/IEEE.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 1: INTRODUCTION.
Deep Feedforward Networks
Model Inference and Averaging
LECTURE 03: DECISION SURFACES
Statistical Models for Automatic Speech Recognition
Pattern Recognition CS479/679 Pattern Recognition Dr. George Bebis
LECTURE 05: THRESHOLD DECODING
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Bayesian Models in Machine Learning
Statistical Models for Automatic Speech Recognition
Slides for Introduction to Stochastic Search and Optimization (ISSO) by J. C. Spall CHAPTER 15 SIMULATION-BASED OPTIMIZATION II: STOCHASTIC GRADIENT AND.
LECTURE 05: THRESHOLD DECODING
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Automatic Speech Recognition: Conditional Random Fields for ASR
Summarizing Data by Statistics
Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John.
Pattern Recognition and Machine Learning
Generally Discriminant Analysis
Mathematical Foundations of BME
Parametric Methods Berlin Chen, 2005 References:
Machine Learning: Lecture 6
LECTURE 05: THRESHOLD DECODING
Machine Learning: UNIT-3 CHAPTER-1
Presenter : Jen-Wei Kuo
Presentation transcript:

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu

2 Outline Introduction Bayes Decision Theory Analysis of Bayes Decision Rule with General Loss Functions –Loss-Independence of the Bayes Decision Rule for Large Posterior Probabilities –Dominance of Maximum Posterior Probability –Upper Bounds of Risk Difference –Minimum Risk with Zero Posterior Probability –Pruning and General Loss Functions Experiments Conclusion

3 Introduction(1/3) In speech recognition, the standard evaluation measure is word error rate (WER) On the other hand, the standard decision rule is realized by using a sentence error based cost function for Bayes decision rule Due to the complexity of the Levenshtein alignment, it was prohibitive to use the number of word errors as cost function for Bayes decision rule for a long time With the increase in computing power and with algorithmic improvements, word error minimizing decision rules became more realistic

4 Introduction(2/3) A number of approaches were presented which investigate efficient approximate realizations of word error minimizing Bayes decision rules –At different level: Search space [1] Summation space for expected loss calculation [2] The loss function [3] –[Eurospeech 97] A. Stolcke : “.Explicit Word Error Rate Minimization in N-Best List Rescoring” [1][2] –[Computer Speech and Language 2000] V. Goel : “Minimum Bayes Risk Automatic Speech Recognition” [1][2] –[Eurospeech 99] L. Mangu : “Finding Consensus Among Words: Lattice- Based Word Error Minimization” [1][2] –[ICASSP 01] F. Wessel : “Explicit Word Error Minimization using Word Hypothesis Posterior Probabilities” [3]

5 Introduction(3/3) Word error minimizing Bayes decision rules are often called “Minimum Bayes Risk” (somewhat misleading) –The important difference (between standard approach) lies in the cost function used This work present a number of properties of the Bayes decision rule –Mainly concerning the relation between using a 0-1 loss function (e.g. sentence errors) and a general loss function (such as phoneme/character/word errors) in speech recognition and machine translation

6 Bayes Decision Theory(1/2) In general form, i.e. using a general loss function, Bayes decision rule results in the class minimizing the expected loss: with the expected loss, or Bayes risk for class c:

7 Bayes Decision Theory(2/2) In particular using the 0-1 loss function, Bayes decision rule can be reduced to finding the class which maximizes the class posterior probability For simplicity, in the following will drop the observation or x- dependence of the posterior probability –Use c max for the class which maximizes the class posterior probability –Use c L for the class which minimizes the Bayes risk for loss function L

8 Analysis of Bayes Decision Rule with General Loss Functions A metric loss function is positive, symmetric, and fulfils the triangle inequality A metric loss function is zero if and only if both arguments are equal

9 Loss-Independence of the Bayes Decision Rule for Large Posterior Porbability Assume a maximum posterior probability and a metric loss. Then the posterior maximizing class c max also minimizes the Bayes risk Proof: Consider the difference between the Bayes risk for class c max and the Bayes risk for any class >= 0 (triangle inequality) A broad estimate can be calculated, which shows where word error minimization can be expected to result in different decisions than sentence error minimization or posterior probability maximization respectively

10 Dominance of Maximum Posterior Probability(1/4) Now assume the maximum posterior probability is less than ½ and the loss function is a metric loss Then the maximizing the posterior probability also minimizes the Bayes risk if a set C of classes can be found for which the following requirements are met: As a special case, condition (3) is fulfilled if the following inequality is fulfilled:

11 Dominance of Maximum Posterior Probability(2/4) Proof: Consider again the difference between the Bayes risk for class c max and the Bayes risk for any class >= 0 (triangle inequality) (3) (2) ref: next slide ?

12 Dominance of Maximum Posterior Probability(3/4) c max C c”c” c’c’

13 Dominance of Maximum Posterior Probability(4/4) Therefore, the efficiency of Bayes risk minimization using a general, non 0-1 loss function can be improved by using Ineq. (1), and Ineq. (4) together with Ineq. (5) to shortlist those test samples, for which a difference to using a 0-1 loss function can be expected

14 Upper Bounds of Risk Difference(1/5) In the case of a maximum posterior probability and a metric loss, the following upper bound can be derived for the difference between the Bayes risk for class C max and the Bayes risk for any class :

15 Upper Bounds of Risk Difference(2/5) Replacing by the class which minimizes the Bayes risk in Ineq.(6) For the following specific choice, Ineq.(7) can be shown to be tight: Need to poof later!

16 Upper Bounds of Risk Difference(3/5) Another upper bound can be found using the following definition of the set of classes : with

17 Upper Bounds of Risk Difference(4/5) For the case the following inequality can be derived: For the case of we can apply Ineq.(6) to obtain the same inequality ≥ 0 ≥ ε≥ ε Ref next slide

18

19 Upper Bounds of Risk Difference(5/5) bounds for the difference between the Bayes risk for the posterior maximizing class and minimum Bayes risk are derived, which can serve as cost estimates for Bayes risk minimization approaches

20 Minimum Risk with Zero Posterior Probability The Bayes risk for the four strings : Provided all posterior probabilities are less than ½, i.e. p i <½, the loss for string “ded” will be minimal even though its posterior probability is zero

21 Pruning and General Loss Functions In the word error minimization approach (which uses A* search and cost estimates using partial hypotheses), the search space and the summation space for the expected loss calculation are the same If we prune the worst hypothesis (with respect to Bayes risk) “dgd” at some stage of the word error minimization search, then the result would be “dedb” instead =2/9 =3/9=4/9 It show that pruning not only reduces the search space but also alter the decision for the remaining search space

22 Experimental Setup(1/3) Perform speech recognition experiments using the WSJ0 corpus with a vocabulary of 5k words The baseline recognition system –uses 1500 generalized triphone states plus one silence state –Gaussian mixture distributions with a total of about 147k densities with a single pooled variance vector 33-dimensional observation vectors are obtained by LDA based on 5 consecutive vectors –12 MFCCs, first derivatives and the second derivative of the energy at a 10ms frame shift Training corpus : 15h speech, and a trigram language model was used The baseline WER is 3.97%

23 Experiments(2/3) The experiments for word error minimization were performed using N-best lists [5] with N = to ensure proper normalization of the posterior probabilities The search for the minimum Bayes risk using the Levenshtein loss function was always started of by calculating the risk for the posterior maximizing word sequence first, which served as an initial risk pruning threshold

24 Experiments(3/3)

25 Conclusion In this work, fundamental properties of Bayes decision rule using general loss functions are derived analytically and are verified experimentally for automatic speech recognition