Download presentation
Presentation is loading. Please wait.
Published byBaldric Flynn Modified over 8 years ago
1
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu
2
2 Outline Introduction Bayes Decision Theory Analysis of Bayes Decision Rule with General Loss Functions –Loss-Independence of the Bayes Decision Rule for Large Posterior Probabilities –Dominance of Maximum Posterior Probability –Upper Bounds of Risk Difference –Minimum Risk with Zero Posterior Probability –Pruning and General Loss Functions Experiments Conclusion
3
3 Introduction(1/3) In speech recognition, the standard evaluation measure is word error rate (WER) On the other hand, the standard decision rule is realized by using a sentence error based cost function for Bayes decision rule Due to the complexity of the Levenshtein alignment, it was prohibitive to use the number of word errors as cost function for Bayes decision rule for a long time With the increase in computing power and with algorithmic improvements, word error minimizing decision rules became more realistic
4
4 Introduction(2/3) A number of approaches were presented which investigate efficient approximate realizations of word error minimizing Bayes decision rules –At different level: Search space [1] Summation space for expected loss calculation [2] The loss function [3] –[Eurospeech 97] A. Stolcke : “.Explicit Word Error Rate Minimization in N-Best List Rescoring” [1][2] –[Computer Speech and Language 2000] V. Goel : “Minimum Bayes Risk Automatic Speech Recognition” [1][2] –[Eurospeech 99] L. Mangu : “Finding Consensus Among Words: Lattice- Based Word Error Minimization” [1][2] –[ICASSP 01] F. Wessel : “Explicit Word Error Minimization using Word Hypothesis Posterior Probabilities” [3]
5
5 Introduction(3/3) Word error minimizing Bayes decision rules are often called “Minimum Bayes Risk” (somewhat misleading) –The important difference (between standard approach) lies in the cost function used This work present a number of properties of the Bayes decision rule –Mainly concerning the relation between using a 0-1 loss function (e.g. sentence errors) and a general loss function (such as phoneme/character/word errors) in speech recognition and machine translation
6
6 Bayes Decision Theory(1/2) In general form, i.e. using a general loss function, Bayes decision rule results in the class minimizing the expected loss: with the expected loss, or Bayes risk for class c:
7
7 Bayes Decision Theory(2/2) In particular using the 0-1 loss function, Bayes decision rule can be reduced to finding the class which maximizes the class posterior probability For simplicity, in the following will drop the observation or x- dependence of the posterior probability –Use c max for the class which maximizes the class posterior probability –Use c L for the class which minimizes the Bayes risk for loss function L
8
8 Analysis of Bayes Decision Rule with General Loss Functions A metric loss function is positive, symmetric, and fulfils the triangle inequality A metric loss function is zero if and only if both arguments are equal
9
9 Loss-Independence of the Bayes Decision Rule for Large Posterior Porbability Assume a maximum posterior probability and a metric loss. Then the posterior maximizing class c max also minimizes the Bayes risk Proof: Consider the difference between the Bayes risk for class c max and the Bayes risk for any class >= 0 (triangle inequality) A broad estimate can be calculated, which shows where word error minimization can be expected to result in different decisions than sentence error minimization or posterior probability maximization respectively
10
10 Dominance of Maximum Posterior Probability(1/4) Now assume the maximum posterior probability is less than ½ and the loss function is a metric loss Then the maximizing the posterior probability also minimizes the Bayes risk if a set C of classes can be found for which the following requirements are met: As a special case, condition (3) is fulfilled if the following inequality is fulfilled:
11
11 Dominance of Maximum Posterior Probability(2/4) Proof: Consider again the difference between the Bayes risk for class c max and the Bayes risk for any class >= 0 (triangle inequality) (3) (2) ref: next slide ?
12
12 Dominance of Maximum Posterior Probability(3/4) c max C c”c” c’c’
13
13 Dominance of Maximum Posterior Probability(4/4) Therefore, the efficiency of Bayes risk minimization using a general, non 0-1 loss function can be improved by using Ineq. (1), and Ineq. (4) together with Ineq. (5) to shortlist those test samples, for which a difference to using a 0-1 loss function can be expected
14
14 Upper Bounds of Risk Difference(1/5) In the case of a maximum posterior probability and a metric loss, the following upper bound can be derived for the difference between the Bayes risk for class C max and the Bayes risk for any class :
15
15 Upper Bounds of Risk Difference(2/5) Replacing by the class which minimizes the Bayes risk in Ineq.(6) For the following specific choice, Ineq.(7) can be shown to be tight: Need to poof later!
16
16 Upper Bounds of Risk Difference(3/5) Another upper bound can be found using the following definition of the set of classes : with
17
17 Upper Bounds of Risk Difference(4/5) For the case the following inequality can be derived: For the case of we can apply Ineq.(6) to obtain the same inequality ≥ 0 ≥ ε≥ ε Ref next slide
18
18
19
19 Upper Bounds of Risk Difference(5/5) bounds for the difference between the Bayes risk for the posterior maximizing class and minimum Bayes risk are derived, which can serve as cost estimates for Bayes risk minimization approaches
20
20 Minimum Risk with Zero Posterior Probability The Bayes risk for the four strings : Provided all posterior probabilities are less than ½, i.e. p i <½, the loss for string “ded” will be minimal even though its posterior probability is zero
21
21 Pruning and General Loss Functions In the word error minimization approach (which uses A* search and cost estimates using partial hypotheses), the search space and the summation space for the expected loss calculation are the same If we prune the worst hypothesis (with respect to Bayes risk) “dgd” at some stage of the word error minimization search, then the result would be “dedb” instead =2/9 =3/9=4/9 It show that pruning not only reduces the search space but also alter the decision for the remaining search space
22
22 Experimental Setup(1/3) Perform speech recognition experiments using the WSJ0 corpus with a vocabulary of 5k words The baseline recognition system –uses 1500 generalized triphone states plus one silence state –Gaussian mixture distributions with a total of about 147k densities with a single pooled variance vector 33-dimensional observation vectors are obtained by LDA based on 5 consecutive vectors –12 MFCCs, first derivatives and the second derivative of the energy at a 10ms frame shift Training corpus : 15h speech, and a trigram language model was used The baseline WER is 3.97%
23
23 Experiments(2/3) The experiments for word error minimization were performed using N-best lists [5] with N = 10000 to ensure proper normalization of the posterior probabilities The search for the minimum Bayes risk using the Levenshtein loss function was always started of by calculating the risk for the posterior maximizing word sequence first, which served as an initial risk pruning threshold
24
24 Experiments(3/3)
25
25 Conclusion In this work, fundamental properties of Bayes decision rule using general loss functions are derived analytically and are verified experimentally for automatic speech recognition
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.