1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Slides:



Advertisements
Similar presentations
Tests of Hypotheses Based on a Single Sample
Advertisements

Clustering II.
Gregory Shklover, Ben Emanuel Intel Corporation MATAM, Haifa 31015, Israel Simultaneous Clock and Data Gate Sizing Algorithm with Common Global Objective.
Decoding of Convolutional Codes  Let C m be the set of allowable code sequences of length m.  Not all sequences in {0,1}m are allowable code sequences!
Hierarchical Clustering. Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram – A tree-like diagram that.
Longest Common Subsequence
Yasuhiro Fujiwara (NTT Cyber Space Labs)
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Texture Segmentation Based on Voting of Blocks, Bayesian Flooding and Region Merging C. Panagiotakis (1), I. Grinias (2) and G. Tziritas (3)
Confidence Measures for Speech Recognition Reza Sadraei.
Planning under Uncertainty
Visual Recognition Tutorial
Clustering II.
Prénom Nom Document Analysis: Data Analysis and Clustering Prof. Rolf Ingold, University of Fribourg Master course, spring semester 2008.
Minimum Classification Error Networks Based on book chapter 9, by Shigeru Katagiri Jaakko Peltonen, 28 th February, 2002.
Evaluating Hypotheses
Introduction to Boosting Aristotelis Tsirigos SCLT seminar - NYU Computer Science.
Application of RNNs to Language Processing Andrey Malinin, Shixiang Gu CUED Division F Speech Group.
Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 8 Tests of Hypotheses Based on a Single Sample.
Copyright © Cengage Learning. All rights reserved. 8 Tests of Hypotheses Based on a Single Sample.
Towers of Hanoi. Introduction This problem is discussed in many maths texts, And in computer science an AI as an illustration of recursion and problem.
Efficient Model Selection for Support Vector Machines
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Benk Erika Kelemen Zsolt
Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
An Introduction to Support Vector Machines (M. Law)
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Round-Robin Discrimination Model for Reranking ASR Hypotheses Takanobu Oba, Takaaki Hori, Atsushi Nakamura INTERSPEECH 2010 Min-Hsuan Lai Department of.
LP Narrowing: A New Strategy for Finding All Solutions of Nonlinear Equations Kiyotaka Yamamura Naoya Tamura Koki Suda Chuo University, Tokyo, Japan.
Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Uncertainty Management in Rule-based Expert Systems
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
School of Computer Science 1 Information Extraction with HMM Structures Learned by Stochastic Optimization Dayne Freitag and Andrew McCallum Presented.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Graphical Models for Segmenting and Labeling Sequence Data Manoj Kumar Chinnakotla NLP-AI Seminar.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
CS 9633 Machine Learning Support Vector Machines
Chapter 7. Classification and Prediction
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Computational Learning Theory
Statistical Models for Automatic Speech Recognition
Data Mining Lecture 11.
Mohamed Kamel Omar and Lidia Mangu ICASSP 2007
Hidden Markov Models Part 2: Algorithms
CSE 573 Introduction to Artificial Intelligence Decision Trees
LECTURE 15: REESTIMATION, EM AND MIXTURES
EE 492 ENGINEERING PROJECT
Machine Learning: Lecture 6
Dynamic Programming Search
Machine Learning: UNIT-3 CHAPTER-1
Major Design Strategies
Presenter : Jen-Wei Kuo
Hierarchical Clustering
Major Design Strategies
Complexity Theory: Foundations
Presentation transcript:

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26 present by 士弘

2 Outline Introduction Minimum Bayes-Risk Classification Framework --Likelihood ratio based hypothesis testing --Maximum a-posteriori probability classification Practical MBR Procedures for ASR --MBR recognition with N-best lists --MBR recognition with lattices -- search under general loss functions --Single stack search under Levenshtein loss function --Prefix tree search under Levenshtein loss function Segmental MBR Procedures --Segmental voting --ROVER (recognizer output voting for error reduction) --e-ROVER Experimental Result Summary

3 Introduction Maximum likelihood techniques that underlie the training and decision processes of most current ASR systems are not sensitive to application specific goals. A promising approach towards the construction of speech recognizers that are tuned for specific tasks is known as Minimum bayes-risk automatic speech recognition

4 Introduction The MBR framework assumes that a quantitative measure of recognition performance is known and that recognition should be a decision process that attempts to minimize the expected error under this measure. The three components of this decision process are: 1.the given error measures 2.the space of possible decisions 3.a probability distribution that allows the measurement of expected error.

5 Minimum Bayes-Risk Classification Framework

6

7

8

9 Likelihood ratio based hypothesis testing

10 Maximum a-posteriori probability classification

11 Practical MBR Procedures for ASR MBR recognizer is difficult to implement for three reason: 1.The evidence and hypothesis spaces tend to be quite large. 2.The problem of large spaces is worsened by the fact that an ASR recognizer often has to process many consecutive utterances. 3.There are efficient DP techniques for MAP recognizer, such methods are not yet available for an MBR recognizer under an arbitrary loss function.

12 Practical MBR Procedures for ASR How to implement? –Two implementation: N-best list rescoring procedure Search over a recognition lattice –Segment long acoustic data into sentence or phrase length utterances. –Restrict the evidence and hypothesis spaces to manageable sets of word strings.

13 MBR recognition with N-best lists

14 MBR recognition with lattices

15 MBR recognition with lattices

16 MBR recognition with lattices

17 MBR recognition with lattices ?

18 MBR Recognition with Lattices

19 search under general loss functions

20 search under general loss functions Two cost functions are required for the search. –The first cost function is associated with each hypothesis, whether partial or complete. Its value is a lower bound on the expected loss that can be obtained by extending the hypothesis through the lattice to completion.

21 search under general loss functions –The second cost function in only associated with complete hypotheses. It is an over-estimate of the expected loss of a complete hypothesis –Hypotheses are kept in a stack (priority queue) sorted by cost C, with the smallest cost hypothesis at the top. –At every iteration the hypothesis at the top of the stack is extended.

22 search under general loss functions When to terminate? 1.When there is a complete hypothesis at the top, its second cost is computed. If this over-estimate cost is smaller than the under-estimate cost C of the next stack hypothesis. 2.There is no partial hypothesis left in the stack.

23 Single stack search under Levenshtein loss function We now present usable cost functions for the Levenshtein distance These costs are not unique, and the efficiency of the search depends on the quality of both the under-estimate and the over-estimate. The Levenshtein loss function is not sensitive to the word time boundaries. Therefore, the word time boundaries would be summed over during the search.

24 Single stack search under Levenshtein loss function

25 Single stack search under Levenshtein loss function

26 Single stack search under Levenshtein loss function

27 Single stack search under Levenshtein loss function

28 Single stack search under Levenshtein loss function

29 Prefix tree search under Levenshtein loss function

30 Prefix tree search under Levenshtein loss function

31 Prefix tree search under Levenshtein loss function

32 Prefix tree search under Levenshtein loss function

33 Prefix tree search under Levenshtein loss function The single stack search and the prefix tree search have the disadvantage that the costs of partial hypotheses of different lengths are compared. This is acceptable under the search formulation, but is not a good comparison for use in pruning since it favors short hypotheses.  sub-optimal. How to solve it? –Using multistack implementation that maintains a separate stack for each hypothesis length. –It has been found to have better pruning characteristics in practice.

34 Segmental MBR Procedures Segmental MBR (SMBR). Utterance level recognition  sequence of simpler MBR recognition. The lattices or N-best lists are segmented into sets of words. Advantages: –The segmentation can be performed to identify high confidence regions within the evidence space. –Within the regions we can produce reliable word hypotheses. –But SMBR focuses on the low confidence regions.

35 Segmental MBR Procedures evidencehypothesis space word sequence Definition reconstruct

36 Segmental MBR Procedures Assume that the utterance level loss can be found from the losses over the segment sets as

37 Segmental MBR Procedures Therefore, under the assumption of losses over the segment sets, utterance level MBR recognition becomes a sequence of smaller MBR recognition problems. In practice it may be difficult to segment the evidence and hypothesis spaces. Utterance level induced loss function is defined as The overall performance under the desired loss function l should depend on how well l I approximates l

38 Segmental Voting Special case of segmental MBR recognition Suppose each evidence and hypothesis segment set contains at most one word. There is a 0/1 loss function on segment sets. The utterance level induced loss for segmental voting

39 Segmental Voting We will now describe two versions of segmental MBR recognition used in state-of-the-art ASR systems. Both these procedures attempt to reduce the word error rate (WER) and thus are based on the Levenshtein loss function.

40 ROVER Recognizer Output Voting for Error Reduction (ROVER) is an N-best list segmental voting procedure. It combines the hypotheses from multiple independent recognizers under the Levenshtein loss.

41 ROVER The word strings of N e are arranged in a word transition network (WTN) that represents an approximate simultaneous alignment of these hypotheses.

42 ROVER The utterance level induced loss in ROVER is derived This loss is similar to the Levenshtein distance between strings W and W ’ when their alignment is specified by the WTN.

43 e-ROVER Extended-ROVER (e-ROVER) The utterance level loss function of e-ROVER is given as follows. –Start with initial WTN –Merge two consecutive set. –Let the loss function on the expanded set be the Levenshtein distance. –The loss function on correspondence sets that did not expand remains the 0/1 loss.

44 e-ROVER

45 e-ROVER The utterance level induced loss in e-ROVER is It follows from the definition of Levenshtein distance that

46 e-ROVER There are two consequences of joining correspondence sets: –After the joining operation, the loss function on the expanded set is no longer the 0/1 loss but is instead the Levenshtein distance. –The size of the expanded set grows exponentially with the number of joining operations, making Equation below progressively difficult to implement. Therefore, it is important to determine the sets to be joined carefully so as to yield maximum gain in Levenshtein distance approximation with minimum combinations of the correspondence sets.

47 Parameter Tuning within the MBR Classification Rule The joint distribution to be used in the MBR recognizers is derived by combining probabilities from acoustic and language models. It is customary in ASR to use two tuning parameters in the computation of joint probability

48 Parameter Tuning within the MBR Classification Rule

49 Utterance Level MBR Word and Keyword Recognition

50 ROVER and e-ROVER for Multilingual ASR

51 Summary We have described automatic speech recognition algorithms that attempt to minimize the average misrecognition cost under task specific loss function. These recognizers, although generally more computationally complex than more widely used MAP algorithms, can be efficiently implemented using an N-best list rescoring procedure or as an search over recognition lattices. Segmental MBR is described as a special case of MBR recognition that results from the segmentation of the recognition search space. The segmentation is done with the assumption that the loss function induced is a good approximation to the original, desired loss function.

52 Conclusion Hard to implement~!!