Bootstrap Estimates For Confidence Intervals In ASR Performance Evaluation Presented by Patty Liu.

Slides:



Advertisements
Similar presentations
Atomatic summarization of voic messages using lexical and prosodic features Koumpis and Renals Presented by Daniel Vassilev.
Advertisements

Copyright © Allyn & Bacon (2007) Statistical Analysis of Data Graziano and Raulin Research Methods: Chapter 5 This multimedia product and its contents.
1 1 Slide STATISTICS FOR BUSINESS AND ECONOMICS Seventh Edition AndersonSweeneyWilliams Slides Prepared by John Loucks © 1999 ITP/South-Western College.
Model Assessment, Selection and Averaging
Maximum Likelihood. Likelihood The likelihood is the probability of the data given the model.
Visual Recognition Tutorial
Clustering short time series gene expression data Jason Ernst, Gerard J. Nau and Ziv Bar-Joseph BIOINFORMATICS, vol
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
Evaluation.
Ensemble Learning: An Introduction
Evaluating Hypotheses
16.5 Introduction to Cost- based plan selection Amith KC Student Id: 109.
Experimental Evaluation
The Analysis of Variance
Inferences About Process Quality
Lehrstuhl für Informatik 2 Gabriella Kókai: Maschine Learning 1 Evaluating Hypotheses.
Today Concepts underlying inferential statistics
Lecture II-2: Probability Review
AM Recitation 2/10/11.
Statistics 11 Hypothesis Testing Discover the relationships that exist between events/things Accomplished by: Asking questions Getting answers In accord.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Statistical Techniques I
Review of Statistical Inference Prepared by Vera Tabakova, East Carolina University ECON 4550 Econometrics Memorial University of Newfoundland.
Word-subword based keyword spotting with implications in OOV detection Jan “Honza” Černocký, Igor Szöke, Mirko Hannemann, Stefan Kombrink Brno University.
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
1 CSI5388: Functional Elements of Statistics for Machine Learning Part I.
Lecture 12 Statistical Inference (Estimation) Point and Interval estimation By Aziza Munir.
Chapter 8 Introduction to Hypothesis Testing
1 Institute of Engineering Mechanics Leopold-Franzens University Innsbruck, Austria, EU H.J. Pradlwarter and G.I. Schuëller Confidence.
Fundamentals of Data Analysis Lecture 9 Management of data sets and improving the precision of measurement.
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Essential Statistics Chapter 131 Introduction to Inference.
Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者:郝柏翰 2013/01/28.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Brian Macpherson Ph.D, Professor of Statistics, University of Manitoba Tom Bingham Statistician, The Boeing Company.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Chapter 4 Linear Regression 1. Introduction Managerial decisions are often based on the relationship between two or more variables. For example, after.
1 Chapter 10: Introduction to Inference. 2 Inference Inference is the statistical process by which we use information collected from a sample to infer.
Discriminative Training and Acoustic Modeling for Automatic Speech Recognition - Chap. 4 Discriminative Training Wolfgang Macherey Von der Fakult¨at f¨ur.
Lecture 16 Section 8.1 Objectives: Testing Statistical Hypotheses − Stating hypotheses statements − Type I and II errors − Conducting a hypothesis test.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Issues concerning the interpretation of statistical significance tests.
The famous “sprinkler” example (J. Pearl, Probabilistic Reasoning in Intelligent Systems, 1988)
UNIT 5.  The related activities of sorting, searching and merging are central to many computer applications.  Sorting and merging provide us with a.
Chapter 20 Classification and Estimation Classification – Feature selection Good feature have four characteristics: –Discrimination. Features.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.
A New Approach to Utterance Verification Based on Neighborhood Information in Model Space Author :Hui Jiang, Chin-Hui Lee Reporter : 陳燦輝.
HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.
Statistical Inference Statistical inference is concerned with the use of sample data to make inferences about unknown population parameters. For example,
 Assumptions are an essential part of statistics and the process of building and testing models.  There are many different assumptions across the range.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Copyright © Cengage Learning. All rights reserved. 5 Joint Probability Distributions and Random Samples.
Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:
Chapter 13 Understanding research results: statistical inference.
BIOL 582 Lecture Set 2 Inferential Statistics, Hypotheses, and Resampling.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Building Valid, Credible & Appropriately Detailed Simulation Models
Statistical Decision Making. Almost all problems in statistics can be formulated as a problem of making a decision. That is given some data observed from.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)
Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky
Introduction to Instrumentation Engineering
Chapter 7: The Normality Assumption and Inference with OLS
Presentation transcript:

Bootstrap Estimates For Confidence Intervals In ASR Performance Evaluation Presented by Patty Liu

2 Introduction (1/2) The most popular performance measure in automatic speech recognition is the word error rate : the number of word in sentence : the edit distance, or Levenshtein distance,between the recognizer output and the reference transcription of sentence The edit distance is the minimum number of insert, substitute and delete operations necessary to transform one sentence into the other.

3 Introduction (2/2) The word error rate is an attractive metric, because it is intuitive, it corresponds well with application scenarios and (unlike sentence error rate) it is sensitive to small changes. On the downside it is not very amenable to statistical analysis. - W is really a rate (number of errors per spoken word), and not a probability (chance of misrecognizing a word). - Moreover, error events do not occur independently. Nevertheless hardly any other publication reports figures on these significance tests. Instead it is common to report the absolute and relative change in word error rate only.

4 Motivation What error rate do we have to expect when changing to a different test set? How reliable is a observed improvement of a system?

5 Bootstrap (1/5) The bootstrap is a computer-based method for assigning measures of accuracy to statistical estimates. The core idea is to create replications of a statistic by random sampling from the data set with replacement (so- called Monte Carlo estimates). We assume that the test corpus can be divided into s segments for which the recognition result is independent and the number of errors can thus be evaluated independently. For speaker-independent CSR it seems appropriate to to choose the set of all utterances of one speaker as a segment.

6 Bootstrap (2/5) For each sentence we record the number of words and the number errors : The following procedure is repeated B times (typically B = ): For b = 1... B we randomly select with replacement s pairs from X, to generate a bootstrap sample The sample will contain several of the original sentences multiple times, while others are missing.

7 Bootstrap (3/5)

8 Bootstrap (4/5) Then we calculate the word error rate on this sample The W ∗ b are called bootstrap replications of W. They can be thought of as samples of the word error rate from an ensemble of virtual test sets. The uncertainty of Wboot can be quantified by the standard error, which has the following bootstrap estimate:

9 Bootstrap (5/5) For large s the distribution of W ∗ is approximately Gaussian. In this case the true word error rate lies with 90% probability in the interval. Even when s is small, we can use the table of replications W ∗ b to determine percentiles which in turn can serve as confidence intervals. For a chosen error threshold α, let be the smallest value in the list, and be the largest. The interval contains the true value of W with probability 1 − 2α. This is the bootstrap-t confidence interval. For example: With B = 1000 and α = 0.05, we sort the list of W ∗ b and use the values at position 50 and 950.

10 EXAMPLE: NAB (North American Business News) ERROR RATES narrower confidence intervals smaller standard error

11 EXAMPLE: NAB ERROR RATES Cboot Wboot 90%5% narrower confidence intervals

12 Comparing Systems (1/2) Competing algorithms are usually tested on the same data. Given two recognition systems A and B with word error counts and, the (absolute) difference in word error rate is We can apply the same bootstrap technique to the quantity ΔW as we did to W. The crucial point is that we calculate the difference in the number of errors of the two systems on identical bootstrap samples. The important consequence that ΔW ∗ has much lower variance than W ∗ of either system.

13 Comparing Systems (2/2) In addition to the two-tailed confidence interval, we may be more interested in whether system B is a real improvement over system A. Θ(x) is the step function, which is one for x > 0 and zero otherwise. So the poi function is the relative number of bootstrap samples which favor system B. We call this measure “probability of improvement” (poi).

14 Example: System Comparison The system used for the examples described earlier now plays the role of system B, while a second system with slightly different acoustic models is system A. System B is apparently better by 0.3% to 0.4% absolute in terms of word error rate. The probability of improvement ranging between 82% and 95%, indicates that we can be moderately confident that this reflects a real superiority of system B, but we should not be too surprised if a fourth test set would be favorable to system A. The notable advantage of this differential analysis is that the standard error of ΔW is approximately one third of the standard error of W.

15 Example: System Comparison

16 Example: System Comparison 0.4

17 Conclusion We would like to emphasize that what we propose is not a new metric for performance evaluation, but a refined analysis of an established metric (word error rate). The proposed method seems attractive, because it is easy to use, it makes no assumption about the distribution of errors, results are directly related to word error rate, and the “probability of improvement” provides an intuitive figure of significance.

Open Vocabulary Speech Recognition with Flat Hybrid Models

19 Introduction (1/3) Large vocabulary speech recognition systems operate with a fixed large but finite vocabulary. Systems operating with a fixed vocabulary are bound to encounter so-called out-of-vocabulary (OOV) words. These are problematic for a number of reasons: - An OOV word will never be recognized (even if the user repeats it), but will be substituted by some in-vocabulary word. - Neighboring words are also often misrecognized. - Later processing stages (e.g. translation, understanding, document retrieval) cannot recover from OOV errors. - OOV words are often content words.

20 Introduction (2/3) The decision rule and knowledge sources used by a large vocabulary speech recognition system: acoustic model relates acoustic features x to phoneme sequences, typically an HMM (vocabulary independent) pronunciation lexicon assigns one (or more) phoneme string(s) to each word language model assigns probabilities to sentences from a finite set of words

21 Introduction (3/3) For open vocabulary recognition, we propose to conceptually abandon the words in favor of individual letters. Unlike words, the set of different letters G in a writing system is finite. Concerning the link to the acoustic realization, the set of phonemes can also be considered finite for a given language. These considerations suggest the following model: - acoustic model - pronunciation model provides a pronunciation for any string of letters - sub-lexical language model assigns probabilities to character strings

22 Introduction (3/3) Alternatively the pronunciation model and sub-lexical language model can be combined into a joint “graphonemic” model

23 Grapheme-to-Phoneme Conversion (1/3) Obviously this approach to open-vocabulary recognition is strongly connected to grapheme-to-phoneme conversion (G2P), where we seek the most likely pronunciation for a given orthographic form: The underlying assumption of this model is that, for each word, its orthographic form and its pronunciation are generated by a common sequence of graphonemic units. Each unit is a pair of a letter sequence and a phoneme sequence of possibly different length. We refer to such a unit as a “graphone”. (Various other names have been suggested: grapheme-phonemejoint multigram, graphoneme, grapheme-to-phoneme correspondence (GPC), chunk)

24 Grapheme-to-Phoneme Conversion (2/3) The joint probability distribution is thus reduced to a probability distribution over graphone sequences which we model using a standard M-gram: The complexity of this model depends on two parameters: the range of the M-gram model and the allowed size of the graphones. We allow the number of letters and phonemes to vary between zero and an upper limit

25 Grapheme-to-Phoneme Conversion (3/3) We were able to verify that shorter units in combination with longer range M-gram modeling yields the best result for the grapheme-to-phoneme task.

26 Models for Open Vocabulary ASR (1/2) We combine the lexical entries with the (sub-lexical) graphones derived from grapheme-to-phoneme conversion to form an unified set of recognition units. From the perspective of OOV detection the sub-lexical units Q have been called “fragments” or “fillers”. By treating words and fragments uniformly the decision rule becomes The sequence model p(u) can be characterized as “hybrid” because it contains mixed M-grams containing both words and fragments. It can also be characterized as “flat”, as opposed to structured approaches that predict and model OOV words with different models.

27 Models for Open Vocabulary ASR (2/2) A shortcoming of this model is that it leaves undetermined were word boundaries (i.e. blanks) should be placed. The heuristic used in this study is to compose any consecutive sub-lexical units into a single word and to treat all lexical units as individual words.

28 Experiments We have three different established vocabularies with 5, 20 and 64 thousand words, each corresponding to the most frequent words in the language model training corpus. For each baseline pronunciation dictionary a grapheme-to- phoneme model was trained with different length constraints using EM training with M-gram length of 3. The recognition vocabulary was then augmented with all graphones inferred by this procedure. For quantitative analysis, we evaluated both word error rate (WER) and letter error rate (LER). Letter error rate is more favorable with respect to almost-correct words and corresponds with the correction effort in dictation applications.

29 Experiments

30 Experiments--Bootstrap Analysis of OOV Impact We are particularly interested in the effect of OOV words on the recognition error rate. This effect could be studied by varying the system’s vocabulary. However, changing the recognition system in this way might introduce secondary effects such as increased confusability between vocabulary entries. Alternatively we can alter the test set. By extending the bootstrap technique, we create an ensemble of virtual test corpora with a varying number of OOV words, and respective WER. This distribution allows us to study the correlation between OOV rate and word error rate without changing the recognition system.

31 Experiments This procedure is detailed in the following: For each sentence i = 1... s we record the number of words, the number of OOV words and the number of recognition errors : For b = 1...B (typically ) we randomly select with replacement s tuples from X to generate a bootstrap sample The OOV rate and word error rate on this sample

32 Experiments The bootstrap replications OOV* b and WER* b can be visualized by a scatter plot. We quantify the observed linear relation between OOV rate and WER by a linear lest squares fit. The slope of the fitted line reflects the number of word errors per OOV word. For this reason we call this quantity “OOV impact”.

33 Discussion (1/2)

34 Discussion (2/2) Obviously the improvement in error rate depends strongly on the OOV rate. It is interesting to compare the OOV impact factor (word errors per OOV word): The baseline systems have values between 1.7 and 2, supporting the common wisdom that each OOV word causes two word errors. Concerning the optimal choice of fragment size L, we note that there are two counteracting effects: - Larger L values increase the size of the graphone inventory, which in turn causes data sparseness problems, leading to worse grapheme-to-phoneme performance. - Smaller values for L cause the unit inventory to contain many very short words with high probabilities, leading to spurious insertions in the recognition result. The present experiments suggest that the best trade-off is at L = 4.

35 Conclusion We have shown that we can significantly improve a well optimized state-of-the-art recognition system by using a simple flat hybrid sub-lexical model. The improvement was observed on a wide range of out-of-vocabulary rates. Even for very low OOV rates, no deterioration occurred. We found that using fragments of up to four letters or phonemes yielded optimal recognition results, while using non-trivial chunks is detrimental to grapheme-to- phoneme conversion.

OPEN-VOCABULARY SPOKEN TERM DETECTION USING GRAPHONE-BASED HYBRID RECOGNITION SYSTEMS Murat Akbacak, Dimitra Vergyri, Andreas Stolcke Speech Technology and Research Laboratory SRI International, Menlo Park, CA 94025, USA

37 Introduction (1/3) Recently, NIST defined a new task, spoken term detection (STD), in which the goal is to locate a specified term rapidly and accurately in large heterogeneous audio archives, to be used ultimately as input to more sophisticated audio search systems. The evaluation metric has two important characteristics: (1) Missing a term is penalized more heavily than having a false alarm for that term, (2) Detection results are averaged over all query terms rather than over their occurrences, i.e., the performance metric considers the contribution of each term equally.

38 Introduction (2/3) Results of the NIST 2006 STD evaluation have shown that systems based on word recognition have an accuracy advantage over systems based on sub-word recognition (although they typically pay a price in run time). Yet, word recognition system are usually based on a fixed vocabulary, resulting in a word-based index that does not allow text- based searching for OOV words. To retrieve OOVs, as well as misrecognized IV words, audio search based on sub-word units (such as syllables and phone N-grams) has been employed in many systems. During recognition, shorter units are more robust to errors and word variants than longer units, but longer units capture more discriminative information and are less susceptible to false matches during retrieval.

39 Introduction (3/3) In order to move toward solutions that address the problem of misrecognition (both IV and OOV) during audio search, previous studies have employed fusion methods to recover from ASR errors during retrieval. Here, we propose a hybrid STD system that uses words and sub- word units together in the recognition vocabulary. The ASR vocabulary is augmented by graphone units. We extract from ASR lattices a hybrid index, which is then converted into a regular word index by a post-processing step that joins graphones into words. It is important to represent ASR lattices with only words (with an expanded vocabulary) rather than with words and sub-word units since the lattices might serve as input to other information processing algorithms, such as for named entity tagging or information extraction, which assume a word-based representation.

40 The STD Task ─Data The test data consists of audio waveforms, a list of regions to be searched, and a list of query terms. For expedience we focus in this study on English and the genre with the highest OOV rate, broadcast news (BN).

41 The STD Task ─Evaluation Metric Basic detection performance will be characterized in the usual way via standard detection error tradeoff (DET) curves of miss probability ( ) versus false alarm probability ( ). : detection threshold : the number of correct (true) detections of term with a score greater than or equal to. : the true number of occurrences of term in the corpus. : the number of spurious (incorrect) detections of term with a score greater than or equal to. : the number of opportunities for incorrect detection of term in the corpus (= “Non-Target” term trials).

42 STD System Description (1/3)

43 STD System Description (2/3) I. Indexing step First, audio input is run through the STT system that produces word or word + graphone recognition hypotheses and lattices. These are converted into a candidate term index with times and detection scores (posteriors). When hybrid recognition (word+graphone) is employed, graphones in the resulting index are combined into words. To be able to do this, we keep word start/end information with a tag in the graphone representation (e.g., “[[.]”, “[.]]”,“[.]” indicate a graphone at the beginning, or end, or in the middle of a word, respectively).

44 STD System Description (3/3) II. Searching step During the retrieval step, first the search terms are extracted from the candidate term list, and then a decision function is applied to accept or reject the candidate based on its detection score.

45 STD System Description ─Speech-to-Text System I. Recognition Engine The STT system used for this task was a sped-up version of the STT systems used in the NIST evaluation for 2004 Rich Transcription (RT-04). STT is using SRI’s Decipher(TM) speaker-independent continuous speech recognition system, which is based on continuous density, state-clustered hidden Markov models (HMMs), with a vocabulary optimized for the BN genre.

46 STD System Description ─Speech-to-Text System II. Graphone-based Hybrid Recognition To compensate for OOV words during retrieval, we used an approach that use sub-word units ─graphones─ to model OOV words. The underlying assumption used in this model is that, for each word, its orthographic form and its pronunciation are generated by a common sequence of graphonemic units. Each graphone is a pair of a letter sequence and a phoneme sequence of possibly different lengths.

47 STD System Description ─Speech-to-Text System In our experiments, we used 50K words (excluding the 10K most frequent ones in our vocabulary) to train the graphone module, with maximum window length, M, set to 4. A hybrid word + graphone LM was estimated and used for recognition. Following is an example of an OOV word modeled by graphones: abromowitz: [[abro] [mo] [witz]] where graphones are represented by their grapheme strings enclosed in brackets, and “[[” and “]]” tags are used to mark word boundary information that is later used to join graphones back into words for indexing.

48 STD System Description ─N-gram Indexing Since the lattice structure provides additional information about the correct hypothesis could appear, to avoid misses (which have a higher cost in the evaluation score than false alarms) several studies have used the whole hypothesized word lattice to obtain the searchable index. We used the lattice-tool in SRILM (version 1.5.1) to extract the list of all word/graphone N-grams (up to N=5 for a word-only (W) STD system, N=8 for a hybrid (W+G) STD system). The term posterior for each N-gram is computed as the forward-backward combined score (acoustic, language, and prosodic scores were used) through all the lattice paths that share the N-gram nodes.

49 STD System Description ─Term Retrieval The term retrieval was implemented using the Unix join command, which concatenates the lines of the sorted term list and the index file for the terms common to both. The computational cost of this simple retrieval mechanism depends only on the size of the index. Each putative retrieved term is marked with a hard decision (YES/NO). Our decision-making module relies on the posterior probabilities generated by the STT system. One of two techniques were employed during the decision-making process. - The first one determines a global threshold for posterior probability (GL-TH) by maximizing the ATWV score, which for this task was found to be 0.4 and for word-based and hybrid systems respectively. - An alternative strategy can be formulated that computes a term- specific threshold (TERM-TH), which has a simple analytical solution.

50 STD System Description ─Term Retrieval Based on decision theory the optimal threshold for each candidate should satisfy where is the value of a correct detection and is the cost for a false alarm. For the ATWV metric we have Since the number of true occurrences of the term is unknown we approximate it for the calculation of the optimal by the sum of the posterior probabilities of the term in the corpus.

51 Experiment Results (1/2) In the hybrid (W+G) STD system there are approximately 15K graphones added to the recognition vocabulary. On average every OOV word results in two errors (itself, and one for a neighboring word because of incorrect context in the language model scoring). (e.g., going from 60K vocabulary to 20K vocabulary leads to a 3.2% increase in WER, and the hybrid system brings this number down to 1.3%).

52 Experiment Results (2/2) An interesting observation is that even for IV terms the hybrid (W+G) STD yields better performance than the word-only (W) STD system. This is because hybrid recognition improves both IV-word and OOV-word recognition, resulting in better retrieval performance for IV and OOV words at the same time.