Sridhar Raghavan and Joseph Picone URL: www.isip.msstate.edu/publications/seminars/msstate/2005/confidence/www.isip.msstate.edu/publications/seminars/msstate/2005/confidence.

Slides:

Advertisements

Similar presentations

Page 1 of 19 Confidence measure using word posteriors Sridhar Raghavan Confidence Measure using Word Graphs 3/6 2/6 4/6 2/6 1/6 4/6 1/6 4/6 1/6 4/6 5/6.

Advertisements

15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Confidence Measures for Speech Recognition Reza Sadraei.

Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Hidden Markov Models Ellen Walker Bioinformatics Hiram College, 2008.

Statistical NLP: Lecture 11

Hidden Markov Models Theory By Johan Walters (SR 2003)

Hidden Markov Models M. Vijay Venkatesh. Outline Introduction Graphical Model Parameterization Inference Summary.

1 Hidden Markov Models (HMMs) Probabilistic Automata Ubiquitous in Speech/Speaker Recognition/Verification Suitable for modelling phenomena which are dynamic.

Hidden Markov Models Fundamentals and applications to bioinformatics.

Lecture 15 Hidden Markov Models Dr. Jianjun Hu mleg.cse.sc.edu/edu/csce833 CSCE833 Machine Learning University of South Carolina Department of Computer.

GS 540 week 6. HMM basics Given a sequence, and state parameters: – Each possible path through the states has a certain probability of emitting the sequence.

Application of HMMs: Speech recognition “Noisy channel” model of speech.

Hidden Markov Model 11/28/07. Bayes Rule The posterior distribution Select k with the largest posterior distribution. Minimizes the average misclassification.

Part 4 b Forward-Backward Algorithm & Viterbi Algorithm CSE717, SPRING 2008 CUBS, Univ at Buffalo.

. Hidden Markov Model Lecture #6 Background Readings: Chapters 3.1, 3.2 in the text book, Biological Sequence Analysis, Durbin et al., 2001.

Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.

Pattern Classification All materials in these slides were taken from Pattern Classification (2nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wiley.

Chapter 4 (part 2): Non-Parametric Classification

Learning HMM parameters Sushmita Roy BMI/CS 576 Oct 21 st, 2014.

Ch. 9 Fundamental of Hypothesis Testing

Ch 8.1 Numerical Methods: The Euler or Tangent Line Method

Inverting Amplifier. Introduction An inverting amplifier is a type of electrical circuit that reverses the flow of current passing through it. This reversal.

Sridhar Raghavan Dept. of Electrical and Computer Engineering Mississippi State University URL:

Hidden Markov Models for Sequence Analysis 4

BINF6201/8201 Hidden Markov Models for Sequence Analysis

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

Design and Implementation of Speech Recognition Systems Spring 2014 Class 13: Training with continuous speech 26 Mar

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

S. Raghavan,H. P. Greeley, J. Berg, E. Friets, J. PiconeJ. P. Wilson Dept. of ECECreare, Inc., Mississippi State UniversityNew Hampshire Presented by:

Introduction to the Practice of Statistics Fifth Edition Chapter 6: Introduction to Inference Copyright © 2005 by W. H. Freeman and Company David S. Moore.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Evaluation Decoding Dynamic Programming.

CS Statistical Machine learning Lecture 24

1 CS 552/652 Speech Recognition with Hidden Markov Models Winter 2011 Oregon Health & Science University Center for Spoken Language Understanding John-Paul.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Maximum Entropy Model, Bayesian Networks, HMM, Markov Random Fields, (Hidden/Segmental) Conditional Random Fields.

1 Hidden Markov Models Hsin-min Wang References: 1.L. R. Rabiner and B. H. Juang, (1993) Fundamentals of Speech Recognition, Chapter.

Discriminative Training and Machine Learning Approaches Machine Learning Lab, Dept. of CSIE, NCKU Chih-Pin Liao.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Statistical Significance Hypothesis Testing.

Statistical Models for Automatic Speech Recognition Lukáš Burget.

CS Statistical Machine learning Lecture 25 Yuan (Alan) Qi Purdue CS Nov

Automated Speach Recognotion Automated Speach Recognition By: Amichai Painsky.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

© 2013 Pearson Education, Inc., publishing as Prentice Hall. All rights reserved.10-1 American Options The value of the option if it is left “alive” (i.e.,

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Lecture Slides Elementary Statistics Twelfth Edition

Trigonometric Identities

What is a Hidden Markov Model?

DTC Quantitative Methods Bivariate Analysis: t-tests and Analysis of Variance (ANOVA) Thursday 20th February 2014

Hidden Markov Models.

LECTURE 33: STATISTICAL SIGNIFICANCE AND CONFIDENCE (CONT.)

Let us consider a sample word graph as described below:

Modeling Arithmetic, Computation, and Languages

Statistical Models for Automatic Speech Recognition

LECTURE 15: HMMS – EVALUATION AND DECODING

Hidden Markov Models - Training

Tight Coupling between ASR and MT in Speech-to-Speech Translation

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Hidden Markov Models Part 2: Algorithms

Discrete Event Simulation - 4

Statistical Models for Automatic Speech Recognition

Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.

LECTURE 14: HMMS – EVALUATION AND DECODING

LECTURE 15: REESTIMATION, EM AND MIXTURES

A word graph algorithm for large vocabulary continuous speech recognition Stefan Ortmanns, Hermann Ney, Xavier Aubert Bang-Xuan Huang Department of Computer.

Error Correction Coding

Statistical inference for the slope and intercept in SLR

Presentation transcript:

Sridhar Raghavan and Joseph Picone URL: HUMAN AND SYSTEMS ENGINEERING: Confidence Measures Based on Word Posteriors and Word Graphs

Page 1 of 21 Confidence measure using word posteriors Abstract Confidence measure using word posterior: There is a strong need for determining the confidence of a word hypothesis in a LVCSR systems, this is because conventional viterbi decoding just generates the overall one best sequence while the performance of a speech recognition system is based on Word error rate and not sentence error rate. A good estimate of the confidence level is the word posterior probability. The word posteriors can be computed from a word graph. A forward-backward algorithm can be used to compute the link posteriors.

Page 2 of 21 Confidence measure using word posteriors Foundation The equation for computing the posterior of the word is as follows [Wessel.F]: The idea here is to sum up the posterior probabilities of all those word hypothesis sequences that contain the word ‘w’ with same start and end times. Time W

Page 3 of 21 Confidence measure using word posteriors Foundation: continued… We cannot compute the above posterior directly, so we decompose it into likelihood and priors using Baye’s rule. Hence the value in the numerator has to be computed using the forward backward algorithm. The denominator term is simply the sum of the numerator for all words ‘w’ occurring in the same time instant. N There are 6 different ways to reach the node N and 2 different ways to leave N, so we need to obtain the forward probability as well as the backward probability to obtain a good estimate of the probability of passing through the node N, and this is where the forward-backward algorithm comes into picture.

Page 4 of 21 Confidence measure using word posteriors What is exactly a word posterior from a word graph? A word posterior is a probability that is computed by considering a word’s acoustic score, language model score and its presence is a particular path through the word graph. An example of a word graph is given below, note that the nodes are the start- stop times and the links are the words. The goal is to determine the link posterior probabilities. Every link holds an acoustic score and a language model probability. quest 3/6 2/6 4/6 2/6 1/6 4/6 1/6 4/6 1/6 4/6 5/6 Sil This is a test sentence Sil this is the is a the guest sentence sense 1/6 Sil

Page 5 of 21 Confidence measure using word posteriors Example Let us consider an example as shown below: 3/6 2/6 4/6 2/6 1/6 4/6 1/6 4/6 1/6 4/6 5/6 Sil This is a test sentence Sil this is the is a the guest a quest sentence sense 1/6 Sil The values on the links are the likelihoods.

Page 6 of 21 Confidence measure using word posteriors Forward-backward algorithm Using forward-backward algorithm for determining the link probability. The equations used to compute the alphas and betas for an HMM are as follows: Computing alphas: Step 1: Initialization: In a conventional HMM forward-backward algorithm we would perform the following – We need to use a slightly modified version of the above equation for processing a word graph. The emission probability will be the acoustic score and the initial probability is taken as 1 since we always begin with a silence.

Page 7 of 21 Confidence measure using word posteriors Forward-backward algorithm continue… The α for the first node in the word graph is computed as follows: Step 2: Induction This step is the main reason we use forward-backward algorithm for computing such probabilities. The alpha values computed in the previous step is used to compute the alphas for the succeeding nodes. Note: Unlike in HMMs where we move from left to right at fixed intervals of time, over here we move from one start time of a word to the next closest word’s start time.

Page 8 of 21 Confidence measure using word posteriors Forward-backward algorithm continue… Let us see the computation of the alphas from node 2, the alpha for node 1 was computed in the previous step during initialization. Node 2: Node 3: Node 4: The alpha calculation continues in this manner for all the remaining nodes The forward backward calculation on word-graphs is similar to the calculations used on HMMs, but in word graphs the transition matrix is populated by the language model probabilities and the emission probability corresponds to the acoustic score /6 2/6 4/6 Sil this is α =1 α =0.5 α = α=1.675E-03

Page 9 of 21 Confidence measure using word posteriors Forward-backward algorithm continue… Once we compute the alphas using the forward algorithm we begin the beta computation using the backward algorithm. The backward algorithm is similar to the forward algorithm, but we start from the last node and proceed from right to left. Step 1 : Initialization Step 2: Induction

Page 10 of 21 Confidence measure using word posteriors Forward-backward algorithm continue… Let us see the computation of the beta values from node 14 and backwards. Node 14: Node 13: Node 12: /6 4/6 5/6 sentence Sil 14 sentence sense 1/6 Sil β=1.66E-3 β=5.55E-3 β=0.833 β= β=1

Page 11 of 21 Confidence measure using word posteriors Forward-backward algorithm continue… Node 11: In a similar manner we obtain the beta values for all the nodes till node 1. We can compute the probabilities on the links (between two nodes) as follows: Let us call this link probability as Γ. Therefore Γ(t-1,t) is computed as the product of α(t-1)*ß(t)*aij. These values give the un-normalized posterior probabilities of the word on the link considering all possible paths through the link.

Page 12 of 21 Confidence measure using word posteriors Word graph showing the computed alphas and betas /6 2/6 4/6 2/6 1/6 4/6 1/6 4/6 1/6 4/6 5/6 Sil This is a test sentence Sil this is the is a the guest quest 14 sentence sense 1/6 Sil α =1 β=2.8843E-14 8 α =0.5 β=2.87E-14 α = β=5.740E-12 α=1.117E-5 β=2.512E-7 α=1.675E-03 β=1.536E-11 α=3.35E-3 β=8.527E-10 α=1.675E-5 β=4.61E-9 α=2.79E-8 β=2.766E-6 α=1.861E-8 β=2.766E-6 α=7.446E-8 β=3.7E-5 α=7.751E-11 β=1.66E-3 α=4.964E-10 β=5.55E-3 α=3.438E-12 β=0.833 α=1.2923E-13 β= α=2.88E-14 β=1 Assumption here is that the probability of occurrence of any word is i.e. we have 100 words in a loop grammar This is the word graph with every node with its corresponding alpha and beta value.

Page 13 of 21 Confidence measure using word posteriors Link probabilities calculated from alphas and betas Γ=4.649E /6 2/6 4/6 2/6 1/6 4/6 1/6 4/6 1/6 4/6 5/6 Sil This is a test sentence Sil this is the is a the guest quest 14 sentence sense 1/6 Sil Γ=5.74E-12 Γ=2.87E-14 Γ=4.288E-12 Γ=7.71E-14 Γ=7.72E-14 Γ=1.549E-13 Γ=8.421E-12 Γ=4.649E-13 Γ=3.08E-13 Γ=4.136E1-12 Γ=3.08E-13 Γ=4.136E-12 Γ=6.45E-13 Γ=1.292E-13 Γ=1.292E-15 Γ=3.438E-14 The following word graph shows the links with their corresponding link posterior probabilities (not yet normalized) Γ=2.87E-14 By choosing the links with the maximum posterior probability we can be certain that we have included most probable words in the final sequence.

Page 14 of 21 Confidence measure using word posteriors Some Alternate approaches… The paper by F.Wessel (confidence Measures for Large Vocabulary Continuous Speech Recognition) describes alternate techniques to compute the posterior, because the drawback of the approach described above is that the lattice has to be very deep to accommodate sufficient links at the same time instant. To overcome the problem once can use a soft time margin instead a hard margin, and this is achieved by considering overlapping words to a certain degree. But, by doing this the author states that the normalization part will no longer work since the probabilities are not summed in the same time frame, and hence will total more than unity. Hence, the author suggests an approach where the posteriors are computed frame-by-frame so that the normalization of the posteriors is possible. In the end it was found that normalization using frame-by-frame approach did not perform significantly better than the overlapping time marks approach. The normalization of the posteriors is done by dividing the value by the sum of the posterior probabilities of all the words in the specific time instant Instead of using the probabilities as described above, once can use logarithmic approximations of the above probabilities so that the multiplications are converted to additions. Also, we can directly use the acoustic and language model scores from the ASR’s output lattice.

Page 15 of 21 Confidence measure using word posteriors Using it on a real application Using the algorithm on real application: * Need to perform word spotting without using a language model i.e. we can only use a loop grammar. * In order to spot the word of interest we will construct a loop grammar with just this one word. * Now the final one best hypothesis will consist of a sequence of the same word repeated N times. So, the challenge here is to determine which of these N words actually corresponds to the word of interest. * This is achieved by computing the link posterior probability and selecting the one with the maximum value.

Page 16 of 21 Confidence measure using word posteriors 1-best output from the word spotter The recognizer puts out the following output : !SENT_START BIG BIG BIG !SENT_END We have to determine which of the three instances of the word actually exists.

Page 17 of 21 Confidence measure using word posteriors sent_start sent_end Lattice from one of the utterances For this example we have to spot the word “BIG” in an utterance that consists of three words (“BIG TIED GOD”). All the links in the output lattice contains the word “BIG”. The values on the links are the acoustic likelihoods in log domain. Hence a forward backward computation just involves addition of these numbers in a systematic manner.

Page 18 of 21 Confidence measure using word posteriors Alphas and betas for the lattice sent_start sent_end α =0 β= α =-1433 β= α =-2528 β= α =-6761 β= α = β= α = β= α = β=-5917 α = β=-1861 α = β=0 The initial probability at both the nodes is ‘1’. So, its logarithmic value is 0. The language model probability of the word is also ‘1’ since it is the only word in the loop grammar.

Page 19 of 21 Confidence measure using word posteriors Link posterior calculation sent_start sent_end 8 Γ= Γ= Γ= Γ= Γ= Γ= Γ= Γ= It is observed that we can obtain a greater discrimination in confidence levels if we also multiply the final probability with the likelihood of the link other than the corresponding alphas and betas. In this example we add the likelihood since it is in log domain.

Page 20 of 21 Confidence measure using word posteriors Inference from the link posteriors Link 1 to 5 corresponds to the first word time instance while 5 to 6 and 6 to 7 correspond to the second and third word instances respectively. It is very clear from the link posterior values that the first instance of the word “BIG” has a much higher probability than the other two.

Page 21 of 21 Confidence measure using word posteriors References: F. Wessel, R. Schlüter, K. Macherey, H. Ney. "Confidence Measures for Large Vocabulary Continuous Speech Recognition". IEEE Trans. on Speech and Audio Processing. Vol. 9, No. 3, pp , March 2001 Wessel, Macherey, and Schauter, "Using Word Probabilities as Confidence Measures, ICASSP'97 G. Evermann and P.C. Woodland, “Large Vocabulary Decoding and Confidence Estimation using Word Posterior Probabilities in Proc. ICASSP 2000, pp , Istanbul. X. Huang, A. Acero, and H.W. Hon, Spoken Language Processing - A Guide to Theory, Algorithm, and System Development, Prentice Hall, ISBN: , 2001 J. Deller, et. al., Discrete-Time Processing of Speech Signals, MacMillan Publishing Co., ISBN: , 2000