Download presentation
Presentation is loading. Please wait.
1
Speech Processing Speech Recognition
August 24, 2005 9/16/2018
2
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LING 138/238 SYMBSYS 138 Intro to Computer Speech and Language Processing Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky 9/16/2018
3
Outline for ASR Acoustic Phonetics ASR Architecture
The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction Acoustic Model Lexicon/Pronunciation Model Decoder Language Model Evaluation 9/16/2018
4
Speech Recognition Architecture
Speech Waveform Spectral Feature Vectors Phone Likelihoods P(o|q) Words 1. Feature Extraction (Signal Processing) 2. Acoustic Model Phone Likelihood Estimation (Gaussians or Neural Networks) 5. Decoder (Viterbi or Stack Decoder) 4. Language Model (N-gram Grammar) 3. HMM Lexicon 9/16/2018
5
Noisy Channel Model In speech recognition you observe an acoustic signal (A=a1,…,an) and you want to determine the most likely sequence of words (W=w1,…,wn): P(W | A) Problem: A and W are too specific for reliable counts on observed data, and are very unlikely to occur in unseen data 9/16/2018
6
Noisy Channel Model Assume that the acoustic signal (A) is already segmented wrt word boundaries P(W | A) could be computed as Problem: Finding the most likely word corresponding to a acoustic representation depends on the context E.g., /'pre-z&ns / could mean “presents” or “presence” depending on the context 9/16/2018
7
Noisy Channel Model Given a candidate sequence W we need to compute P(W) and combine it with P(W | A) Applying Bayes’ rule: The denominator P(A) can be dropped, because it is constant for all W 9/16/2018
8
Noisy Channel in a Picture
9
ASR Lexicon: Markov Models for pronunciation
9/16/2018
10
The Hidden Markov model
9/16/2018
11
Formal definition of HMM
States: a set of states Q = q1, q2…qN Transition probabilities: a set of probabilities A = a01,a02,…an1,…ann. Each aij represents P(j|i) Observation likelihoods: a set of likelihoods B=bi(ot), probability that state i generated observation t Special non-emitting initial and final states 9/16/2018
12
Pieces of the HMM Observation likelihoods (‘b’), p(o|q), represents the acoustics of each phone, and are computed by the gaussians (“Acoustic Model”, or AM) Transition probabilities represent the probability of different pronunciations (different sequences of phones) States correspond to phones 9/16/2018
13
Pieces of the HMM Actually states usually correspond to triphones
CHEESE (phones): ch iy z CHEESE (triphones) #-ch+iy, ch-iy+z, iy-z+# In fact, each triphone has 3 states for beginning, middle, and end of the triphone. 9/16/2018
14
A real HMM 9/16/2018
15
HMMs: what’s the point again?
HMM is used to compute P(O|W) I.e. compute the likelihood of the acoustic sequence O given a string of words W As part of our generative model We do this for every possible sentence of English and then pick the most likely one How? Decoding 9/16/2018
16
The Three Basic Problems for HMMs
Problem 1 (Evaluation): Given the observation sequence O=(o1o2…oT), and an HMM model =(A,B,), how do we efficiently compute P(O| ), the probability of the observation sequence given the model? Problem 2 (Decoding): Given the observation sequence O=(o1o2…oT), and an HMM model =(A,B,), how do we choose a corresponding state sequence Q=(q1,q2…qt) that is optimal in some sense (I.e. best explains the observations)? Problem 3 (Learning): How do we adjust the model parameters =(A,B,) to maximize P(O| )? 9/16/2018
17
The Evaluation Problem
Computing the likelihood of the observation sequence Why is this hard? Imagine the HMM for “need” above, with subphones. n0 n1 n2 iy4 iy5 d6 d7 d8 And that each state has a loopback Given a series of 350 ms (35 observations) Possible alignments of states to observations We would have to sum over all these to compute P(O|Q) 9/16/2018
18
Given a Word string W, compute p(O|W)
Sum over all possible sequences of states 9/16/2018
19
Summary: Computing the observation likelihood p(O|)
Why we can’t do an explicit sum over all paths? Because it’s intractable O(NT) What to do instead? The Forward Algorithm O(N2T) It uses dynamic programming to compute P(O|) 9/16/2018
20
The Decoding Problem Given observations O=(o1o2…oT), and HMM =(A,B,), how do we choose best state sequence Q=(q1,q2…qt)? The forward algorithm computes P(O|W) Could find best W by running forward algorithm for each W in L, picking W maximizing P(O|W) But we can’t do this, since number of sentences is O(WT). Instead: Viterbi Decoding: dynamic programming modification of the forward algorithm A* Decoding: search the space of all possible sentences using the forward algorithm as a subroutine. 9/16/2018
21
Viterbi: the intuition
9/16/2018
22
Viterbi: Search 9/16/2018
23
Viterbi: Word Internal
9/16/2018
24
Viterbi: Between words
9/16/2018
25
Language Modeling The noisy channel model expects P(W); the probability of the sentence We saw this was also used in the decoding process as the probability of transitioning from one word to another. The model that computes P(W) is called the language model. 9/16/2018
26
The Chain Rule Recall the definition of conditional probabilities
Rewriting: Or… 9/16/2018
27
The Chain Rule more generally
P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) The big red dog was P(The)*P(big|the)*P(red|the big)*P(dog|the big red)*P(was|the big red dog) Better P(The| <Beginning of sentence>) written as P(The | <S>) 9/16/2018
28
General case The word sequence from position 1 to n is
So the probability of a sequence is 9/16/2018
29
Learning Setting all the parameters in an ASR system Given:
training set: wavefiles & word transcripts for each sentence Hand-built HMM lexicon Initial acoustic models stolen from another recognizer train a LM on the word transcripts + other data For each sentence, create one big HMM by combining all the HMM-words together Use the Viterbi algorithm to align HMM against the data, resulting in phone labeling of speech train new Gaussian acoustic models iterate (go to 2). 9/16/2018
30
Word Error Rate Word Error Rate =
100 (Insertions+Substitutions + Deletions) Total Word in Correct Transcript Aligment example: REF: portable **** PHONE UPSTAIRS last night so HYP: portable FORM OF STORES last night so Eval I S S WER = 100 (1+2+0)/6 = 50% 9/16/2018
31
Decoding The decoder combines evidence from The likelihood: P(A | W)
This can be approximated as: The prior: P(W) 9/16/2018
32
Search Space Given a word-segmented acoustic sequence list all candidates Compute the most likely path 'bot ik-'spen-siv 'pre-z&ns boat excessive presidents bald expensive presence bold expressive presents bought inactive press 9/16/2018
33
Problem 3: Learning Unfortunately, there is no known way to analytically find a global maximum, i.e., a model , such that But it is possible to find a local maximum Given an initial model , we can always find a model , such that 9/16/2018
34
Parameter Re-estimation
Use the forward-backward (or Baum-Welch) algorithm, which is a hill-climbing algorithm Using an initial parameter instantiation, the forward-backward algorithm iteratively re-estimates the parameters and improves the probability that given observation are generated by the new parameters 9/16/2018
35
Parameter Re-estimation
Three parameters need to be re-estimated: Initial state distribution: Transition probabilities: ai,j Emission probabilities: bi(ot) 9/16/2018
36
Re-estimating Transition Probabilities
What’s the probability of being in state si at time t and going to state sj, given the current model and parameters? 9/16/2018
37
Re-estimating Transition Probabilities
9/16/2018
38
Re-estimating Transition Probabilities
The intuition behind the re-estimation equation for transition probabilities is Formally: 9/16/2018
39
Re-estimating Transition Probabilities
Defining As the probability of being in state si, given the complete observation O We can say: 9/16/2018
40
Review of Probabilities
Forward probability: The probability of being in state si, given the partial observation o1,…,ot Backward probability: The probability of being in state si, given the partial observation ot+1,…,oT Transition probability: The probability of going from state si, to state sj, given the complete observation o1,…,oT State probability: The probability of being in state si, given the complete observation o1,…,oT 9/16/2018
41
Re-estimating Initial State Probabilities
Initial state distribution: is the probability that si is a start state Re-estimation is easy: Formally: 9/16/2018
42
Re-estimation of Emission Probabilities
Emission probabilities are re-estimated as Formally: where Note that here is the Kronecker delta function and is not related to the in the discussion of the Viterbi algorithm!! 9/16/2018
43
The Updated Model Coming from we get to by the following update rules:
9/16/2018
44
Expectation Maximization
The forward-backward algorithm is an instance of the more general EM algorithm The E Step: Compute the forward and backward probabilities for a give model The M Step: Re-estimate the model parameters 9/16/2018
45
Summary ASR Architecture Five easy pieces of an ASR system
The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next step - Decoding: how to combine these to compute words from speech! 9/16/2018
46
Acoustic Modeling Given a 39d vector corresponding to the observation of one frame oi And given a phone q we want to detect Compute p(oi|q) Most popular method: GMM (Gaussian mixture models) Other methods MLP (multi-layer perceptron) 9/16/2018
47
Acoustic Modeling: MLP computes p(q|o)
9/16/2018
48
Gaussian Mixture Models
Also called “fully-continuous HMMs” P(o|q) computed by a Gaussian: 9/16/2018
49
Gaussians for Acoustic Modeling
A Gaussian is parameterized by a mean and a variance: Different means P(o|q): P(o|q) is highest here at mean P(o|q is low here, very far from mean) P(o|q) o 9/16/2018
50
Training Gaussians A (single) Gaussian is characterized by a mean and a variance Imagine that we had some training data in which each phone was labeled We could just compute the mean and variance from the data: 9/16/2018
51
But we need 39 gaussians, not 1!
The observation o is really a vector of length 39 So need a vector of Gaussians: 9/16/2018
52
Actually, mixture of gaussians
Each phone is modeled by a sum of different gaussians Hence able to model complex facts about the data Phone A Phone B 9/16/2018
53
Gaussians acoustic modeling
Summary: each phone is represented by a GMM parameterized by M mixture weights M mean vectors M covariance matrices Usually assume covariance matrix is diagonal I.e. just keep separate variance for each cepstral feature 9/16/2018
54
Summary ASR Architecture Five easy pieces of an ASR system
The Noisy Channel Model Five easy pieces of an ASR system Feature Extraction: 39 “MFCC” features Acoustic Model: Gaussians for computing p(o|q) Lexicon/Pronunciation Model HMM: Next time: Decoding: how to combine these to compute words from speech! 9/16/2018
55
Outline Computing Word Error Rate
Goal of search: how to combine AM and LM Viterbi search Review and adding in LM Beam search Silence models A* Search Fast match Tree structured lexicons N-Best and multipass search N-best Word lattice and word graph Forward-Backward search (not related to F-B training) 9/16/2018
56
Evaluation How do we evaluate recognizers? Word error rate 9/16/2018
57
Word Error Rate Word Error Rate =
100 (Insertions+Substitutions + Deletions) Total Word in Correct Transcript Aligment example: REF: portable **** PHONE UPSTAIRS last night so HYP: portable FORM OF STORES last night so Eval I S S WER = 100 (1+2+0)/6 = 50% 9/16/2018
58
NIST sctk-1.3 scoring softare: Computing WER with sclite
Sclite aligns a hypothesized text (HYP) (from the recognizer) with a correct or reference text (REF) (human transcribed) id: (2347-b-013) Scores: (#C #S #D #I) REF: was an engineer SO I i was always with **** **** MEN UM and they HYP: was an engineer ** AND i was always with THEM THEY ALL THAT and they Eval: D S I I S S 9/16/2018
59
More on sclite SYSTEM SUMMARY PERCENTAGES by SPEAKER
, | /csrnab.hyp | | | | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | | | | 4t0 | | | | 4t1 | | | | 4t2 | | | |================================================================| | Sum/Avg| | | | Mean | | | | S.D. | | | | Median | | | ` ' 9/16/2018
60
Sclite output for error analysis
CONFUSION PAIRS Total (972) With >= 1 occurances (972) 1: > (%hesitation) ==> on 2: > the ==> that 3: > but ==> that 4: > a ==> the 5: > four ==> for 6: > in ==> and 7: > there ==> that 8: > (%hesitation) ==> and 9: > (%hesitation) ==> the 10: > (a-) ==> i 11: > and ==> i 12: > and ==> in 13: > are ==> there 14: > as ==> is 15: > have ==> that 16: > is ==> this 9/16/2018
61
Sclite output for error analysis
17: > it ==> that 18: > mouse ==> most 19: > was ==> is 20: > was ==> this 21: > you ==> we 22: > (%hesitation) ==> it 23: > (%hesitation) ==> that 24: > (%hesitation) ==> to 25: > (%hesitation) ==> yeah 26: > a ==> all 27: > a ==> know 28: > a ==> you 29: > along ==> well 30: > and ==> it 31: > and ==> we 32: > and ==> you 33: > are ==> i 34: > are ==> were 9/16/2018
62
Summary on WER WER is clearly better than metrics like e.g., perplexity But should we be more concerned with meaning (“semantic error rate”)? Good idea, but hard to agree on Has been applied in dialogue systems, where desired semantic output is more clear Recent research: modify training to directly minimize WER instead of maximizing likelihood 9/16/2018
63
Part II: Search 9/16/2018
64
What we are searching for
Given Acoustic Model (AM) and Language Model (LM): AM (likelihood) LM (prior) (1) 9/16/2018
65
Combining Acoustic and Language Models
We don’t actually use equation (1) AM underestimates acoustic probability Why? Bad independence assumptions Intuition: we compute (independent) AM probability estimates every 10 ms; but LM only every word. AM and LM have vastly different dynamic ranges 9/16/2018
66
Language Model Scaling Factor
Solution: add a language model weight (also called language weight LW or language model scaling factor LMSF Value determined empirically, is positive (why?) For Sphinx, similar systems, generally in the range 9/16/2018
67
Word Insertion Penalty
But LM prob P(W) also functions as penalty for inserting words Intuition: when a uniform language model (every word has an equal probability) is used, LM prob is a 1/N penalty multiplier taken for each word If penalty is large, decoder will prefer fewer longer words If penalty is small, decoder will prefer more shorter words When tuning LM for balancing AM, side effect of penalty So we add a separate word insertion penalty to offset 9/16/2018
68
Log domain We do everything in log domain So final equation: 9/16/2018
69
Language Model Scaling Factor
As LMSF is increased: More deletion errors (since increase penalty for transitioning between words) Fewer insertion errors Need wider search beam (since path scores larger) Less influence of acoustic model observation probabilities 9/16/2018 Text from Bryan Pellom’s slides
70
Word Insertion Penalty
Controls trade-off between insertion and deletion errors As penalty becomes larger (more negative) More deletion errors Fewer insertion errors Acts as a model of effect of length on probability But probably not a good model (geometric assumption probably bad for short sentences) 9/16/2018 Text augmented from Bryan Pellom’s slides
71
Part III: More on Viterbi
9/16/2018
72
Adding LM probabilities to Viterbi: (1) Uniform LM
Visualizing the search space for 2 words 9/16/2018 Figure from Huang et al page 611
73
Viterbi trellis with 2 words and uniform LM
Null transition from the end-state of each word to start-state of all (both) words. 9/16/2018 Figure from Huang et al page 612
74
Viterbi for 2 word continuous recognition
Viterbi search: computations done time-synchronously from left to read, I.e. each cell for time t is computed before proceedings to time t+1 9/16/2018 Text from Kjell Elenius course slides; figure from Huang pate 612
75
Search space for unigram LM
9/16/2018 Figure from Huang et al page 617
76
Search space with bigrams
9/16/2018 Figure from Huang et al page 618
77
Silences Each word HMM has optional silence at end
Model for word “two” with two final states. 9/16/2018
78
Reminder: Viterbi approximation
Correct equation We approximate P(O|W): Often called “the Viterbi approximation” “The most likely word sequence is approximated by the most likely state sequence” 9/16/2018
79
Speeding things up Viterbi is O(N2T), where N is total number of HMM states, and T is length This is too large for real-time search A ton of work in ASR search is just to make search faster: Beam search (pruning) Fast match Tree-based lexicons 9/16/2018
80
Beam search Instead of retaining all candidates (cells) at every time frame Use a threshold T to keep subset: At each time t Identify state with lowest cost Dmin Each state with cost > Dmin+ T is discarded (“pruned”) before moving on to time t+1 9/16/2018
81
Viterbi Beam search Is the most common and powerful search algorithm for LVCSR Note: What makes this possible is time-synchronous We are comparing paths of equal length For two different word sequences W1 and W2: We are comparing P(W1|O0t) and P(W2|O0t) Based on same partial observation sequence O0t So denominator is same, can be ignored Time-asynchronous search (A*) is harder 9/16/2018
82
Viterbi Beam Search Empirically, beam size of 5-10% of search space
Thus 90-95% of HMM states don’t have to be considered at each time t Vast savings in time. 9/16/2018
83
Part IV: A* Search 9/16/2018
84
A* Decoding Intuition:
If we had good heuristics for guiding decoding We could do depth-first (best-first) search and not waste all our time on computing all those paths at every time step as Viterbi does. A* decoding, also called stack decoding, is an attempt to do that. A* also does not make the Viterbi assumption Uses the actual forward probability, rather than the Viterbi approximation 9/16/2018
85
Reminder: A* search A search algorithm is “admissible” if it can guarantee to find an optimal solution if one exists. Heuristic search functions rank nodes in search space by f(N), the goodness of each node N in a search tree, computed as: f(N) = g(N) + h(N) where g(N) = The distance of the partial path already traveled from root S to node N h(N) = Heuristic estimate of the remaining distance from node N to goal node G. 9/16/2018
86
Reminder: A* search If the heuristic function h(N) of estimating the remaining distance form N to goal node G is an underestimate of the true distance, best-first search is admissible, and is called A* search. 9/16/2018
87
A* search for speech The search space is the set of possible sentences
The forward algorithm can tell us the cost of the current path so far g(.) We need an estimate of the cost from the current node to the end h(.) 9/16/2018
88
A* Decoding (2) 9/16/2018
89
Stack decoding (A*) algorithm
9/16/2018
90
A* Decoding (2) 9/16/2018
91
A* Decoding (cont.) 9/16/2018
92
A* Decoding (cont.) 9/16/2018
93
Making A* work: h(.) If h(.) is zero, breadth first search
Stupid estimates of h(.): Amount of time left in utterance Slightly smarter: Estimate expected cost-per-frame for remaining path Multiply that by remaining time This can be computed from the training set (how much was the average acoustic cost for a frame in the training set) Later: multi-pass decoding, can use backwards algorithm to estimate h* for any hypothesis! 9/16/2018
94
A*: When to extend new words
Stack decoding is asynchronous Need to detect when a phone/word ends, so search can extend to next phone/word If we had a cost measure: how well input matches HMM state sequence so far We could look for this cost measure slowly going down, and then sharply going up as we start to see the start of the next word. Can’t use forward algorithm because can’t compare hypotheses of different lengths Can do various length normalizations to get a normalized cost 9/16/2018
95
Fast match Efficiency: don’t want to expand to every single next word to see if it’s good. Need a quick heuristic for deciding which sets of words are good expansions “Fast match” is the name for this class of heuristics. Can do some simple approximation to words whose initial phones seem to match the upcoming input 9/16/2018
96
Part V: Tree structured lexicons
9/16/2018
97
Tree structured lexicon
9/16/2018
98
Part VI: N-best and multipass search
9/16/2018
99
N-best and multipass search algorithms
The ideal search strategy would use every available knowledge source (KS) But is often difficult or expensive to integrate a very complex KS into first pass search For example, parsers as a language model; have long-distance dependencies that violate dynamic programming assumptions Other knowledge sources might not be left-to-right (knowledge of following words can help predict preceding words) For this reason (and others we will see) we use multipass search algorithms 9/16/2018
100
Multipass Search 9/16/2018
101
Some definitions N-best list Word lattice Word graph
Instead of single best sentence (word string), return ordered list of N sentence hypotheses Word lattice Compact representation of word hypotheses and their times and scores Word graph FSA representation of lattice in which times are represented by topology 9/16/2018
102
N-best list 9/16/2018 From Huang et al, page 664
103
Word lattice Encodes More compact than N-best list: Word
Starting/ending time(s) of word Acoustic score of word More compact than N-best list: Utterance with 10 words, 2 hyps per word 1024 different sentences Lattice with only 20 different hypotheses 9/16/2018 From Huang et al, page 665
104
Word Graph 9/16/2018 From Huang et al, page 665
105
Converting word lattice to word graph
Word lattice can have range of possible end frames for word Create an edge from (wi,ti) to (wj,tj) if tj-1 is one of the end-times of wi 9/16/2018 Bryan Pellom’s algorithm and figure, from his slides
106
Lattices Some researchers are careful to distinguish between word graphs and word lattices But we’ll follow convention in using “lattice” to mean both word graphs and word lattices. Two facts about lattices: Density: the number of word hypotheses or word arcs per uttered word Lattice error rate (also called “lower bound error rate”): the lowest word error rate for any word sequence in lattice Lattice error rate is the “oracle” error rate, the best possible error rate you could get from rescoring the lattice. We can use this as an upper bound 9/16/2018
107
Computing N-best lists
In the worst case, an admissible algorithm for finding the N most likely hypotheses is exponential in the length of the utterance. S. Young “Generating Multiple Solutions from Connected Word DP Recognition Algorithms”. Proc. of the Institute of Acoustics, 6:4, For example, if AM and LM score were nearly identical for all word sequences, we must consider all permutations of word sequences for whole sentence (all with the same scores). But of course if this is true, can’t do ASR at all! 9/16/2018
108
Computing N-best lists
Instead, various non-admissible algorithms: (Viterbi) Exact N-best (Viterbi) Word Dependent N-best 9/16/2018
109
A* N-best A* (stack-decoding) is best-first search
So we can just keep generating results until it finds N complete paths This is the N-best list But is inefficient 9/16/2018
110
Exact N-best for time-synchronous Viterbi
Due to Schwartz and Chow; also called “sentence-dependent N-best” Idea: maintain separate records for paths with distinct histories History: whole word sequence up to current time t and word w When 2 or more paths come to the same state at the same time, merge paths w/same history and sum their probabilities. Otherwise, retain only N-best paths for each state 9/16/2018
111
Exact N-best for time-synchronous Viterbi
Efficiency: Typical HMM state has 2 or 3 predecessor states within word HMM So for each time frame and state, need to compare/merge 2 or 3 sets of N paths into N new paths. At end of search, N paths in final state of trellis reordered to get N-best word sequence Complex is O(N); this is too slow for practical systems 9/16/2018
112
Forward-Backward Search
Useful to know how well a given partial path will do in rest of the speech. But can’t do this in one-pass search Two-pass strategy: Forward-Backward Search 9/16/2018
113
Forward-Backward Search
First perform a forward search, computing partial forward scores for each state Then do second pass search backwards From last frame of speech back to first Using as Heuristic estimate for h* function for A* search or Fast match score for remaining path Details: Forward pass must be fast: Viterbi with simplified AM and LM Backward pass can be A* or Viterbi 9/16/2018
114
Forward-Backward Search
Forward pass: At each time t Record score of final state of each word ending. Set of words whose final states are active (surviving in beam) at time t is t. Score of final state of each word w in t is t(w) Sum of cost of matching utterance up to time t given most likely word sequence ending in word w and cost of LM score for that word sequence At end of forward search, best cost is T. Backward pass Run in reverse (backward) considering last frame T as beginning one Both AM and LM need to be reversed Usually A* search 9/16/2018
115
Forward-Backward Search: Backward pass, at each time t
Best path removed from stack List of possible one-word extensions generated Suppose best path at time t is phwj, where wj is first word of this partial path (last word expanded in backward search) Current score of path phwj is t(phw) We want to extend to next word wi Two questions: Find h* heuristic for estimating future input stream t(wi)!! So new score for word is t(w)+t(phw) Find best crossing time t between wi and wj. t*=argmin_t[t(w)+t(phw) 9/16/2018
116
One-pass vs. multipass Potential problems with multipass Why multipass
Can’t use for real-time (need end of sentence) (But can keep successive passes really fast) Each pass can introduce inadmissible pruning (But one-pass does the same w/beam pruning and fastmatch) Why multipass Very expensive KSs. (NL parsing,higher-order n-gram, etc) Spoken language understanding: N-best perfect interface Research: N-best list very powerful offline tools for algorithm development N-best lists needed for discriminant training (MMIE, MCE) to get rival hypotheses 9/16/2018
117
Summary Computing Word Error Rate
Goal of search: how to combine AM and LM Viterbi search Review and adding in LM Beam search Silence models A* Search Fast match Tree structured lexicons N-Best and multipass search N-best Word lattice and word graph Forward-Backward search (not related to F-B training) 9/16/2018
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.