2 Outline The speech recognition problem Search algorithms A time-synchronous Viterbi-based decoder Performance analysis Summary
3 The speech recognition problem Many of the fundamentals of speech communication process are still not clearly and defy rigorous mathematical descriptions. It’s important that the system be able to handle a large vocabulary, and be independent of speaker and language characteristic such as accents, speaking styles, dysfluencies, syntax, and grammar.
4 Search algorithms (cont) Typical search algorithms 1. Viterbi search 2. Stack decoders 3. Multi-pass search 4. Forward-backward search
5 Search algorithms (cont) Viterbi search : A class of breadth-first search techniques Time-synchronous Viterbi beam search is used to reduce the search space The main problem is that the state-level information cannot be merged readily to reduce the number of required computations
6 Search algorithms (cont) Stack decoders : A class of depth-first search techniques Need to normalizing the score of a path based on the number of frames of data it spans Suffering form problems of speed, size, accuracy and robustness for large vocabulary spontaneous speech application
7 Search algorithms (cont) Multi-pass search : Computationally inexpensive acoustic models are initially used to produce a list of likely word hypotheses (ex : bigram language model and context-independent phone) Refined using more detailed and computationally demanding model (ex : trigram and cross-word triphone)
8 Search algorithms (cont) Multi-pass search : Figure 1 : An example of the N-best list of hypotheses generated for a simple utterance
9 Search algorithms (cont) Multi-pass search : Figure 2 : the resulting word graph from figure 1
10 Search algorithms (cont) Forward-backward search An approximate time-synchronous search in the forward-pass direction to facilitate a more complex and expensive search in the backward direction. The forward pass can be made extremely suboptimal and efficient.
11 A time-synchronous viterbi-based decoder Complexity of search Search space organization Search space reduction
12 Complexity of search Lexicon Language model Network decoding N-gram decoding Acoustic model Context-independent model Context-dependent model (word-internal) Cross-word model
13 Complexity of search (cont) Network decoding A grammar that defines the structure of the language used of the words or a word graph generated by a previous recognition process They cannot be merged into a single path if such instances of the same triphone correspond to different node in the network The complexity of the search and memory requirements are directly proportional to the size of the expanded network.
14 Complexity of search (cont) Network decoding Figure 3 : An example of network decoding using word-internal context-dependent models.
15 Complexity of search (cont) N-gram decoding : typically consist of only a subset of the possible N-grams, and the likelihood of the other word can be estimated using a back-off model. Ex : bigram
16 N-gram decoding Paths with very different origins can be merged later in time if the have the same current instance, which is now defined by the phone model and the N-gram history. Ex : for bigram To implement the language model is to cache the N- gram scores of all the active words in memory, and leave the rest of the language model on disk Complexity of search (cont)
17 Complexity of search (cont) Cross-word acoustic models Figure 4 : A small part of the expanded network from Figure 3 using cross-word triphones.
18 Complexity of search (cont) Cross-word acoustic models Figure 5 : An overview of the relative complexity of the search problem that shows the impact of various types of acoustic and language models..
19 Search space organization Lexical trees Figure 6 : An example lexical tree used in the decoder, The dark circles represent starts and ends of words, the word identity is unknown till a word-end lexical node is reached..
20 Search space organization (cont) Language model lookahead : the delay in the application of the LM score at the word end allows for undesirable growth in the complexity of the search To overcome this problem, the nodes internal to a word store the maximum LM score of all the words covered by that lexical node
21 Search space organization (cont) Acoustic evaluation The likelihood score evaluated is stored locally with the state information and reused whenever that state is revisited in that frame
22 Search space reduction Pruning Path merging Word graph compaction
23 Search space reduction (cont) Pruning : to identify low-scoring partial paths that have a very low probability of getting any better, and stop propagating them further. Some commonly used heuristics are: Setting pruning beams based on the hypothesis score Limiting the total number of model instances active at a given time (maximum active phone model instance pruning) Setting an upper bound on the number of words allowed to end at a given frame (maximum active word-end pruning)
24 Search space reduction (cont) Setting pruning beams based on the hypothesis score State level : Phone level : Word level :
25 Search space reduction (cont) Setting pruning beams based on the hypothesis score : identity of a word is known with a much higher likelihood at the end of the word compared to it’s beginning. It is beneficial to curb the fan-out caused the language model list of possible next words. Word-level threshold is usually tighter compared to the state and phone-level beams
26 Search space reduction (cont) Setting pruning beams based on the hypothesis score Figure 7 : Effect of beam widths on the recognition accuracy and complexity of the search.
27 Search space reduction (cont) maximum active phone model instance (MAPMI) pruning : each partial path is identified : current node in the lexical tree the identity of the phone model being evaluated and the last completely evaluated word defined by that path. to limit the number of these partial path (instance) active at any time
28 Search space reduction (cont) maximum active phone model instance (MAPMI) pruning : Figure 8 : Effect of MAPMI pruning on memory usage as illustrated on a 68 frames long utterance from the SWB corpus for word graph generation.
29 Search space reduction (cont) maximum active phone model instance (MAPMI) pruning : Figure 9 : Effect of MAPMI pruning on the recognition accuracy and complexity of the search, on a subset of the SWB corpus
30 Search space reduction (cont) maximum active word-end pruning : to propagate only a few word ends that associated with the highest likelihood path scores.
31 Search space reduction (cont) Path merging : by sharing the evaluation of similar parts of different hypotheses the decoder can prevent the computational load. Take the word level for example : if more than one active path leads to the end of a word, then only the best path among them is propagated further.
32 Search space reduction (cont) Path merging Lexical tree : automatically ensures that all the partial hypotheses represented here have identical linguistic context During Word graph generation: only one path is propagated, multiple path histories are preserved through a sorted backpointer list.
33 Search space reduction (cont) Word graph compaction : a word graph often contains multiple instances of the same word sequence, each with a different alignment with respect to time.
34 Search space reduction (cont) Word graph compaction : a word graph : Figure 10 : An illustration of the word graph
35 Search space reduction (cont) Word graph compaction : a word graph that removes the time stamp Figure 11 : An illustration of the word graph compaction from figure 10
36 Performance analysis Several experiment on two different corpora : the OGI Alphadigits and SWITCHBOARD, using both word-internal as well as cross- word triphone models. Hardware : 333MHz Pentium II processor with 512MB of memory.
37 Performance analysis (cont) OGI Alphadigits corpus (OGI-AD) : A database of telephone speech collected from approximately 3000 subject. The vocabulary consisted of the letters of the alphabet as well as the digits 0 through 9. Each subject spoke a list of either 19 or 29 alphanumeric strings.
38 Performance analysis (cont) OGI-AD : Language model : Figure 12 : The language model for the Alphadigits corpus is a fully connected grammar.
39 Performance analysis (cont) OGI-AD results Table 1. An analysis of performance on the OGI-AD task for network decoding.
40 Performance analysis (cont) Memory varies with the length of the utterance Figure 13 : Memory and run-time for word graph rescoring as a function of utterance length. we use word-internal models here
41 Performance analysis (cont) Memory varies with the length of the utterance Figure 14 : Memory and run-time for word graph rescoring as a function of utterance length. we use cross-word models here
42 Performance analysis (cont) Memory varies with the length of the utterance Figure 15 : Memory and run-time for word graph generation as a function of utterance length. we use word-internal models here
43 Performance analysis (cont) Switchboard (SWB) corpus : Consists of recognition of spontaneous conversational speech collected over standard telephone lines. Decoding with cross-word acoustic models is a challenge on this task.
44 Performance analysis (cont) SWB is currently one of most challenging benchmarks for systems. Reasons are : Acoustics : a variety of transducers and noisy channels Language model Pronunciation variation
45 Performance analysis (cont) Switchboard (SWB) result : Table 2. An analysis of performance on the LDC-SWB task for rescoring word graphs generated using a bigram language model.
46 Switchboard (SWB) result : Table 3. Summary of the decoder performance on the LDC-SWB task for word graph generation using a bigram language model.
47 Summary Future direction in search can be summarized in one word : real – time More intelligent pruning algorithms Multi-pass systems Fast-matching strategies within the acoustic model Vector quantization-like approach