The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree of Doctor of Philosophy Presenter: Hsu-Ting Wei
2
3 Context
4 Contents (cont.)
5 Introduction The use of context dependent models introduces two major problems: –1. Sparsely and unevenness training data –2. Efficient decoding strategy which incorporates context dependencies both within words and across word boundaries
6 Introduction (cont.) About problem 1 (ch3) –Construct a robust and accurate recognizers using decision tree bases clustering techniques –Linguistic knowledge is used –The approach allows the construction of models which are dependent upon contextual effects occurring across word boundaries About problem 2 (ch4~) –The thesis presents a new decoder design which is capable of using these models efficiently –The decoder can generate a lattice of word hypotheses with little computational overhead.
7 Ch3. Context dependency in speech 3.1 Contextual Variation –In order to maximize the accuracy of HMM based speech recognition systems, it is necessary to carefully tailor their architecture to ensure that they exploit the strengths of HMM while minimizing the effects of their weaknesses. Signal parameterisation Model structure –Ensure that their between class variance is higher than the within class variance
8 Ch3. Context dependency in speech (cont.) Most of the variability inherent in speech is due to contextual effects: –Session effects Speaker effects –Major source of variation Environmental effects –Control by minimizing the background noise and ensuring that the same microphone is used –Local effects Utterance –Co-articulation, stress, emphasis By taking these contextual effects into account, the variability can be reduced and the accuracy of the models increased.
9 Ch3. Context dependency in speech (cont.) Session effects –Speaker dependent system (SD) is significantly more accurate than a similar speaker independent system (SI). –Speaker effects Gender and age Dialect Style –In order to making the SI system to simulate SD system, we can do : Operating recognizers in parallel Adapting the recognizer to match the new speaker
10 Ch3. Context dependency in speech (cont.) Session effects (cont.) –Operating recognizers in parallel Disadvantage: –The computational load appears to rises linearly with the number of systems Advantage: –The system tends to dominate quickly and the computational load is high for only the first few seconds of speech Speaker type answer
11 Ch3. Context dependency in speech (cont.) Session effects (cont.) –Adapting the recognizer to match the new speaker Problem: There is insufficient data to update the model –It is possible to make use of both techniques and initially use parallel systems to choose the speaker characteristics, then, once enough data is available, adapt the chosen system to better match the speaker. MAP, MLLR
12 Ch3. Context dependency in speech (cont.) Local effects –Co-articulation means that the acoustic realization of a phone in a particular phonetic context is more consistent than the same phone occurring in a variety of contexts. – Ex: ”We were away with William in Sea World.” w iy w er… s iy w er
13 Ch3. Context dependency in speech (cont.) Local effects –Context Dependent Phonetic Models IN LIMSI –45 monophone context (Festival CMU: 41) »STEAK = sil s t ey k sil –2071 biphone context (Festival CMU :1364) »STEAK = sil sil-s s-t t-ey ey-k sil –95221 triphone context »STEAK = sil sil-s+t s-t+ey t-ey+k ey-k+sil sil –Word Boundaries Word Internal Context Dependency (Intra-word) –STEAK AND CHIPS = sil s+t s-t+ey t-ey+k ey-k ae+n ae-n+d n- d ch+ih ch-ih+p ih-p+s p-s sil Cross World Context Dependency (Inter-word) =>can increase accuracy –STEAK AND CHIPS = sil sil-s+t s-t+ey t-ey+k ey-k+ae k-ae+n ae-n+d n-d+ch d-ch+ih ch-ih+p ih-p+s p-s+sil sil
14 English dictionary Festlex CMU - Lexicon (American English) for Festival Speech System ( ) –40 distinct phones. ("hello" nil (((hh ax l) 0) ((ow) 1))). ("world" nil (((w er l d) 1)))..
15 English dictionary (cont.) The LIMSI dictionary phones set (1993) –45 phones
16 Linguistic knowledge (cont.) 鼻音 摩擦音 流音 General questions
17 Linguistic knowledge (cont.) Vowel questions
18 Linguistic knowledge (cont.) Consonant questions 發音時很用力的子音 發音較不費力的子音 舌尖音 刺耳的 音節主音 摩擦音 破擦音
19 Linguistic knowledge (cont.) Questions which is used in HTK <= State tying
20 Ch4.Decoding This chapter described several decoding techniques suitable for recognition of continuous speech using HMM. It is concerned with the use of cross word context dependent acoustic and long span language models. Ideal decoder –4.2 Time-Synchronous decoding Token passing Beam pruning N-Best decoding Limitations Back-Off implementation –4.3 Best First Decoding A* Decoding The stack decoder for speech recognition –4.4 A Hybrid approach
21 Ch4.Decoding (cont.) 4.1 Requirements –Ideal decoder : It should find the most likely grammatical hypothesis for an unknow utterance Acoustic model likelihood Language model likelihood
22 Ch4.Decoding (cont.) 4.1 Requirements (cont.) –The ideal decoder would have following characteristics Efficiency: Ensure that the system does not lag behind the speaker. Accuracy: Find the most likely grammatical sequence of words for each utterance. Scalability ( 可擴放性 ): (?) The computation required by the decoder would also increase less than linearly with the size of the vocabulary. Versatility( 多樣性 ): Allow a variety of constraints and knowledge sources to be incorporates directly into the search without compromising its efficiency. (n-gram language + cross-word context dependent models)
23 Conclusion Implement HTK right biphone task and triphone task