Automatic Speech Recognition

Automatic Speech Recognition
ILVB-2006 Tutorial

The Noisy Channel Model
Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993] The noisy channel model [Lee et al., 1996] Acoustic input considered a noisy version of a source sentence Noisy Channel Decoder Source sentence Noisy sentence Guess at original sentence 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? ILVB-2006 Tutorial

The Noisy Channel Model
What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn Bayes rule Golden rule ILVB-2006 Tutorial

Speech Recognition Architecture Meets Noisy Channel
버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? Feature Extraction Decoding Speech Signals Word Sequence Network Construction Speech DB Acoustic Model Pronunciation Model Language Model HMM Estimation G2P Text Corpora LM Estimation ILVB-2006 Tutorial

Feature Extraction The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] Frame size : 25ms / Frame rate : 10ms 39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute coefficients X(n) Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension) 25 ms 10ms . . . a a a3 ILVB-2006 Tutorial

Acoustic Model Provide P(O|Q) = P(features|phone)
Modeling Units [Bahl et al., 1986] Context-independent : Phoneme Context-dependent : Diphone, Triphone, Quinphone pL-p+pR : left-right context triphone Typical acoustic model [Juang et al., 1986] Continuous-density Hidden Markov Model Distribution : Gaussian Mixture HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause codebook bj(x) ILVB-2006 Tutorial

Pronunciation Model Provide P(Q|W) = P(phone|word)
Word Lexicon [Hazen et al., 2002] Map legal phone sequences into words according to phonotactic rules G2P (Grapheme to phoneme) : Generate a word lexicon automatically Several word may have multiple pronunciations Example Tomato P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [ow] [ey] 0.2 1.0 0.5 1.0 1.0 [t] [m] [t] [ow] 0.8 [ah] 1.0 0.5 [aa] 1.0 ILVB-2006 Tutorial

Training Training process [Lee et al., 1996] Network for training yes
Speech DB Feature Extraction Baum-Welch Re-estimation Converged? End no HMM ONE TWO THREE ONE TWO THREE ONE Sentence HMM W AH N 1 2 3 Word HMM Phone HMM ILVB-2006 Tutorial

Language Model Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] We saw this was also used in the decoding process as the probability of transitioning from one word to another. Word sequence : W = w1,w2,w3,…,wn The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language n-gram Language Model n-gram language models use the previous n-1 words to represent the history Bi-grams are easily incorporated in a viterbi search ILVB-2006 Tutorial

Network Construction Expanding every word to state level, we get a search network [Demuynck et al., 1997] Acoustic Model Pronunciation Model Language Model I L S A M 일 이 삼 사 I L S A M 삼 사 일 이 I L S A M Word transition P(일|x) P(사|x) P(삼|x) P(이|x) LM is applied start end 이 일 사 삼 Between-word Intra-word Search Network ILVB-2006 Tutorial

Decoding Find Viterbi Search : Dynamic Programming
Token Passing Algorithm [Young et al., 1989] Initialize all states with a token with a null history and the likelihood that it’s a start state For each frame ak For each token t in state s with probability P(t), history H For each state r Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H ILVB-2006 Tutorial

Decoding Pruning [Young et al., 1996]
Entire search space for Viterbi search is much too large Solution is to prune tokens for paths whose score is too low Typical method is to use: histogram: only keep at most n total hypotheses beam: only keep hypotheses whose score is a fraction of best score N-best Hypotheses and Word Graphs Keep multiple tokens and return n-best paths/scores Can produce a packed word graph (lattice) Multiple Pass Decoding Perform multiple passes, applying successively more fine-grained language models ILVB-2006 Tutorial

Large Vocabulary Continuous Speech Recognition (LVCSR)
Decoding continuous speech over large vocabulary Computationally complex because of huge potential search space Weighted Finite State Transducers (WFST) [Mohri et al., 2002] Efficiency in time and space Dynamic Decoding On-demand network constructions Much less memory requirements Word : Sentence WFST Search Network Phone : Word WFST Combination Optimization HMM : Phone WFST State : HMM WFST ILVB-2006 Tutorial

References (1/2) L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52. C. Beaujard and M. Jardino, Language Modeling based on Automatic Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp K. Demuynck, J. Duchateau, and D. V. Compernolle, A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146. T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104. M. Mohri, F. Pereira, and M Riley, Weighted finite-state transducers in speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88. ILVB-2006 Tutorial

References (2/2) B. H. Juang, S. E. Levinson, and M. M. Sondhi, Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309. C. H. Lee, F. K. Soong, and K. K. Paliwal, Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers. K. K. Paliwal, Dimensionality reduction of the enhanced feature set for the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173. L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286. L. R. Rabiner and B. H. Juang, Fundamentals of Speech Recognition, Prentice-Hall. S. J. Young, N. H. Russell, and J. H. S Thornton, Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department. S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, The HTK book. Entropics Cambridge Research Lab., Cambridge, UK. ILVB-2006 Tutorial

Automatic Speech Recognition

Similar presentations

Presentation on theme: "Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Speech Recognition

Similar presentations

Presentation on theme: "Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback