Automatic Speech Recognition

Slides:



Advertisements
Similar presentations
Building an ASR using HTK CS4706
Advertisements

Building an ASR using HTK CS4706
Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Natural Language Processing - Speech Processing -
Application of HMMs: Speech recognition “Noisy channel” model of speech.
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
Hidden Markov Models. Hidden Markov Model In some Markov processes, we may not be able to observe the states directly.
Part 6 HMM in Practice CSE717, SPRING 2008 CUBS, Univ at Buffalo.
COMP 4060 Natural Language Processing Speech Processing.
Why is ASR Hard? Natural speech is continuous
Audio Processing for Ubiquitous Computing Uichin Lee KAIST KSE.
Introduction to Automatic Speech Recognition
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
Speech and Language Processing
 Feature extractor  Mel-Frequency Cepstral Coefficients (MFCCs) Feature vectors.
7-Speech Recognition Speech Recognition Concepts
HMM - Basics.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
By: Meghal Bhatt.  Sphinx4 is a state of the art speaker independent, continuous speech recognition system written entirely in java programming language.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
IRCS/CCN Summer Workshop June 2003 Speech Recognition.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Artificial Intelligence 2004 Speech & Natural Language Processing Natural Language Processing written text as input sentences (well-formed) Speech.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2005 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
1 LSA 352 Summer 2007 LSA 352 Speech Recognition and Synthesis Dan Jurafsky Lecture 6: Feature Extraction and Acoustic Modeling IP Notice: Various slides.
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
BY KALP SHAH Sentence Recognizer. Sphinx4 Sphinx4 is the best and versatile recognition system. Sphinx4 is a speech recognition system which is written.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
1 A speech recognition system for Swedish running on Android Simon Lindholm LTH May 7, 2010.
CS 224S / LINGUIST 285 Spoken Language Processing
Automatic Speech Recognition
CS 224S / LINGUIST 285 Spoken Language Processing
Reza Yazdani Albert Segura José-María Arnau Antonio González
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Juicer: A weighted finite-state transducer speech decoder
An overview of decoding techniques for LVCSR
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Automatic Speech Recognition Introduction
Statistical Models for Automatic Speech Recognition
8.0 Search Algorithms for Speech Recognition
Computational NeuroEngineering Lab
Automatic Speech Recognition
HUMAN LANGUAGE TECHNOLOGY: From Bits to Blogs
Tight Coupling between ASR and MT in Speech-to-Speech Translation
Speech Processing Speech Recognition
Hidden Markov Models Part 2: Algorithms
Statistical Models for Automatic Speech Recognition
Speech Processing Speech Recognition
Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky
LECTURE 15: REESTIMATION, EM AND MIXTURES
Speech Recognition: Acoustic Waves
Presentation transcript:

Automatic Speech Recognition ILVB-2006 Tutorial

The Noisy Channel Model Automatic speech recognition (ASR) is a process by which an acoustic speech signal is converted into a set of words [Rabiner et al., 1993] The noisy channel model [Lee et al., 1996] Acoustic input considered a noisy version of a source sentence Noisy Channel Decoder Source sentence Noisy sentence Guess at original sentence 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? ILVB-2006 Tutorial

The Noisy Channel Model What is the most likely sentence out of all sentences in the language L given some acoustic input O? Treat acoustic input O as sequence of individual observations O = o1,o2,o3,…,ot Define a sentence as a sequence of words: W = w1,w2,w3,…,wn Bayes rule Golden rule ILVB-2006 Tutorial

Speech Recognition Architecture Meets Noisy Channel 버스 정류장이 어디에 있나요? 버스 정류장이 어디에 있나요? Feature Extraction Decoding Speech Signals Word Sequence Network Construction Speech DB Acoustic Model Pronunciation Model Language Model HMM Estimation G2P Text Corpora LM Estimation ILVB-2006 Tutorial

Feature Extraction The Mel-Frequency Cepstrum Coefficients (MFCC) is a popular choice [Paliwal, 1992] Frame size : 25ms / Frame rate : 10ms 39 feature per 10ms frame Absolute : Log Frame Energy (1) and MFCCs (12) Delta : First-order derivatives of the 13 absolute coefficients Delta-Delta : Second-order derivatives of the 13 absolute coefficients X(n) Preemphasis/ Hamming Window FFT (Fast Fourier Transform) Mel-scale filter bank log|.| DCT (Discrete Cosine Transform) MFCC (12-Dimension) 25 ms 10ms . . . a1 a2 a3 ILVB-2006 Tutorial

Acoustic Model Provide P(O|Q) = P(features|phone) Modeling Units [Bahl et al., 1986] Context-independent : Phoneme Context-dependent : Diphone, Triphone, Quinphone pL-p+pR : left-right context triphone Typical acoustic model [Juang et al., 1986] Continuous-density Hidden Markov Model Distribution : Gaussian Mixture HMM Topology : 3-state left-to-right model for each phone, 1-state for silence or pause codebook bj(x) ILVB-2006 Tutorial

Pronunciation Model Provide P(Q|W) = P(phone|word) Word Lexicon [Hazen et al., 2002] Map legal phone sequences into words according to phonotactic rules G2P (Grapheme to phoneme) : Generate a word lexicon automatically Several word may have multiple pronunciations Example Tomato P([towmeytow]|tomato) = P([towmaatow]|tomato) = 0.1 P([tahmeytow]|tomato) = P([tahmaatow]|tomato) = 0.4 [ow] [ey] 0.2 1.0 0.5 1.0 1.0 [t] [m] [t] [ow] 0.8 [ah] 1.0 0.5 [aa] 1.0 ILVB-2006 Tutorial

Training Training process [Lee et al., 1996] Network for training yes Speech DB Feature Extraction Baum-Welch Re-estimation Converged? End no HMM ONE TWO THREE ONE TWO THREE ONE Sentence HMM W AH N 1 2 3 Word HMM Phone HMM ILVB-2006 Tutorial

Language Model Provide P(W) ; the probability of the sentence [Beaujard et al., 1999] We saw this was also used in the decoding process as the probability of transitioning from one word to another. Word sequence : W = w1,w2,w3,…,wn The problem is that we cannot reliably estimate the conditional word probabilities, for all words and all sequence lengths in a given language n-gram Language Model n-gram language models use the previous n-1 words to represent the history Bi-grams are easily incorporated in a viterbi search ILVB-2006 Tutorial

Language Model Example Finite State Network (FSN) Context Free Grammar (CFG) Bigram 세시 네시 서울 부산 에서 출발 기차 버스 하는 대구 대전 출발 도착 $time = 세시|네시; $city = 서울|부산|대구|대전; $trans = 기차|버스; $sent = $city (에서 $time 출발 | 출발 $city 도착) 하는 $trans P(에서|서울)=0.2 P(세시|에서)=0.5 P(출발|세시)=1.0 P(하는|출발)=0.5 P(출발|서울)=0.5 P(도착|대구)=0.9 … ILVB-2006 Tutorial

Network Construction Expanding every word to state level, we get a search network [Demuynck et al., 1997] Acoustic Model Pronunciation Model Language Model I L S A M 일 이 삼 사 I L S A M 삼 사 일 이 I L S A M Word transition P(일|x) P(사|x) P(삼|x) P(이|x) LM is applied start end 이 일 사 삼 Between-word Intra-word Search Network ILVB-2006 Tutorial

Decoding Find Viterbi Search : Dynamic Programming Token Passing Algorithm [Young et al., 1989] Initialize all states with a token with a null history and the likelihood that it’s a start state For each frame ak For each token t in state s with probability P(t), history H For each state r Add new token to s with probability P(t) Ps,r Pr(ak), and history s.H ILVB-2006 Tutorial

Decoding Pruning [Young et al., 1996] Entire search space for Viterbi search is much too large Solution is to prune tokens for paths whose score is too low Typical method is to use: histogram: only keep at most n total hypotheses beam: only keep hypotheses whose score is a fraction of best score N-best Hypotheses and Word Graphs Keep multiple tokens and return n-best paths/scores Can produce a packed word graph (lattice) Multiple Pass Decoding Perform multiple passes, applying successively more fine-grained language models ILVB-2006 Tutorial

Large Vocabulary Continuous Speech Recognition (LVCSR) Decoding continuous speech over large vocabulary Computationally complex because of huge potential search space Weighted Finite State Transducers (WFST) [Mohri et al., 2002] Efficiency in time and space Dynamic Decoding On-demand network constructions Much less memory requirements Word : Sentence WFST Search Network Phone : Word WFST Combination Optimization HMM : Phone WFST State : HMM WFST ILVB-2006 Tutorial

References (1/2) L. Bahl, P. F. Brown, P. V. de Souza, and R .L. Mercer, 1986. Maximum mutual information estimation of hidden Markov model ICASSP, pp.49–52. C. Beaujard and M. Jardino, 1999. Language Modeling based on Automatic Word Concatenations, In Proceedings of 8th European Conference on Speech Communication and Technology, vol. 4, pp.1563-1566. K. Demuynck, J. Duchateau, and D. V. Compernolle, 1997. A static lexicon network representation for cross-word context dependent phones, Proceedings of the 5th European Conference on Speech Communication and Technology, pp.143–146. T. J. Hazen, I. L. Hetherington, H. Shu, and K. Livescu, 2002. Pronunciation modeling using a finite-state transducer representation, Proceedings of the ISCA Workshop on Pronunciation Modeling and Lexicon Adaptation, pp.99–104. M. Mohri, F. Pereira, and M Riley, 2002. Weighted finite-state transducers in speech recognition, Computer Speech and Language, vol.16, no.1, pp.69–88. ILVB-2006 Tutorial

References (2/2) B. H. Juang, S. E. Levinson, and M. M. Sondhi, 1986. Maximum likelihood estimation for multivariate mixture observations of Markov chains, IEEE Transactions on Information Theory, vol.32, no.2, pp.307–309. C. H. Lee, F. K. Soong, and K. K. Paliwal, 1996. Automatic Speech and Speaker Recognition: Advanced Topics, Kluwer Academic Publishers. K. K. Paliwal, 1992. Dimensionality reduction of the enhanced feature set for the HMMbased speech recognizer, Digital Signal Processing, vol.2, pp.157–173. L. R. Rabiner, 1989, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, vol.77, no.2, pp.257–286. L. R. Rabiner and B. H. Juang, 1993. Fundamentals of Speech Recognition, Prentice-Hall. S. J. Young, N. H. Russell, and J. H. S Thornton, 1989. Token passing: a simple conceptual model for connected speech recognition systems. Technical Report CUED/F-INFENG/TR.38, Cambridge University Engineering Department. S. Young, J. Jansen, J. Odell, D. Ollason, and P. Woodland, 1996. The HTK book. Entropics Cambridge Research Lab., Cambridge, UK. ILVB-2006 Tutorial