Download presentation
Presentation is loading. Please wait.
Published byDirk VanCompernolle Modified over 6 years ago
1
Speech Recognition H02A6 2018-2019
2
CHAPTER 1 Introduction
3
HOWintonation mood, stress, speaking rate WHATcontenttranscription WHERE room acoustics reverberation background noise WHOspeakersexaccent Speech Recognition Goal = Extracting Information from Spoken Language SPEECH 3
4
Speech Recognition -- What is it ? transcribing text literally … as you want to see in the transcript of a session in court creating a document from speech … correctly transcribing the text input despite poor and flawed pronunciation … and smoothly disambiguating it from markup commands deriving commands from a spoken input... as needed for voice dialing with a telephone what often seems so easy for humans, is not a trivial task for machines –as there is no one-to-one match between acoustics and speech sounds –as there is no one-to-one match between speech and text –as there is no trivial alignment between speech and text “The style of written prose in not that of spoken oratory, … ” Aristotle, Rethoric, Book 3 – part 12, 350 BC. 4
5
Human Speech Recognition Machine = human brain –Crafted over millions of years of evolution –Trainable –Language independent Learning / Adaptation –Complex multi-staged learning –A couple of years of real life data –Language dependent listening babble sounds words sentences 5
6
Speech Recognition by Machines “Inspired” by Human Speech Recognition –Airplanes don’t flap wings Machinery –Computers –Machine Learning algorithms for Sequence-to-sequence recognition –DTW, HMMs, DNNs, VITERBI, CTC, …. Data Large speech + text Databases Acoustic Model Vocabulary Language Model 6
7
Recognizer Speech Recognition Architecture Feature Extraction Spectral Analysis Feature Engineering Algorithms inspired by the human auditory system Sequence-2-Sequence Machine Learning Algorithms feature vectors speech signal word sequence acoustic model modellanguage.. Training (AM+LM) speech & text data speech text 7
8
ASR SYSTEM Recognizer Speech Recognition Architecture Feature Extraction feature vectors speech signal word sequence acoustic model model..language model model lexicon AM TRAINING Acoustic Model Training annotated speech data transcriptions F. E. Language Model Training text data LM TRAINING 8
9
ASR Components (1) : ACOUSTIC MODEL time (sec) 0.10 0.20 0.30 0.40 IH Y K M P | K | AH | M | P | Y | UW | T | ER | S | | COMPUTERS | ACOUSTIC MODEL = computational model (GMM, DNN,.. ) of the input-output relationship + trained from example data INPUT sliding window over spectrogram OUTPUT label scores 9
10
ASR Components (2): LEXICON The lexicon gives the pronunciation of each word in the language –The Roman alphabet was not designed for the English language. –Due to different dynamics between written (spelling reforms) and spoken (uncontrolled natural evolution) languages discrepancies exist between spoken and written forms –The lexicon includes pronunciation variants (tomato) and homonyms (desert) ABOUTAH B AW T ACTIONAE K SH AH N CHEESE CH IY Z COMPUTER K AH M P Y UW T ER DESERT D EH Z ER T DESERT(1) D IH Z ER T DIVED AY V GOODG UH D HIDDENHH IH D AH N JOYJH OY LANGUAGE L AE NG G W AH JH LANGUAGE(1) L AE NG G W IH JH LEARNING L ER N IH NG MACHINE M AH SH IY N MARKOVM AA R K AO F MODELM AA D AH L MOTHER M AH DH ER NETWORKS N EH T W ER K S NEURAL N UH R AH L NEURAL(1) N Y UH R AH L PLEASURE P L EH ZH ER RECOGNITION R EH K AH G N IH SH AH N SHOUTSH AW T SPEECH S P IY CH SPOONS P UW N THINGTH IH NG TOMATO T AH M EY T OW TOMATO(1) T AH M AA T OW TOMATOE T AH M EY T OW TOMATOE(1) T AH M AA T OW YESY EH S 10
11
Phonetic Alphabet ARPABETIPAEXAMPLETRANSCRIPTION PHONE CLASS AAɑbalmB AA L M VOWEL AEæbatbatB AE T VOWEL AHʌbuttB AH T VOWEL AOɔboughtB AO T VOWEL AWaʊboutB AW T VOWEL AYaɪbiteB AY T VOWEL BbbuyB AY STOP CHtʃchurchCH ER CH AFFRICATE DddieD AY STOP DHðthyDH AY FRICATIVE EHɛbetbetB EH T VOWEL ERɝbirdB ER T VOWEL EYeɪbaitB EY T VOWEL FffightF AY T FRICATIVE GɡguyG AY STOP HHhhighHH AY SEMIVOWEL IHɪbitbitB IH T VOWEL IYibeatB IY T VOWEL JHdʒjiveJH AY V AFFRICATE KkkiteK AY T STOP LllieL AY LIQUID MmmymyM AY NASAL NnnighN AY NASAL NGNGŋsingS IH NG NASAL OWoʊboatB OW T VOWEL OYɔɪboyB OY VOWEL PppieP AY STOP RɹryeR AY LIQUID SssighS AY FRICATIVE SHʃshySH AY FRICATIVE TttieT AY STOP THθthighTH AY FRICATIVE UHʊbookB UH K VOWEL UWubootB UW T VOWEL VvvieV AY FRICATIVE WwwiseW AY Z SEMIVOWEL YjyachtY AH T SEMIVOWEL ZzzooZ UH FRICATIVE ZHʒpleasureP L EH ZH ER FRICATIVE CMU LEXICON –is a reference lexicon used in many (US English) speech recognition benchmarks IPA (International Phonetic Alphabet) –allows for a representation of spoken language –allows for different levels of detail ARPABET –is a simplified ASCII rendering of the IPA (International Phonetic Alphabet) –in the CMU DICT 39 phonetic symbols are used; 3 levels of stress may be added to the vowels (optional) –speech additional non-speech symbols may be used to transcribe silence, coughs, non-speech events, … This is not a course on phonetics … we use our intuition in these matters 11
12
ASR Components (3): LANGUAGE MODEL A statistical language model uses word sequence probabilities to predict the next possible words –it allows for disambiguating homonyms on basis of the context –more importantly, it drastically improves speech recognition accuracy in general by favoring probably text over unlikely text when the acoustic model is uncertain contextnext wordprobability Iwant.10 Iwould.10 Idon’t.03... I wantto.48 I wanta.10 I wantthat.05 I wantyou.01 … 12
13
ASR Components (4): RECOGNIZER The speech recognizer (search engine) 1.takes the stream of acoustic evidence vectors (computed by the AM) as input (typically one every 10 msec) 2.to find (search for) the most likely word sequence given the acoustic evidence subject to linguistic (and/or application) knowledge available in –lexicon –language model 13
14
Language Model X 1:T = feature vector sequence W = word sequence Speech Recognition in a Bayesian Framework Acoustic Model + Lexicon Search Engine
15
X 1:T = sequence of feature vectors W 1:M = sequence of recognized words U 1:N = sequence of units (forming the words) Speech Recognition is a “Sequence” Problem the input is a sequence of feature vectors of length T the segmentation of the input stream wrt. the unit stream is a priori unkown a sequence of units of a priori unknown length N [ restrictions may apply ]
16
STEP 1: ACOUSTIC MODEL (static) Pattern matching of feature vectors 16
17
K AH M P STEP 2: Sequence Matching Finding the most plausible transcription of audio data An acoustic model provides local (in time) evidence for each phone … Y K L M N AH P … By dynamic programming adding acoustic, phonological, lexical and linguistic constraints you can find the most plausible transcription of unsegmented acoustic data SPEECH RECOGNITION is a sequence-2-sequence problem converting a sequence of continuous acoustic vectors into a (much) shorter sequence of discrete words 17
18
Content INTRODUCTION PART I: SPEECH ANALYSIS (L1-L2, lab1-2) –The human auditory system & auditory perception –Speech Features: Formants, Pitch, Spectral Envelope, Cepstrum PART II: ACOUSTIC MODELING (L3-L7, lab3-5) –Pattern Matching (Bayesian Classification) applied to Speech Features –Algorithms for sequence2sequence recognition: DTW, HMMs –Context-Dependent Modeling, Decision Trees PART IV: ASR SYSTEMS (L8) –Knowledge sources: Lexicon & Language Models –Search Engine PART V: ADVANCED MODELING TECHNIQUES (L9-10, lab6) –Issues in Acoustic Modeling (Adaptation, Noise,.. ) –Deep Neural Nets for AM –Deep Neural Nets for LM APPLICATIONS, INDUSTRIAL OUTLOOK 18
19
PREREQUISITES A healthy mathematical mind Basic mathematical concepts and notations –matrix algebra –log, exp,.. Basic statistical concepts –mean, variance –normal & binomial distributions Notions of information theory –information expressed in bits A self-test is available on TOLEDO If the test is difficult for you, pls. take up your old math course notes before asking any questions to us !! 19
20
EXAM oral with written preparation 4 or 5 questions on different sections of the course exercise oriented OPEN BOOK RULES –limited to 2hrs (open book is not intended as ‘ study time ’ ) –you can bring any WRITTEN material NO COMMUNICATION DEVICES NO LIVING BEINGS (physical nor virtual) !! –you may use a basic calculator all calculations can be done by hand don ’ t blame your calculator for using the wrong formula 20
21
Practical Information H02A6 (2018-2019) 4 ECTS – 1 st semester Lectures –Fri 14:00-16:00 ESAT – AULA L –Dirk Van Compernolle ESAT 01.05 compi@esat.kuleuven.becompi@esat.kuleuven.be Exercises –6 sessions x 2,5 hrs –Lyan Verwimp lyan.verwimp@esat.kuleuven.belyan.verwimp@esat.kuleuven.be Course Notes –distributed in class TOLEDO –General Messages – exceptions to standard class schedule –Video recordings of a similar course –Addenda to the course notes –Exercises: info & solutions 21
22
Tentative Class Schedule 2018 WKLECTLECTURES (ELEC AULA L)SCHEDULE 11Introduction, The Auditory Scene, Architecture of ASR systems28/09/2018 22Spectrogram, Source-Filter Model05/10/2018 33A Bayesian Framework for pattern recognition. Speech Features: Energy, Pitch, Formants12/10/2018 44Frame Based Recognition : Gaussian Mixture Models, EM Algorithm. Speech Features: MEL FBANK, MFCC19/10/2018 55Recognition of Sequence Data. Lehvenstein distance. Dynamic Time Warping.26/10/2018 6No Class02/11/2018 76Hidden Markov Models for speech recognition: concept, Viterbi & Forward Pass algorithms, Viterbi training09/11/2018 87Multistate, context-dependent modeling, decision trees (incl. phonetics)16/11/2018 98ASR Architecture, language modeling and search23/11/2018 109Neural Nets for AM in Speech Recognition30/11/2018 1110Neural Nets for LM in Speech Recognition07/12/2018 1211Systems, Demos, Benchmarks, Industrial Outlook14/12/2018 13-Reserve (Q&A)21/12/2018 EXERCISE SESSIONSGROUP A (ELEC 02.53)GROUP B (ELEC 00.60) [ Tuesday's ][ Wednesday's ] 3Ex1Auditory Demonstrations9/10 - 13:3010/10 - 16:00 5Ex2Spectrograms23/10 - 13:3024/10 - 16:00 7Ex3Pattern Matching, Frame Based Recognition06/11 - 13:3007/11 - 16:00 9Ex4Trellis Processing20/11 - 13:3021/11 - 16:00 11Ex5Training of HMMs, Decision trees04/12 - 13:3005/12 - 16:00 12Ex6RNN language modeling11/12 - 13:3012/12 - 16:00 13-Reserve18/12 - 13:3019/12 - 16:00
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.