A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg
Speech and NLP Communication in Natural Language Text: –Carefully prepared –Grammatical –Machine readable Typos Sometimes OCR or handwriting issues 1
Speech and NLP Communication in Natural Language Speech: –Spontaneous –Less Grammatical –Machine readable with > 10% error using on speech recognition. 2
The traditional view 3 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Text Documents Training Application
The simplest approach 4 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Text Documents Training Application
Speech is errorful text 5 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Training Application
Speech signal can be used 6 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Training Application
Hybrid speech signal and text 7 Text Processing System Named Entity Recognizer Text Processing System Named Entity Recognizer Transcribed Documents Training Application Text Documents
Speech Recognition Standard HMM speech recognition. Front End Acoustic Model Pronunciation Model Language Model Decoding 8
Speech Recognition 9 Front End Acoustic Model Pronunciation Model Language Model Word Sequence Acoustic Feature Vector Phone Likelihoods Word Likelihoods
Speech Recognition 10 Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label
Front End How do we convert a wave form into a useful representation? We are looking for a vector of numbers which describe the acoustic content Assuming 22kHz 16bit sound. Modeling this directly is not feasible...yet 11
Discrete Cosine Transform Every wave can be decomposed into component sine or cosine waves. Fast Fourier Transform is used to do this efficiently 12
Overlapping frames Spectrograms allow for visual inspection of spectral information. We are looking for a compact, numerical representation 13 10ms
Single Frame of FFT 14 Australian male /i:/ from “heed” FFT analysis window 12.8ms
Example Spectrogram 15
“Standard” Representation Mel Frequency Cepstral Coefficients –MFCC 16 Pre- Emphasis window FFT Mel-Filter Bank log FFT -1 Deltas energy 12 MFCC 12 ∆ MFCC 12∆∆ MFCC 1 energy 1 ∆ energy 1 ∆∆ energy
Speech Recognition 17 Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label
Language Model What is the probability of a sequence of words? Assume you have a vocabulary of V words. How many possible sequences of N words are there? 18
General Language Modeling Any probability calculation can be used here. Class based language models. e.g. Recurrent neural networks 19
Speech Recognition 20 Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label
Pronunciation Modeling Identify the likelihood of a phone sequence given a word sequence. There are many simplifying assumptions in pronunciation modeling. 1.The pronunciation of each word is independent of the previous and following. 21
Dictionary as Pronunciation Model Assume each word has a single pronunciation 22 IAY CATK AE T THEDH AH HADH AE D ABSURDAH B S ER D YOUY UH D
Weighted Dictionary as Pronunciation Model Allow multiple pronunciations and weight each by their likelihood 23 IAY.4 IIH.6 THEDH AH.7 THEDH IY.3 YOUY UH.5 YOUY UW.5
Grapheme to Phoneme conversion What about words that you have never seen before? What if you don’t think you’ve seen every possible pronunciation? How do you pronounce: “McKayla”? or “Zoomba”? Try to learn the phonetics of the language. 24
Letter to Sound Rules Manually written rules that are able to convert one or more letters to one or more sounds. T -> /t/ H -> /h/ TH -> /dh/ E -> /e/ These rules can get complicated based on the surrounding context. –K is silent when word initial and followed by N. 25
Speech Recognition 26 Language Model Calculate the probability ofa sequence of words Language Model Calculate the probability ofa sequence of words Front End Convert sounds into a sequence of observation vectors Front End Convert sounds into a sequence of observation vectors Language Model Calculate the probability of a sequence of words Language Model Calculate the probability of a sequence of words Pronunciation Model The probability of a pronunciation given a word Pronunciation Model The probability of a pronunciation given a word Acoustic Model The probability of a set of observations given a phone label Acoustic Model The probability of a set of observations given a phone label
Acoustic Modeling Hidden markov model. –Used to model the relationship between two sequences. 27
Hidden Markov model In a Hidden Markov Model the state sequence is unobserved. Only an observation sequence is available 28 q1q1 q2q2 q3q3 x1x1 x1x1 x2x2 x2x2 x3x3 x3x3
Hidden Markov model Observations are MFCC vectors States are phone labels Each state (phone) has an associated GMM modeling the MFCC likelihood 29 q1q1 q2q2 q3q3 x1x1 x1x1 x2x2 x2x2 x3x3 x3x3
Training acoustic models TIMIT –close, manual phonetic transcription –2342 sentences Extract MFCC vectors from each frame within each phone For each phone, train a GMM using Expectation Maximization. These GMM is the Acoustic Model. –Common to use 8, or 16 Gaussian Mixture Components. 30
Gaussian Mixture Model 31
HMM Topology for Training Rather than having one GMM per phone, it is common for acoustic models to represent each phone as 3 triphones 32 S1 S3 S2 S4 S5 /r/
33 Speech in Natural Language Processing ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY
34 Speech in Natural Language Processing Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what’s the station name) Downtown Crossing and (um) that’ll get you back to the Red Line just as easily.
35 Spoken Language Processing NLP system IR IE QA Summarization Topic Modeling Speech Recognition
36 Spoken Language Processing NLP system IR IE QA Summarization Topic Modeling ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY
37 Dealing with Speech Errors ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY Robust NLP system IR IE QA Summarization Topic Modeling
38 Automatic Speech Recognition Assumption ALSO FROM NORTH STATION I THINK THE ORANGE LINE RUNS BY THERE TOO SO YOU CAN ALSO CATCH THE ORANGE LINE AND THEN INSTEAD OF TRANSFERRING UM I YOU KNOW THE MAP IS REALLY OBVIOUS ABOUT THIS BUT INSTEAD OF TRANSFERRING AT PARK STREET YOU CAN TRANSFER AT UH WHAT’S THE STATION NAME DOWNTOWN CROSSING UM AND THAT’LL GET YOU BACK TO THE RED LINE JUST AS EASILY ASR produces a “transcript” of Speech.
39 Automatic Speech Recognition Assumption “Rich Transcription” Also, from the North Station... (I think the Orange Line runs by there too so you can also catch the Orange Line... ) And then instead of transferring (um I- you know, the map is really obvious about this but) Instead of transferring at Park Street, you can transfer at (uh what’s the station name) Downtown Crossing and (um) that’ll get you back to the Red Line just as easily. ASR produces a “transcript” of Speech.
40 Decrease WERIncrease Robustness Speech as Noisy Text Robust NLP system IR IE QA Summarization Topic Modeling Speech Recognition
41 Other directions for improvement. Prosodic Analysis Robust NLP system IR IE QA Summarization Topic Modeling Speech Recognition Use Lattices or N-Best lists
Processing Speech Processing speech is difficult –There are errors in transcripts. –It is not grammatical –The style (genre) of speech is different from the available (text) training data. Processing speech is easy –Speaker information –Intention (sarcasm, certainty, emotion, etc.) –Segmentation 42