Automatic Speech Recognition Introduction Readings: Jurafsky & Martin HLT Survey Chapter 1
The Human Dialogue System
Computer Dialogue Systems Audition Automatic Speech Recognition Natural Language Understanding Dialogue Management Planning Natural Language Generation Text-to- speech signalwords logical form wordssignal
Computer Dialogue Systems AuditionASRNLU Dialogue Mgmt. Planning NLG Text-to- speech signalwords logical form wordssignal
Parameters of ASR Capabilities Different types of tasks with different difficulties –Speaking mode (isolated words/continuous speech) –Speaking style (read/spontaneous) –Enrollment (speaker-independent/dependent) –Vocabulary (small 20kword) –Language model (finite state/context sensitive) –Perplexity (small 100) –Signal-to-noise ratio (high > 30 dB/low < 10dB) –Transducer (high quality microphone/telephone)
The Noisy Channel Model message noisy channel message MessageChannel + =Signal Decoding model: find Message*= argmax P(Message|Signal) But how do we represent each of these things?
ASR using HMMs Try to solve P(Message|Signal) by breaking the problem up into separate components Most common method: Hidden Markov Models –Assume that a message is composed of words –Assume that words are composed of sub-word parts (phones) –Assume that phones have some sort of acoustic realization –Use probabilistic models for matching acoustics to phones to words
HMMs: The Traditional View gohome gohom x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 x9x9 Markov model backbone composed of phones (hidden because we don’t know correspondences) Acoustic observations Each line represents a probability estimate (more later)
HMMs: The Traditional View gohome gohom x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 x9x9 Markov model backbone composed of phones (hidden because we don’t know correspondences) Acoustic observations Even with same word hypothesis, can have different alignments. Also, have to search over all word hypotheses
HMMs as Dynamic Bayesian Networks gohome q 0 =g x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 x9x9 Markov model backbone composed of phones Acoustic observations q 1 =oq 2 =oq 3 =oq 4 =hq 5 =oq 6 =oq 7 =oq 8 =mq 9 =m
HMMs as Dynamic Bayesian Networks gohome q 0 =g x0x0 x1x1 x2x2 x3x3 x4x4 x5x5 x6x6 x7x7 x8x8 x9x9 Markov model backbone composed of phones q 1 =oq 2 =oq 3 =oq 4 =hq 5 =oq 6 =oq 7 =oq 8 =mq 9 =m ASR: What is best assignment to q 0 …q 9 given x 0 …x 9 ?
Hidden Markov Models & DBNs DBN representationMarkov Model representation
Parts of an ASR System Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … The cat chased the dog S E A R C H
Parts of an ASR System Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … Produces acoustics (x t ) Maps acoustics to phones Maps phones to words Strings words together
Feature calculation
Frequency Time Find energy at each time step in each frequency channel
Feature calculation Frequency Time Take inverse Discrete Fourier Transform to decorrelate frequencies
Feature calculation … … … … … Input: Output:
Robust Speech Recognition Different schemes have been developed for dealing with noise, reverberation –Additive noise: reduce effects of particular frequencies –Convolutional noise: remove effects of linear filters (cepstral mean subtraction)
Now what? … … … … That you … ???
Machine Learning! … … … … That you … Pattern recognition with HMMs
Hidden Markov Models (again!) P(acoustics t |state t ) Acoustic Model P(state t+1 |state t ) Pronunciation/Language models
Acoustic Model … … … … dhaat Assume that you can label each vector with a phonetic label Collect all of the examples of a phone together and build a Gaussian model (or some other statistical model, e.g. neural networks) N a ( ) P(X|state=a)
Building up the Markov Model Start with a model for each phone Typically, we use 3 states per phone to give a minimum duration constraint, but ignore that here… a p 1-p transition probability a p 1-p a p a p
Building up the Markov Model Pronunciation model gives connections between phones and words Multiple pronunciations: ow tm dh p dh 1-p dh a papa 1-p a t ptpt 1-p t ah ow ey ah t
Building up the Markov Model Language model gives connections between words (e.g., bigram grammar) dha t hiyyuw p(he|that) p(you|that)
ASR as Bayesian Inference q1w1q1w1 q2w1q2w1 q3w1q3w1 x1x1 x2x2 x3x3 tha t hiyyuw p(he|that) p(you|that) h iy shuh d argmax W P(W|X) =argmax W P(X|W)P(W)/P(X) =argmax W P(X|W)P(W) =argmax W Q P(X,Q|W)P(W) ≈argmax W max Q P(X,Q|W)P(W) ≈argmax W max Q P(X|Q) P(Q|W) P(W)
ASR Probability Models Three probability models –P(X|Q): acoustic model –P(Q|W): duration/transition/pronunciation model –P(W): language model language/pronunciation models inferred from prior knowledge Other models learned from data (how?)
Parts of an ASR System Feature Calculation Language Modeling Acoustic Modeling Pronunciation Modeling cat: dog: dog mail: mAl the: D&, DE … cat dog: cat the: the cat: the dog: the mail: … The cat chased the dog S E A R C H P(X|Q)P(Q|W)P(W)
EM for ASR: The Forward- Backward Algorithm Determine “state occupancy” probabilities –I.e. assign each data vector to a state Calculate new transition probabilities, new means & standard deviations (emission probabilities) using assignments
ASR as Bayesian Inference q1w1q1w1 q2w1q2w1 q3w1q3w1 x1x1 x2x2 x3x3 tha t hiyyuw p(he|that) p(you|that) h iy shuh d argmax W P(W|X) =argmax W P(X|W)P(W)/P(X) =argmax W P(X|W)P(W) =argmax W Q P(X,Q|W)P(W) ≈argmax W max Q P(X,Q|W)P(W) ≈argmax W max Q P(X|Q) P(Q|W) P(W)
Search When trying to find W*=argmax W P(W|X), need to look at (in theory) –All possible word sequences W –All possible segmentations/alignments of W&X Generally, this is done by searching the space of W –Viterbi search: dynamic programming approach that looks for the most likely path –A* search: alternative method that keeps a stack of hypotheses around If |W| is large, pruning becomes important
How to train an ASR system Have a speech corpus at hand –Should have word (and preferrably phone) transcriptions –Divide into training, development, and test sets Develop models of prior knowledge –Pronunciation dictionary –Grammar Train acoustic models –Possibly realigning corpus phonetically
How to train an ASR system Test on your development data (baseline) **Think real hard Figure out some neat new modification Retrain system component Test on your development data Lather, rinse, repeat ** Then, at the end of the project, test on the test data.
Judging the quality of a system Usually, ASR performance is judged by the word error rate ErrorRate = 100*(Subs + Ins + Dels) / Nwords REF: I WANT TO GO HOME *** REC: * WANT TWO GO HOME NOW SC: D C S C C I 100*(1S+1I+1D)/5 = 60%
Judging the quality of a system Usually, ASR performance is judged by the word error rate This assumes that all errors are equal –Also, a bit of a mismatch between optimization criterion and error measurement Other (task specific) measures sometimes used –Task completion –Concept error rate