Automatic Speech Recognition
The Malevolent Hal
WE’RE Not Quite There Yet (and Lucky for Us) But what is an error?
The Model Speech is variable. An acoustic utterance will never match any model exactly. Conclusion: Speech Recognition is a special case of Bayesian inference.
Goal Of a Probabilistic Noisy Channel Architecture What is the most likely sequence of words W out of all word sequences in a language L given some acoustic input O? Where O is a sequence of observations 0=o1, o2 , o3 , ..., ot each oi is a floating point value representing ~10ms worth of energy of that slice of 0 And w=w1, w2 , w3 ,..., wn each wi is a word in L
ASR as a Conditional Probability We have to invoke you know whom
Bayes Rule Let’s us transform To In fact
New Terms Feature Extraction Acoustic waveform is sampled into frames (10, 15, 20 ms) Transformed into a vector of 39 features Acoustic Model or Phone Recognition (likelihoods) compute the likelihood of observed features given linguistic units (words, phones, triphones): p(O|W) output is sequence of probabilities, one for each time frame contains the likelihooods that each linguistic unit generated the acoustic feature vector
New Wine in Old Bottles Language Modeling (priors) Lexicon bigrams/trigrams/quadrigrams of words in a lexicon Lexicon A list of words with a pronunciation for each word expressed as a phone sequence Decoding (Viterbi) Combines the acoustic model, language model and lexicon to produce the most probable sequence of words Training Filling in the HMM lattice using the Baum-Welch (forward-backward) algorithm Observations: acoustic signals, information about the waveform at that point in time Hidden states: phones/triphones
The architecture
It’s More Complex Phones can extend beyond 1 s That’s 100 frames But they’re not acoustically identical
[ay k] ~.45s Notice F2 rises and F1 falls and the difference between silence and release parts of [k]
Conclusion Phones are non-homogenous over time Modeled with beginning, middle, end six ([s ih k s]
The Formal Model B = bi(ot) = p(ot|qi) The probability of a feature vector being generated by subphone state i