Download presentation
Presentation is loading. Please wait.
1
Automatic Speech Recognition
2
The Malevolent Hal
3
WE’RE Not Quite There Yet (and Lucky for Us)
But what is an error?
4
The Model Speech is variable. An acoustic utterance will never match any model exactly. Conclusion: Speech Recognition is a special case of Bayesian inference.
5
Goal Of a Probabilistic Noisy Channel Architecture
What is the most likely sequence of words W out of all word sequences in a language L given some acoustic input O? Where O is a sequence of observations 0=o1, o2 , o3 , ..., ot each oi is a floating point value representing ~10ms worth of energy of that slice of 0 And w=w1, w2 , w3 ,..., wn each wi is a word in L
6
ASR as a Conditional Probability
We have to invoke you know whom
7
Bayes Rule Let’s us transform To In fact
8
New Terms Feature Extraction
Acoustic waveform is sampled into frames (10, 15, 20 ms) Transformed into a vector of 39 features Acoustic Model or Phone Recognition (likelihoods) compute the likelihood of observed features given linguistic units (words, phones, triphones): p(O|W) output is sequence of probabilities, one for each time frame contains the likelihooods that each linguistic unit generated the acoustic feature vector
9
New Wine in Old Bottles Language Modeling (priors) Lexicon
bigrams/trigrams/quadrigrams of words in a lexicon Lexicon A list of words with a pronunciation for each word expressed as a phone sequence Decoding (Viterbi) Combines the acoustic model, language model and lexicon to produce the most probable sequence of words Training Filling in the HMM lattice using the Baum-Welch (forward-backward) algorithm Observations: acoustic signals, information about the waveform at that point in time Hidden states: phones/triphones
10
The architecture
11
It’s More Complex Phones can extend beyond 1 s That’s 100 frames
But they’re not acoustically identical
12
[ay k] ~.45s Notice F2 rises and F1 falls and the difference between silence and release parts of [k]
13
Conclusion Phones are non-homogenous over time
Modeled with beginning, middle, end six ([s ih k s]
14
The Formal Model B = bi(ot) = p(ot|qi) The probability of a feature vector being generated by subphone state i
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.