Presentation is loading. Please wait.

Presentation is loading. Please wait.

Automatic Speech Recognition

Similar presentations

Presentation on theme: "Automatic Speech Recognition"— Presentation transcript:

1 Automatic Speech Recognition

2 The Malevolent Hal

3 WE’RE Not Quite There Yet (and Lucky for Us)
But what is an error?

4 The Model Speech is variable. An acoustic utterance will never match any model exactly. Conclusion: Speech Recognition is a special case of Bayesian inference.

5 Goal Of a Probabilistic Noisy Channel Architecture
What is the most likely sequence of words W out of all word sequences in a language L given some acoustic input O? Where O is a sequence of observations 0=o1, o2 , o3 , ..., ot each oi is a floating point value representing ~10ms worth of energy of that slice of 0 And w=w1, w2 , w3 ,..., wn each wi is a word in L

6 ASR as a Conditional Probability
We have to invoke you know whom

7 Bayes Rule Let’s us transform To In fact

8 New Terms Feature Extraction
Acoustic waveform is sampled into frames (10, 15, 20 ms) Transformed into a vector of 39 features Acoustic Model or Phone Recognition (likelihoods) compute the likelihood of observed features given linguistic units (words, phones, triphones): p(O|W) output is sequence of probabilities, one for each time frame contains the likelihooods that each linguistic unit generated the acoustic feature vector

9 New Wine in Old Bottles Language Modeling (priors) Lexicon
bigrams/trigrams/quadrigrams of words in a lexicon Lexicon A list of words with a pronunciation for each word expressed as a phone sequence Decoding (Viterbi) Combines the acoustic model, language model and lexicon to produce the most probable sequence of words Training Filling in the HMM lattice using the Baum-Welch (forward-backward) algorithm Observations: acoustic signals, information about the waveform at that point in time Hidden states: phones/triphones

10 The architecture

11 It’s More Complex Phones can extend beyond 1 s That’s 100 frames
But they’re not acoustically identical

12 [ay k] ~.45s Notice F2 rises and F1 falls and the difference between silence and release parts of [k]

13 Conclusion Phones are non-homogenous over time
Modeled with beginning, middle, end six ([s ih k s]

14 The Formal Model B = bi(ot) = p(ot|qi) The probability of a feature vector being generated by subphone state i

Download ppt "Automatic Speech Recognition"

Similar presentations

Ads by Google