Automatic Speech Recognition

Automatic Speech Recognition

The Malevolent Hal

WE’RE Not Quite There Yet (and Lucky for Us)
But what is an error?

The Model Speech is variable. An acoustic utterance will never match any model exactly. Conclusion: Speech Recognition is a special case of Bayesian inference.

Goal Of a Probabilistic Noisy Channel Architecture
What is the most likely sequence of words W out of all word sequences in a language L given some acoustic input O? Where O is a sequence of observations 0=o1, o2 , o3 , ..., ot each oi is a floating point value representing ~10ms worth of energy of that slice of 0 And w=w1, w2 , w3 ,..., wn each wi is a word in L

ASR as a Conditional Probability
We have to invoke you know whom

Bayes Rule Let’s us transform To In fact

New Terms Feature Extraction
Acoustic waveform is sampled into frames (10, 15, 20 ms) Transformed into a vector of 39 features Acoustic Model or Phone Recognition (likelihoods) compute the likelihood of observed features given linguistic units (words, phones, triphones): p(O|W) output is sequence of probabilities, one for each time frame contains the likelihooods that each linguistic unit generated the acoustic feature vector

New Wine in Old Bottles Language Modeling (priors) Lexicon
bigrams/trigrams/quadrigrams of words in a lexicon Lexicon A list of words with a pronunciation for each word expressed as a phone sequence Decoding (Viterbi) Combines the acoustic model, language model and lexicon to produce the most probable sequence of words Training Filling in the HMM lattice using the Baum-Welch (forward-backward) algorithm Observations: acoustic signals, information about the waveform at that point in time Hidden states: phones/triphones

The architecture

It’s More Complex Phones can extend beyond 1 s That’s 100 frames
But they’re not acoustically identical

[ay k] ~.45s Notice F2 rises and F1 falls and the difference between silence and release parts of [k]

Conclusion Phones are non-homogenous over time
Modeled with beginning, middle, end six ([s ih k s]

The Formal Model B = bi(ot) = p(ot|qi) The probability of a feature vector being generated by subphone state i

Automatic Speech Recognition

Similar presentations

Presentation on theme: "Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Automatic Speech Recognition

Similar presentations

Presentation on theme: "Automatic Speech Recognition"— Presentation transcript:

Similar presentations

About project

Feedback