Download presentation
Presentation is loading. Please wait.
Published byCody Nichols Modified over 9 years ago
1
Automatic Speech Recognition (ASR): A Brief Overview
2
Radio Rex – 1920’s ASR
3
Statistical ASR i_best = argmax P(M |X ) = argmax P(X|M ) P(M ) (1st term, acoustic model; 2nd term, language model) P(X|M ) P(X|Q ) [ Viterbi approx.] where Q is the best state sequence in M approximated by product of local likelihoods (Markov,conditional independence assumptions) i i i M M i i i i i i
4
Automatic Speech Recognition Speech Production/Collection Pre-processing Feature Extraction Hypothesis Generation Cost Estimator Decoding
5
Simplified Model of Speech Production Periodic Source Random Source Filters Vocal vibration or turbulence (Fine spectral structure) Vocal tract, nasal tract, radiation (spectral envelope)
6
Pre-processing Room Acoustics Speech Microphone Linear Filtering Sampling & Digitization Issues: Noise and reverb, effect on modeling
7
Frame 1 Frame 2 Feature Vector X 1 Feature Vector X 2 Framewise Analysis of Speech
8
Feature Extraction Spectral Analysis Auditory Model/ Orthogonalize (cepstrum) Issues: Design for discrimination, insensitivities to scaling and simple distortions
9
Representations are Important Network 23% frame correct Network 70% frame correct Speech waveform PLP features
10
Mel Frequency Scale
11
Spectral vs Temporal Processing Processing Analysis (e.g., cepstral) Processing (e.g., mean removal) Time frequency Spectral processing Temporal processing
12
Hypothesis Generation Issue: models of language and task cat dog a dog is not a cat a cat not is adog
13
Cost Estimation Distances -Log probabilities, from u discrete distributions u Gaussians, mixtures neural networks
14
Nonlinear Time Normalization
15
Decoding
16
Pronunciation Models
17
Language Models Most likely words for largest product P(acoustics words) P(words) P(words) = P(words history) bigram, history is previous word trigram, history is previous 2 words n-gram, history is previous n-1 words
18
ASR System Architecture Pronunciation Lexicon Signal Processing Acoustic Probability Estimator (HMM state likelihoods) Decoder Recognized Words “zero” “three” “two” Probabilities “z” -0.81 “th” = 0.15 “t” = 0.03 Cepstrum Speech Signal Language Model
19
HMMs for Speech Math from Baum and others, 1966-1972 Applied to speech by Baker in the original CMU Dragon System (1974) Developed by IBM (Baker, Jelinek, Bahl, Mercer,….) (1970-1993) Extended by others in the mid-1980’s
20
Hidden Markov model (graphical form) q q 12 q q 34 x 1 2 x x 3 4 x
21
Hidden Markov Model (state machine form) qqq P(x | q ) 1 2 3 P(q | q ) 21 3243 1 2 3
22
Markov model q q 12 P(x,x |q,q ) P( q ) P(x |q ) P(q | q ) P(x | q ) 11111222221
23
HMM Training Steps Initialize estimators and models Estimate “hidden” variable probabilities Choose estimator parameters to maximize model likelihoods Assess and repeat steps as necessary A special case of Expectation Maximization (EM)
24
Progress in 3 Decades From digits to 60,000 words From single speakers to many From isolated words to continuous speech From no products to many products, some systems actually saving LOTS of money
25
Real Uses Telephone: phone company services (collect versus credit card) Telephone: call centers for query information (e.g., stock quotes, parcel tracking) Dictation products: continuous recognition, speaker dependent/adaptive
26
But: Still <97% on “yes” for telephone Unexpected rate of speech causes doubling or tripling of error rate Unexpected accent hurts badly Performance on unrestricted speech at 70% (with good acoustics) Don’t know when we know Few advances in basic understanding
27
Why is ASR Hard? Natural speech is continuous Natural speech has disfluencies Natural speech is variable over: global rate, local rate, pronunciation within speaker, pronunciation across speakers, phonemes in different contexts
28
Why is ASR Hard? (continued) Large vocabularies are confusable Out of vocabulary words inevitable Recorded speech is variable over: room acoustics, channel characteristics, background noise Large training times are not practical User expectations are for equal to or greater than “human performance”
29
ASR Dimensions Speaker dependent, independent Isolated, continuous, keywords Lexicon size and difficulty Task constraints, perplexity Adverse or easy conditions Natural or read speech
30
Telephone Speech Limited bandwidth (F vs S) Large speaker variability Large noise variability Channel distortion Different handset microphones Mobile and handsfree acoustics
31
Hot Research Problems Speech in noise Multilingual conversational speech (EARS) Portable (e.g., cellular) ASR Question answering Understanding meetings – or at least browsing them
32
Hot Research Approaches New (multiple) features and models New statistical dependencies Multiple time scales Multiple (larger) sound units Dynamic/robust pronunciation models Long-range language models Incorporating prosody Incorporating meaning Non-speech modalities Understanding confidence
33
Multi-frame analysis Incorporate multiple frames as a single observation LDA the most common approach Neural networks Bayesian networks (graphical models, including Buried Markov Models)
34
Linear Discriminant Analysis (LDA) = X x x x x x 1 2 3 4 5 y y 1 2 All variables for several frames Transformation to maximize ratio: between-class variance within-class variance
35
Multi-layer perceptron
36
Buried Markov Models
37
Multi-stream analysis Multi-band systems Multiple temporal properties Multiple data-driven temporal filters
38
Multi-band analysis
39
Temporally distinct features
40
Combining streams
41
Another novel approach: Articulator dynamics Natural representation of context Production apparatus has mass, inertia Difficult to accurately model Can approximate with simple dynamics
42
Hidden Dynamic Models “We hold these truths to be self-evident: that speech is produced by an underlying dynamic system, that it is endowed by its production system with certain inherent dynamic qualities, among these are compactness, continuity, and the pursuit of target values for each phone class, that to exploit these characteristics Hidden Dynamic Models are instituted among men. We … solemnly publish and declare, that these phone classes are and of aright ought to be free and context independent states …And for the support of this declaration, with a firm reliance on the acoustic theory of speech production, we mutually pledge our lives, our fortunes, and our sacred honor.” John Bridle and Li Deng, 1998 Hopkins Spoken LanguageWorkshop, with apologies to Thomas Jefferson... (See http://www/clsp.jhu.edu/ws98/projects/dynamic/)
43
Hidden Dynamic Models TARGET VALUES SEGMENTATION TARGET SWITCH FILTER NEURAL NETWORK SPEECH PATTERN
44
Sources of Optimism Comparatively new research lines Many examples of improvements Moore’s Law much more processing Points toward joint development of front end and statistical components
45
Summary 2002 ASR based on 50+ years of research Core algorithms mature systems, 10-30 yrs Deeply difficult, but tasks can be chosen that are easier in SOME dimension Much more yet to do
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.