Download presentation
Presentation is loading. Please wait.
1
Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI
2
ICSI and EECS International Computer Science Institute Nonprofit, closely affiliated with UCB-EECS: - faculty (e.g., Morgan, Feldman) - Board (Berlekamp, Karp, Malik) - students (PhD, MS) Focus areas in speech,language,theory, internet research; CITRIS involvement
3
A working speech recognizer (circa 1920)
4
A working speech recognizer (circa 2002)
5
Current Applications Toys Telephone queries (operator/touch tone replacement) Voice dialing (for cell phones) Dictation (esp. for specific domains)
6
Major Reasons for Success Late 60’s statistical methodology (HMMs, developed for cryptography) applied to speech in 70’s and 80’s Moore’s Law + engineering refinements to HMM training/recognition (1986-now) Normalization approaches (mean norms, RASTA filtering, vocal tract length approx)
7
Two examples of things that helped RASTA: 2% digit error -> 60% for different phone system; down to 3% using RASTA; now used for voice dialing in millions of cell phones Vocal tract length normalization: 1 parameter for each speaker, significant effect on errors; now used in all large research systems
8
Major Technical Challenges Speaker variability for fluent/conversational (pronunciation, rate, overlaps) 25-40%error on conversations Acoustic variability for general environments (noise, reverb, talker movement) 3-10%error on read digits (vs <1% in clean conditions)
9
Modern ASR Systems From 50,000 ft, all ASR systems the same: - compute local spectral envelope - determine likelihoods of speech sounds - search for most likely HMMs Spectral envelope distorted by many things - Alternatives often are bad fits to the statistical models
10
Pronunciation Lexicon Signal Processing Phonetic Probability Estimator Decoder (word search) Words Speech Grammar ASR in Brief
11
ASR is half-deaf Phonetic classification very poor Success due to constraints (domain, speaker, noise-canceling mic, etc) These constraints can mask the underlying weakness of the technology
12
Rethinking Acoustic Processing for ASR Escape dependence on spectral envelope Use multiple front ends across time/freq Modify statistical models to accommodate new front ends Design optimal combination schemes for multiple models
13
The DARPA (IAO) “EARS” Program New 5 year program to radically reduce errors in conversational speech-to-text Two components: - Rich Transcription (large reductions in error rate, improvements in readability and portability to new languages) - Novel Approaches (radical changes)
14
EARS: Effective Affordable Reusable Speech-to-text Rich Transcription: 4 teams - SRI/ICSI/UW - BBN/U.Pitt/UW/LIMSI - Cambridge U. - IBM Novel Approaches: 2 teams - ICSI/SRI/UW/OGI/Columbia/IDIAP - Microsoft
15
time Novel Approach 1: Pushing the Envelope (aside) Problem: Spectral envelope is a fragile information carrier estimate of sound identity information fusion 10 ms OLD PROPOSED Solution: Probabilities from multiple time-frequency patches i-th estimate up to 1s k-th estimate n-th estimate estimate of sound identity
16
Novel Approach 2: Beyond Frames… Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm OLD PROPOSED conventional HMM short-term features Problem: Features & models interact, new features may require different models advanced features multi-rate / dynamic scale classifier
17
Other speech-to-text projects Dialog systems: DARPA Communicator/Symphony, German SmartKom Noise/reverberation for cell phone, military environments: DARPA SPINE program, various European projects (EU, ETSI) Recognition/retrieval/summarization for multiparty meetings: Swiss IM2, EU m4, ICSI/UW/SRI/Columbia NSF-ITR
18
Resource generation from Berkeley researchers gmtk - a new graphical model toolkit specialized for speech (extension of 2 PhD theses, Bilmes [UW] and Zweig [IBM]) - Publicly available speech/neural network software (RASTA, speech neural network training system) Soon: a “meeting data” corpus
19
Campus interaction Within EECS (CIS): - Feldman (also ICSI), NLU - Jordan and Russell, machine learning Linguists: - Ohala, phonology - Fillmore(ICSI), semantic lexicography
20
Natural Speech + Language Projects at ICSI/EECS Berkeley Restaurant Project (BeRP) - online stochastic context free grammar probabilities with natural mixed initiative SmartKom - tourist information query system w/American pronunciations of German place names
21
Summary Progress in speech recognition research led to working systems in particular domains Performance still severely limited for conversational speech, noisy/reverberant conditions We and others are working to transcend these limitations with novel approaches
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.