Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.

Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI

ICSI and EECS International Computer Science Institute Nonprofit, closely affiliated with UCB-EECS: - faculty (e.g., Morgan, Feldman) - Board (Berlekamp, Karp, Malik) - students (PhD, MS) Focus areas in speech,language,theory, internet research; CITRIS involvement

A working speech recognizer (circa 1920)

A working speech recognizer (circa 2002)

Current Applications Toys Telephone queries (operator/touch tone replacement) Voice dialing (for cell phones) Dictation (esp. for specific domains)

Major Reasons for Success Late 60’s statistical methodology (HMMs, developed for cryptography) applied to speech in 70’s and 80’s Moore’s Law + engineering refinements to HMM training/recognition (1986-now) Normalization approaches (mean norms, RASTA filtering, vocal tract length approx)

Two examples of things that helped RASTA: 2% digit error -> 60% for different phone system; down to 3% using RASTA; now used for voice dialing in millions of cell phones Vocal tract length normalization: 1 parameter for each speaker, significant effect on errors; now used in all large research systems

Major Technical Challenges Speaker variability for fluent/conversational (pronunciation, rate, overlaps)  25-40%error on conversations Acoustic variability for general environments (noise, reverb, talker movement)  3-10%error on read digits (vs <1% in clean conditions)

Modern ASR Systems From 50,000 ft, all ASR systems the same: - compute local spectral envelope - determine likelihoods of speech sounds - search for most likely HMMs Spectral envelope distorted by many things - Alternatives often are bad fits to the statistical models

Pronunciation Lexicon Signal Processing Phonetic Probability Estimator Decoder (word search) Words Speech Grammar ASR in Brief

ASR is half-deaf Phonetic classification very poor Success due to constraints (domain, speaker, noise-canceling mic, etc) These constraints can mask the underlying weakness of the technology

Rethinking Acoustic Processing for ASR Escape dependence on spectral envelope Use multiple front ends across time/freq Modify statistical models to accommodate new front ends Design optimal combination schemes for multiple models

The DARPA (IAO) “EARS” Program New 5 year program to radically reduce errors in conversational speech-to-text Two components: - Rich Transcription (large reductions in error rate, improvements in readability and portability to new languages) - Novel Approaches (radical changes)

EARS: Effective Affordable Reusable Speech-to-text Rich Transcription: 4 teams - SRI/ICSI/UW - BBN/U.Pitt/UW/LIMSI - Cambridge U. - IBM Novel Approaches: 2 teams - ICSI/SRI/UW/OGI/Columbia/IDIAP - Microsoft

time Novel Approach 1: Pushing the Envelope (aside) Problem: Spectral envelope is a fragile information carrier estimate of sound identity information fusion 10 ms OLD PROPOSED Solution: Probabilities from multiple time-frequency patches i-th estimate up to 1s k-th estimate n-th estimate estimate of sound identity

Novel Approach 2: Beyond Frames… Solution: Advanced features require advanced models, not limited by fixed-frame-rate paradigm OLD PROPOSED conventional HMM short-term features Problem: Features & models interact, new features may require different models advanced features multi-rate / dynamic scale classifier

Other speech-to-text projects Dialog systems: DARPA Communicator/Symphony, German SmartKom Noise/reverberation for cell phone, military environments: DARPA SPINE program, various European projects (EU, ETSI) Recognition/retrieval/summarization for multiparty meetings: Swiss IM2, EU m4, ICSI/UW/SRI/Columbia NSF-ITR

Resource generation from Berkeley researchers gmtk - a new graphical model toolkit specialized for speech (extension of 2 PhD theses, Bilmes [UW] and Zweig [IBM]) - Publicly available speech/neural network software (RASTA, speech neural network training system) Soon: a “meeting data” corpus

Campus interaction Within EECS (CIS): - Feldman (also ICSI), NLU - Jordan and Russell, machine learning Linguists: - Ohala, phonology - Fillmore(ICSI), semantic lexicography

Natural Speech + Language Projects at ICSI/EECS Berkeley Restaurant Project (BeRP) - online stochastic context free grammar probabilities with natural mixed initiative SmartKom - tourist information query system w/American pronunciations of German place names

Summary Progress in speech recognition research led to working systems in particular domains Performance still severely limited for conversational speech, noisy/reverberant conditions We and others are working to transcend these limitations with novel approaches

Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.

Similar presentations

Presentation on theme: "Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI.

Similar presentations

Presentation on theme: "Open Problems in Speech Recognition Nelson Morgan, EECS and ICSI."— Presentation transcript:

Similar presentations

About project

Feedback