Presentation is loading. Please wait.

Presentation is loading. Please wait.

IBM ASR Workshop Paris, France 18-20 Sept 2000 Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies.

Similar presentations


Presentation on theme: "IBM ASR Workshop Paris, France 18-20 Sept 2000 Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies."— Presentation transcript:

1 IBM ASR Workshop Paris, France 18-20 Sept 2000 Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies Group IBM Thomas J. Watson Research Center Special thanks to: Stan Chen, Satya Dharanipragada, Geoff Zweig and members of the Telephony Speech Algorithms Group

2 IBM ASR Workshop Paris, France 18-20 Sept 2000 Common UI Folklore “Except when interacting with video games, a user does not take very well to surprises” Human-Computer Interaction Dix, Finley, Aboud and Beale “Golden Rule #3: Make the interface consistent” Elements of user interface design Mandel “Computer users usually seek predictable responses and are discouraged if they must engage in clarification dialogs frequently” Designing the User Interface Shneiderman

3 IBM ASR Workshop Paris, France 18-20 Sept 2000 Speech Recognition Progress

4 IBM ASR Workshop Paris, France 18-20 Sept 2000 Human Performance (Lippmann, 1997)

5 IBM ASR Workshop Paris, France 18-20 Sept 2000 Problem Categorization Dictation (WSJ) Broadcast News DARPA Communicato r SWBVoicemailMeetings Well Formed Varied, primarily Well formed Spontaneous ComputerAudienceComputerPerson People Full BWMixed, primarily full BW Telephone BW Far-field 7%12%16%20-30%30%55%

6 IBM ASR Workshop Paris, France 18-20 Sept 2000 Domain Dependence Training Data Transactio n SwitchboardVoicemail YP4.396.448.55 Digits1.341.862.36 Switchboar d --3957 Voicemail--4736

7 IBM ASR Workshop Paris, France 18-20 Sept 2000 Observations - 1. spontaneous speech: largest effect on WER (Switchboard, Voicemail, Meetings, real-world speech) - 2. multi-environment speech sources (16K, 8K, far-field microphone, noisy...) - 3. multi-domain speech sources (dictation, travel, call center, small vocab, broadcast news) - 4. domain-dependence of performance Focus areas Improve spontaneous speech models 1. Articulatory modeling 2. Prosodic features 3. Segmental graphical models 4. Joint parameter estimation 5. Speaker separation for multi-speaker speech 6. Data collection for "meeting speech" Multi-environment 1. non-linear feature space transformation 2. Hidden observations Multi-domain 1. Multistyle training 2. Domain independent LM Objective: Develop speech recognition system that mimics human performance (independent of environment, domain, works as well for spontaneous as for carefully enunciated speech)

8 IBM ASR Workshop Paris, France 18-20 Sept 2000

9 IBM ASR Workshop Paris, France 18-20 Sept 2000

10 IBM ASR Workshop Paris, France 18-20 Sept 2000

11 IBM ASR Workshop Paris, France 18-20 Sept 2000

12 IBM ASR Workshop Paris, France 18-20 Sept 2000 30% Improvement No initial decoding

13 IBM ASR Workshop Paris, France 18-20 Sept 2000

14 IBM ASR Workshop Paris, France 18-20 Sept 2000

15 IBM ASR Workshop Paris, France 18-20 Sept 2000 ASR Workshop

16 IBM ASR Workshop Paris, France 18-20 Sept 2000 A Language Model that Works Well on Many Domains Different (static) language models work best on different domains Use dynamic adaptation to make a generic LM act like a domain-specific LM –Generic LM – linear interpolation of collection of domain-specific LMs (SWB, BN, digit/date grammar, etc.) –Adapt by dynamically adjusting interpolation weights Want to be able to adapt quickly –At the word/sentence level, not at the document level Um, yeah. Well, anyway, I’ll be arriving at four twenty two p.m. on flight fifty six. Say hi to mom. Oh, and don’t forget to buy IBM at one forty-four.

17 IBM ASR Workshop Paris, France 18-20 Sept 2000 Adapting Language Model Interpolation Weights Simply re-estimate weights to maximize likelihood of adaptation data (like dynamic deleted interpolation) –Can be quite slow because have to accumulate a lot of evidence Add hidden variable to model that tracks which domain LM is currently being used (Bayesian adaptation) –Rate of adaptation can be fast, depend on context, and can be trained on domain labelled data.

18 IBM ASR Workshop Paris, France 18-20 Sept 2000

19 IBM ASR Workshop Paris, France 18-20 Sept 2000

20 IBM ASR Workshop Paris, France 18-20 Sept 2000 Other Factors Driving Progress

21 IBM ASR Workshop Paris, France 18-20 Sept 2000 What Types of Data Do We Need? ConditionTargetsCurrently Available (U.S) Total Amount5000 hours speech 10 GB LM data 1000 hours speech 1 GB LM data StylesImperatives Queries Fluent conversation Declamatories C&C tasks ATIS/DC SWB/BN/Meetings WSJ/Voicemail/BN Environment s High bandwidth/High SNR Low bandwidth/High SNR Low SNR WSJ/BN SWB/Voicemail Meetings DomainsLow perplexity Medium perplexity High perplexity Digits, spelling DC/ATIS SWB/VM/WSJ/BN

22 IBM ASR Workshop Paris, France 18-20 Sept 2000 Some Concrete Suggestions Target: 5000 Hours of transcribed spontaneous speech 2000 Hours/year50000 hours/year (25) 5000 hours of speech Cost ~ $1M Test data: Mixture of current and new sources Switchboard, Voicemail, BN, DC, OGI SPEECON, Meetings Sources of new data: Supergirl By David Odell Script - Revised Screenplay Word Document Superman: The Motion Picture By Mario Puzo Early Draft Script Superman: The Motion Picture By Mario Puzo Shooting Script Superman II Directed By Richard Donner Script - Early Version Superman II Directed By Richard Lester Script Later Version Superman II Shooting Script Superman IV: The Quest for Peace By Christopher Reeve, Script - Superman: The Man Of Steel By Alex Ford & J Ellison Script - Unproduced Superman Lives By Kevin Smith Script - Unproduced Superman Lives By Dan Gilroy Script synopsis Unproduced

23 IBM ASR Workshop Paris, France 18-20 Sept 2000 Conclusions Speech recognition performance not adequate Human performance figures suggests that we still have enormous room for improvement Presented several new algorithms to attack problem aggressively Suggested training and test methodology to drive research Communal participation critical to push ahead


Download ppt "IBM ASR Workshop Paris, France 18-20 Sept 2000 Towards Superhuman Speech Recognition Mukund Padmanabhan and Michael Picheny Human Language Technologies."

Similar presentations


Ads by Google