Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003.

Similar presentations


Presentation on theme: "1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003."— Presentation transcript:

1 1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003

2 2 Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan, Lex Olorenshaw, Honda Hitoshi, Helmut Luke Spoken Language Technology, SONY NSCA 3300 Zanker Rd MS/SJ1B5, San Jose CA E-mail:xavier@slt.sel.sony.com

3 3 ABSTRACT This presentation highlights three major key techniques used in the embedded isolated command recognition system developed for AIBO:  Robust Broadband HMMs  Small Context dependent HMMs  Efficient Confidence Measure (Task independent)

4 4 Sony’s AIBO entertainment robot

5 5 General AIBO ASR Overview and Features End-point Detection + Feature Extraction -Noise Attenuation: NSS -Channel Normalization: CMS, CMV, or DB eq. ASR + CM for Speech Verification -ASR based on PLUs: -Engine based on Viterbi with Beam Search -Lexicon ~ 100 to 300 Dictionaries Entries -3 states/1 Gaussian per State CHMM Triphone Clean Speech + Mixed with Noises + Artificially Reverberated AIBO Dialogue Manager Others Sensors: -Vision -Tact AIBO: -Activity -Personality, mud

6 6 HMM Training Strategies 1 TRAINING OBJECTIVES: Obtain a robust recognizer in noisy far field conditions: We use SIMULATE noisy Matched conditions by : Mixing Clean speech with expected noises at target SNR Artificially reverberate the training Corpus using the frequency Response filter of expected far field Room environments (0.5 ~ 1.5m) Obtain an accurate recognizer in near field conditions high SNR conditions. The recognizer should be close to real-time. A Tradeoff is obtained by training in match noisy conditions and clean speech conditions: “Broadband HMM”

7 7 Robust “Broadband” HMMs HMM-Accumulators Noise+Reverberation1 HMM-Accumulators Noise+Reverberation N ~N Clean Accumulators Room_Response_1 * Speech + Noise_1Room_Response_N * Speech + Noise_N Clean Speech Final Broadband HMM

8 8 Embedded ASR System Specification HMM with Small Memory Size : < 500 Kb CPU efficient ASR: The CPU can calculated a Maximum of Gaussians 300 per frame Compress front-end, 20 features: 6 Mfcc + 7 delta- MFCC + 7 delta2-MFCC Vocabulary can be easily modified: Phone based approach

9 9 Monophone vs Triphone Monophone 1 Monophone 2 Triphone # of Gaussians421 # of States~120 ~1500 Beam200~600150~full300 Memory (Kb)9045500 Ave. Word Acc.95.583.6~8697.2

10 10 CM computation CM Generator CM>Thres Thres yes no reject or ask for confirmation perform AIBO action

11 11 Recognition process Hypo 1 Hypo 2 Hypo N...... N-best Recognizer SPEECH AM Vocabulary

12 12 CM Formulation 1 Likelihood score ratio: Approximation with the N-best: Used in combination with A test for in-vocabulary errors, A confidence measure is built:

13 13 CM Formulation 2 Pseudo-filler score Pseudo-background score Confidence value [0,1] Number of hypos in the list i-th score in the N-best list S1S1 S average SNSN

14 14 CM Thresholds for several AM’s and AIBO life

15 15 CM thresholds for different vocabularies

16 16 Conclusions Broadband HMMs provide a convenient tradeoff between noise robustness and accuracy in quite conditions. HMM with Context dependent units (triphones or biphones) and 1 Gaussian/State are computationally less expensive and more accurate than monophones and more robust to noise. The CM presented is very simple to compute yet effective to categorize correct results from incorrect ones and OOV’s. CMs are robust to changes in the vocabulary and architecture of the recognizer. Due to its simplicity and stability, the CM looks appealing for real-life command applications.

17 17 References H. Lucke, H Honda, K Minamino, A Hiroe, H Mori, H Ogawa, Y Asano, H Kishi, “Development of a Spontaneous Speech Rcognition engine for an Entertainment Robot”, ISCA IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, 2003. G. Hern á ndez Á brego, X. Men é ndez-Pidal, Thomas Kemp, K Minamino, H Lucke, “Automatic Set-up for Spontaneous Speech Recognition Engines Based on Merit Optimization”, ICASSP-2003, HongKong Xavier Menéndez-Pidal, Lei Duan, Jingwen Lu, Beatriz Dukes, Michael Emonts, Gustavo Hernández- Ábrego, Lex Olorenshaw “Efficient phone-base Recognition Engines for Chinese and English Isolated command applications”, International Symposium on Chinese Spoken Language Processing (ISCSLP) Taipei, Taiwan, August 2002 G. Hern á ndez Á brego, X. Men é ndez-Pidal, L. Olorenshaw, "Robust and Efficient Confidence measure for Isolated command application", in Proceedings of Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, December 2001


Download ppt "1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003."

Similar presentations


Ads by Google