1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003.

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Speech Recognition Part 3 Back end processing. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Author :Panikos Heracleous, Tohru Shimizu AN EFFICIENT KEYWORD SPOTTING TECHNIQUE USING A COMPLEMENTARY LANGUAGE FOR FILLER MODELS TRAINING Reporter :
15.0 Utterance Verification and Keyword/Key Phrase Spotting References: 1. “Speech Recognition and Utterance Verification Based on a Generalized Confidence.
Object Recognition using Invariant Local Features Applications l Mobile robots, driver assistance l Cell phone location or object recognition l Panoramas,
Confidence Measures for Speech Recognition Reza Sadraei.
PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State.
December 2006 Cairo University Faculty of Computers and Information HMM Based Speech Synthesis Presented by Ossama Abdel-Hamid Mohamed.
Signal Processing Institute Swiss Federal Institute of Technology, Lausanne 1 Feature selection for audio-visual speech recognition Mihai Gurban.
Sam Pfister, Stergios Roumeliotis, Joel Burdick
Detection of Recognition Errors and Out of the Spelling Dictionary Names in a Spelled Name Recognizer for Spanish R. San-Segundo, J. Macías-Guarasa, J.
Communications & Multimedia Signal Processing Analysis of the Effects of Train noise on Recognition Rate using Formants and MFCC Esfandiar Zavarehei Department.
EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.
Advances in WP1 and WP2 Paris Meeting – 11 febr
HIWIRE MEETING Trento, January 11-12, 2007 José C. Segura, Javier Ramírez.
Visual Speech Recognition Using Hidden Markov Models Kofi A. Boakye CS280 Course Project.
A Low-Power Low-Memory Real-Time ASR System. Outline Overview of Automatic Speech Recognition (ASR) systems Sub-vector clustering and parameter quantization.
Why is ASR Hard? Natural speech is continuous
Authors: Anastasis Kounoudes, Anixi Antonakoudi, Vasilis Kekatos
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
HMM-BASED PSEUDO-CLEAN SPEECH SYNTHESIS FOR SPLICE ALGORITHM Jun Du, Yu Hu, Li-Rong Dai, Ren-Hua Wang Wen-Yi Chu Department of Computer Science & Information.
1 7-Speech Recognition (Cont’d) HMM Calculating Approaches Neural Components Three Basic HMM Problems Viterbi Algorithm State Duration Modeling Training.
1 CSE 552/652 Hidden Markov Models for Speech Recognition Spring, 2006 Oregon Health & Science University OGI School of Science & Engineering John-Paul.
7-Speech Recognition Speech Recognition Concepts
Chapter 14 Speaker Recognition 14.1 Introduction to speaker recognition 14.2 The basic problems for speaker recognition 14.3 Approaches and systems 14.4.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
SCALE Workshop, Saarbrücken, January 12, 2010 Prof. Hervé Bourlard Idiap Research Institute EPFL Idiap Research Institute Centre du Parc P.O Box 592 CH.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Informing Multisource Decoding for Robust Speech Recognition Ning Ma and Phil Green Speech and Hearing Research Group The University of Sheffield 22/04/2005.
Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,
Dijana Petrovska-Delacrétaz 1 Asmaa el Hannani 1 Gérard Chollet 2 1: DIVA Group, University of Fribourg 2: GET-ENST, CNRS-LTCI,
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.
1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
Robust Entropy-based Endpoint Detection for Speech Recognition in Noisy Environments 張智星
Learning Long-Term Temporal Feature in LVCSR Using Neural Networks Barry Chen, Qifeng Zhu, Nelson Morgan International Computer Science Institute (ICSI),
CHAPTER 8 DISCRIMINATIVE CLASSIFIERS HIDDEN MARKOV MODELS.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
© 2005, it - instituto de telecomunicações. Todos os direitos reservados. Arlindo Veiga 1,2 Sara Cadeias 1 Carla Lopes 1,2 Fernando Perdigão 1,2 1 Instituto.
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.
Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture Mark D. Skowronski Computational Neuro-Engineering Lab University of Florida March 31,
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
1 7-Speech Recognition Speech Recognition Concepts Speech Recognition Approaches Recognition Theories Bayse Rule Simple Language Model P(A|W) Network Types.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
Automatic Speech Recognition
Reza Yazdani Albert Segura José-María Arnau Antonio González
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
EEL 6586: AUTOMATIC SPEECH PROCESSING Hidden Markov Model Lecture
Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003

2 Xavier Menendez-Pidal, jointly with: Gustavo Hernandez-Abrego, Lei Duan, Lex Olorenshaw, Honda Hitoshi, Helmut Luke Spoken Language Technology, SONY NSCA 3300 Zanker Rd MS/SJ1B5, San Jose CA

3 ABSTRACT This presentation highlights three major key techniques used in the embedded isolated command recognition system developed for AIBO:  Robust Broadband HMMs  Small Context dependent HMMs  Efficient Confidence Measure (Task independent)

4 Sony’s AIBO entertainment robot

5 General AIBO ASR Overview and Features End-point Detection + Feature Extraction -Noise Attenuation: NSS -Channel Normalization: CMS, CMV, or DB eq. ASR + CM for Speech Verification -ASR based on PLUs: -Engine based on Viterbi with Beam Search -Lexicon ~ 100 to 300 Dictionaries Entries -3 states/1 Gaussian per State CHMM Triphone Clean Speech + Mixed with Noises + Artificially Reverberated AIBO Dialogue Manager Others Sensors: -Vision -Tact AIBO: -Activity -Personality, mud

6 HMM Training Strategies 1 TRAINING OBJECTIVES: Obtain a robust recognizer in noisy far field conditions: We use SIMULATE noisy Matched conditions by : Mixing Clean speech with expected noises at target SNR Artificially reverberate the training Corpus using the frequency Response filter of expected far field Room environments (0.5 ~ 1.5m) Obtain an accurate recognizer in near field conditions high SNR conditions. The recognizer should be close to real-time. A Tradeoff is obtained by training in match noisy conditions and clean speech conditions: “Broadband HMM”

7 Robust “Broadband” HMMs HMM-Accumulators Noise+Reverberation1 HMM-Accumulators Noise+Reverberation N ~N Clean Accumulators Room_Response_1 * Speech + Noise_1Room_Response_N * Speech + Noise_N Clean Speech Final Broadband HMM

8 Embedded ASR System Specification HMM with Small Memory Size : < 500 Kb CPU efficient ASR: The CPU can calculated a Maximum of Gaussians 300 per frame Compress front-end, 20 features: 6 Mfcc + 7 delta- MFCC + 7 delta2-MFCC Vocabulary can be easily modified: Phone based approach

9 Monophone vs Triphone Monophone 1 Monophone 2 Triphone # of Gaussians421 # of States~120 ~1500 Beam200~600150~full300 Memory (Kb) Ave. Word Acc ~8697.2

10 CM computation CM Generator CM>Thres Thres yes no reject or ask for confirmation perform AIBO action

11 Recognition process Hypo 1 Hypo 2 Hypo N N-best Recognizer SPEECH AM Vocabulary

12 CM Formulation 1 Likelihood score ratio: Approximation with the N-best: Used in combination with A test for in-vocabulary errors, A confidence measure is built:

13 CM Formulation 2 Pseudo-filler score Pseudo-background score Confidence value [0,1] Number of hypos in the list i-th score in the N-best list S1S1 S average SNSN

14 CM Thresholds for several AM’s and AIBO life

15 CM thresholds for different vocabularies

16 Conclusions Broadband HMMs provide a convenient tradeoff between noise robustness and accuracy in quite conditions. HMM with Context dependent units (triphones or biphones) and 1 Gaussian/State are computationally less expensive and more accurate than monophones and more robust to noise. The CM presented is very simple to compute yet effective to categorize correct results from incorrect ones and OOV’s. CMs are robust to changes in the vocabulary and architecture of the recognizer. Due to its simplicity and stability, the CM looks appealing for real-life command applications.

17 References H. Lucke, H Honda, K Minamino, A Hiroe, H Mori, H Ogawa, Y Asano, H Kishi, “Development of a Spontaneous Speech Rcognition engine for an Entertainment Robot”, ISCA IEEE Workshop on Spontaneous Speech Processing and Recognition (SSPR), Tokyo, G. Hern á ndez Á brego, X. Men é ndez-Pidal, Thomas Kemp, K Minamino, H Lucke, “Automatic Set-up for Spontaneous Speech Recognition Engines Based on Merit Optimization”, ICASSP-2003, HongKong Xavier Menéndez-Pidal, Lei Duan, Jingwen Lu, Beatriz Dukes, Michael Emonts, Gustavo Hernández- Ábrego, Lex Olorenshaw “Efficient phone-base Recognition Engines for Chinese and English Isolated command applications”, International Symposium on Chinese Spoken Language Processing (ISCSLP) Taipei, Taiwan, August 2002 G. Hern á ndez Á brego, X. Men é ndez-Pidal, L. Olorenshaw, "Robust and Efficient Confidence measure for Isolated command application", in Proceedings of Automatic Speech Recognition and Understanding Workshop, Madonna di Campiglio, Trento, Italy, December 2001