PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State.

Slides:



Advertisements
Similar presentations
Advances in WP1 Trento Meeting January
Advertisements

Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Advanced Speech Enhancement in Noisy Environments
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Pitch Prediction From MFCC Vectors for Speech Reconstruction Xu shao and Ben Milner School of Computing Sciences, University of East Anglia, UK Presented.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.
An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:
Motivation Traditional approach to speech and speaker recognition:
Advances in WP1 Turin Meeting – 9-10 March
Application of HMMs: Speech recognition “Noisy channel” model of speech.
HIWIRE Progress Report Chania, May 2007 Presenter: Prof. Alex Potamianos Technical University of Crete Presenter: Prof. Alex Potamianos Technical University.
Advances in WP1 and WP2 Paris Meeting – 11 febr
1 INTRODUCTION METHODSRESULTSCONCLUSION Noise Robust Speech Recognition Group SB740 Noise Robust Speech Recognition Group SB740.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Speech and Language Processing
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
ETSI STQ-Aurora Distributed Speech Recognition (DSR) Bernhard Noé Distributed Speech Recognition.
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Author: Naveen Parihar Inst. for Signal and Info. Processing Dept.
17.0 Distributed Speech Recognition and Wireless Environment References: 1. “Quantization of Cepstral Parameters for Speech Recognition over the World.
Compensating speaker-to-microphone playback system for robust speech recognition So-Young Jeong and Soo-Young Lee Brain Science Research Center and Department.
NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION Author: Daniel May Mississippi State University Contact Information: 1255 Louisville St.
LINEAR DYNAMIC MODEL FOR CONTINUOUS SPEECH RECOGNITION Ph.D. Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
NOISE DETECTION AND CLASSIFICATION IN SPEECH SIGNALS WITH BOOSTING Nobuyuki Miyake, Tetsuya Takiguchi and Yasuo Ariki Department of Computer and System.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
A NEW FEATURE EXTRACTION MOTIVATED BY HUMAN EAR Amin Fazel Sharif University of Technology Hossein Sameti, S. K. Ghiathi February 2005.
Speaker Verification Speaker verification uses voice as a biometric to determine the authenticity of a user. Speaker verification systems consist of two.
1 Development of the Embedded Speech Recognition Interface done for AIBO ICSI Presentation January 2003.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
PHASE-BASED DUAL-MICROPHONE SPEECH ENHANCEMENT USING A PRIOR SPEECH MODEL Guangji Shi, M.A.Sc. Ph.D. Candidate University of Toronto Research Supervisor:
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Survey of Robust Speech Techniques in ICASSP 2009 Shih-Hsiang Lin ( 林士翔 ) 1Survey of Robustness Techniques in ICASSP 2009.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Automatic Speech Recognition
Speech recognition in mobile environment Robust ASR with dual Mic
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Speech Processing Speech Recognition
EEG Recognition Using The Kaldi Speech Recognition Toolkit
Automatic Speech Recognition: Conditional Random Fields for ASR
DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark
Ala’a Spaih Abeer Abu-Hantash Directed by Dr.Allam Mousa
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Speech / Non-speech Detection
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State University {parihar, D. Pearce Speech and MultiModal Group Motorola Labs, UK H. G. Hirsch Dept. of Elec. Eng. and Computer Science Niederrhein University, Germany URL:

Page 1 of 15 Performance Analysis of ALV Baseline System Abstract In this paper, we present the design and analysis of the baseline recognition system used for ETSI Aurora large vocabulary (ALV) evaluation. The experimental paradigm is presented along with the results from a number of experiments designed to minimize the computational requirements for the system. The ALV baseline system achieved a WER of 14.0% on the standard 5K Wall Street Journal task, and required 4 xRT for training and 15 xRT for decoding (on an 800 MHz Pentium processor). It is shown that increasing the sampling frequency from 8 kHz to 16 kHz improves performance significantly only for the noisy test conditions. Utterance detection resulted in significant improvements only on the noisy conditions for the mismatched training conditions. Use of the DSR standard VQ-based compression algorithm did not result in a significant degradation. The model mismatch and microphone mismatch resulted in a relative increase in WER by 300% and 200%, respectively.

Page 2 of 15 Performance Analysis of ALV Baseline System Motivation ALV goal was at least a 25% relative improvement over the baseline MFCC front end Develop generic baseline LVCSR system with no front end specific tuning Benchmark the baseline MFCC front end using generic LVCSR system on six focus conditions — sampling frequency reduction, utterance detection, feature-vector compression, model mismatch, microphone variation, and additive noise

Page 3 of 15 Performance Analysis of ALV Baseline System ALV Baseline System Development Standard context-dependent cross- word HMM-based system: Acoustic models: state-tied 16-mixture cross-word triphones Language model: WSJ0 5K bigram Search: Viterbi one-best using lexical trees for N-gram cross-word decoding Lexicon: based on CMUlex Performance: 8.3% WER at 85xRT Monophone Modeling State-Tying CD-Triphone Modeling CD-Triphone Modeling Mixture Modeling (16) Training Data

Page 4 of 15 Performance Analysis of ALV Baseline System The baseline HMM system used an ETSI standard MFCC-based front end: Zero-mean debiasing 10 ms frame duration 25 ms Hamming window Absolute energy 12 cepstral coefficients First and second derivatives ETSI WI007 Front End Fourier Transf. Analysis Cepstral Analysis Zero-mean and Pre-emphasis Energy / Input Speech

Page 5 of 15 Performance Analysis of ALV Baseline System Real-time reduction FactorWERRelative Degrad. Baseline system8.3%N/A Terminal filtering8.4%1% ETSI front end9.6%14% Beam Adj. (15xRT)11.8%23% 16 to 4 mixtures14.1%20% 50% reduction14.9%6% Endpointing14.0%-6% Derived from ISIP WSJ0 system (with CMS) Aurora-4 database terminal filtering resulted in marginal degradation ETSI WI007 front end is 14% worst (no CMS) ALV Baseline System performance: 14.0% Real-time: 4 xRT for training and 15 xRT for decoding on an 800 MHz Pentium

Page 6 of 15 Performance Analysis of ALV Baseline System Aurora—4 database Acoustic Training: Derived from 5000 word WSJ0 task TS1 (clean), and TS2 (multi-condition) Clean plus 6 noise conditions Randomly chosen SNR between 10 and 20 dB 2 microphone conditions (Sennheiser and secondary) 2 sample frequencies – 16 kHz and 8 kHz G.712 filtering at 8 kHz and P.341 filtering at 16 kHz Development and Evaluation Sets: Derived from WSJ0 Evaluation and Development sets 14 test sets for each 7 recorded on Sennheiser; 7 on secondary Clean plus 6 noise conditions Randomly chosen SNR between 5 and 15 dB G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

Page 7 of 15 Performance Analysis of ALV Baseline System Perfectly-matched condition (TrS1 and TS1): No significant degradation Mismatched conditions (TrS1 and TS2-TS14): No clear trend Matched conditions (TrS2 and TS1-TS14): Significant degradation on noisy conditions recorded on Senn. mic. (TS3-TS8) Sampling Frequency Reduction 0 TS2 TS3TS4TS5TS6TS7 TS kHz 8 kHz

Page 8 of 15 Performance Analysis of ALV Baseline System Perfectly-matched condition (TrS1 and TS1): No significant improvement Mismatched conditions (TrS1 and TS2-TS14): Significant improvement due to reduction in insertions Utterance Detection Matched conditions (TrS2 and TS1-TS14): No significant improvement TS9 (Sec., Car) TS2 (Senn., Car) Test Set 54.4% 41.4% Sub. W/O Endpointing 12.3% 3.6% Del. 15.1% 20.1% Ins. 13.0%3.6%40.0% 10.1%15.1%49.1% Sub. With Endpointing Ins.Del.

Page 9 of 15 Performance Analysis of ALV Baseline System Feature-vector Compression Sampling frequency specific codebooks — 8 kHz and 16 kHz Perfectly-matched condition (TrS1 and TS1): No significant degradation Mismatched conditions (TrS1 and TS2-TS14): No significant degradation Matched conditions (TrS2 and TS1-TS14): Significant degradation on a few matched conditions – TS3,8,9,10,12 at 16 kHz sampling and TS7,12 at 8 kHz sampling frequency

Page 10 of 15 Performance Analysis of ALV Baseline System Model Mismatch Perfectly-matched condition (TrS1 and TS1): Best performance Mismatched conditions (TrS1 and TS2-TS14): Significant degradations Matched conditions (TrS2 and TS1-TS14): Better than mismatched conditions TS2TS3TS4TS5TS6TS7 TrS1 TrS2 TS1 (Clean) TS2TS3TS4TS5TS6TS

Page 11 of 15 Performance Analysis of ALV Baseline System Microphone Variation Senn. Sec. Train on Sennheiser mic.; evaluate on secondary mic. Perfectly-matched condition (TrS1 and TS1): Optimal performance Mismatched condition (TrS1 and TS8): Significant degradation Matched conditions: Less severe degradation when samples of sec. microphone seen during training TrS1 TrS2

Page 12 of 15 Performance Analysis of ALV Baseline System Additive Noise Matched Conditions: Exposing systems to noise and microphone variations (TS2) improves performance TS2TS3TS4TS5TS6TS Mismatched Conditions: Performance degrades on noise condition when systems are trained only on clean data TS2TS3TS4TS5TS6TS7 TrS1 TrS2 TS1 (Clean)

Page 13 of 15 Performance Analysis of ALV Baseline System Summary and Conclusions Presented a WSJ0 based LVCSR system that runs at 4xRT for training and 15xRT for decoding on a 800 MHz Pentium Reduction in benchmarking time from 1034 to 203 days Increase in sampling frequency from 8 kHz to 16 kHz results in significant improvement only on matched noisy test conditions Utterance detection resulted in significant improvements only on the noisy conditions for the mismatched training conditions VQ based compression is robust in DSR environment Exposing models to different noisy conditions and microphone conditions improves the speech recognition performance in adverse conditions

Page 14 of 15 Performance Analysis of ALV Baseline System Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front endAurora Project Website Available Resources Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkitSpeech Recognition Toolkits ETSI DSR Website: reports and front end standardsETSI DSR Website

Page 15 of 15 Performance Analysis of ALV Baseline System Brief Bibliography N. Parihar, Performance Analysis of Advanced Front Ends, M.S. Dissertation, Mississippi State University, December 2003.Performance Analysis of Advanced Front Ends N. Parihar, and J. Picone, “An Analysis of the Aurora Large Vocabulary Evaluation,” Eurospeech 2003, pp , Geneva, Switzerland, September 2003.An Analysis of the Aurora Large Vocabulary Evaluation N. Parihar and J. Picone, “DSR Front End LVCSR Evaluation - AU/384/02,” Aurora Working Group, European Telecommunications Standards Institute, December 06, 2002.DSR Front End LVCSR Evaluation - AU/384/02 D. Pearce, “Overview of Evaluation Criteria for Advanced Distributed Speech Recognition,” ETSI STQ-Aurora DSR Working Group, October 2001.Overview of Evaluation Criteria for Advanced Distributed Speech Recognition G. Hirsch, “Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task,” ETSI STQ- Aurora DSR Working Group, December 2002.Experimental Framework for the Performance Evaluation of Speech Recognition Front-ends in a Large Vocabulary Task “ETSI ES v1.1.2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm,” ETSI, April 2000.ETSI ES v1.1.2 Distributed Speech Recognition; Front-end Feature Extraction Algorithm; Compression Algorithm