Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.

Slides:



Advertisements
Similar presentations
Improved ASR in noise using harmonic decomposition Introduction Pitch-Scaled Harmonic Filter Recognition Experiments Results Conclusion aperiodic contribution.
Advertisements

aperiodic periodic Production of /z/: Covariation and weighting of harmonically decomposed streams for ASR Introduction Pitch-scaled harmonic filter Recognition.
Building an ASR using HTK CS4706
Advances in WP1 Trento Meeting January
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Advanced Speech Enhancement in Noisy Environments
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
Distribution-Based Feature Normalization for Robust Speech Recognition Leveraging Context and Dynamics Cues Yu-Chen Kao and Berlin Chen Presenter : 張庭豪.
PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State.
HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.
An Energy Search Approach to Variable Frame Rate Front-End Processing for Robust ASR Julien Epps and Eric H. C. Choi National ICT Australia Presenter:
Advances in WP1 Turin Meeting – 9-10 March
F 鍾承道 Acoustic Features for Speech Recognition: From Mel-Frequency Cepstrum Coefficients (MFCC) to BottleNeck Features(BNF)
MODULATION SPECTRUM EQUALIZATION FOR ROBUST SPEECH RECOGNITION Source: Automatic Speech Recognition & Understanding, ASRU. IEEE Workshop on Author.
Advances in WP1 and WP2 Paris Meeting – 11 febr
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Robust Automatic Speech Recognition by Transforming Binary Uncertainties DeLiang Wang (Jointly with Dr. Soundar Srinivasan) Oticon A/S, Denmark (On leave.
Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.
Classification of place of articulation in unvoiced stops with spectro-temporal surface modeling V. Karjigi , P. Rao Dept. of Electrical Engineering,
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Speech and Language Processing
Automatic detection of microchiroptera echolocation calls from field recordings using machine learning algorithms Mark D. Skowronski and John G. Harris.
ETSI STQ-Aurora Distributed Speech Recognition (DSR) Bernhard Noé Distributed Speech Recognition.
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
International Conference on Intelligent and Advanced Systems 2007 Chee-Ming Ting Sh-Hussain Salleh Tian-Swee Tan A. K. Ariff. Jain-De,Lee.
Reporter: Shih-Hsiang( 士翔 ). Introduction Speech signal carries information from many sources –Not all information is relevant or important for speech.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Author: Naveen Parihar Inst. for Signal and Info. Processing Dept.
Ekapol Chuangsuwanich and James Glass MIT Computer Science and Artificial Intelligence Laboratory,Cambridge, Massachusetts 02139,USA 2012/07/2 汪逸婷.
NONLINEAR DYNAMIC INVARIANTS FOR CONTINUOUS SPEECH RECOGNITION Author: Daniel May Mississippi State University Contact Information: 1255 Louisville St.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
AGA 4/28/ NIST LID Evaluation On Use of Temporal Dynamics of Speech for Language Identification Andre Adami Pavel Matejka Petr Schwarz Hynek Hermansky.
Authors: Sriram Ganapathy, Samuel Thomas, and Hynek Hermansky Temporal envelope compensation for robust phoneme recognition using modulation spectrum.
1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.
A NEW FEATURE EXTRACTION MOTIVATED BY HUMAN EAR Amin Fazel Sharif University of Technology Hossein Sameti, S. K. Ghiathi February 2005.
Speech Recognition Feature Extraction. Speech recognition simplified block diagram Speech Capture Speech Capture Feature Extraction Feature Extraction.
DIALOG SYSTEMS FOR AUTOMOTIVE ENVIRONMENTS Presenter: Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
INSTITUTE FOR SIGNAL AND INFORMATION PROCESSING Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State.
Robust Feature Extraction for Automatic Speech Recognition based on Data-driven and Physiologically-motivated Approaches Mark J. Harvilla1, Chanwoo Kim2.
Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,
ICASSP 2006 Robustness Techniques Survey ShihHsiang 2006.
Automatic Speech Recognition A summary of contributions from multiple disciplines Mark D. Skowronski Computational Neuro-Engineering Lab Electrical and.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Chapter 7 Speech Recognition Framework  7.1 The main form and application of speech recognition  7.2 The main factors of speech recognition  7.3 The.
Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.
Noise Reduction in Speech Recognition Professor:Jian-Jiun Ding Student: Yung Chang 2011/05/06.
1 Electrical and Computer Engineering Binghamton University, State University of New York Electrical and Computer Engineering Binghamton University, State.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
1 LOW-RESOURCE NOISE-ROBUST FEATURE POST-PROCESSING ON AURORA 2.0 Chia-Ping Chen, Jeff Bilmes and Katrin Kirchhoff SSLI Lab Department of Electrical Engineering.
Spectral and Temporal Modulation Features for Phonetic Recognition Stephen A. Zahorian, Hongbing Hu, Zhengqing Chen, Jiang Wu Department of Electrical.
Speech Processing Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
A maximum likelihood estimation and training on the fly approach
Learning Long-Term Temporal Features
Network Training for Continuous Speech Recognition
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi Tel: Fax: URL: gsa/

INTRODUCTION BLOCK DIAGRAM APPROACH Core components: Transduction Feature extraction Acoustic modeling (hidden Markov models) Language modeling (statistical N-grams) Search (Viterbi beam) Knowledge sources

INTRODUCTION AURORA EVALUATION OVERVIEW WSJ 5K (closed task) with seven (digitally- added) noise conditions Common ASR system Two participants: QIO: QualC., ICSI, OGI; MFA: Moto., FrTel., Alcatel Client/server applications Evaluate robustness in noisy environments Propose a standard for LVCSR applications Performance Summary Site Test Set Clean Noise (Sennh) Noise (MultiM) Base (TS1)15%59%75% Base (TS2)19%33%50% QIO (TS2)17%26%41% MFA (TS2)15%26%40%

Is the 31% relative improvement (34.5% vs. 50.3%) operationally significant ? INTRODUCTION MOTIVATION Aurora Large Vocabulary (ALV) evaluation goal was at least a 25% relative improvement over the baseline MFCC front end MFCC: Overall WER – 50.3% 8 kHz – 49.6%16 kHz – 51.0% TS1TS2TS1TS2 58.1%41.0%62.2%39.8% QIO: Overall WER – 37.5% 8 kHz – 38.4%16 kHz – 36.5% TS1TS2TS1TS2 43.2%33.6%40.7%32.4% MFA: Overall WER – 34.5% 8 kHz – 34.5%16 kHz – 34.4% TS1TS2TS1TS2 37.5%31.4%37.2%31.5% ALV Evaluation Results Generic baseline LVCSR system with no front end specific tuning Would front end specific tuning change the rankings?

EVALUATION PARADIGM THE AURORA – 4 DATABASE Acoustic Training: Derived from 5000 word WSJ0 task TS1 (clean), and TS2 (multi-condition) Clean plus 6 noise conditions Randomly chosen SNR between 10 and 20 dB 2 microphone conditions (Sennheiser and secondary) 2 sample frequencies – 16 kHz and 8 kHz G.712 filtering at 8 kHz and P.341 filtering at 16 kHz Development and Evaluation Sets: Derived from WSJ0 Evaluation and Development sets 14 test sets for each 7 test sets recorded on Sennheiser; 7 on secondary Clean plus 6 noise conditions Randomly chosen SNR between 5 and 15 dB G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

EVALUATION PARADIGM BASELINE LVCSR SYSTEM Standard context-dependent cross- word HMM-based system: Acoustic models: state-tied 4-mixture cross-word triphones Language model: WSJ0 5K bigram Search: Viterbi one-best using lexical trees for N-gram cross- word decoding Lexicon: based on CMUlex Real-time: 4 xRT for training and 15 xRT for decoding on an 800 MHz Pentium Monophone Modeling State-Tying CD-Triphone Modeling CD-Triphone Modeling Mixture Modeling (2,4) Training Data

EVALUATION PARADIGM WI007 ETSI MFCC FRONT END Zero-mean debiasing 10 ms frame duration 25 ms Hamming window Absolute energy 12 cepstral coefficients First and second derivatives Input Speech Fourier Transf. Analysis Cepstral Analysis Zero-mean and Pre-emphasis Energy /

FRONT END PROPOSALS QIO FRONT END 10 msec frame duration 25 msec analysis window 15 RASTA-like filtered cepstral coefficients MLP-based VAD Mean and variance normalization First and second derivatives Fourier Transform RASTA Mel-scale Filter Bank DCT Mean/Variance Normalization Input Speech / MLP-based VAD

FRONT END PROPOSALS MFA FRONT END 10 msec frame duration 25 msec analysis window Mel-warped Wiener filter based noise reduction Energy-based VADNest Waveform processing to enhance SNR Weighted log-energy 12 cepstral coefficients Blind equalization (cepstral domain) VAD based on acceleration of various energy based measures First and second derivatives Input Speech Noise Reduction Cepstral Analysis Waveform Processing Blind Equalization Feature Processing VADNest VAD /

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors. Tuning parameters:  State-tying thresholds: solves the problem of sparsity of training data by sharing state distributions among phonetically similar states  Language model scale: controls influence of the language model relative to the acoustic models (more relevant for WSJ)  Word insertion penalty: balances insertions and deletions (always a concern in noisy environments)

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING QIO FE - 7.5% relative improvement MFA FE - 9.4% relative improvement Ranking is still the same (14.9% vs. 12.5%) ! FECond.# of Tied States State Tying ThresholdsLM Scale Word Ins. Pen. WER SplitMergeOccu. QIOBase % QIOTuned % MFABase % MFATuned %

EXPERIMENTAL RESULTS COMPARISON OF TUNING Front End Train Set TuningAverage WER over 14 Test Sets QIO1No43.1% QIO2No38.1% Avg. WERNo38.4% QIO1Yes45.7% QIO2Yes35.3% Avg. WERYes40.5% MFA1No37.5% MFA2No31.8% Avg. WERNo34.7% MFA1Yes37.0% MFA2Yes31.1% Avg. WERYes34.1% Same Ranking: relative performance gap increased from 9.6% to 15.8% On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%) On TS2, MFA FE significantly better only on test sets 5 and 14

EXPERIMENTAL RESULTS MICROPHONE VARIATION Train on Sennheiser mic.; evaluate on secondary mic. Matched conditions result in optimal performance Significant degradation for all front ends on mismatched conditions Both QIO and MFA provide improved robustness relative to MFCC baseline SennheiserSecondary ETSI QIO MFA

EXPERIMENTAL RESULTS ADDITIVE NOISE ETSI QIO MFA TS2TS3TS4TS5TS6TS7 Performance degrades on noise condition when systems are trained only on clean data Both QIO and MFA deliver improved performance TS2TS3TS4TS5TS6 TS7 Exposing systems to noise and microphone variations (TS2) improves performance

SUMMARY AND CONCLUSIONS WHAT HAVE WE LEARNED? Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline WER is still high ( ~ 35%), human benchmarks have reported low error rates (~1%). Improvement in performance is not operationally significant Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO) Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline

APPENDIX AVAILABLE RESOURCES Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkitSpeech Recognition Toolkits ETSI DSR Website: reports and front end standardsETSI DSR Website Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front endAurora Project Website