Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi State University Contact Information: Box 9571 Mississippi State University Mississippi State, Mississippi 39762 Tel: 662-325-8335 Fax: 662-325-2298 URL: http://www.isip.msstate.edu/publications/seminars/msstate_misc/2004/ http://www.isip.msstate.edu/publications/seminars/msstate_misc/2004/ gsa/ Email: {parihar,picone}@isip.msstate.edu

INTRODUCTION BLOCK DIAGRAM APPROACH Core components: Transduction Feature extraction Acoustic modeling (hidden Markov models) Language modeling (statistical N-grams) Search (Viterbi beam) Knowledge sources

INTRODUCTION AURORA EVALUATION OVERVIEW WSJ 5K (closed task) with seven (digitally- added) noise conditions Common ASR system Two participants: QIO: QualC., ICSI, OGI; MFA: Moto., FrTel., Alcatel Client/server applications Evaluate robustness in noisy environments Propose a standard for LVCSR applications Performance Summary Site Test Set Clean Noise (Sennh) Noise (MultiM) Base (TS1)15%59%75% Base (TS2)19%33%50% QIO (TS2)17%26%41% MFA (TS2)15%26%40%

Is the 31% relative improvement (34.5% vs. 50.3%) operationally significant ? INTRODUCTION MOTIVATION Aurora Large Vocabulary (ALV) evaluation goal was at least a 25% relative improvement over the baseline MFCC front end MFCC: Overall WER – 50.3% 8 kHz – 49.6%16 kHz – 51.0% TS1TS2TS1TS2 58.1%41.0%62.2%39.8% QIO: Overall WER – 37.5% 8 kHz – 38.4%16 kHz – 36.5% TS1TS2TS1TS2 43.2%33.6%40.7%32.4% MFA: Overall WER – 34.5% 8 kHz – 34.5%16 kHz – 34.4% TS1TS2TS1TS2 37.5%31.4%37.2%31.5% ALV Evaluation Results Generic baseline LVCSR system with no front end specific tuning Would front end specific tuning change the rankings?

EVALUATION PARADIGM THE AURORA – 4 DATABASE Acoustic Training: Derived from 5000 word WSJ0 task TS1 (clean), and TS2 (multi-condition) Clean plus 6 noise conditions Randomly chosen SNR between 10 and 20 dB 2 microphone conditions (Sennheiser and secondary) 2 sample frequencies – 16 kHz and 8 kHz G.712 filtering at 8 kHz and P.341 filtering at 16 kHz Development and Evaluation Sets: Derived from WSJ0 Evaluation and Development sets 14 test sets for each 7 test sets recorded on Sennheiser; 7 on secondary Clean plus 6 noise conditions Randomly chosen SNR between 5 and 15 dB G.712 filtering at 8 kHz and P.341 filtering at 16 kHz

EVALUATION PARADIGM BASELINE LVCSR SYSTEM Standard context-dependent cross- word HMM-based system: Acoustic models: state-tied 4-mixture cross-word triphones Language model: WSJ0 5K bigram Search: Viterbi one-best using lexical trees for N-gram cross- word decoding Lexicon: based on CMUlex Real-time: 4 xRT for training and 15 xRT for decoding on an 800 MHz Pentium Monophone Modeling State-Tying CD-Triphone Modeling CD-Triphone Modeling Mixture Modeling (2,4) Training Data

EVALUATION PARADIGM WI007 ETSI MFCC FRONT END Zero-mean debiasing 10 ms frame duration 25 ms Hamming window Absolute energy 12 cepstral coefficients First and second derivatives Input Speech Fourier Transf. Analysis Cepstral Analysis Zero-mean and Pre-emphasis Energy /

FRONT END PROPOSALS QIO FRONT END 10 msec frame duration 25 msec analysis window 15 RASTA-like filtered cepstral coefficients MLP-based VAD Mean and variance normalization First and second derivatives Fourier Transform RASTA Mel-scale Filter Bank DCT Mean/Variance Normalization Input Speech / MLP-based VAD

FRONT END PROPOSALS MFA FRONT END 10 msec frame duration 25 msec analysis window Mel-warped Wiener filter based noise reduction Energy-based VADNest Waveform processing to enhance SNR Weighted log-energy 12 cepstral coefficients Blind equalization (cepstral domain) VAD based on acceleration of various energy based measures First and second derivatives Input Speech Noise Reduction Cepstral Analysis Waveform Processing Blind Equalization Feature Processing VADNest VAD /

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING Pruning beams (word, phone and state) were opened during the tuning process to eliminate search errors. Tuning parameters:  State-tying thresholds: solves the problem of sparsity of training data by sharing state distributions among phonetically similar states  Language model scale: controls influence of the language model relative to the acoustic models (more relevant for WSJ)  Word insertion penalty: balances insertions and deletions (always a concern in noisy environments)

EXPERIMENTAL RESULTS FRONT END SPECIFIC TUNING QIO FE - 7.5% relative improvement MFA FE - 9.4% relative improvement Ranking is still the same (14.9% vs. 12.5%) ! FECond.# of Tied States State Tying ThresholdsLM Scale Word Ins. Pen. WER SplitMergeOccu. QIOBase3209165 840181016.1% QIOTuned3512125 750201014.9% MFABase3208165 840181013.8% MFATuned4254100 600180512.5%

EXPERIMENTAL RESULTS COMPARISON OF TUNING Front End Train Set TuningAverage WER over 14 Test Sets QIO1No43.1% QIO2No38.1% Avg. WERNo38.4% QIO1Yes45.7% QIO2Yes35.3% Avg. WERYes40.5% MFA1No37.5% MFA2No31.8% Avg. WERNo34.7% MFA1Yes37.0% MFA2Yes31.1% Avg. WERYes34.1% Same Ranking: relative performance gap increased from 9.6% to 15.8% On TS1, MFA FE significantly better on all 14 test sets (MAPSSWE p=0.1%) On TS2, MFA FE significantly better only on test sets 5 and 14

EXPERIMENTAL RESULTS MICROPHONE VARIATION Train on Sennheiser mic.; evaluate on secondary mic. Matched conditions result in optimal performance Significant degradation for all front ends on mismatched conditions Both QIO and MFA provide improved robustness relative to MFCC baseline 0 10 20 30 40 SennheiserSecondary ETSI QIO MFA

EXPERIMENTAL RESULTS ADDITIVE NOISE ETSI QIO MFA 0 10 20 30 40 50 60 70 TS2TS3TS4TS5TS6TS7 Performance degrades on noise condition when systems are trained only on clean data Both QIO and MFA deliver improved performance 0 10 20 30 40 TS2TS3TS4TS5TS6 TS7 Exposing systems to noise and microphone variations (TS2) improves performance

SUMMARY AND CONCLUSIONS WHAT HAVE WE LEARNED? Both QIO and MFA front ends achieved ALV evaluation goal of improving performance by at least 25% relative over ETSI baseline WER is still high ( ~ 35%), human benchmarks have reported low error rates (~1%). Improvement in performance is not operationally significant Front end specific parameter tuning did not result in significant change in overall performance (MFA still outperforms QIO) Both QIO and MFA front ends handle convolution and additive noise better than ETSI baseline

APPENDIX AVAILABLE RESOURCES Speech Recognition Toolkits: compare front ends to standard approaches using a state of the art ASR toolkitSpeech Recognition Toolkits ETSI DSR Website: reports and front end standardsETSI DSR Website Aurora Project Website: recognition toolkit, multi-CPU scripts, database definitions, publications, and performance summary of the baseline MFCC front endAurora Project Website

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.

Similar presentations

Presentation on theme: "Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.

Similar presentations

Presentation on theme: "Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info."— Presentation transcript:

Similar presentations

About project

Feedback