The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE.

Slides:



Advertisements
Similar presentations
LABORATOIRE DINFORMATIQUE CERI 339 Chemin des Meinajariès BP AVIGNON CEDEX 09 Tél (0) Fax (0)
Advertisements

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.
Acoustic Model Adaptation Based On Pronunciation Variability Analysis For Non-Native Speech Recognition Yoo Rhee Oh, Jae Sam Yoon, and Hong Kook Kim Dept.
Frederico Rodrigues and Isabel Trancoso INESC/IST, 2000 Robust Recognition of Digits and Natural Numbers.
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.
Nonparametric-Bayesian approach for automatic generation of subword units- Initial study Amir Harati Institute for Signal and Information Processing Temple.
PERFORMANCE ANALYSIS OF AURORA LARGE VOCABULARY BASELINE SYSTEM Naveen Parihar, and Joseph Picone Center for Advanced Vehicular Systems Mississippi State.
HIWIRE MEETING Paris, February 11, 2005 JOSÉ C. SEGURA LUNA GSTC UGR.
AN INVESTIGATION OF DEEP NEURAL NETWORKS FOR NOISE ROBUST SPEECH RECOGNITION Michael L. Seltzer, Dong Yu Yongqiang Wang ICASSP 2013 Presenter : 張庭豪.
Speaker Adaptation in Sphinx 3.x and CALO David Huggins-Daines
Incorporating Tone-related MLP Posteriors in the Feature Representation for Mandarin ASR Overview Motivation Tone has a crucial role in Mandarin speech.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
Why is ASR Hard? Natural speech is continuous
May 20, 2006SRIV2006, Toulouse, France1 Acoustic Modeling of Accented English Speech for Large-Vocabulary Speech Recognition ATR Spoken Language Communication.
Introduction to Automatic Speech Recognition
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Midterm Review Spoken Language Processing Prof. Andrew Rosenberg.
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Speech and Language Processing
Diamantino Caseiro and Isabel Trancoso INESC/IST, 2000 Large Vocabulary Recognition Applied to Directory Assistance Services.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
Minimum Mean Squared Error Time Series Classification Using an Echo State Network Prediction Model Mark Skowronski and John Harris Computational Neuro-Engineering.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Author: Naveen Parihar Inst. for Signal and Info. Processing Dept.
1M4 speech recognition University of Sheffield M4 speech recognition Vincent Wan, Martin Karafiát.
LOG-ENERGY DYNAMIC RANGE NORMALIZATON FOR ROBUST SPEECH RECOGNITION Weizhong Zhu and Douglas O’Shaughnessy INRS-EMT, University of Quebec Montreal, Quebec,
Machine Translation  Machine translation is of one of the earliest uses of AI  Two approaches:  Traditional approach using grammars, rewrite rules,
DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.
IMPROVING RECOGNITION PERFORMANCE IN NOISY ENVIRONMENTS Joseph Picone 1 Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng. Mississippi.
Modeling Speech using POMDPs In this work we apply a new model, POMPD, in place of the traditional HMM to acoustically model the speech signal. We use.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.
HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.
A Baseline System for Speaker Recognition C. Mokbel, H. Greige, R. Zantout, H. Abi Akl A. Ghaoui, J. Chalhoub, R. Bayeh University Of Balamand - ELISA.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Reestimation Equations Continuous Distributions.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
PhD Candidate: Tao Ma Advised by: Dr. Joseph Picone Institute for Signal and Information Processing (ISIP) Mississippi State University Linear Dynamic.
Speech Communication Lab, State University of New York at Binghamton Dimensionality Reduction Methods for HMM Phonetic Recognition Hongbing Hu, Stephen.
Speech Recognition with CMU Sphinx Srikar Nadipally Hareesh Lingareddy.
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.
Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.
ICASSP 2007 Robustness Techniques Survey Presenter: Shih-Hsiang Lin.
Performance Analysis of Advanced Front Ends on the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info.
RCC-Mean Subtraction Robust Feature and Compare Various Feature based Methods for Robust Speech Recognition in presence of Telephone Noise Amin Fazel Sharif.
Statistical Models for Automatic Speech Recognition Lukáš Burget.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Message Source Linguistic Channel Articulatory Channel Acoustic Channel Observable: MessageWordsSounds Features Bayesian formulation for speech recognition:
An Analysis of the Aurora Large Vocabulary Evaluation Authors: Naveen Parihar and Joseph Picone Inst. for Signal and Info. Processing Dept. Electrical.
Page 1 of 10 ASR – effect of five parameters on the WER performance of HMM SR system Sanjay Patil, Jun-Won Suh Human and Systems Engineering Experimental.
Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.
1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ; Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.
Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.
Olivier Siohan David Rybach
An overview of decoding techniques for LVCSR
Tight Coupling between ASR and MT in Speech-to-Speech Translation
Statistical Models for Automatic Speech Recognition
Automatic Speech Recognition: Conditional Random Fields for ASR
A Tutorial on Bayesian Speech Feature Enhancement
Anthor: Andreas Tsiartas, Prasanta Kumar Ghosh,
Presenter : Jen-Wei Kuo
Presenter: Shih-Hsiang(士翔)
Combination of Feature and Channel Compensation (1/2)
Presentation transcript:

The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE Corporation Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing

INTRODUCTION l Collaboration between The MITRE Corporation and Mississippi State Institute for Signal and Information Processing (ISIP) –Primary goal: Evaluate the impact of noise pre-processing developed for other DoD applications l MITRE: –Focus on robust speech recognition using noise reduction techniques, including effects of tactical communications links –Distributed information access systems for military applications (DARPA Communicator) l Mississippi State: –Focus on stable, practical, advanced LVCSR technology –Open source large vocabulary speech recognition tools –Training, education and dissemination of information related to all aspects of speech research l ISIP-STT System utilized combination of technologies from both organizations

OVERVIEW OF THE SYSTEM l Standard MFCC front-end with side-based CMS l Acoustic modeling: –Left-right model topology –Skip states for special models like silence –Continuous density mixture Gaussian HMMs –Both Baum-Welch and Viterbi training supported –Phonetic decision tree-based state-tying l Hierarchical search Viterbi decoder

ISIP-STT ACOUSTIC MODELING l Left-right model topology l Skip states for special models like silence l Continuous density mixture Gaussian HMMs l Both Baum-Welch and Viterbi training supported l Phonetic decision tree-based state-tying

STATE-TYING: MOTIVATION l Context-dependent models for better performance l Increased parameter count l Need to reduce computations without degrading performance

DECISION TREES l Why decision trees? –Both data and knowledge driven –Capable of handling unseen contexts l Estimation criteria –Maximum likelihood-based approach –Multiple stopping criteria

TREE BUILDING l Splitting rule –Maximize likelihood of the data given the tree l Stopping Rule –Likelihood increase less than threshold –Minimum state occupancy at each node

FEATURES AND PERFORMANCE l Batch processing l Real-time performance of the training process during various stages:

DECODER: OVERVIEW l Algorithmic features: –Single-pass decoding –Hierarchical Viterbi search –Dynamic network expansion l Functional features: –Cross-word context-dependent acoustic models –Word graph rescoring, forced alignments, N-gram decoding l Structural features: –Word graph compaction –Multiple pronunciations –Memory management

WORD GRAPH COMPACTION l Timing information ignored in word graph rescoring l Merge duplicate arcs, but preserve all original sentence hypotheses

LEXICAL TREES l Lexical pronunciation trees l Required for compact representation of the lexicon l Results in delayed LM application l Single tree copy needed for N-gram decoding mode l Beam pruning: separate beam at each level in search hierarchy

DYNAMIC CONTEXT GENERATION l Lexical pronunciation trees composed of CI models only l Context-dependent lexical tree is not practical l Expansion on the fly reduces memory requirements significantly

EVALUATION SYSTEM - NOISE PREPROCESSING l Using Harsh Environment Noise Pre-Processor (HENPP) front- end to remove noise from input speech l HENPP developed by AT&T to address background noise effects in DoD speech coding environments (see Accardi and Cox, Malah et al, ICASSP 1999) l Multiplicative spectral processing - minimal distortion, eliminates “doodley-doos” (aka “musical noise”) l “Minimum statistics” noise adaptation - handles quasi-stationary additive noise (random and stochastic) without assumptions l Limitations: –Not designed to address transient noise –Noise adaptation sensitive to “push-to-talk” effects l Integrated 2.4 kbps MELP/HENPP demonstrated successfully in low- to moderate-perplexity ASR: LPC-10MELPMELP/HENPP

EVALUATION SYSTEM - DATA AND TRAINING l 10 hours of SPINE data used for training - no DRT words l 100 frames per second, 25msec Hamming window l 12 base FFT-derived mel cepstra with side-based CMS and log- energy l Delta and acceleration coefficients l 44 phone set to cover SPINE data l 909 models, 2725 states

EVALUATION SYSTEM - LM and LEXICON l 5226 words in the SPINE lexicon, provided by CMU l CMU language model l Bigrams obtained by throwing away the trigrams l LM size: 5226 unigrams, bigrams

EVALUATION SYSTEM - DECODING l Single stage decoding using word-internal acoustic models and bigram LM

RESULTS AND ANALYSIS l Lattice generation/lattice rescoring will improve results. l Informal analysis of evaluation data and results: –Negative correlation between recognition performance and SNR

RESULTS AND ANALYSIS (cont.) l Clean speech : “B” side of spine_eval_033 (281 total words) l Low SNR example: “A” side of spine_eval_021 (115 total words):

RESULTS AND ANALYSIS (cont.) l HENPP designed for human listening purposes –Optimized to raise DRT scores in presence of noise and coding –DRT scores, WER tend to be poorly correlated; minor perceptual distortions often have magnified adverse effect on speech recognizers l Need to retune the HENPP –Algorithm is very effective for robust recognition of noisy speech at low SNR’s –Too aggressive when applied to clean speech - some information is lost –Minor adjustments will preserve noisy speech performance and boost clean speech performance

ISSUES l Decoding slow on this task –100x real-time (on 600 MHz Pentium) –Newer version of ISIP-STT decoder will be faster –Had to use bigram LM in the allowed time frame l Large amount of eval data –With slow decoding, seriously limited experiments l The devil is in the details: –Certain training data problematic “Noise field is up” –Automatic segmentation (having eval segmentations would help)

CONCLUSIONS l MITRE / MS State-ISIP system; standard recognition approach using advanced noise preprocessing front end l Time limitation: could only officially report on the baseline system l Performed initial experiment with noise-preprocessing (AT&T HENPP) –Overall word error rate did not improve –Informal analysis suggests that for low SNR conversations, noise pre-processing does help. –Difficulty with high SNR conversations l There is potential for improvement with application specific tuning of HENPP. l Approach is very promising for coded speech in commercial and military environments

BACKUP SLIDES

PRUNING STRATEGIES l Separate beam at each level in search hierarchy l Maximum Active Phone Model Instance (MAPMI) pruning