Download presentation
Presentation is loading. Please wait.
Published byAlbert Little Modified over 9 years ago
2
The 2000 NRL Evaluation for Recognition of Speech in Noisy Environments MITRE / MS State - ISIP Burhan Necioglu Bryan George George Shuttic The MITRE Corporation Ramasubramanian Sundaram Joe Picone Mississippi State U. Inst. for Signal & Information Processing
3
INTRODUCTION l Collaboration between The MITRE Corporation and Mississippi State Institute for Signal and Information Processing (ISIP) –Primary goal: Evaluate the impact of noise pre-processing developed for other DoD applications l MITRE: –Focus on robust speech recognition using noise reduction techniques, including effects of tactical communications links –Distributed information access systems for military applications (DARPA Communicator) l Mississippi State: –Focus on stable, practical, advanced LVCSR technology –Open source large vocabulary speech recognition tools –Training, education and dissemination of information related to all aspects of speech research l ISIP-STT System utilized combination of technologies from both organizations
4
OVERVIEW OF THE SYSTEM l Standard MFCC front-end with side-based CMS l Acoustic modeling: –Left-right model topology –Skip states for special models like silence –Continuous density mixture Gaussian HMMs –Both Baum-Welch and Viterbi training supported –Phonetic decision tree-based state-tying l Hierarchical search Viterbi decoder
5
ISIP-STT ACOUSTIC MODELING l Left-right model topology l Skip states for special models like silence l Continuous density mixture Gaussian HMMs l Both Baum-Welch and Viterbi training supported l Phonetic decision tree-based state-tying
6
STATE-TYING: MOTIVATION l Context-dependent models for better performance l Increased parameter count l Need to reduce computations without degrading performance
7
DECISION TREES l Why decision trees? –Both data and knowledge driven –Capable of handling unseen contexts l Estimation criteria –Maximum likelihood-based approach –Multiple stopping criteria
8
TREE BUILDING l Splitting rule –Maximize likelihood of the data given the tree l Stopping Rule –Likelihood increase less than threshold –Minimum state occupancy at each node
9
FEATURES AND PERFORMANCE l Batch processing l Real-time performance of the training process during various stages:
10
DECODER: OVERVIEW l Algorithmic features: –Single-pass decoding –Hierarchical Viterbi search –Dynamic network expansion l Functional features: –Cross-word context-dependent acoustic models –Word graph rescoring, forced alignments, N-gram decoding l Structural features: –Word graph compaction –Multiple pronunciations –Memory management
11
WORD GRAPH COMPACTION l Timing information ignored in word graph rescoring l Merge duplicate arcs, but preserve all original sentence hypotheses
12
LEXICAL TREES l Lexical pronunciation trees l Required for compact representation of the lexicon l Results in delayed LM application l Single tree copy needed for N-gram decoding mode l Beam pruning: separate beam at each level in search hierarchy
13
DYNAMIC CONTEXT GENERATION l Lexical pronunciation trees composed of CI models only l Context-dependent lexical tree is not practical l Expansion on the fly reduces memory requirements significantly
14
EVALUATION SYSTEM - NOISE PREPROCESSING l Using Harsh Environment Noise Pre-Processor (HENPP) front- end to remove noise from input speech l HENPP developed by AT&T to address background noise effects in DoD speech coding environments (see Accardi and Cox, Malah et al, ICASSP 1999) l Multiplicative spectral processing - minimal distortion, eliminates “doodley-doos” (aka “musical noise”) l “Minimum statistics” noise adaptation - handles quasi-stationary additive noise (random and stochastic) without assumptions l Limitations: –Not designed to address transient noise –Noise adaptation sensitive to “push-to-talk” effects l Integrated 2.4 kbps MELP/HENPP demonstrated successfully in low- to moderate-perplexity ASR: LPC-10MELPMELP/HENPP
15
EVALUATION SYSTEM - DATA AND TRAINING l 10 hours of SPINE data used for training - no DRT words l 100 frames per second, 25msec Hamming window l 12 base FFT-derived mel cepstra with side-based CMS and log- energy l Delta and acceleration coefficients l 44 phone set to cover SPINE data l 909 models, 2725 states
16
EVALUATION SYSTEM - LM and LEXICON l 5226 words in the SPINE lexicon, provided by CMU l CMU language model l Bigrams obtained by throwing away the trigrams l LM size: 5226 unigrams, 12511 bigrams
17
EVALUATION SYSTEM - DECODING l Single stage decoding using word-internal acoustic models and bigram LM
18
RESULTS AND ANALYSIS l Lattice generation/lattice rescoring will improve results. l Informal analysis of evaluation data and results: –Negative correlation between recognition performance and SNR
19
RESULTS AND ANALYSIS (cont.) l Clean speech : “B” side of spine_eval_033 (281 total words) l Low SNR example: “A” side of spine_eval_021 (115 total words):
20
RESULTS AND ANALYSIS (cont.) l HENPP designed for human listening purposes –Optimized to raise DRT scores in presence of noise and coding –DRT scores, WER tend to be poorly correlated; minor perceptual distortions often have magnified adverse effect on speech recognizers l Need to retune the HENPP –Algorithm is very effective for robust recognition of noisy speech at low SNR’s –Too aggressive when applied to clean speech - some information is lost –Minor adjustments will preserve noisy speech performance and boost clean speech performance
21
ISSUES l Decoding slow on this task –100x real-time (on 600 MHz Pentium) –Newer version of ISIP-STT decoder will be faster –Had to use bigram LM in the allowed time frame l Large amount of eval data –With slow decoding, seriously limited experiments l The devil is in the details: –Certain training data problematic “Noise field is up” –Automatic segmentation (having eval segmentations would help)
22
CONCLUSIONS l MITRE / MS State-ISIP system; standard recognition approach using advanced noise preprocessing front end l Time limitation: could only officially report on the baseline system l Performed initial experiment with noise-preprocessing (AT&T HENPP) –Overall word error rate did not improve –Informal analysis suggests that for low SNR conversations, noise pre-processing does help. –Difficulty with high SNR conversations l There is potential for improvement with application specific tuning of HENPP. l Approach is very promising for coded speech in commercial and military environments
23
BACKUP SLIDES
24
PRUNING STRATEGIES l Separate beam at each level in search hierarchy l Maximum Active Phone Model Instance (MAPMI) pruning
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.