Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Horacio Franco Jing Zheng and Andreas Stolcke Speech Technology & Research Laboratory SRI International, Menlo Park, CA
Dec. 4-5, 2003EARS STT Workshop2 Goals Assess effect of TDT-4 data on SRI BN system (not previously used) Explore alternatives for use of closed-caption transcripts for acoustic and LM training Specifically, investigate algorithm for “repairing” inaccuracies in CC transcripts. Initial test of voicing feature front end on BN (originally developed for CTS)
Dec. 4-5, 2003EARS STT Workshop3 Talk Overview BN training on TDT-4 CC data –Generation of raw transcripts –Waveform segmentation –Transcript Repair with FlexAlign –FlexAlign output for LM training –Effect of amount of training data –Comparison with CUED TDT-4 transcripts –Ongoing effort on voicing features for BN acoustic modeling
Dec. 4-5, 2003EARS STT Workshop4 TDT-4 Training: Generation of Waveforms Segments and Reference Transcripts References were assumed to be delimited by and in the LDC transcripts. The speech signal was cut using the time marks extracted from the tags surrounding the TEXT elements. Long waveforms were identified and recut at progressively shorter pauses until all waveforms were 30s or shorter. Used PTM acoustic models for forced alignment that didn’t require speaker-level normalizations. Used “flexible” forced alignment (see next).
Dec. 4-5, 2003EARS STT Workshop5 FlexAlign Special lattices were generated for each segment. Each word was preceded by an optional pause and an optional nonlexical word model. Goal was to simultaneously delete noisy or mistranscribed text and insert disfluencies.
Dec. 4-5, 2003EARS STT Workshop6 Optional Nonlexical Word Transition probabilities were approximated by the unigram relative frequencies in the 96/97 BN acoustic training corpus.
Dec. 4-5, 2003EARS STT Workshop7 Training Procedure Final refs were the output of the recognizer on the FlexAlign lattices. WER wrt original CC transcripts: 5.0% (Sub 0.4, Ins 4.4, Del 0.3) Standard acoustic models were built using Viterbi training on these transcripts.
Dec. 4-5, 2003EARS STT Workshop8 Does FlexAlignment Help LM Training? OriginalSubsetFlexAlignMixture PPL(TDT4) PPL(Eval03) grams grams grams grams “Subset”: Random selection of original CC references to match token count of FlexAlign transcripts. Note: Only disfluency in the test data was “uh”.
Dec. 4-5, 2003EARS STT Workshop9 An Accidental Experiment What happens if we train on only a subset of the data? Is the performance proportionately worse? BaselineSubset TDT4All TDT4 Hours Speakers WER
Dec. 4-5, 2003EARS STT Workshop10 Comparison with CUED TDT-4 Training Transcripts CUED TDT-4 transcripts were generated by a STT system with a biased LM (trained on TDT-4 CC). CUED transcripts were generated from CU word time information and SRI waveform segments. CUED transcripts sometimes have “holes” in them where our wave segments span more than one of CUED waves (probably due to ad removal). WER wrt CC transcriptions: Originals:18.2% (Sub 7.7, Ins 3.2,, Del 7.2) Flex-align:19.5% (Sub 10.1, Ins 3.8, Del 5.6) A fairer comparison ought to use CUED transcripts with CUED segments for training the acoustic models, so take results with a grain of salt!
Dec. 4-5, 2003EARS STT Workshop11 Results of First Decoding Pass ExperimentsTrain HoursWER Baseline TDT4-subset TDT4-full CUED transcripts
Dec. 4-5, 2003EARS STT Workshop12 Multi-pass System Results First pass WERMulti-pass WER Baseline TDT Rel. improvement8.8%6.2% Multi-pass system used new decoding strategy (described in later talk). But: MFC instead of PLP, and no SAT normalization in training (to save time).
Dec. 4-5, 2003EARS STT Workshop13 Voicing Features Test voicing features developed for CTS system for BN STT (cf. Martigny talk) –Then, we obtained a 2% relative error reduction across stages Use Peak of autocorrelation and entropy of higher order cepstrum Use a window of 5 frames of two voicing features Juxtapose MDCC plus deltas and double deltas to window of voicing features Apply dimensionality reduction with HLDA. Final feature vector has 39 dimensions
Dec. 4-5, 2003EARS STT Workshop14 Voicing Features Results FrontendWER MFCC22.6 MFCC+voicing23.3 TDT-4 devtest set (results on first pass) Used equivalent parameters to those optimized for CTS system Need to investigate (reoptimize) FE parameters for higher BW It is not clear what the effect of background music might be in voicing features in BN Possible software issues With higher BW features, voicing features may be more redundant.
Dec. 4-5, 2003EARS STT Workshop15 Summary Developed CC transcript “repair” algorithm based on flexible alignment. Training on “repaired” TDT-4 transcripts gives 8.8% (1 st pass) to 6.2% (multi-pass) relative improvement of Hub-4 training. Accidental result: leaving out 1/3 of new data reduces improvement only marginally. Transcript “repair” not suitable for LM training (yet). No improvement from voicing features (yet), need to investigate parameters.
Dec. 4-5, 2003EARS STT Workshop16 Future Work Redo comparison with alternative transcripts more carefully. Investigate data filtering (e.g., based on reject word occurrences in FlexAlign output). Add the rest of the data ! Further investigate the use of voicing features.