1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke SRI / International Computer Science Institute Barbara Peskin International Computer Science Institute
2 Overview Introduction Sampling Criteria Experiments Summary
3 Data Sampling Select a subset of data for acoustic model training A variety of scenarios where sampling can be useful: –May reduce transcription costs if data are untranscribed, e.g. Broadcast News –May filter out bad data w/ transcription/alignment errors –May reduce training/decoding costs for target performance –Could train multiple systems on different subsets of data, e.g. for cross-system adaptation –May improve accuracy in cross-domain tasks, e.g CTS acoustic models for meetings recognition
4 Data Sampling (contd.) Key assumptions –Maximum likelihood training –Transcribed data –Utterance-by-utterance data selection Investigate the utility of various sampling criteria for CTS acoustic models (trained on Fisher) at different amounts of training data Comparison metric: word error rate (WER) Ultimate goals are tasks w/ unsupervised learning and discriminative training, where data quality is arguably much more important
5 Experimental Paradigm Train: Data sampled from male Fisher data (778 hrs – whatever available in Spring ‘04) Test: 2004 NIST development set BBN + LDC segmentations Decision-tree tied triphones – an automatic mechanism to control model complexity SRI Decipher recognition system –Not the standard system; runs fast and involves only one acoustic model
6 Experimental Paradigm (contd.) Training –Viterbi-style maximum likelihood training –Cross-word models; 128 mixtures per tied state Decoding –Phone-loop MLLR –Decoding and lattice generation –Lattice rescoring w/ a 4-gram LM –Expansion of lattices w/ a 3-gram LM –N-best decoding from expanded lattices –N-best rescoring w/ a 4-gram LM + duration models –Confusion network decoding of final hypothesis
7 Sampling Criteria Random sampling Likelihood-derived criteria Accuracy-based criteria Context coverage
8 Random Sampling Select an arbitrary subset of available data Very simple; doesn’t introduce any systematic variations Ideal for experimentation w/ small amounts of training data Data statistics –Average utterance length: 3.77 secs –Average silence% per utterance: 20%
9 Results: Random Sampling WER for random, hierarchical subsets of training data Based on a single random sample Incremental gains under our ML training paradigm
10 Likelihood-based Criteria Select utterances according to utterance-level acoustic likelihood score: score = utterance likelihood / number # of frames Pros –Very simple; readily computed –Utterances w/ low and high scores tend to indicate transcription errors/long utterances, and long silences Cons –Likelihood has no direct relevance to accuracy –May need additional normalization to deal w/ silence Can argue for selecting utterances w/ low, high, and average likelihood scores
11 Normalized Likelihood (Speech + Non-Speech) Per-frame utterance likelihoods on male Fisher data Unimodal distribution, simplifying selection regimes Select utterances w/ low, high, and average likelihoods Score PDF High-likelihood utterances tend to have a lot of silence Use likelihood only from speech frames
12 Normalized Likelihood (Speech) Use likelihood only from speech frames More concentrated, shifted towards lower likelihoods Score PDF
13 Results: Likelihood-based Sampling w/ speech + non-speech w/ speech only Selecting utterances w/ average likelihood scores performs the best No benefit over random sampling if likelihoods from non-speech frames contribute 0.5 % absolute improvement over random sampling for 256 hours of data, if non-speech frames are excluded
14 Accuracy-based Criteria Select utterances based on their recognition difficulty Word and phone error rates, or lattice entropy Pros –Directed towards the final objective (WER) –Straightforward to calculate w/ additional cost Cons –Accuracy seems to be highly concentrated (across utt.’s) Focus on average phone accuracy per utterance
15 Phone Accuracy Average phone accuracy per utterance, after a monotonic transformation (to spread the distribution) f(x) = log(1-x) Score PDF
16 Results: Accuracy-based Sampling For small amounts of training data (< 128 hours), utterances w/ low phone recognition accuracy perform better At larger amounts of data, training on more difficult utterances seems to be more advantageous; (promising to perform better than random sampling)
17 Triphone Coverage Under a generative modeling paradigm (e.g. HMMs) and ML estimation, might argue that it is sufficient to accurately estimate a distribution when enough prototypes are found Frequent triphones will be selected anyway, so tailor the sampling towards utterances w/ infrequent counts Greedy utterance selection to maximize entropy of the triphone count distribution
18 Results: Triphone Coverage-based Sampling For small amounts of selecting utterances to maximize triphone coverage performs similar to the likelihood-based sampling criteria No advantage (even some degradation) as compared to random sampling Maybe need to have many examples from frequent triphones to get a better coverage of non-contextual variations, e.g. speakers
19 Summary Compared a variety of acoustic data selection criteria for labeled data and ML training (random sampling, and those based on likelihood, accuracy, and triphone coverage) Found out that likelihood-based selection after removing silences perform the best and slightly improve over random sampling (0.5% abs.) No significant performance improvement Caveat: –Our accuracy-based and triphone coverage-based selection criteria are rather simplistic
20 Future Work Tasks where the data quality is more important –Untranscribed data –Discriminative training More sophisticated accuracy and context-coverage criteria, e.g. lattice entropy/confidence Data selection for cross-domain tasks, e.g. CTS data for Meetings recognition Speaker-level data selection –Could be useful for cross-adaptation methods