Download presentation
Presentation is loading. Please wait.
Published byAlaina Holt Modified over 9 years ago
1
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke SRI / International Computer Science Institute Barbara Peskin International Computer Science Institute
2
2 Overview Introduction Sampling Criteria Experiments Summary
3
3 Data Sampling Select a subset of data for acoustic model training A variety of scenarios where sampling can be useful: –May reduce transcription costs if data are untranscribed, e.g. Broadcast News –May filter out bad data w/ transcription/alignment errors –May reduce training/decoding costs for target performance –Could train multiple systems on different subsets of data, e.g. for cross-system adaptation –May improve accuracy in cross-domain tasks, e.g CTS acoustic models for meetings recognition
4
4 Data Sampling (contd.) Key assumptions –Maximum likelihood training –Transcribed data –Utterance-by-utterance data selection Investigate the utility of various sampling criteria for CTS acoustic models (trained on Fisher) at different amounts of training data Comparison metric: word error rate (WER) Ultimate goals are tasks w/ unsupervised learning and discriminative training, where data quality is arguably much more important
5
5 Experimental Paradigm Train: Data sampled from male Fisher data (778 hrs – whatever available in Spring ‘04) Test: 2004 NIST development set BBN + LDC segmentations Decision-tree tied triphones – an automatic mechanism to control model complexity SRI Decipher recognition system –Not the standard system; runs fast and involves only one acoustic model
6
6 Experimental Paradigm (contd.) Training –Viterbi-style maximum likelihood training –Cross-word models; 128 mixtures per tied state Decoding –Phone-loop MLLR –Decoding and lattice generation –Lattice rescoring w/ a 4-gram LM –Expansion of lattices w/ a 3-gram LM –N-best decoding from expanded lattices –N-best rescoring w/ a 4-gram LM + duration models –Confusion network decoding of final hypothesis
7
7 Sampling Criteria Random sampling Likelihood-derived criteria Accuracy-based criteria Context coverage
8
8 Random Sampling Select an arbitrary subset of available data Very simple; doesn’t introduce any systematic variations Ideal for experimentation w/ small amounts of training data Data statistics –Average utterance length: 3.77 secs –Average silence% per utterance: 20%
9
9 Results: Random Sampling WER for random, hierarchical subsets of training data Based on a single random sample Incremental gains under our ML training paradigm
10
10 Likelihood-based Criteria Select utterances according to utterance-level acoustic likelihood score: score = utterance likelihood / number # of frames Pros –Very simple; readily computed –Utterances w/ low and high scores tend to indicate transcription errors/long utterances, and long silences Cons –Likelihood has no direct relevance to accuracy –May need additional normalization to deal w/ silence Can argue for selecting utterances w/ low, high, and average likelihood scores
11
11 Normalized Likelihood (Speech + Non-Speech) Per-frame utterance likelihoods on male Fisher data Unimodal distribution, simplifying selection regimes Select utterances w/ low, high, and average likelihoods Score PDF High-likelihood utterances tend to have a lot of silence Use likelihood only from speech frames
12
12 Normalized Likelihood (Speech) Use likelihood only from speech frames More concentrated, shifted towards lower likelihoods Score PDF
13
13 Results: Likelihood-based Sampling w/ speech + non-speech w/ speech only Selecting utterances w/ average likelihood scores performs the best No benefit over random sampling if likelihoods from non-speech frames contribute 0.5 % absolute improvement over random sampling for 256 hours of data, if non-speech frames are excluded
14
14 Accuracy-based Criteria Select utterances based on their recognition difficulty Word and phone error rates, or lattice entropy Pros –Directed towards the final objective (WER) –Straightforward to calculate w/ additional cost Cons –Accuracy seems to be highly concentrated (across utt.’s) Focus on average phone accuracy per utterance
15
15 Phone Accuracy Average phone accuracy per utterance, after a monotonic transformation (to spread the distribution) f(x) = log(1-x) Score PDF
16
16 Results: Accuracy-based Sampling For small amounts of training data (< 128 hours), utterances w/ low phone recognition accuracy perform better At larger amounts of data, training on more difficult utterances seems to be more advantageous; (promising to perform better than random sampling)
17
17 Triphone Coverage Under a generative modeling paradigm (e.g. HMMs) and ML estimation, might argue that it is sufficient to accurately estimate a distribution when enough prototypes are found Frequent triphones will be selected anyway, so tailor the sampling towards utterances w/ infrequent counts Greedy utterance selection to maximize entropy of the triphone count distribution
18
18 Results: Triphone Coverage-based Sampling For small amounts of selecting utterances to maximize triphone coverage performs similar to the likelihood-based sampling criteria No advantage (even some degradation) as compared to random sampling Maybe need to have many examples from frequent triphones to get a better coverage of non-contextual variations, e.g. speakers
19
19 Summary Compared a variety of acoustic data selection criteria for labeled data and ML training (random sampling, and those based on likelihood, accuracy, and triphone coverage) Found out that likelihood-based selection after removing silences perform the best and slightly improve over random sampling (0.5% abs.) No significant performance improvement Caveat: –Our accuracy-based and triphone coverage-based selection criteria are rather simplistic
20
20 Future Work Tasks where the data quality is more important –Untranscribed data –Discriminative training More sophisticated accuracy and context-coverage criteria, e.g. lattice entropy/confidence Data selection for cross-domain tasks, e.g. CTS data for Meetings recognition Speaker-level data selection –Could be useful for cross-adaptation methods
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.