Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Similar presentations


Presentation on theme: "1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke."— Presentation transcript:

1 1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke SRI / International Computer Science Institute Barbara Peskin International Computer Science Institute

2 2 Overview Introduction Sampling Criteria Experiments Summary

3 3 Data Sampling  Select a subset of data for acoustic model training  A variety of scenarios where sampling can be useful: –May reduce transcription costs if data are untranscribed, e.g. Broadcast News –May filter out bad data w/ transcription/alignment errors –May reduce training/decoding costs for target performance –Could train multiple systems on different subsets of data, e.g. for cross-system adaptation –May improve accuracy in cross-domain tasks, e.g CTS acoustic models for meetings recognition

4 4 Data Sampling (contd.)  Key assumptions –Maximum likelihood training –Transcribed data –Utterance-by-utterance data selection  Investigate the utility of various sampling criteria for CTS acoustic models (trained on Fisher) at different amounts of training data  Comparison metric: word error rate (WER)  Ultimate goals are tasks w/ unsupervised learning and discriminative training, where data quality is arguably much more important

5 5 Experimental Paradigm  Train: Data sampled from male Fisher data (778 hrs – whatever available in Spring ‘04)  Test: 2004 NIST development set  BBN + LDC segmentations  Decision-tree tied triphones – an automatic mechanism to control model complexity  SRI Decipher recognition system –Not the standard system; runs fast and involves only one acoustic model

6 6 Experimental Paradigm (contd.)  Training –Viterbi-style maximum likelihood training –Cross-word models; 128 mixtures per tied state  Decoding –Phone-loop MLLR –Decoding and lattice generation –Lattice rescoring w/ a 4-gram LM –Expansion of lattices w/ a 3-gram LM –N-best decoding from expanded lattices –N-best rescoring w/ a 4-gram LM + duration models –Confusion network decoding of final hypothesis

7 7 Sampling Criteria  Random sampling  Likelihood-derived criteria  Accuracy-based criteria  Context coverage

8 8 Random Sampling  Select an arbitrary subset of available data  Very simple; doesn’t introduce any systematic variations  Ideal for experimentation w/ small amounts of training data  Data statistics –Average utterance length: 3.77 secs –Average silence% per utterance: 20%

9 9 Results: Random Sampling  WER for random, hierarchical subsets of training data  Based on a single random sample  Incremental gains under our ML training paradigm

10 10 Likelihood-based Criteria  Select utterances according to utterance-level acoustic likelihood score: score = utterance likelihood / number # of frames  Pros –Very simple; readily computed –Utterances w/ low and high scores tend to indicate transcription errors/long utterances, and long silences  Cons –Likelihood has no direct relevance to accuracy –May need additional normalization to deal w/ silence  Can argue for selecting utterances w/ low, high, and average likelihood scores

11 11 Normalized Likelihood (Speech + Non-Speech)  Per-frame utterance likelihoods on male Fisher data  Unimodal distribution, simplifying selection regimes  Select utterances w/ low, high, and average likelihoods Score PDF  High-likelihood utterances tend to have a lot of silence  Use likelihood only from speech frames

12 12 Normalized Likelihood (Speech)  Use likelihood only from speech frames  More concentrated, shifted towards lower likelihoods Score PDF

13 13 Results: Likelihood-based Sampling w/ speech + non-speech w/ speech only Selecting utterances w/ average likelihood scores performs the best No benefit over random sampling if likelihoods from non-speech frames contribute 0.5 % absolute improvement over random sampling for 256 hours of data, if non-speech frames are excluded

14 14 Accuracy-based Criteria  Select utterances based on their recognition difficulty  Word and phone error rates, or lattice entropy  Pros –Directed towards the final objective (WER) –Straightforward to calculate w/ additional cost  Cons –Accuracy seems to be highly concentrated (across utt.’s)  Focus on average phone accuracy per utterance

15 15 Phone Accuracy  Average phone accuracy per utterance, after a monotonic transformation (to spread the distribution) f(x) = log(1-x) Score PDF

16 16 Results: Accuracy-based Sampling For small amounts of training data (< 128 hours), utterances w/ low phone recognition accuracy perform better At larger amounts of data, training on more difficult utterances seems to be more advantageous; (promising to perform better than random sampling)

17 17 Triphone Coverage  Under a generative modeling paradigm (e.g. HMMs) and ML estimation, might argue that it is sufficient to accurately estimate a distribution when enough prototypes are found  Frequent triphones will be selected anyway, so tailor the sampling towards utterances w/ infrequent counts  Greedy utterance selection to maximize entropy of the triphone count distribution

18 18 Results: Triphone Coverage-based Sampling For small amounts of selecting utterances to maximize triphone coverage performs similar to the likelihood-based sampling criteria No advantage (even some degradation) as compared to random sampling Maybe need to have many examples from frequent triphones to get a better coverage of non-contextual variations, e.g. speakers

19 19 Summary  Compared a variety of acoustic data selection criteria for labeled data and ML training (random sampling, and those based on likelihood, accuracy, and triphone coverage)  Found out that likelihood-based selection after removing silences perform the best and slightly improve over random sampling (0.5% abs.)  No significant performance improvement  Caveat: –Our accuracy-based and triphone coverage-based selection criteria are rather simplistic

20 20 Future Work  Tasks where the data quality is more important –Untranscribed data –Discriminative training  More sophisticated accuracy and context-coverage criteria, e.g. lattice entropy/confidence  Data selection for cross-domain tasks, e.g. CTS data for Meetings recognition  Speaker-level data selection –Could be useful for cross-adaptation methods


Download ppt "1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke."

Similar presentations


Ads by Google