1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Slides:

Advertisements

Similar presentations

PHONE MODELING AND COMBINING DISCRIMINATIVE TRAINING FOR MANDARIN-ENGLISH BILINGUAL SPEECH RECOGNITION Yanmin Qian, Jia Liu ICASSP2010 Pei-Ning Chen CSIE.

Advertisements

Markpong Jongtaveesataporn † Chai Wutiwiwatchai ‡ Koji Iwano † Sadaoki Furui † † Tokyo Institute of Technology, Japan ‡ NECTEC, Thailand.

Speech Recognition with Hidden Markov Models Winter 2011

Adaption Adjusting Model’s parameters for a new speaker. Adjusting all parameters need a huge amount of data (impractical). The solution is to cluster.

1 Bayesian Adaptation in HMM Training and Decoding Using a Mixture of Feature Transforms Stavros Tsakalidis and Spyros Matsoukas.

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

The 1980’s Collection of large standard corpora Front ends: auditory models, dynamics Engineering: scaling to large vocabulary continuous speech Second.

Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.

Speaker Adaptation for Vowel Classification

Evaluating Hypotheses

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

Optimal Adaptation for Statistical Classifiers Xiao Li.

EE225D Final Project Text-Constrained Speaker Recognition Using Hidden Markov Models Kofi A. Boakye EE225D Final Project.

1 USING CLASS WEIGHTING IN INTER-CLASS MLLR Sam-Joo Doh and Richard M. Stern Department of Electrical and Computer Engineering and School of Computer Science.

May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.

Introduction to ModelingMonte Carlo Simulation Expensive Not always practical Time consuming Impossible for all situations Can be complex Cons Pros Experience.

Toshiba Update 04/09/2006 Data-Driven Prosody and Voice Quality Generation for Emotional Speech Zeynep Inanoglu & Steve Young Machine Intelligence Lab.

Normalization of the Speech Modulation Spectra for Robust Speech Recognition Xiong Xiao, Eng Siong Chng, and Haizhou Li Wen-Yi Chu Department of Computer.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires too much data and is computationally complex. Solution: Create.

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Example Clustered Transformations MAP Adaptation Resources: ECE 7000:

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Recognition of spoken and spelled proper names Reporter : CHEN, TZAN HWEI Author :Michael Meyer, Hermann Hild.

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

The Use of Context in Large Vocabulary Speech Recognition Julian James Odell March 1995 Dissertation submitted to the University of Cambridge for the degree.

Boosting Training Scheme for Acoustic Modeling Rong Zhang and Alexander I. Rudnicky Language Technologies Institute, School of Computer Science Carnegie.

HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

1 Robust Endpoint Detection and Energy Normalization for Real-Time Speech and Speaker Recognition Qi Li, Senior Member, IEEE, Jinsong Zheng, Augustine.

PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

UNSUPERVISED CV LANGUAGE MODEL ADAPTATION BASED ON DIRECT LIKELIHOOD MAXIMIZATION SENTENCE SELECTION Takahiro Shinozaki, Yasuo Horiuchi, Shingo Kuroiwa.

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Hello, Who is Calling? Can Words Reveal the Social Nature of Conversations?

Presented by: Fang-Hui Chu Discriminative Models for Speech Recognition M.J.F. Gales Cambridge University Engineering Department 2007.

Adaption Def: To adjust model parameters for new speakers. Adjusting all parameters requires an impractical amount of data. Solution: Create clusters and.

1 Adaptive Subjective Triggers for Opinionated Document Retrieval (WSDM 09’) Kazuhiro Seki, Kuniaki Uehara Date: 11/02/09 Speaker: Hsu, Yu-Wen Advisor:

Copyright © 2013 by Educational Testing Service. All rights reserved. Evaluating Unsupervised Language Model Adaption Methods for Speaking Assessment ShaSha.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bagging and Boosting Cross-Validation ML.

Network Training for Continuous Speech Recognition Author: Issac John Alphonso Inst. for Signal and Info. Processing Dept. Electrical and Computer Eng.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Phone-Level Pronunciation Scoring and Assessment for Interactive Language Learning Speech Communication, 2000 Authors: S. M. Witt, S. J. Young Presenter:

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

APPLICATIONS OF DIRICHLET PROCESS MIXTURES TO SPEAKER ADAPTATION Amir Harati and Joseph PiconeMarc Sobel Institute for Signal and Information Processing,

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Flexible Speaker Adaptation using Maximum Likelihood Linear Regression Authors: C. J. Leggetter P. C. Woodland Presenter: 陳亮宇 Proc. ARPA Spoken Language.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

NTNU SPEECH AND MACHINE INTELEGENCE LABORATORY Discriminative pronunciation modeling using the MPE criterion Meixu SONG, Jielin PAN, Qingwei ZHAO, Yonghong.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Olivier Siohan David Rybach

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Presenter : Jen-Wei Kuo

Network Training for Continuous Speech Recognition

Presentation transcript:

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke SRI / International Computer Science Institute Barbara Peskin International Computer Science Institute

2 Overview Introduction Sampling Criteria Experiments Summary

3 Data Sampling  Select a subset of data for acoustic model training  A variety of scenarios where sampling can be useful: –May reduce transcription costs if data are untranscribed, e.g. Broadcast News –May filter out bad data w/ transcription/alignment errors –May reduce training/decoding costs for target performance –Could train multiple systems on different subsets of data, e.g. for cross-system adaptation –May improve accuracy in cross-domain tasks, e.g CTS acoustic models for meetings recognition

4 Data Sampling (contd.)  Key assumptions –Maximum likelihood training –Transcribed data –Utterance-by-utterance data selection  Investigate the utility of various sampling criteria for CTS acoustic models (trained on Fisher) at different amounts of training data  Comparison metric: word error rate (WER)  Ultimate goals are tasks w/ unsupervised learning and discriminative training, where data quality is arguably much more important

5 Experimental Paradigm  Train: Data sampled from male Fisher data (778 hrs – whatever available in Spring ‘04)  Test: 2004 NIST development set  BBN + LDC segmentations  Decision-tree tied triphones – an automatic mechanism to control model complexity  SRI Decipher recognition system –Not the standard system; runs fast and involves only one acoustic model

6 Experimental Paradigm (contd.)  Training –Viterbi-style maximum likelihood training –Cross-word models; 128 mixtures per tied state  Decoding –Phone-loop MLLR –Decoding and lattice generation –Lattice rescoring w/ a 4-gram LM –Expansion of lattices w/ a 3-gram LM –N-best decoding from expanded lattices –N-best rescoring w/ a 4-gram LM + duration models –Confusion network decoding of final hypothesis

7 Sampling Criteria  Random sampling  Likelihood-derived criteria  Accuracy-based criteria  Context coverage

8 Random Sampling  Select an arbitrary subset of available data  Very simple; doesn’t introduce any systematic variations  Ideal for experimentation w/ small amounts of training data  Data statistics –Average utterance length: 3.77 secs –Average silence% per utterance: 20%

9 Results: Random Sampling  WER for random, hierarchical subsets of training data  Based on a single random sample  Incremental gains under our ML training paradigm

10 Likelihood-based Criteria  Select utterances according to utterance-level acoustic likelihood score: score = utterance likelihood / number # of frames  Pros –Very simple; readily computed –Utterances w/ low and high scores tend to indicate transcription errors/long utterances, and long silences  Cons –Likelihood has no direct relevance to accuracy –May need additional normalization to deal w/ silence  Can argue for selecting utterances w/ low, high, and average likelihood scores

11 Normalized Likelihood (Speech + Non-Speech)  Per-frame utterance likelihoods on male Fisher data  Unimodal distribution, simplifying selection regimes  Select utterances w/ low, high, and average likelihoods Score PDF  High-likelihood utterances tend to have a lot of silence  Use likelihood only from speech frames

12 Normalized Likelihood (Speech)  Use likelihood only from speech frames  More concentrated, shifted towards lower likelihoods Score PDF

13 Results: Likelihood-based Sampling w/ speech + non-speech w/ speech only Selecting utterances w/ average likelihood scores performs the best No benefit over random sampling if likelihoods from non-speech frames contribute 0.5 % absolute improvement over random sampling for 256 hours of data, if non-speech frames are excluded

14 Accuracy-based Criteria  Select utterances based on their recognition difficulty  Word and phone error rates, or lattice entropy  Pros –Directed towards the final objective (WER) –Straightforward to calculate w/ additional cost  Cons –Accuracy seems to be highly concentrated (across utt.’s)  Focus on average phone accuracy per utterance

15 Phone Accuracy  Average phone accuracy per utterance, after a monotonic transformation (to spread the distribution) f(x) = log(1-x) Score PDF

16 Results: Accuracy-based Sampling For small amounts of training data (< 128 hours), utterances w/ low phone recognition accuracy perform better At larger amounts of data, training on more difficult utterances seems to be more advantageous; (promising to perform better than random sampling)

17 Triphone Coverage  Under a generative modeling paradigm (e.g. HMMs) and ML estimation, might argue that it is sufficient to accurately estimate a distribution when enough prototypes are found  Frequent triphones will be selected anyway, so tailor the sampling towards utterances w/ infrequent counts  Greedy utterance selection to maximize entropy of the triphone count distribution

18 Results: Triphone Coverage-based Sampling For small amounts of selecting utterances to maximize triphone coverage performs similar to the likelihood-based sampling criteria No advantage (even some degradation) as compared to random sampling Maybe need to have many examples from frequent triphones to get a better coverage of non-contextual variations, e.g. speakers

19 Summary  Compared a variety of acoustic data selection criteria for labeled data and ML training (random sampling, and those based on likelihood, accuracy, and triphone coverage)  Found out that likelihood-based selection after removing silences perform the best and slightly improve over random sampling (0.5% abs.)  No significant performance improvement  Caveat: –Our accuracy-based and triphone coverage-based selection criteria are rather simplistic

20 Future Work  Tasks where the data quality is more important –Untranscribed data –Discriminative training  More sophisticated accuracy and context-coverage criteria, e.g. lattice entropy/confidence  Data selection for cross-domain tasks, e.g. CTS data for Meetings recognition  Speaker-level data selection –Could be useful for cross-adaptation methods