Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Slides:



Advertisements
Similar presentations
Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.
Advertisements

Building an ASR using HTK CS4706
Combining Heterogeneous Sensors with Standard Microphones for Noise Robust Recognition Horacio Franco 1, Martin Graciarena 12 Kemal Sonmez 1, Harry Bratt.
Multipitch Tracking for Noisy Speech
Improvement of Audio Capture in Handheld Devices through Digital Filtering Problem Microphones in handheld devices are of low quality to reduce cost. This.
The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,
SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.
ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.
Speaker Detection Without Models Dan Gillick July 27, 2004.
Investigation on Mandarin Broadcast News Speech Recognition Mei-Yuh Hwang, Xin Lei, Wen Wang*, Takahiro Shinozaki University of Washington, *SRI 9/19/2006,
Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)
June 14th, 2005Speech Group Lunch Talk Kofi A. Boakye International Computer Science Institute Mixed Signals: Speech Activity Detection and Crosstalk in.
Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,
May 30th, 2006Speech Group Lunch Talk Features for Improved Speech Activity Detection for Recognition of Multiparty Meetings Kofi A. Boakye International.
Acoustic and Linguistic Characterization of Spontaneous Speech Masanobu Nakamura, Koji Iwano, and Sadaoki Furui Department of Computer Science Tokyo Institute.
Introduction to Automatic Speech Recognition
Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.
Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.
1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.
Automatic Pitch Tracking September 18, 2014 The Digitization of Pitch The blue line represents the fundamental frequency (F0) of the speaker’s voice.
Speech and Language Processing
March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.
Automatic Pitch Tracking January 16, 2013 The Plan for Today One announcement: Starting on Monday of next week, we’ll meet in Craigie Hall D 428 We’ll.
Introduction to Controlling the Output Power of a Transistor Stage A load network will be designed to maximize the output power obtainable from the Mitsubishi.
A brief overview of Speech Recognition and Spoken Language Processing Advanced NLP Guest Lecture August 31 Andrew Rosenberg.
1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.
1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.
Jun-Won Suh Intelligent Electronic Systems Human and Systems Engineering Department of Electrical and Computer Engineering Speaker Verification System.
Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.
Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.
Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.
8.0 Search Algorithms for Speech Recognition References: of Huang, or of Becchetti, or , of Jelinek 4. “ Progress.
1 Using a Large LM Nicolae Duta Richard Schwartz EARS Technical Workshop September 5, Martigny, Switzerland.
1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.
Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.
SCALING UP: LEARNING LARGE-SCALE RECOGNITION METHODS FROM SMALL-SCALE RECOGNITION TASKS Nelson Morgan, Barry Y Chen, Qifeng Zhu, Andreas Stolcke International.
1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.
ISL Meeting Recognition Hagen Soltau, Hua Yu, Florian Metze, Christian Fügen, Yue Pan, Sze-Chen Jou Interactive Systems Laboratories.
1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.
Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis Kei Hashimoto, Yoshihiko Nankaku, and Keiichi.
Problems of Modeling Phone Deletion in Conversational Speech for Speech Recognition Brian Mak and Tom Ko Hong Kong University of Science and Technology.
1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Combining Speech Attributes for Speech Recognition Jeremy Morris November 9, 2006.
Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.
Performance Comparison of Speaker and Emotion Recognition
A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-
Overview Excel is a spreadsheet, a grid made from columns and rows. It is a software program that can make number manipulation easy and somewhat painless.
Improved Video Categorization from Text Metadata and User Comments ACM SIGIR 2011:Research and development in Information Retrieval - Katja Filippova -
1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
Speaker Verification System Middle Term Presentation Performed by: Barak Benita & Daniel Adler Instructor: Erez Sabag.
1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.
January 2001RESPITE workshop - Martigny Multiband With Contaminated Training Data Results on AURORA 2 TCTS Faculté Polytechnique de Mons Belgium.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Cross-Dialectal Data Transferring for Gaussian Mixture Model Training in Arabic Speech Recognition Po-Sen Huang Mark Hasegawa-Johnson University of Illinois.
H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.
Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.
A NONPARAMETRIC BAYESIAN APPROACH FOR
Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI
Feature Mapping FOR SPEAKER Diarization IN NOisy conditions
Speaker Identification:
Learning Long-Term Temporal Features
Presenter : Jen-Wei Kuo
Emre Yılmaz, Henk van den Heuvel and David A. van Leeuwen
Presentation transcript:

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena, Horacio Franco Jing Zheng and Andreas Stolcke Speech Technology & Research Laboratory SRI International, Menlo Park, CA

Dec. 4-5, 2003EARS STT Workshop2 Goals Assess effect of TDT-4 data on SRI BN system (not previously used) Explore alternatives for use of closed-caption transcripts for acoustic and LM training Specifically, investigate algorithm for “repairing” inaccuracies in CC transcripts. Initial test of voicing feature front end on BN (originally developed for CTS)

Dec. 4-5, 2003EARS STT Workshop3 Talk Overview BN training on TDT-4 CC data –Generation of raw transcripts –Waveform segmentation –Transcript Repair with FlexAlign –FlexAlign output for LM training –Effect of amount of training data –Comparison with CUED TDT-4 transcripts –Ongoing effort on voicing features for BN acoustic modeling

Dec. 4-5, 2003EARS STT Workshop4 TDT-4 Training: Generation of Waveforms Segments and Reference Transcripts References were assumed to be delimited by and in the LDC transcripts. The speech signal was cut using the time marks extracted from the tags surrounding the TEXT elements. Long waveforms were identified and recut at progressively shorter pauses until all waveforms were 30s or shorter. Used PTM acoustic models for forced alignment that didn’t require speaker-level normalizations. Used “flexible” forced alignment (see next).

Dec. 4-5, 2003EARS STT Workshop5 FlexAlign Special lattices were generated for each segment. Each word was preceded by an optional pause and an optional nonlexical word model. Goal was to simultaneously delete noisy or mistranscribed text and insert disfluencies.

Dec. 4-5, 2003EARS STT Workshop6 Optional Nonlexical Word Transition probabilities were approximated by the unigram relative frequencies in the 96/97 BN acoustic training corpus.

Dec. 4-5, 2003EARS STT Workshop7 Training Procedure Final refs were the output of the recognizer on the FlexAlign lattices. WER wrt original CC transcripts: 5.0% (Sub 0.4, Ins 4.4, Del 0.3) Standard acoustic models were built using Viterbi training on these transcripts.

Dec. 4-5, 2003EARS STT Workshop8 Does FlexAlignment Help LM Training? OriginalSubsetFlexAlignMixture PPL(TDT4) PPL(Eval03) grams grams grams grams “Subset”: Random selection of original CC references to match token count of FlexAlign transcripts. Note: Only disfluency in the test data was “uh”.

Dec. 4-5, 2003EARS STT Workshop9 An Accidental Experiment What happens if we train on only a subset of the data? Is the performance proportionately worse? BaselineSubset TDT4All TDT4 Hours Speakers WER

Dec. 4-5, 2003EARS STT Workshop10 Comparison with CUED TDT-4 Training Transcripts CUED TDT-4 transcripts were generated by a STT system with a biased LM (trained on TDT-4 CC). CUED transcripts were generated from CU word time information and SRI waveform segments. CUED transcripts sometimes have “holes” in them where our wave segments span more than one of CUED waves (probably due to ad removal). WER wrt CC transcriptions: Originals:18.2% (Sub 7.7, Ins 3.2,, Del 7.2) Flex-align:19.5% (Sub 10.1, Ins 3.8, Del 5.6) A fairer comparison ought to use CUED transcripts with CUED segments for training the acoustic models, so take results with a grain of salt!

Dec. 4-5, 2003EARS STT Workshop11 Results of First Decoding Pass ExperimentsTrain HoursWER Baseline TDT4-subset TDT4-full CUED transcripts

Dec. 4-5, 2003EARS STT Workshop12 Multi-pass System Results First pass WERMulti-pass WER Baseline TDT Rel. improvement8.8%6.2% Multi-pass system used new decoding strategy (described in later talk). But: MFC instead of PLP, and no SAT normalization in training (to save time).

Dec. 4-5, 2003EARS STT Workshop13 Voicing Features Test voicing features developed for CTS system for BN STT (cf. Martigny talk) –Then, we obtained a 2% relative error reduction across stages Use Peak of autocorrelation and entropy of higher order cepstrum Use a window of 5 frames of two voicing features Juxtapose MDCC plus deltas and double deltas to window of voicing features Apply dimensionality reduction with HLDA. Final feature vector has 39 dimensions

Dec. 4-5, 2003EARS STT Workshop14 Voicing Features Results FrontendWER MFCC22.6 MFCC+voicing23.3 TDT-4 devtest set (results on first pass) Used equivalent parameters to those optimized for CTS system Need to investigate (reoptimize) FE parameters for higher BW It is not clear what the effect of background music might be in voicing features in BN Possible software issues With higher BW features, voicing features may be more redundant.

Dec. 4-5, 2003EARS STT Workshop15 Summary Developed CC transcript “repair” algorithm based on flexible alignment. Training on “repaired” TDT-4 transcripts gives 8.8% (1 st pass) to 6.2% (multi-pass) relative improvement of Hub-4 training. Accidental result: leaving out 1/3 of new data reduces improvement only marginally. Transcript “repair” not suitable for LM training (yet). No improvement from voicing features (yet), need to investigate parameters.

Dec. 4-5, 2003EARS STT Workshop16 Future Work Redo comparison with alternative transcripts more carefully. Investigate data filtering (e.g., based on reject word occurrences in FlexAlign output). Add the rest of the data ! Further investigate the use of voicing features.