Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.

Slides:

Advertisements

Similar presentations

Robust Speech recognition V. Barreaud LORIA. Mismatch Between Training and Testing n mismatch influences scores n causes of mismatch u Speech Variation.

Advertisements

The SRI 2006 Spoken Term Detection System Dimitra Vergyri, Andreas Stolcke, Ramana Rao Gadde, Wen Wang Speech Technology & Research Laboratory SRI International,

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

Hidden Markov Models Theory By Johan Walters (SR 2003)

Review of ICASSP 2004 Arthur Chan. Part I of This presentation (6 pages) Pointers of ICASSP 2004 (2 pages) NIST Meeting Transcription Workshop (2 pages)

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

Computer Science Department A Speech / Music Discriminator using RMS and Zero-crossings Costas Panagiotakis and George Tziritas Department of Computer.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

2001/03/29Chin-Kai Wu, CS, NTHU1 Speech and Language Technologies for Audio Indexing and Retrieval JOHN MAKHOUL, FELLOW, IEEE, FRANCIS KUBALA, TIMOTHY.

1 Less is More? Yi Wu Advisor: Alex Rudnicky. 2 People: There is no data like more data!

Spoken Term Detection Evaluation Overview Jonathan Fiscus, Jérôme Ajot, George Doddington December 14-15, Spoken Term Detection Workshop

DIVINES – Speech Rec. and Intrinsic Variation W.S.May 20, 2006 Richard Rose DIVINES SRIV Workshop The Influence of Word Detection Variability on IR Performance.

Discriminative Feature Optimization for Speech Recognition

Lightly Supervised and Unsupervised Acoustic Model Training Lori Lamel, Jean-Luc Gauvain and Gilles Adda Spoken Language Processing Group, LIMSI, France.

Rapid and Accurate Spoken Term Detection Owen Kimball BBN Technologies 15 December 2006.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Topic Detection and Tracking Introduction and Overview.

Zero Resource Spoken Term Detection on STD 06 dataset Justin Chiu Carnegie Mellon University 07/24/2012, JHU.

Graphical models for part of speech tagging

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis for Speech Recognition Bing Zhang and Spyros Matsoukas BBN Technologies Present.

Minimum Phoneme Error Based Heteroscedastic Linear Discriminant Analysis For Speech Recognition Bing Zhang and Spyros Matsoukas, BBN Technologies, 50 Moulton.

Learning Objectives In this chapter you will learn about the t-test and its distribution t-test for related samples t-test for independent samples hypothesis.

Classification and Ranking Approaches to Discriminative Language Modeling for ASR Erinç Dikici, Murat Semerci, Murat Saraçlar, Ethem Alpaydın 報告者：郝柏翰 2013/01/28.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Yun-Nung (Vivian) Chen, Yu Huang, Sheng-Yi Kong, Lin-Shan Lee National Taiwan University, Taiwan.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Overview of the TDT-2003 Evaluation and Results Jonathan Fiscus NIST Gaithersburg, Maryland November 17-18, 2002.

A Phonetic Search Approach to the 2006 NIST Spoken Term Detection Evaluation Roy Wallace, Robbie Vogt and Sridha Sridharan Speech and Audio Research Laboratory,

Improving out of vocabulary name resolution The Hanks David Palmer and Mari Ostendorf Computer Speech and Language 19 (2005) Presented by Aasish Pappu,

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.

1 Using a Large LM Nicolae Duta Richard Schwartz EARS Technical Workshop September 5, Martigny, Switzerland.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Cluster-specific Named Entity Transliteration Fei Huang HLT/EMNLP 2005.

1 Prosody-Based Automatic Segmentation of Speech into Sentences and Topics Elizabeth Shriberg Andreas Stolcke Speech Technology and Research Laboratory.

1 DUTIE Speech: Determining Utility Thresholds for Information Extraction from Speech John Makhoul, Rich Schwartz, Alex Baron, Ivan Bulyko, Long Nguyen,

National Taiwan University, Taiwan

1 Broadcast News Segmentation using Metadata and Speech-To-Text Information to Improve Speech Recognition Sebastien Coquoz, Swiss Federal Institute of.

Voice Activity Detection based on OptimallyWeighted Combination of Multiple Features Yusuke Kida and Tatsuya Kawahara School of Informatics, Kyoto University,

Variational Bayesian Methods for Audio Indexing

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Supervised Learning Resources: AG: Conditional Maximum Likelihood DP:

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.

Using Conversational Word Bursts in Spoken Term Detection Justin Chiu Language Technologies Institute Presented at University of Cambridge September 6.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

STD Approach Two general approaches: word-based and phonetics-based Goal is to rapidly detect the presence of a term in a large audio corpus of heterogeneous.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Spoken Language Group Chinese Information Processing Lab. Institute of Information Science Academia Sinica, Taipei, Taiwan

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

ASSESSING SEARCH TERM STRENGTH IN SPOKEN TERM DETECTION Amir Harati and Joseph Picone Institute for Signal and Information Processing, Temple University.

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Utterance verification in continuous speech recognition decoding and training Procedures Author :Eduardo Lleida, Richard C. Rose Reporter : 陳燦輝.

H ADVANCES IN MANDARIN BROADCAST SPEECH RECOGNITION Overview Goal Build a highly accurate Mandarin speech recognizer for broadcast news (BN) and broadcast.

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Speaker : chia hua Authors : Long Qin, Ming Sun, Alexander Rudnicky

College of Engineering

Mohamed Kamel Omar and Lidia Mangu ICASSP 2007

Speaker Identification:

Presenter : Jen-Wei Kuo

Presentation transcript:

Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006

14-Dec-06 Rapid and Accurate Spoken Term Detection 2 Overview of Talk BBN English system description Evaluation results Development experiments BBN explored STD across languages, but with limited evaluation resources we chose to field systems only in CTS for each language.

14-Dec-06 Rapid and Accurate Spoken Term Detection 3 BBN Evaluation Team Core Team Chia-lin Kao Owen Kimball Michael Kleber David Miller Additional assistance Thomas Colthurst Herb Gish Steve Lowe Rich Schwartz

14-Dec-06 Rapid and Accurate Spoken Term Detection 4 BBN System Overview Byblos STT indexer detector decider lattices phonetic- transcripts index scored detection lists final output with YES/NO decisions audio searc h terms ATWV cost parameter s indexing searching

14-Dec-06 Rapid and Accurate Spoken Term Detection 5 BBN System Overview: STT Byblos STT indexer detector decider lattices phonetic- transcripts index scored detection lists final output with YES/NO decisions audio searc h terms ATWV cost parameter s

14-Dec-06 Rapid and Accurate Spoken Term Detection 6 Primary STT configuration STT generates a lattice of hypotheses and a phonetic transcript for each input audio file hour EARS RT04 CTS acoustic model training corpus 946M words language model training 14.9% WER on Std.Dev06 CTS data

14-Dec-06 Rapid and Accurate Spoken Term Detection 7 Primary STT English Architechture Segmentation + Feature Extraction Forward- Backward Decoding Lattice Rescoring Waveform Fw SI STM AM, bigram LM Bw SI SCTM AM, approx.trigram LM RDLT Features Final Lattice Final 1-best SI crossword SCTM AM, trigram LM Adaptation Parameters System described in detail in B. Zhang, et al. “Discriminatively trained region dependent feature transforms for speech recognition”. Proc. ICASSP 2006, Toulouse, France. N-best Hypothesis Trigram Lattice Speaker Adaptation Forward- Backward Decoding Lattice Rescoring Trigram Lattice Fw HLDA-SAT STM AM, bigram LM Bw HLDA-SAT SCTM AM, approx.trigram LM HLDA-SAT crossword SCTM AM, trigram LM

14-Dec-06 Rapid and Accurate Spoken Term Detection 8 BBN System Overview: Indexer Byblos STT indexer detector decider lattices phonetic- transcripts index scored detection lists final output with YES/NO decisions audio searc h terms ATWV cost parameter s

14-Dec-06 Rapid and Accurate Spoken Term Detection 9Indexer Indexer precomputes single-word detection records from lattices. –Stores as hashed sorted lists for fast lookup. Computes fraction of likelihood that flows over each arc. –Uses forward-backward algorithm. –Optimistic posterior: ignores possibility true word is missing from lattice. Clusters detections with same word, close times, summing their scores WHICH [a=-205 l=-5] CAT [a=-170 l=-2]IS [a=-18 l=-2] THAT [a=-92 l=-3] A [a=-12 l=-2] WITCH [a=-200 l=-4] WITCH [a=-203 l=-4] CUT [a=-175 l=-3]

14-Dec-06 Rapid and Accurate Spoken Term Detection 10 Index Structure phonetic transcripts CAT WITCH WHICH … file9: b=39.1 d=0.3 p=0.83 file3: b=25.2 d=0.1 p=0.77 file5: b=173.8 d=0.2 p=0.52 …

14-Dec-06 Rapid and Accurate Spoken Term Detection 11 BBN System Overview: Detector Byblos STT indexer detector decider lattices phonetic- transcripts index scored detection lists final output with YES/NO decisions audio searc h terms ATWV cost parameter s

14-Dec-06 Rapid and Accurate Spoken Term Detection 12Detector Detector generates a sorted, scored list of candidate detection records for each search term supplied. For single-word IV terms, performs trivial retrieval from index. For multi-word IV terms, looks for acceptable sequences of single-word detections –Component detections must satisfy adjacency timing constraints –Assigns minimum component score to the multi-word detection. OOV not a significant factor in English CTS – see Levantine talk. Audio FileBeginDurationScore fsh_60262_exA fsh_61228_exA fsh_60844_exA fsh_60650_exA fsh_61228_exA candidates for term “bombing”

14-Dec-06 Rapid and Accurate Spoken Term Detection 13 BBN System Overview: Decider Byblos STT indexer detector decider lattices phonetic- transcripts index scored detection lists final output with YES/NO decisions audio searc h terms ATWV cost parameter s

14-Dec-06 Rapid and Accurate Spoken Term Detection 14Decider Audio FileBeginDurationScoreYES/NO fsh_60262_exA ? fsh_61228_exA ? fsh_60844_exA ? fsh_60650_exA ? fsh_61228_exA ? Decider picks and applies a score threshold for each list to make YES/NO decisions. –Processes each list of candidates independently –Processes all detection records in a list jointly –Aims to maximize ATWV metric candidates for term “bombing”

14-Dec-06 Rapid and Accurate Spoken Term Detection 15 Primary Evaluation Metric “Actual Term Weighted Value” is primary metric

14-Dec-06 Rapid and Accurate Spoken Term Detection 16 Understanding ATWV Perfect ATWV = 1.0 Mute detector has ATWV = 0.0 Negative ATWV is possible. Motivated by application-based costs: All search terms are weighted equally False alarm cost is almost constant, but miss cost varies by term. –Missing an instance of a rare term is expensive. –Missing an instance of a frequent term cheap.

14-Dec-06 Rapid and Accurate Spoken Term Detection 17 Decider Theory Given unbiased, independent posterior probabilities on detections and known constant value/cost on outcome, optimal decision threshold  satisfies In ATWV metric, if N true (term) > 0

14-Dec-06 Rapid and Accurate Spoken Term Detection 18 Decider Approximations N true (term) unknown, and detection scores biased. For each term, estimate from detections D i :

14-Dec-06 Rapid and Accurate Spoken Term Detection STD Evaluation English Results English CTS Results

14-Dec-06 Rapid and Accurate Spoken Term Detection 20 NIST English DET curves

14-Dec-06 Rapid and Accurate Spoken Term Detection 21 Effect of STT Error Rate Loss of 2.5 WER caused ATWV to drop –Magnified effect because changes in lattice word posteriors don’t show up in WER WER affected by scoring conventions. –Contraction, hyphenation normalization –Rigorous match definition for this eval causes WER to increase by 0.5 System WER Dev06 ATWV DryRun06 ATWV BBN primary BBN contrast STT WER has strong effect on ATWV:

14-Dec-06 Rapid and Accurate Spoken Term Detection 22 Importance of Lattice Output Lattice searching reduces P miss –8-fold increase in number of candidate detections from STT Improves estimate of N true for decisions –Holds P FA down Dev06DryRun06 1-bestlattices1-bestlattices primary contrast Search lattices is more accurate than searching 1-best transcripts

14-Dec-06 Rapid and Accurate Spoken Term Detection 23 Effect of Multi-word Detection Logic Exact detection of multi-word search terms is possible: –Store full lattice –Search for words on adjacent edges –Use fw-bw to get true posterior probability Approximate multi-word detection: –Store only individual words, forget topology –Search for words ordered & close in time –Pr(phrase) = min Pr(words in phrase) Effect of Approximate Multi-word Detection Search timeIndex sizeATWV decreased by 99.5%decreased by 97%increased by 0.01

14-Dec-06 Rapid and Accurate Spoken Term Detection 24 BBN STD Summary Accurate detection (83% of perfect ATWV) Fast search time Small index size Configurable indexing speed –Fast index speed maintains good accuracy. Encapsulated decision logic –Easy to tailor for cost metrics other than ATWV

14-Dec-06 Rapid and Accurate Spoken Term Detection 25 Contrast STT configuration 2300hrs/800hrs/1500hrs AM training data (complementary MPE). Same LM training data as primary system Somewhat smaller model than primary 18.1 % WER on Std.Dev06 CTS data –compared to 14.9% for primary

14-Dec-06 Rapid and Accurate Spoken Term Detection 26 Contrast STT English Architechture Segmentation + Feature Extraction Forward- Backward Decoding Speaker Adaptation Lattice Rescoring Waveform Fw SI STM AM, bigram LM Bw SI SCTM AM, approx.trigram LM Cepstra + Energy Trigram Lattice Final Result HLDA-SAT crossword SCTM AM, trigram LM Cepstra + Energy 1-best Hypothesis Adaptation Parameters Architechture same as S. Matsoukas et al “The 2004 BBN 1xRT Recognition Systems for English Broadcast News and Conversational Telephone Speech” Proc. Interspeech 2005, Lisboa, Portugal.