March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Slides:

Advertisements

Similar presentations

SRI 2001 SPINE Evaluation System Venkata Ramana Rao Gadde Andreas Stolcke Dimitra Vergyri Jing Zheng Kemal Sonmez Anand Venkataraman.

Advertisements

Measuring the Influence of Long Range Dependencies with Neural Network Language Models Le Hai Son, Alexandre Allauzen, Franc¸ois Yvon Univ. Paris-Sud and.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

In Search of a More Probable Parse: Experiments with DOP* and the Penn Chinese Treebank Aaron Meyers Linguistics 490 Winter 2009.

Bag-Of-Word normalized n-gram models ISCA 2008 Abhinav Sethy, Bhuvana Ramabhadran IBM T. J. Watson Research Center Yorktown Heights, NY Presented by Patty.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

Re-ranking for NP-Chunking: Maximum-Entropy Framework By: Mona Vajihollahi.

Error Analysis: Indicators of the Success of Adaptation Arindam Mandal, Mari Ostendorf, & Ivan Bulyko University of Washington.

“Applying Morphology Generation Models to Machine Translation” By Kristina Toutanova, Hisami Suzuki, Achim Ruopp (Microsoft Research). UW Machine Translation.

Improved Tone Modeling for Mandarin Broadcast News Speech Recognition Xin Lei 1, Manhung Siu 2, Mei-Yuh Hwang 1, Mari Ostendorf 1, Tan Lee 3 1 SSLI Lab,

Sphinx 3.4 Development Progress Report in February Arthur Chan, Jahanzeb Sherwani Carnegie Mellon University Mar 1, 2004.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

1M4 speech recognition University of Sheffield M4 speech recognition Martin Karafiát*, Steve Renals, Vincent Wan.

Tree Kernels for Parsing: (Collins & Duffy, 2001) Advanced Statistical Methods in NLP Ling 572 February 28, 2012.

1IBM T.J. Waston CLSP, The Johns Hopkins University Using Random Forests Language Models in IBM RT-04 CTS Peng Xu 1 and Lidia Mangu 2 1. CLSP, the Johns.

11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.

Discriminative Syntactic Language Modeling for Speech Recognition Michael Collins, Brian Roark Murat, Saraclar MIT CSAIL, OGI/OHSU, Bogazici University.

A search-based Chinese Word Segmentation Method ——WWW 2007 Xin-Jing Wang: IBM China Wen Liu: Huazhong Univ. China Yong Qin: IBM China.

Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

A Language Independent Method for Question Classification COLING 2004.

DISCRIMINATIVE TRAINING OF LANGUAGE MODELS FOR SPEECH RECOGNITION Hong-Kwang Jeff Kuo, Eric Fosler-Lussier, Hui Jiang, Chin-Hui Lee ICASSP 2002 Min-Hsuan.

Presented by: Fang-Hui Chu Boosting HMM acoustic models in large vocabulary speech recognition Carsten Meyer, Hauke Schramm Philips Research Laboratories,

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Rapid and Accurate Spoken Term Detection Michael Kleber BBN Technologies 15 December 2006.

Data Sampling & Progressive Training T. Shinozaki & M. Ostendorf University of Washington In collaboration with L. Atlas.

Automatic Speech Recognition: Conditional Random Fields for ASR Jeremy Morris Eric Fosler-Lussier Ray Slyh 9/19/2008.

CMU Robust Vocabulary-Independent Speech Recognition System Hsiao-Wuen Hon and Kai-Fu Lee ICASSP 1991 Presenter: Fang-Hui CHU.

11 Effects of Explicitly Modeling Noise Words Chia-lin Kao, Owen Kimball, Spyros Matsoukas.

HIERARCHICAL SEARCH FOR LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION Author :Neeraj Deshmukh, Aravind Ganapathiraju and Joseph Picone.

1 Boostrapping language models for dialogue systems Karl Weilhammer, Matthew N Stuttle, Steve Young Presenter: Hsuan-Sheng Chiu.

An Overview of Nonparametric Bayesian Models and Applications to Natural Language Processing Narges Sharif-Razavian and Andreas Zollmann.

1 Using a Large LM Nicolae Duta Richard Schwartz EARS Technical Workshop September 5, Martigny, Switzerland.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

ASSESSING LEARNING ALGORITHMS Yılmaz KILIÇASLAN. Assessing the performance of the learning algorithm A learning algorithm is good if it produces hypotheses.

1 CRANDEM: Conditional Random Fields for ASR Jeremy Morris 11/21/2008.

Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.

Supertagging CMSC Natural Language Processing January 31, 2006.

MINIMUM WORD CLASSIFICATION ERROR TRAINING OF HMMS FOR AUTOMATIC SPEECH RECOGNITION Yueng-Tien, Lo Speech Lab, CSIE National.

The Use of Virtual Hypothesis Copies in Decoding of Large-Vocabulary Continuous Speech Frank Seide IEEE Transactions on Speech and Audio Processing 2005.

1 Minimum Error Rate Training in Statistical Machine Translation Franz Josef Och Information Sciences Institute University of Southern California ACL 2003.

HMM vs. Maximum Entropy for SU Detection Yang Liu 04/27/2004.

Dynamic Tuning Of Language Model Score In Speech Recognition Using A Confidence Measure Sherif Abdou, Michael Scordilis Department of Electrical and Computer.

Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

Confidence Measures As a Search Guide In Speech Recognition Sherif Abdou, Michael Scordilis Department of Electrical and Computer Engineering, University.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Learning Event Durations from Event Descriptions Feng Pan, Rutu Mulkar, Jerry R. Hobbs University of Southern California ACL ’ 06.

1 Voicing Features Horacio Franco, Martin Graciarena Andreas Stolcke, Dimitra Vergyri, Jing Zheng STAR Lab. SRI International.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Author :K. Thambiratnam and S. Sridharan DYNAMIC MATCH PHONE-LATTICE SEARCHES FOR VERY FAST AND ACCURATE UNRESTRICTED VOCABULARY KEYWORD SPOTTING Reporter.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Recent Paper of Md. Akmal Haidar Meeting before ICASSP 2013 報告者：郝柏翰 2013/05/23.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Jeff Ma and Spyros Matsoukas EARS STT Meeting March , Philadelphia Post-RT04 work on Mandarin.

Bayes Risk Minimization using Metric Loss Functions R. Schlüter, T. Scharrenbach, V. Steinbiss, H. Ney Present by Fang-Hui, Chu.

Smoothing Issues in the Strucutred Language Model

N-best list reranking using higher level phonetic, lexical, syntactic and semantic knowledge sources Mithun Balakrishna, Dan Moldovan and Ellis K. Cave.

Experience Report: System Log Analysis for Anomaly Detection

Qifeng Zhu, Barry Chen, Nelson Morgan, Andreas Stolcke ICSI & SRI

Juicer: A weighted finite-state transducer speech decoder

Authorship Attribution Using Probabilistic Context-Free Grammars

專題研究 week3 Language Model and Decoding

Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing

Automatic Speech Recognition: Conditional Random Fields for ASR

Learning Long-Term Temporal Features

Presenter : Jen-Wei Kuo

Presentation transcript:

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology & Research Laboratory SRI International 2. School of Electrical and Computer Engineering Purdue University

March 24, 2005EARS STT Workshop2 Motivation RT-03 SuperARV gave excellent results using a backoff N-gram approximation [ICASSP’04 paper] N-gram backoff approximation of RT-04 SuperARV did not generalize to RT-04 evaluation test set –Dev04: achieved 1.0% absolute WER reduction over baseline LM –Eval04: no gain in WER (in fact, a small loss) RT-04 SARV LM was developed under considerable time pressure –Training procedure is very time consuming (weeks and months), due to syntactic parsing of training data –Did not have time to examine all design choices in combination Reexamine all design decisions in detail

March 24, 2005EARS STT Workshop3 What Changed? RT-04 SARV training differed from RT-03 in 2 aspects: Retrained the Charniak parser using a combination of the Switchboard Penn Treebank and Wall Street Journal Penn Treebank The 2003 parser was trained on the WSJ Treebank only. Built a SuperARV LM with additional modifiee lexical feature constraints (Standard+ model) The 2003 LM was a SuperARV LM without these additional constraints (Standard model) Changes had given improvements at various points, but weren’t tested in complete systems on new Fisher data.

March 24, 2005EARS STT Workshop4 Plan of Attack Revisit changes to training procedure –Check effect on old and new data sets and systems Revisit the backoff N-gram approximation –Did we just get lucky in 2003 ? –Evaluate full SuperARV LM in N-best rescoring –Find better approximations Start investigation by going back to 2003 LM, then move to current system. Validate training software (and document and release) Work in progress Holding out on eval04 testing (avoid implicit tuning)

March 24, 2005EARS STT Workshop5 Perplexity of RT-03 LMs RT-03 LM training data LM types tested: –“Word”: Word backoff 4-gram, KN smoothed –“SARV N-gram”: N-gram approximation to standard SuperARV LM –“SARV Standard”: full SuperARV (without additional constraints) Full model gains smaller on dev04 N-gram approximation breaks down Test SetsWordSARV N-gram SARV Standard dev eval dev

March 24, 2005EARS STT Workshop6 N-best Rescoring with Full SuperARV LM Evaluated full Standard SARV LM in final N-best rescoring Based on PLP subsystem of RT-03 CTS system Full SARV rescoring is expensive, so tried increasingly longer N-best lists –Top-50 –Top-500 –Top-2000 (max used in eval system) Early passes (including MLLR) use baseline LM, so gains will be limited

March 24, 2005EARS STT Workshop7 RT-03 LM N-best Rescoring Results Standard SuperARV reduces WER on eval02, eval03 No gain on dev04 Identical gains on eval03-SWB and eval03-Fisher SuperARV gain increases with larger hypothesis space Test Set Top-50Top-500Top-2000 WordSARV Standard WordSARV Standard WordSARV Standard eval eval dev

March 24, 2005EARS STT Workshop8 Adding Modifiee Constraints Constraints enforced by a Constraint Dependency Grammar (on which SuperARV is based) can be enhanced by utilizing modifiee information in unary and binary constraints Expected that this information can improve SuperARV LM. In RT-04 development, explored using only the modifiee’s lexical category in the LM, adding them to the SuperARV tag structure. This reduced perplexity and WER in early experiments. But: additional tag constraints could have hurt LM generalization!

March 24, 2005EARS STT Workshop9 Perplexity with Modifiee Constraints Trained a SuperARV LM augmented with modifiee lexical features on RT-03 LM data (“Standard+” model) Standard+ model reduces perplexity on the eval02 and eval03 test sets (relative to Standard) But not on Fisher (dev04) test set! Test SetWord N-gram SARV N-gram SARV Standard SARV Standard+ dev eval dev

March 24, 2005EARS STT Workshop10 N-best Rescoring with Modifiee Constraints WER reductions consistent with perplexity results No improvement on dev04. Test Set Top-50Top-500 Word N-gram SARV Standard SARV Standard+ Word N-gram SARV Standard SARV Standard+ eval eval dev

March 24, 2005EARS STT Workshop11 In-domain vs. Out-of-domain Parser Training SuperARVs are collected from CDG parses that are obtained by transforming CFG parses CFG parses are generated using existing state-of-the- art parsers. In 2003: CTS data parsed with parser trained on Wall Street Journal Treebank (out-of-domain parser) In 2004: Obtained trainable version of Charniak parser Retrained parser on a combination of Switchboard Treebank and WSJ Treebank (in-domain parser) –Expected improved consistency and accuracy of parse structures –However, there were bugs in that retraining; fixed for the current experiment.

March 24, 2005EARS STT Workshop12 Rescoring Results with In-domain Parser Reparsed the RT-03 LM training with in-domain parser Retrained Standard SuperARV model (“Standard-retrained”) N-best rescoring system as before In-domain parsing helps Also: number of distinct SuperARV tags reduced in retraining (improved parser consistency) Test SetTop-500 Rescoring WER (%) Word N-gram SARV Standard SARV Standard+ SARV Standard- retrained SARV Standard- retrained+ eval

March 24, 2005EARS STT Workshop13 Summary So Far Prior design decisions have been validated Adding modifiee constraints helps LM on matched data Reparsing with retrained in-domain parser improves LM quality Now: reexamine approximation used in decoding (work in progress)

March 24, 2005EARS STT Workshop14 N-best Rescoring with RT-04 Full Standard+ Model RT-04 model is “Standard+” model (includes modifee constraints) RT-04 had been built with in-domain parser Caveat: old parser runs fraught with some (not catastrophic) bugs, still need to reparse RT-04 LM training data (significantly more than RT-03 data) Improved WER, but smaller gains than on older test sets Gains improve with more hypotheses Suggests need for better approximation to enable use of SuperARV in search Test set Top-50Top-500 Word N-gram SARV Standard+ Word N- gram SARV Standard+ dev

March 24, 2005EARS STT Workshop15 Original N-gram Approximation Algorithm Algorithm Description: 1.For each ngram observed in the training data (note their SuperARV tag information is known), calculate its probability using the Standard or Standard+ SuperARV LM, generating a new LM after renormalization; 2.For each of these ngrams, w 1 …w n, (note their tags are t 1 …t n ), 1.Extract the short-SuperARV (a subset of components of a SuperARV) sequence from t 1 …t n, denoted st 1 …st n ; 2.Find the list of word sequences sharing the same short-SuperARV sequences as st 1 …st n, using the lexicon constructed after training; 3.We select ngrams from this list of word sequences which do not exist in the training data by finding those ngrams that, when added, can reduce the perplexity on a held-out test set or increase its perplexity lower than a threshold; 3.The resulting LM could be pruned to make its size comparable to a word- based LM. If the held-out set is small, algorithm will result in overfitting If the held-out set is large, algorithm will be slow.

March 24, 2005EARS STT Workshop16 Revised N-gram Approximation for SuperARV LMs Idea: build a testset-specific N-gram LM that approximates the SuperARV LM [suggested by Dimitra Vergyri] Include all N-grams that “matter” to the decoder Method: Step 1: perform the first-pass decoding using a word-based language model on a test set, and generate HTK lattices Step 2: extract N-grams from the HTK lattices; prune based on posterior counts Step 3: compute conditional probabilities for these N-grams using a standard SuperARV language model Step 4: compute backoff weights based on the conditional probabilities Step 5: apply the resulting N-gram LM in all subsequent decoding passes (using standard tools) Some approximations left: –Due to pruning in Step 2 –From using only N-gram context, not full sentence prefix Drawback: Step 3 takes significant compute time –currently  10xRT, but not optimized for speed yet

March 24, 2005EARS STT Workshop17 Lattice N-gram Approximation Experiment Based on RT-03 Standard SuperARV LM Extracted N-grams from first-pass HTK lattices Pruned N-grams with total posterior count < Left with 3.6M N-grams on a 6h test set RT-02/03 experiment –Uses 2003 acoustic models –2000-best rescoring (1 st pass) Dev-04 experiment –Uses 2004 acoustic models –Lattice rescoring (1 st pass)

March 24, 2005EARS STT Workshop18 Lattice N-gram Approximation Results 1.2% absolute gain on old (matched) test sets Small 0.2% gain on Fisher (mismatched) test set Recall: no Fisher gain previously with N-best rescoring Better exploitation of full hypothesis space yields results Test setWord N-gramSARV Lattice N-gram eval eval dev

March 24, 2005EARS STT Workshop19 Conclusions and Future Work There is tradeoff between the generality and selectivity of a SuperARV model, much as was observed in our past CDG grammar induction experiments. –When making a model more constrained, its generality may be reduced. –Modifiee lexical features are helpful for strengthening constraints for word prediction, but they might need more or better matched data –We need a better understanding of the interaction between this knowledge source and characteristics of the training data, e.g., the Fisher domain. For a structured model like the SuperARV model, it is beneficial to improve the quality of training syntactic structures, e.g., making them less errorful or most consistent. –Observed LM win from better parses (using retrained parser) –Can expect further gains from advances in parse accuracy

March 24, 2005EARS STT Workshop20 Conclusions and Future Work (Cont.) Old N-gram approximation was flawed New N-gram approximation looks promising, but also needs more work –Tests using full system –Rescoring algorithm needs speeding up Still to do: reparse current CTS LM training set. Longer term: plan to investigate how conversational speech phenomena (sentence fragments, disfluencies) can be modeled better in the SuperARV framework.