1 Using a Large LM Nicolae Duta Richard Schwartz EARS Technical Workshop September 5,6 2003 Martigny, Switzerland.

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.

1 COMM 301: Empirical Research in Communication Lecture 15 – Hypothesis Testing Kwan M Lee.

Rapid and Accurate Spoken Term Detection David R. H. Miller BBN Technolgies 14 December 2006.

Chapter 14 Comparing two groups Dr Richard Bußmann.

Lattices Segmentation and Minimum Bayes Risk Discriminative Training for Large Vocabulary Continuous Speech Recognition Vlasios Doumpiotis, William Byrne.

REDUCED N-GRAM MODELS FOR IRISH, CHINESE AND ENGLISH CORPORA Nguyen Anh Huy, Le Trong Ngoc and Le Quan Ha Hochiminh City University of Industry Ministry.

ECE 8443 – Pattern Recognition ECE 8423 – Adaptive Signal Processing Objectives: Adaptation Resources: RS: Unsupervised vs. Supervised RS: Unsupervised.

Confidence Estimation for Machine Translation J. Blatz et.al, Coling 04 SSLI MTRG 11/17/2004 Takahiro Shinozaki.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

Ngram models and the Sparsity problem John Goldsmith November 2002.

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

1 LM Approaches to Filtering Richard Schwartz, BBN LM/IR ARDA 2002 September 11-12, 2002 UMASS.

1 Language Model Adaptation in Machine Translation from Speech Ivan Bulyko, Spyros Matsoukas, Richard Schwartz, Long Nguyen, and John Makhoul.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

+ Quantitative Statistics: Chi-Square ScWk 242 – Session 7 Slides.

1 Advanced Smoothing, Evaluation of Language Models.

Name:Venkata subramanyan sundaresan Instructor:Dr.Veton Kepuska.

1 International Computer Science Institute Data Sampling for Acoustic Model Training Özgür Çetin International Computer Science Institute Andreas Stolcke.

Overview: Humans are unique creatures. Everything we do is slightly different from everyone else. Even though many times these differences are so minute.

Arthur Kunkle ECE 5525 Fall Introduction and Motivation  A Large Vocabulary Speech Recognition (LVSR) system is a system that is able to convert.

11/24/2006 CLSP, The Johns Hopkins University Random Forests for Language Modeling Peng Xu and Frederick Jelinek IPAM: January 24, 2006.

Statistical Reasoning for everyday life Intro to Probability and Statistics Mr. Spering – Room 113.

Statistical Language Modeling using SRILM Toolkit

MAKING DECISIONS IN THE FACE OF VARIABILITY TWSSP Wednesday.

March 24, 2005EARS STT Workshop1 A Study of Some Factors Impacting SuperARV Language Modeling Wen Wang 1 Andreas Stolcke 1 Mary P. Harper 2 1. Speech Technology.

Large Language Models in Machine Translation Conference on Empirical Methods in Natural Language Processing 2007 報告者：郝柏翰 2013/06/04 Thorsten Brants, Ashok.

Hypothesis Testing: One Sample Cases. Outline: – The logic of hypothesis testing – The Five-Step Model – Hypothesis testing for single sample means (z.

Efficient Minimal Perfect Hash Language Models David Guthrie, Mark Hepple, Wei Liu University of Sheffield.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

2010 Failures in Czech-English Phrase-Based MT 2010 Failures in Czech-English Phrase-Based MT Full text, acknowledgement and the list of references in.

Intelligent Database Systems Lab N.Y.U.S.T. I. M. Chinese Word Segmentation and Statistical Machine Translation Presenter : Wu, Jia-Hao Authors : RUIQIANG.

1 Improved Speaker Adaptation Using Speaker Dependent Feature Projections Spyros Matsoukas and Richard Schwartz Sep. 5, 2003 Martigny, Switzerland.

Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)

Chapter 20 Testing hypotheses about proportions

1 Using TDT Data to Improve BN Acoustic Models Long Nguyen and Bing Xiang STT Workshop Martigny, Switzerland, Sept. 5-6, 2003.

Large sample CI for μ Small sample CI for μ Large sample CI for p

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

Compact WFSA based Language Model and Its Application in Statistical Machine Translation Xiaoyin Fu, Wei Wei, Shixiang Lu, Dengfeng Ke, Bo Xu Interactive.

1 Update on WordWave Fisher Transcription Owen Kimball, Chia-lin Kao, Jeff Ma, Rukmini Iyer, Rich Schwartz, John Makhoul.

Structural Metadata Annotation of Speech Corpora: Comparing Broadcast News and Broadcast Conversations Jáchym KolářJan Švec University of West Bohemia.

Lecture 4 Ngrams Smoothing

Hidden Markov Models: Decoding & Training Natural Language Processing CMSC April 24, 2003.

Recurrent neural network based language model Tom´aˇs Mikolov, Martin Karafia´t, Luka´sˇ Burget, Jan “Honza” Cˇernocky, Sanjeev Khudanpur INTERSPEECH 2010.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Estimating N-gram Probabilities Language Modeling.

Latent Topic Modeling of Word Vicinity Information for Speech Recognition Kuan-Yu Chen, Hsuan-Sheng Chiu, Berlin Chen ICASSP 2010 Hao-Chin Chang Department.

A DYNAMIC APPROACH TO THE SELECTION OF HIGH ORDER N-GRAMS IN PHONOTACTIC LANGUAGE RECOGNITION Mikel Penagarikano, Amparo Varona, Luis Javier Rodriguez-

Using Neural Network Language Models for LVCSR Holger Schwenk and Jean-Luc Gauvain Presented by Erin Fitzgerald CLSP Reading Group December 10, 2004.

St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences Recurrent Neural Network-based Language Modeling for an Automatic.

1 ICASSP Paper Survey Presenter: Chen Yi-Ting. 2 Improved Spoken Document Retrieval With Dynamic Key Term Lexicon and Probabilistic Latent Semantic Analysis.

A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

Gaussian Mixture Language Models for Speech Recognition Mohamed Afify, Olivier Siohan and Ruhi Sarikaya.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Dec. 4-5, 2003EARS STT Workshop1 Broadcast News Training Experiments Anand Venkataraman, Dimitra Vergyri, Wen Wang, Ramana Rao Gadde, Martin Graciarena,

Section 9.1 First Day The idea of a significance test What is a p-value?

Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.

Significance Tests: The Basics Textbook Section 9.1.

ECE 8443 – Pattern Recognition ECE 8527 – Introduction to Machine Learning and Pattern Recognition Objectives: Bayes Rule Mutual Information Conditional.

Pruning Analysis for the Position Specific Posterior Lattices for Spoken Document Search Jorge Silva University of Southern California Ciprian Chelba and.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

1 Minimum Bayes-risk Methods in Automatic Speech Recognition Vaibhava Geol And William Byrne IBM ； Johns Hopkins University 2003 by CRC Press LLC 2005/4/26.

Jun Wu and Sanjeev Khudanpur Center for Language and Speech Processing

CSCE 771 Natural Language Processing

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Presenter : Jen-Wei Kuo

Presentation transcript:

1 Using a Large LM Nicolae Duta Richard Schwartz EARS Technical Workshop September 5, Martigny, Switzerland

2 “There is no data like more data” -- Bob Mercer, Arden House, 1995 Corollary: More data only helps if you don’t throw it away.

3 Ngram Pruning  When we train an n-gram LM on a large corpus, most of the observed n-grams only occur a small number of times.  We typically discard these, for two reasons: –We assume they are not statistically significant –We don’t want to use a LM with 700M distinct 4-grams!  Questions: –Does the fact that an n-gram occurred one time provide useful information? –Is it practical to use a really large LM?

4 N-gram Hit Rate  We find the n-gram “hit rate” to be a useful diagnostic.  The hit rate is the percentage of n-gram tokens in the reference transcription that are explicitly in the LM model.  We find that the probability that a word is recognized is affected significantly by whether the corresponding n- gram is in the LM, because if it is not, the LM probability (from backing off) is significantly lower.

5 Experimental Results  English broadcast news test, (H4Dev03) LM Order Cutoffs [4g, 3g] LM size [4g, 3g] Hit Rates [4g,3g] PerplexWER 3[inf, 6][0, 36M][0, 76%] [inf, 0][0, 305M][0, 84%] [6, 6][40M, 36M][49%,76%] [0, 0][710M,305M][61%,84%]  Cutoff of 6 for trigram loses 0.5% absolute  4-gram with cuttoff of 6 gains 0.5%  4-gram cutoff of 6 loses 0.3%

6 Batch Implementation  Count ALL ngrams. Store in (4) sorted files. –Total size is about 8.5 GB.  Do first pass recognition using ‘normal’ LM –Produce n-best (or lattice) –Make sorted list of all recognized n-grams of all orders in the hypotheses in the test set  Make one pass through the count file –Extract only those counts needed (in the hypotheses) –Accumulate total count and number unique transitions for each state  Create mini-LM  Apply LM to n-best (or lattice) as re-scoring.  Total process requires 15 minutes for a 3 hour test. –< 0.1 xRT –Could be implemented to be fast on a short test file.

7 Discussion  Even a single observed token of an n-gram tells you that it is possible. –It is important to know the difference between n-grams that are unobserved because they are rare and those that are impossible. [If we could really know this, we would have much better results.]  The gain from keeping all n-grams is significant (0.5% for 3-grams, 0.3% for 4-grams).  Small question: –Would this result hold for other back-off methods?