Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Hidden Markov Models Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
N-Grams and Language Modeling
Word prediction What are likely completions of the following sentences? –“Oh, that must be a syntax …” –“I have to go to the …” –“I’d also like a Coke.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Language Models Data-Intensive Information Processing Applications ― Session #9 Nitin Madnani University of Maryland Tuesday, April 6, 2010 This work is.
1 Advanced Smoothing, Evaluation of Language Models.
Introduction to Automatic Speech Recognition
BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.
Ngram Models Bahareh Sarrafzadeh Winter Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain.
Improving Utterance Verification Using a Smoothed Na ï ve Bayes Model Reporter : CHEN, TZAN HWEI Author :Alberto Sanchis, Alfons Juan and Enrique Vidal.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
1 Natural Language Processing (6) Zhao Hai 赵海 Department of Computer Science and Engineering Shanghai Jiao Tong University
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Combining Statistical Language Models via the Latent Maximum Entropy Principle Shaojum Wang, Dale Schuurmans, Fuchum Peng, Yunxin Zhao.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
LML Speech Recognition Speech Recognition Introduction I E.M. Bakker.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
1 Statistical NLP: Lecture 7 Collocations. 2 Introduction 4 Collocations are characterized by limited compositionality. 4 Large overlap between the concepts.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Chapter 23: Probabilistic Language Models April 13, 2004.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Natural Language Processing Statistical Inference: n-grams
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
N-Grams Chapter 4 Part 2.
Statistical Language Models
Statistical Models for Automatic Speech Recognition
Language Modelling By Chauhan Rohan, Dubois Antoine & Falcon Perez Ricardo Supervised by Gangireddy Siva 1.
In the name of God Language Modeling Mohammad Bahrani Feb 2011.
CPSC 503 Computational Linguistics
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Speech Recognition: Acoustic Waves
Language Modeling for Speech Recognition
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo

Examples of Good & Bad Language Models Excerption from Herman, comic strips by Jim Unger

What’s a Language Model A Language model is a probability distribution over word sequences P(“And nothing but the truth”)  P(“And nuts sing on the roof”)  0

What’s a language model for? Speech recognition Handwriting recognition Spelling correction Optical character recognition Machine translation (and anyone doing statistical modeling)

The Equation The observation can be image features (handwriting recognition), acoustics (speech recognition), word sequence in another language (MT), etc.

How Language Models work Hard to compute P(“And nothing but the truth”) Decompose probability P(“and nothing but the truth) = P(“and”)  P(“nothing|and”)  P(“but|and nothing”)  P(“the|and nothing but”)  P(“truth|and nothing but the”)

The Trigram Approximation Assume each word depends only on the previous two words P(“the|and nothing but”)  P(“the|nothing but”) P(“truth|and nothing but the”)  P(“truth|but the”)

How to find probabilities? Count from real text Pr(“the | nothing but”)  c(“nothing but the”) / c(“nothing but”)

Evaluation How can you tell a good language model from a bad one? Run a speech recognizer (or your application of choice), calculate word error rate Slow Specific to your recognizer

Perplexity An example Data: “the whole truth and nothing but the truth” Lexicon: L={the, whole, truth, and, nothing, but} Model 1: uni-gram, Pr(L 1 )=…=Pr(L 6 )=1/6 Model 2: unigram, Pr(“the”)=Pr(“truth”)=1/4, Pr(“whole”)=Pr(“and”)=Pr(“nothing”)=Pr(“but”)=1/8

Perplexity: Is lower better? Remarkable fact: the “true” model for data has the lowest possible perplexity Lower the perplexity, the closer we are to true model. Perplexity correlates well with the error rate of recognition task Correlates better when both models are trained on same data Doesn’t correlate well when training data changes

Smoothing Terrible on test data: If no occurrences of C(xyz), probability is 0 P(sing|nuts) =0 leads to infinite perplexity!

Smoothing: Add One Add one smoothing: Add delta smoothing: Simple add-one smoothing does not perform well – the probability of rarely seen events is over-estimated

Smoothing: Simple Interpolation Interpolate Trigram, Bigram, Unigram for best combination Almost good enough

Smoothing: Redistribution of Probability Mass (Backing Off) [Katz87] Discounting Discounted probability mass Redistribution (n-1)-gram

Factor can be determined by the relative frequency of singletons, i.e., events observed exactly once in the data [Ney95] Linear Discount

Generalization : function of y, determined by cross-validation Requires more data Computation is expensive More General Formulation Drawback of linear discount The counts of frequently observed events are modified the most ; against the “law of large numbers”

The discount is an absolute value Works pretty well, easier than linear discounting Absolute Discounting

References [1] Katz S, Estimation of probabilities from sparse data for the language model component of a speech recognizer, IEEE Trans on Acoustics, Speech, and Signal Processing 35(3): , 1987 [2] Ney H, Essen U, Kneser R, On the estimation of “small” probabilities by leaving-one-out, ITTT Trans. on PAMI 17(12): , 1995 [3] Joshua Goodman, A tutorial of language model: the State of The Art in Language Modeling, research.microsoft.com/~joshuago/lm-tutorial- public.ppt