9/22/1999 JHU CS 600.465/Jan Hajic 1 Introduction to Natural Language Processing (600.465) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
FTP Biostatistics II Model parameter estimations: Confronting models with measurements.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Ch 9. Markov Models 고려대학교 자연어처리연구실 한 경 수
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
1 *Introduction to Natural Language Processing ( ) Maximum Entropy Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Copyright © 2010 Pearson Education, Inc. Warm Up- Good Morning! If all the values of a data set are the same, all of the following must equal zero except.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Bit of Progress in Language Modeling Extended Version
Language acquisition
10/04/1999 JHU CS /Jan Hajic 1 *Introduction to Natural Language Processing ( ) Words and the Company They Keep Dr. Jan Hajič CS Dept., Johns.
CSC321: Neural Networks Lecture 12: Clustering Geoffrey Hinton.
Section 8.1 Estimating  When  is Known In this section, we develop techniques for estimating the population mean μ using sample data. We assume that.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
12/08/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Translation: Alignment and Parameter Estimation.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
12/06/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Statistical Parsing Dr. Jan Hajič CS Dept., Johns Hopkins Univ.
CSC321: 2011 Introduction to Neural Networks and Machine Learning Lecture 11: Bayesian learning continued Geoffrey Hinton.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory I AI-lab
1 Introduction to Natural Language Processing ( ) Mutual Information and Word Classes AI-lab
Language acquisition
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
Collocations. Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,
CSE 517 Natural Language Processing Winter 2015
Natural Language Processing Statistical Inference: n-grams
6. Population Codes Presented by Rhee, Je-Keun © 2008, SNU Biointelligence Lab,
1 Chapter 4, Part 1 Basic ideas of Probability Relative Frequency, Classical Probability Compound Events, The Addition Rule Disjoint Events.
9/14/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing Probability AI-Lab
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) Essential Information Theory II AI-lab
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Discriminative n-gram language modeling Brian Roark, Murat Saraclar, Michael Collins Presented by Patty Liu.
CPH Dr. Charnigo Chap. 11 Notes Figure 11.2 provides a diagram which shows, at a glance, what a neural network does. Inputs X 1, X 2,.., X P are.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
Gaussian Mixture Models And their training with the EM algorithm
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Presentation transcript:

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns Hopkins Univ.

9/22/1999JHU CS / Intro to NLP/Jan Hajic2 The Zero Problem “ Raw ” n-gram language model estimate: –necessarily, some zeros !many: trigram model → 2.16 ⅹ parameters, data ~ 10 9 words –which are true 0? optimal situation: even the least frequent trigram would be seen several times, in order to distinguish it ’ s probability vs. other trigrams optimal situation cannot happen, unfortunately (open question: how many data would we need?) – → we don ’ t know –we must eliminate the zeros Two kinds of zeros: p(w|h) = 0, or even p(h) = 0!

9/22/1999JHU CS / Intro to NLP/Jan Hajic3 Why do we need Nonzero Probs? To avoid infinite Cross Entropy: –happens when an event is found in test data which has not been seen in training data H(p) = ∞  prevents comparing data with ≥ 0 “ errors ” To make the system more robust –low count estimates: they typically happen for “ detailed ” but relatively rare appearances –high count estimates: reliable but less “ detailed ”

9/22/1999JHU CS / Intro to NLP/Jan Hajic4 Eliminating the Zero Probabilities: Smoothing Get new p ’ (w) (same  ): almost p(w) but no zeros Discount w for (some) p(w) > 0: new p ’ (w) < p(w)  w ∈ discounted (p(w) - p ’ (w)) = D Distribute D to all w; p(w) = 0: new p ’ (w) > p(w) –possibly also to other w with low p(w) For some w (possibly): p ’ (w) = p(w) Make sure  w ∈  p ’ (w) = 1 There are many ways of smoothing

9/22/1999JHU CS / Intro to NLP/Jan Hajic5 Smoothing by Adding 1 Simplest but not really usable: –Predicting words w from a vocabulary V, training data T: p ’ (w|h) = (c(h,w) + 1) / (c(h) + |V|) for non-conditional distributions: p ’ (w) = (c(w) + 1) / (|T| + |V|) –Problem if |V| > c(h) (as is often the case; even >> c(h)!) Example: Training data: what is it what is small ? |T| = 8 V = { what, is, it, small, ?,, flying, birds, are, a, bird,. }, |V| = 12 p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) =.25 2 ⅹ .001 p(it is flying.) =.125 ⅹ.25 ⅹ 0 2 = 0 p ’ (it) =.1, p ’ (what) =.15, p ’ (.)=.05 p ’ (what is it?) =.15 2 ⅹ.1 2 .0002 p ’ (it is flying.) =.1 ⅹ.15 ⅹ.05 2 .00004

9/22/1999JHU CS / Intro to NLP/Jan Hajic6 Adding less than 1 Equally simple: –Predicting words w from a vocabulary V, training data T: p ’ (w|h) = (c(h,w) + ) / (c(h) + |V|),  for non-conditional distributions: p ’ (w) = (c(w) + ) / (|T| + |V|) Example: Training data: what is it what is small ? |T| = 8 V = { what, is, it, small, ?,, flying, birds, are, a, bird,. }, |V| = 12 p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) =.25 2 ⅹ .001 p(it is flying.) =.125 ⅹ.25  0 2 = 0 Use =.1: p ’ (it) .12, p ’ (what) .23, p ’ (.) .01 p ’ (what is it?) =.23 2 ⅹ.12 2 .0007 p ’ (it is flying.) =.12 ⅹ.23 ⅹ.01 2 

9/22/1999JHU CS / Intro to NLP/Jan Hajic7 Good - Turing Suitable for estimation from large data –similar idea: discount/boost the relative frequency estimate: p r (w) = (c(w) + 1) ⅹ  N(c(w) + 1) / (|T| ⅹ  N(c(w))), where N(c) is the count of words with count c (count-of-counts) specifically, for c(w) = 0 (unseen words), p r (w) = N(1) / (|T| ⅹ  N(0)) –good for small counts (< 5-10, where N(c) is high) –variants ( see MS ) –normalization! (so that we have  w  p ’ (w) = 1)

9/22/1999JHU CS / Intro to NLP/Jan Hajic8 Good-Turing: An Example Example: remember: p r (w) = (c(w) + 1) ⅹ  N(c(w) + 1) / (|T| ⅹ  N(c(w))) Training data: what is it what is small ? |T| = 8 V = { what, is, it, small, ?,, flying, birds, are, a, bird,. }, |V| = 12 p(it)=.125, p(what)=.25, p(.)=0 p(what is it?) =.25 2 ⅹ .001 p(it is flying.) =.125 ⅹ.25 ⅹ 0 2 = 0 Raw reestimation (N(0) = 6, N(1) = 4, N(2) = 2, N(i) = 0 for i > 2): p r (it) = (1+1) ⅹ N(1+1)/(8 ⅹ N(1)) = 2 ⅹ 2/(8 ⅹ 4) =.125 p r (what)=(2+1) ⅹ N(2+1)/(8 ⅹ N(2))=3 ⅹ 0/(8 ⅹ 2)=0: keep orig. p(what) p r (.) = (0+1) ⅹ N(0+1)/(8 ⅹ N(0)) = 1 ⅹ 4/(8 ⅹ 6) .083 Normalize (divide by 1.5 =  w ∈ |V| p r (w)) and compute: p ’ (it) .08, p ’ (what) .17, p ’ (.) .06 p ’ (what is it?) =.17 2 ⅹ.08 2 .0002 p ’ (it is flying.) =.08 ⅹ.17 ⅹ.06 2 .00004

9/22/1999JHU CS / Intro to NLP/Jan Hajic9 Smoothing by Combination: Linear Interpolation Combine what? distributions of various level of detail vs. reliability n-gram models: use (n-1)gram, (n-2)gram,..., uniform reliability detail Simplest possible combination: –sum of probabilities, normalize: p(0|0) =.8, p(1|0) =.2, p(0|1) = 1, p(1|1) = 0, p(0) =.4, p(1) =.6: p ’ (0|0) =.6, p ’ (1|0) =.4, p ’ (0|1) =.7, p ’ (1|1) =.3

9/22/1999JHU CS / Intro to NLP/Jan Hajic10 Typical n-gram LM Smoothing Weight in less detailed distributions using =( 0, , ,  ): p ’ (w i | w i-2,w i-1 ) =  p 3 (w i | w i-2,w i-1 ) +  p 2 (w i | w i-1 ) +  p 1 (w i ) + 0  /|V| Normalize: i > 0,  i=0..n i = 1 is sufficient ( 0 = 1 -  i=1..n i ) (n=3) Estimation using MLE: –fix the p 3, p 2, p 1 and |V| parameters as estimated from the training data –then find such { i }which minimizes the cross entropy (maximizes probability of data): -(1/|D|)  i=1..|D| log 2 (p ’ (w i |h i ))

9/22/1999JHU CS / Intro to NLP/Jan Hajic11 Held-out Data What data to use? –try the training data T: but we will always get  = 1 why? (let p iT be an i-gram distribution estimated using r.f. from T) minimizing H T (p ’ ) over a vector, p ’ =  p 3T +  p 2T +  p 1T +  /|V| –remember: H T (p ’ ) = H(p 3T ) + D(p 3T ||p ’ ); (p 3T fixed → H(p 3T ) fixed, best) –which p ’ minimizes H T (p ’ )? Obviously, a p ’ for which D(p 3T || p ’ )=0 –...and that ’ s p 3T (because D(p||p) = 0, as we know). –...and certainly p ’ = p 3T if  = 1 (maybe in some other cases, too). – (p ’ = 1 ⅹ  p 3T + 0  ⅹ p 2T + 0 ⅹ  p 1T + 0/|V|) –thus: do not use the training data for estimation of  must hold out part of the training data (heldout data, H):...call the remaining data the (true/raw) training data, T the test data S (e.g., for comparison purposes): still different data!

9/22/1999JHU CS / Intro to NLP/Jan Hajic12 The Formulas Repeat: minimizing -(1/|H|)  i=1..|H| log 2 (p ’ (w i |h i )) over p ’ (w i | h i ) = p ’ (w i | w i-2,w i-1 ) =  p 3 (w i | w i-2,w i-1 ) +  p 2 (w i | w i-1 ) +  p 1 (w i ) + 0  /|V| “ Expected Counts (of lambdas) ” : j = 0..3 c( j ) =  i=1..|H| ( j p j (w i |h i ) / p ’ (w i |h i )) “ Next ” : j = 0..3 j,next = c( j ) /  k=0..3 (c( k )) ! ! !

9/22/1999JHU CS / Intro to NLP/Jan Hajic13 The (Smoothing) EM Algorithm 1. Start with some, such that j > 0 for all j ∈ Compute “ Expected Counts ” for each j. 3. Compute new set of j, using the “ Next ” formula. 4. Start over at step 2, unless a termination condition is met. Termination condition: convergence of. –Simply set an , and finish if | j - j,next | <  for each j (step 3). Guaranteed to converge: follows from Jensen ’ s inequality, plus a technical proof.

9/22/1999JHU CS / Intro to NLP/Jan Hajic14 Remark on Linear Interpolation Smoothing “ Bucketed ” smoothing: –use several vectors of instead of one, based on (the frequency of) history: (h) e.g. for h = (micrograms,per) we will have ( h ) = (.999,.0009,.00009,.00001) (because “ cubic ” is the only word to follow...) –actually: not a separate set for each history, but rather a set for “ similar ” histories ( “ bucket ” ): (b(h)), where b: V 2 → N (in the case of trigrams) b classifies histories according to their reliability (~ frequency)

9/22/1999JHU CS / Intro to NLP/Jan Hajic15 Bucketed Smoothing: The Algorithm First, determine the bucketing function b (use heldout!): –decide in advance you want e.g buckets –compute the total frequency of histories in 1 bucket (f max (b)) –gradually fill your buckets from the most frequent bigrams so that the sum of frequencies does not exceed f max (b) (you might end up with slightly more than 1000 buckets) Divide your heldout data according to buckets Apply the previous algorithm to each bucket and its data

9/22/1999JHU CS / Intro to NLP/Jan Hajic16 Simple Example Raw distribution (unigram only; smooth with uniform): p(a) =.25, p(b) =.5, p(  ) = 1/64 for  ∈ {c..r}, = 0 for the rest: s,t,u,v,w,x,y,z Heldout data: baby; use one set of ( 1 : unigram, 0 : uniform) Start with 1 =.5; p ’ (b) =.5 x / 26 =.27 p ’ (a) =.5 x / 26 =.14 p ’ (y) =.5 x / 26 =.02 c( 1 ) =.5x.5/ x.25/ x.5/ x0/.02 = 2.72 c( 0 ) =.5x.04/ x.04/ x.04/ x.04/.02 = 1.28 Normalize: 1,next =.68, 0,next =.32. Repeat from step 2 (recompute p ’ first for efficient computation, then c( i ),...) Finish when new lambdas almost equal to the old ones (say, < 0.01 difference).

9/22/1999JHU CS / Intro to NLP/Jan Hajic17 Some More Technical Hints Set V = {all words from training data}. You may also consider V = T ∪ H, but it does not make the coding in any way simpler (in fact, harder). But: you must never use the test data for you vocabulary! Prepend two “ words ” in front of all data: avoids beginning-of-data problems call these index -1 and 0: then the formulas hold exactly When c n (w) = 0: Assign 0 probability to p n (w|h) where c n-1 (h) > 0, but a uniform probability (1/|V|) to those p n (w|h) where c n-1 (h) = 0 [this must be done both when working on the heldout data during EM, as well as when computing cross-entropy on the test data!]