Ngram models and the Sparsity problem John Goldsmith November 2002.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Probability Distributions CSLU 2850.Lo1 Spring 2008 Cameron McInally Fordham University May contain work from the Creative Commons.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Hidden Markov Models John Goldsmith. Markov model A markov model is a probabilistic model of symbol sequences in which the probability of the current.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
Probability (cont.). Assigning Probabilities A probability is a value between 0 and 1 and is written either as a fraction or as a proportion. For the.
1 The Sample Mean rule Recall we learned a variable could have a normal distribution? This was useful because then we could say approximately.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
Albert Gatt Corpora and Statistical Methods Lecture 9.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Investment Analysis and Portfolio management Lecture: 24 Course Code: MBF702.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Language acquisition
Text Classification, Active/Interactive learning.
Sampling distributions - for counts and proportions IPS chapter 5.1 © 2006 W. H. Freeman and Company.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
College Algebra Sixth Edition James Stewart Lothar Redlin Saleem Watson.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Language Modeling Anytime a linguist leaves the group the recognition rate goes up. (Fred Jelinek)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
Language acquisition
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
Statistical NLP Winter 2009
Central Tendency & Dispersion
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Number Sense Disambiguation Stuart Moore Supervised by: Anna Korhonen (Computer Lab)‏ Sabine Buchholz (Toshiba CRL)‏
Ka-fu Wong © 2003 Chap 6- 1 Dr. Ka-fu Wong ECON1003 Analysis of Economic Data.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Inferential Statistics Inferential statistics allow us to infer the characteristic(s) of a population from sample data Slightly different terms and symbols.
Adding and Subtracting Decimals © Math As A Second Language All Rights Reserved next #8 Taking the Fear out of Math 8.25 – 3.5.
ANOVA, Regression and Multiple Regression March
Natural Language Processing Statistical Inference: n-grams
INFERENTIAL STATISTICS DOING STATS WITH CONFIDENCE.
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
1 Research Methods in Psychology AS Descriptive Statistics.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Copyright © 2010, 2007, 2004 Pearson Education, Inc. Chapter 18 Sampling Distribution Models.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
AP Statistics From Randomness to Probability Chapter 14.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
Honors Statistics From Randomness to Probability
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Learning From Observed Data
Presentation transcript:

Ngram models and the Sparsity problem John Goldsmith November 2002

The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last n words have been. (n = 0,1,2,3) Why this is reasonable What the problems are

Why this is reasonable The last few words tells us a lot about the next word: collocations prediction of current category: the is followed by nouns or adjectives semantic domain

Reminder about applications Speech recognition Handwriting recognition POS tagging

Problem of sparsity Words are very rare events (even if we’re not aware of that), so What feel like perfectly common sequences of words may be too rare to actually have in our training corpus

What’s the next word? in a ____ with a ____ the last ____ shot a _____ open the ____ over my ____ President Bill ____ keep tabs ____

borrowed from Henke, based on Manning and Schütze Example: Corpus: five Jane Austen novels N = 617,091 words V = 14,585 unique words Task: predict the next word of the trigram “inferior to ________” from test data, Persuasion: “[In person, she was] inferior to both [sisters.]”

borrowed from Henke, based on Manning and Schütze Instances in the Training Corpus: “inferior to ________”

Maximum Likelihood Estimate: borrowed from Henke, based on Manning and Schütze

Maximum Likelihood Distribution = D ML probability is assigned exactly based on the n-gram count in the training corpus. Anything not found in the training corpus gets probability 0.

borrowed from Henke, based on Manning and Schütze Actual Probability Distribution:

Conundrum Do we stick very tight to the “Maximum Likelihood” model, assigning zero probability to sequences not seen in the training corpus? Answer: we simply cannot; the results are just too bad.

Smoothing We need, therefore, some “smoothing” procedure which adds some of the probability mass to unseen n-grams and must therefore take away some of the probability mass from observed n-grams

Discounting, back-off, and deleted interpolation These words all go with “smoothing”. “Smoothing” describes the general problem we face: getting probability mass to the great unseen. “Discounting” describes who we take probability mass away from, and how much….

“Back-off” and “deleted interpolation” are the two standard ways of redistributing the probability mass taken away by discounting.

Back-off and deleted interpolation for a given context: What is probability of words {w i } i in the context: following “in the__” (e.g., pocket) ? Words that were found in this context get a probability a bit less than and with backoff, the held-back probability mass is distributed over words in the context “the __”. And how?

Probability mass is distributed over “the WORD” pretty much in proportion to how often each word appears in the context “the___”. But even there, we hold some of the probability mass, and assign it to all words independent of context.

Deleted Interpolation Is linear: for any word in context (e.g., pocket after in the), we choose three ls and take its probability to be the weighted average of the trigram, bigram, and unigram models: 1 P(pocket|in the) + 2 P(pocket|the) +  P(pocket) If we fixed the ls, we would only need to insist that they sum to 1.0. But…

We don’t fix them: we allow them to vary, depending on the context (“in the”); we need to do some fancier calculations then (Expectation-Maximization).

General ideas about discounting Three closely related ideas that are widely used.

“Sum of counts” method of creating a distribution You can always get a distribution from a set of counts by dividing each count by the total count of the set. “bins”: name for the different preceding n- grams that we keep track of. Each bin gets a probability, and they must sum to 1.0

Zero knowledge Suppose we give a count of 1 to every possible bin in our model. If our model is a bigram model, we give a count of 1 to the V 2 conceivable bigrams. (V if unigram, V 3 if trigram, etc.) Admittedly, this model assumes zero knowledge of the language…. We get a distribution for each bin by assigning probability 1/V 2 to each bin. Call this distribution D N.

Too much knowledge Give each bin exactly the number of counts that it earns from the training corpus. If we are making a bigram model, then there are V 2 bins, and those bigrams that do not appear in the training corpus get a count of 0. We get the Maximum Likelihood distribution by dividing by the total count = N.

Laplace (“Adding one”) Add the bin counts from the Zero-knowledge case (1 for each bin, V 2 of them in bigram case) and the bin counts from the Too-much knowledge (score in training corpus) Divide by total number of counts = V 2 + N Formula: each bin gets probability (Count in corpus + 1) / (V 2 + N)

Lidstone’s Law Choose a number, between 0 and 1, for the count in the NoKnowledge distribution. Then the count in each bin is Count in corpus + And we assign probability to it (where the number of bins is V2, because we’re considering a bigram model: If = 1 this is Laplace; If = 0.5, this is Jeffrey-Perks Law If = 0, this is Maximum Likelihood

Another way to say this… We can also think of Laplace as a weighted average of two distributions, the No Knowledge distribution and the MaximumLikelihood distribution…

2. Averaging distributions Remember this: If you take weighted averages of distributions of this form: * distribution D 1 + (1- ) * distribution D 2 the result is a distribution: all the numbers sum to 1.0 This means that you split the probability mass between the two distributions (in proportion  then divide up those smaller portions exactly according to D 1 and D 2.

“Adding 1” (Laplace) Is it clear that

this is a special case of  D N + (1- )D ML where  V 2 /(V 2 +N). How big is this? if V= 50,000, then V 2 = 2,500,000,000. This means that if our corpus is 2 and a half billion words, we are still reserving half of our probability mass for zero knowledge – that’s too much. = V 2 /(V 2 +N) = 2,500,000,000/5,000,000,000 = 0.5

Good-Turing discounting The central problem is assigning probability mass to unseen examples, especially unseen bigrams (or trigrams), based on known vocabulary. Good-Turing estimation says that a good estimate for the total probability of unseen n-grams is the total number of 1-grams seen = N 1 /N.

Intuition behind Turing’s idea Suppose you want to know, in general, the likelihood that the next word you see will be a word of frequency N, as far as the corpus that you’ve observed so far is concerned. Consider the inverted problem: you’ve seen a corpus so far, with a bunch of words with various frequencies….

We usually think of creating of a corpus as being like consecutive selection of words from a dictionary, with a (stationary) word probability distribution. Suppose, instead, that corpus creation consists of: First, selection of a (multi-)set of N words in an unordered fashion; and then Second, an ordering is imposed on them by consecutively picking words to be the last word, second-to-last word, etc.:

First: Put N words (some different, some the same) in a bag. They’re an unordered set (multiset, really).

Now Select what will be the last word of the corpus: Pick it out, label it word #N. The bag now has N-1 words in it. end[N]

Continue: Take out a word, declare it to be word #N-1. Repeat till you get to the first word…

We now have a sequence of moments that illustrate the creation of the corpus (though we did it backwards in time). At each moment, we know what words were in the bag, and we know what word just got removed from it (or rather, what word is just about to be removed from it, from the point of view of normal time)… Now, back to thinking about Good-Turing from the normal, usual point of view…

Thinking forward, you want to create a corpus which is one word smaller, so you randomly delete a word from your corpus. What’s the probability that you (randomly) choose a word of frequency 1? 2? 27? Let’s say there are N 1 words of frequency 1, N 2 words of frequency 2, etc. Then:  i i x N i = total length of corpus = N, and the probability of removing a word of frequency i is

So the probability of choosing a word that occurred once is N 1 /N – that is, the number of words that occurred once, divided by the total length of the corpus.

So we take the probability mass assigned empirically to n-grams seen once, and assign it to all the unseen n-grams (we know how many there are: if the vocabulary is of size V, then there are V n n-grams: if we have seen T distinct n-grams, then each unseen n-gram gets probability:

So unseen n-grams got all of the probability mass that had been earned by the n-grams seen once. So the n-grams seen once will grab all of the probability mass earned by n- grams seen twice, then (uniformly) distributed:

So n-grams seen twice will take all the probability mass earned by n-grams seen three times…and we stop this foolishness around the time when observed frequencies are reliable, around 10 times. seen 1x seen 2x 3x 4x 5x pred 1x pred 2x 3x 4x 5x MODEL: assigns probabilities Counts all unseen ngrams

The End (if we ignore Bell-Witten)

Witten-Bell discounting Let’s try to estimate the probability of all of the unseen N-grams of English, given a corpus. First guess: the probability of hitting a new word in a corpus is roughly equal to the number of new words encountered in the observed corpus divided by the number of tokens. (Likewise for bigrams, n-grams). prob = #distinct words/#words ?

That over-estimates …because at the beginning, almost every word looks new and unseen! So we must either decrease the numerator or increase the denominator. Witten-Bell: Suppose we have a data-structure keeping track of seen words. As we read a corpus, with each word, we ask: have you seen this before? If it says, No, we say, Add it to your memory (that’s a separate function). The probability of new words is estimated by the proportion of calls to this data-structure which are “Add” functions.

Estimate prob (unseen word) as And then distribute K uniformly over unseen unigrams (that’s hard…) or n-grams, and reduce the probability given to seen n- grams

Therefore, the estimated real probability of seeing one of the N-grams we have already seen is N/(T+B), and the estimate of seeing a new N-gram at any moment is T/(T+N). So we want to distribute T/(T+N) over the unseen N-grams.