N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams to previously unseen N-grams. This leads to another question: how do we do this?

Unsmoothed bigrams Recall that we use unigram and bigram counts to compute bigram probabilities: –P(w n |w n-1 ) = C(w n-1 w n ) / C(w n-1 )

How many bigrams are in a text? Suppose text had N words, how many bigrams (tokens) does it contain? At most N: we assume appearing before first word to get a bigram probability for the word in the initial position. Example (5 words): –words: w1 w2 w3 w4 w5 –bigrams: w1, w1 w2, w2 w3, w3 w4, w4 w5

How many possible bigrams are there? With a vocabulary of N words, there are N 2 possible bigrams.

Example description Berkeley Restaurant Project corpus –approximately 10,000 sentences –1616 word types –tables will show counts or probabilities for 7 word types, carefully chosen so that the 7 by 7 matrix is not too sparse –notice that many counts in first table are zero (25 zeros of 49 entries)

Unsmoothed N-grams IwanttoeatChinesefoodlunch I81087013000 want307860686 to30108603012 eat002019252 Chinese200001201 food190170000 lunch4000010 Bigram counts (figure 6.4 from text)

Computing probabilities Recall formula (we normalize by unigram counts): –P(w n |w n-1 ) = C(w n-1 w n ) / C(w n-1 ) Unigram counts are: WORDIwanttoeatChinesefoodlunch UNIGRA M COUNT 3437121532569382131506459 p( eat | to ) = c( to eat ) / c( to ) = 860 / 3256 =.26 p( to | eat ) = c( eat to ) / c(eat) = 2 / 938 =.0021

Unsmoothed N-grams w n w n-1 IwanttoeatChinesefoodlunch I.0023.320.0038000 want.00250.640.0049.0066.0049 to.000920.0031.26.000920.0037 eat00.00210.02.0021.055 Chinese.00940000.56.0047 food.0130.0110000 lunch.00870000.00220 Bigram probabilities (figure 6.5 from text): p( w n | w n-1 )

What do zeros mean? Just because a bigram has a zero count or a zero probability does not mean that it cannot occur – it just means it didn’t occur in the training corpus. So we arrive back at our question: what do we do with bigrams that have zero counts when we encounter them?

Let’s rephrase the question How can we ensure that none of the possible bigrams have zero counts/probabilities? Process of spreading the probability mass around to all possible bigrams is called smoothing. We start with a very simple model, called add-one smoothing.

Add-one smoothing Basic idea: add one to actual counts, across the board. This ensures that there are no unigrams or bigrams with zero counts. Typically this adds too much probability mass to non-occurring bigrams.

Add-one smoothing: computing the probabilities Unadjusted probabilities: –P(w n |w n-1 ) = C(w n-1 w n ) / C(w n-1 ) Adjusted probabilities: –P*(w n |w n-1 ) = [ C(w n-1 w n ) + 1 ] / [ C(w n-1 ) + V ] Notes –V is total number of word types in vocabulary –In numerator we add one to the count of each bigram, just as we do with the unigram counts. –In denominator we add V, since we are adding one more bigram token of the form w n-1 w, for each w in our vocabulary

A simple approach to smoothing: Add-one smoothing IwanttoeatChinesefoodlunch I91088114111 want417871797 to41118614113 eat113120353 Chinese311111212 food201181111 lunch5111121 Add-one smoothed bigram counts (figure 6.6 from text)

Back to the probabilities Recall the formula for the adjusted probabilities: –P*(w n |w n-1 ) = [ C(w n-1 w n ) + 1 ] / [ C(w n-1 ) + V ] Unigram counts (adjusted by adding V=1616): WORDIwanttoeatChinesefoodlunch UNIGRAM COUNT 5053293148722554182931222075 p( eat | to ) = c( to eat ) / c( to ) = 861 / 4872 =.18 (was.26) p( to | eat ) = c( eat to ) / c( eat ) = 3 / 2554 =.0012 (was.0021) p( eat | lunch ) = c( lunch eat ) / c( lunch ) = 1 / 2075 =.00048 (was 0) p( eat | want ) = c( want eat ) / c( want ) = 1 / 2931 =.00034 (was 0)

A simple approach to smoothing: Add-one smoothing IwanttoeatChinesefoodlunch I.0018.22.0002.0028.0002 want.0014.00035.28.00035.0025.0032.0025 to.00082.00021.0023.18.00082.00021.0027 eat.00039.0012.00039.0078.0012.021 Chinese.0016.00055.066.0011 food.0064.00032.0058.00032 lunch.0024.00048.00096.00048 Add-one smoothed bigram probabilities (figure 6.7 from text)

Discounting We define the discount to be the ratio of new and old counts (in our case smoothed and unsmoothed counts). Discounts for add-one smoothing for this example: WORD IwanttoeatChinesefoodlunch UNIGRAM COUNT 3437121532569382131506459 ADD-ONE COUNT 3437+1616 = 5053 1215+1616 = 2931 3256+1616 = 4872 938+1616 = 2554 213+1616 = 1829 1506+1616 = 3122 459+1616 = 2075 ADD-ONE DISCOUNT 3437/5053 = 0.68 1215/2931 = 0.42 3256/4872 = 0.69 938/2554 = 0.37 213/1829 = 0.12 1506/3122 = 0.48 459/2075 = 0.22

What does the discount tell us? The discount tells us from where the probability mass is coming. "Looking at the discount … shows us how strikingly the counts for each prefix-word have been reduced; the bigrams starting with Chinese were discounted by a factor of 8!" [p. 209]

Witten-Bell discounting Another approach to smoothing Basic idea: “Use the count of things you’ve seen once to help estimate the count of things you’ve never seen.” [p. 211] “How can we compute the probability of seeing an N- gram for the first time? By counting the number of times we saw N-grams for the first time in our training corpus. This is very simple to produce since the count of “first-time” N-grams is just the number of N-gram types we saw in the data (since we had to see each type for the first time exactly once).” [p. 211]

How much probability mass can we reassign? Total probability mass assigned to all (as yet) unseen n-grams is T / [ T + N ], where –T is the total number of observed types (not vocabulary size) –N is the number of tokens “We can think of our training corpus as a series of events; one event for each token and one event for each new type.” [p. 211] Formula above estimates “the probability of a new type event occurring.” [p. 211]

Distribution of probability mass This probability mass is distributed evenly amongst the unseen n-grams. Z = number of zero-count n-grams. p i * = [ T / (N+T) ] / Z = T / Z(N+T)

Discounting (unigram case) This probability mass has to come from somewhere. Recall the unsmoothed probability is p i = c i / N (N is number of tokens) The smoothed (discounted) probability for non-zero count unigrams is p i * = c i / (N + T) We can also give smoothed counts (Z is # unigram types with zero counts): –For zero-count unigrams: c i * = (T/Z) N/(N+T) –For non-zero count unigrams: c i * = c i * N/(N+T)

Discounting (bigram case) Total probability mass being redistributed (conditioned on first word of bigram):  i:c(w x w i )=0 p*(w i |w x ) = T(w x ) / (N(w x ) + T(w x )) Distrubute this probability mass to unseen bigrams. Z is # bigram types with given first word with zero counts. The smoothed (discounted) probability is p*(w i |w x ) = T(w x ) / Z(w x )(N(w x ) + T(w x ))c(w i-1 w i ) = 0 p*(w i |w x ) = c(w x w i ) / (c(w x ) + T(w x ))c(w i-1 w i ) > 0

Witten-Bell: discounted counts IwanttoeatChinesefoodlunch I81060.06213.062 want3.046740.046686 to3.085108273.08512 eat.075 2 17246 Chinese2.012 1091 food18.05916.059 lunch4.026 1 Witten-Bell smoothed (discounted) bigram counts (figure 6.9 from text) Notice that counts which were 0 unsmoothed are <1 smoothed; contrast with Add-One Smoothing.

Discount comparison Table shows discounts for add-one and Witten-Bell smoothing for this example: WORD IwanttoeatChinesefoodlunch ADD-ONE DISCOUNT.68.42.69.37.12.48.22 WITTEN- BELL DISCOUNT.97.94.96.88.91.94.91

Training sets and test sets Corpus divided into training set and test set Need test items to not be in training set, else they will have artificially high probability Can use this to evaluate different systems: –train two different systems on the same training set –compare performance of systems on the same test set

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

Similar presentations

Presentation on theme: "N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

Similar presentations

Presentation on theme: "N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams."— Presentation transcript:

Similar presentations

About project

Feedback