Presentation is loading. Please wait.

Presentation is loading. Please wait.

Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

Similar presentations


Presentation on theme: "Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004."— Presentation transcript:

1 Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004

2 The Sparse Data Problem Maximum likelihood estimation works fine for data that occur frequently in the training corpus Problem 1: Low frequency n-grams If n-gram x occurs twice, and n-gram y occurs once. Is x really twice as likely as y? Problem 2: Zero counts If n-gram y does not occur in the training data, does that mean that it should have probability zero?

3 The Sparse Data Problem Data sparseness is a serious and frequently occurring problem Probability of a sequence is zero if it contains unseen n-grams

4 Smoothing=Redistributing Probability Mass

5 Add-One Smoothing Most simple smoothing technique For all n-grams, including unseen n- grams, add one to their counts Un-smoothed probability: Add-one probability:

6 Add-One Smoothing P(w n |w n-1 ) = C(w n-1 w n )/C(w n-1 ) P +1 (w n |w n-1 ) = [C(w n-1 w n )+1]/[C(w n-1 )+V]

7 Add-One Smoothing c i ’=(c i +1) c i =c(w i,w i-1 )

8 Add-One Smoothing Pro: Very simple technique Cons: Too much probability mass is shifted towards unseen n-grams Probability of frequent n-grams is underestimated Probability of rare (or unseen) n-grams is overestimated All unseen n-grams are smoothed in the same way Using a smaller added-counted does not solve this problem in principle

9 Witten-Bell Discounting Probability mass is shifted around, depending on the context of words If P(w i | w i-1,…,w i-m ) = 0, then the smoothed probability P WB (w i | w i-1,…,w i-m ) is higher if the sequence w i-1,…,w i-m occurs with many different words w i

10 Witten-Bell Smoothing Let’s consider bi-grams T(w i-1 ) is the number of different words (types) that occur to the right of w i-1 N(w i-1 ) is the number of all word occurrences (tokens) to the right of w i-1 Z(w i-1 ) is the number of bigrams in the current data set starting with w i-1 that do not occur in the training data

11 Witten-Bell Smoothing If c(w i-1,w i ) = 0 If c(w i-1,w i ) > 0

12 Witten-Bell Smoothing c i ′=(c i +1) cici c i ′ = T/Z · if c i =0 c i · otherwise

13 Witten-Bell Smoothing Witten-Bell Smoothing is more conservative when subtracting probability mass Gives rather good estimates Problem: If w i-1 and w i did not occur in the training data the smoothed probability is still zero

14 Backoff Smoothing Deleted interpolation If the n-gram w i-n,…,w i is not in the training data, use w i-(n-1),…,w i More general, combine evidence from different n- grams Where lambda is the ‘confidence’ weight for the longer n-gram Compute lambda parameters from held-out data Lambdas can be n-gram specific

15 Other Smoothing Approaches Good-Turing Discounting: Re-estimate amount of probability mass for zero (or low count) n-grams by looking at n-grams with higher counts Kneser-Ney Smoothing: Similar to Witten-Bell smoothing but considers number of word types preceding a word Katz Backoff Smoothing: Reverts to shorter n- gram contexts if the count for the current n- gram is lower than some threshold


Download ppt "Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004."

Similar presentations


Ads by Google