Smoothing Mengqiu Wang 09 April 2010 (Thanks to Anish Johnson, Nate Chambers, Bill MacCartney, Jenny Finkel, and Sushant Prakash for these materials)
Format and content of sections Mix of theoretical and practical topics, targeted at PAs Emphasis on simple examples worked out in detail
Outline for today Information theory: intuitions and examples entropy, joint entropy, conditional entropy, mutual information relative entropy (KL divergence), cross entropy, perplexity Smoothing: examples, proofs, implementation absolute discounting example how to prove you have a proper probability distribution Good-Turing smoothing tricks smoothing and conditional distributions Java implementation representing huge models efficiently Some tips on the programming assignments
Example Toy language with two words: "A" and "B". Want to build predictive model from: "A B, B A." "A B, B A A A!" "B A, A B!" "A!" "A, A, A." "" (Can say nothing)
A Unigram Model Let's omit punctuation and put a stop symbol after each utterance: A B B A . A B B A A A . B A A B . A . A A A . . Let C(x) be the observed count of unigram x Let o(x) be the observed frequency of unigram x A multinomial probability distribution with 2 free parameters Event space {A, B, .} o(x): MLE estimates of the unknown true parameters
A Bigram Model (1/2) This time we'll put a stop symbol before and after each utterance: . A B B A . . A B B A A A . . B A A B . . A . . A A A . . . Let C(x, y) be the observed count of bigram xy Let o(x, y) be the observed frequency of bigram xy multinomial probability distribution with 8 free parameters Event space {A, B, .} × {A, B, .} MLE estimates of the unknown true parameters
A Bigram Model (2/2) o(y | x) = o(x, y) / o(x) Marginal distributions o(x) and o(y) Conditional distributions o(y | x) and o(x | y) ? o(y | x) = o(x, y) / o(x) o(x | y) = o(x, y) / o(y) o(B | A)= o(B, A) / o(A) = (1/8) / (1/2) = 1/4
Entropy (1/2) Note that entropy is (the negative of) the expected value of the log probability of x. Log probabilities come up a lot in statistical modeling. Since probabilities are always ≤ 1, log probabilities are always ≤ 0. Therefore, entropy is always ≥ 0. If X is a random variable whose distribution is p (which we write "X ~ p"), then we can define the entropy of X, H(X), as follows: Let's calculate the entropy of our observed unigram distribution, o(x):
Entropy (2/2) What does entropy mean, intuitively? Helpful to think of it as a measure of the evenness or uniformity of the distribution Lowest entropy? what parameters achieve that minimum? Highest entropy? what parameters achieve that maximum? What if m(x) = 0 for some x? By definition 0 lg 0 = 0, so that events with probability 0 do not affect the entropy calculation.
Joint Entropy (1/2) If random variables X and Y have joint distribution p(x, y), then we can define the joint entropy of X and Y, H(X, Y), as follows: Let's calculate the entropy of our observed bigram distribution, o(x, y): What happens when (x,y) are independent? H(X,Y) = H(X)+H(Y) becuz p(x,y) = p(x) * p(y) joint entropy H(X, Y)
Joint Entropy (2/2) Try fiddling with the parameters of the following joint distribution m(x, y), and observe what happens to the joint entropy:
Conditional Entropy (1/2) If random variables X and Y have joint distribution p(x, y), then we can define the conditional entropy of Y given X, H(Y | X), as follows: Conditional entropy tells us How much does knowing of X tells us about Y If x and y are independent, H(Y|X) = H(Y), we can also see that from the formula
Conditional Entropy (2/2) From previous slide: H(Y | X) = x p(x) [ - y p(y|x) lg p(y | x) ] An alternative of computing H(Y | X) H(Y | X) = H(X,Y) – H(X)
Relative entropy (1/3) (KL divergence) If we have two probability distributions, p(x) and q(x), we can define the relative entropy (aka KL divergence) between p and q, D(p||q), as follows: We define 0 log0 = 0, plog(p/0)= Relative entropy measures how much two probability distributions differ. asymmetric Identical distributions have zero relative entropy. Non-identical distributions have positive relative entropy. never negative
Relative entropy (2/3) (KL divergence) Suppose we want to compare our observed unigram distribution o(x) to some arbitrary model distribution m(x). What is the relative entropy between them? Try fiddling with the parameters of m(x) and see what it does to the KL divergence. What parameters minimize the divergence? Maximize?
Relative entropy (3/3) (KL divergence) What happens when o(x) = 0? (lg 0 is normally undefined!) Events that cannot happen (according to o(x)) do not contribute to KL divergence between o(x) and any other distribution. What happens when m(x) = 0? (division by 0!) If an event x was observed (o(x) > 0) but your model says it can't (m(x) = 0), then your model is infinitely surprised: D(o || m) = ∞. D(p || q) = H(p,q) – H(p) Why is D(p || q) 0? Can you prove it? Hint: Gibb’s Inequality Cross Entropy: H(p,q) = -∑x p(x) log(q(x))
Smoothing: Absolute Discounting Idea: reduce counts of observed event types by a fixed amount , and reallocate the count mass to unobserved event types. Absolute discounting is simple and gives quite good results. Terminology: x - an event (a type) (e.g., bigram) C - count of all observations (tokens) (e.g., training size) C(x) - count of observations (tokens) of type x (e.g. bigram counts) V - # of event types: |{x}| (e.g. size of vocabulary) Nr - # of event types observed r times: |{x: C(x) = r}| - a number between 0 and 1 (e.g. 0.75)
Absolute Discounting (Cont.) For seen types, we deduct from the count mass: Pad(x) = (C(x) - ) / C if C(x) > 0 How much count mass did we harvest by doing this? We took from each of V - N0 types, so we have (V - N0) to redistribute among the N0 unseen types. So each unseen type will get a count mass of (V - N0) / N0: Pad(x) = (V-N0) / N0C if C(x) = 0 To see how this works, let's go back to our original example and look at bigram counts. To bring unseens into the picture, let's suppose we have another word, C, giving rise to 7 new unseen bigrams:
Absolute Discounting (Cont.) The probabilities add up to 1, but how do we prove this in general? Also, how do you choose a good value for ? 0.75 is often recommended, but you can also use held-out data to tune this parameter. Look down the column of Pad probabilities. Anything troubling to you?
Proper Probability Distributions To prove that a function p is a probability distribution, you must show: 1. ∀x p(x) ≥ 0 2. ∑x p(x) = 1 The first is generally trivial; the second can be more challenging. A proof for absolute discounting will illustrate the general idea: ∑xPad(x) = ∑{x:C(x)>0}Pad(x) + ∑{x:C(x)=0}Pad(x) = ∑{x:C(x)>0} (C(x)-)/C + ∑{x:C(x)=0}(V-N0)/N0C [V-N0 terms] [N0 terms] = [∑{x:C(x)>0}C(x)]/C - (V-N0)/C + (V-N0)/C = ∑xC(x)/C = C/C = 1
Good-Turing Smoothing We redistribute the count mass of types observed r+1 times evenly among types observed r times. Then, estimate P(x) as r*/C, where r* is an adjusted count for types observed r times. We want r* such that: r* Nr = (r+1) Nr+1 r* = (r+1) Nr+1 / Nr => PGT(x) = ((r+1) Nr+1/Nr) / C
Good-Turing (Cont.) To see how this works, let's go back to our example and look at bigram counts. To make the example more interesting, we'll assume we've also seen the sentence, "C C C C C C C C C C C", giving us another 12 bigram observations, as summarized in the following table of counts:
Good-Turing (Cont.) But now the probabilities do not add up to 1 One problem is that for high values of r (i.e., for high-frequency bigrams), Nr+1 is quite likely to be 0 One way to address this is to use the Good-Turing estimates only for frequencies r < k for some constant cutoff k. Above this, the MLE estimates are used.
Good-Turing (Cont.) Thus a better way to define r* is: r* = (r+1) E[Nr+1] / E[Nr] Where E[Nr] means the expected number of event types (bigrams) observed r times. So we fit some function S to the observed values (r, Nr) Gale & Sampson (1995) suggest using a power curve Nr = arb, with b < -1 Fit using linear regression in logspace: log Nr = log a + b log r The observed distribution of Nr, when transformed into logspace, looks like this:
Good-Turing (Cont.) Here is a graph of the line fit in logspace and the resulting power curve: The fit is a poor fit, but it gives us smoothed values for Nr. Now, For r > 0, we use S(r) to generate the adjusted counts r*: r* = (r+1) S(r+1) / S(r)
Smoothing and Conditional Probabilities Some people have the wrong idea about how to combine smoothing with conditional probability distributions. You know that a conditional distribution can be computed as the ratio of a joint distribution and a marginal distribution: P(x | y) = P(x, y) / P(y) What if you want to use smoothing? WRONG: Smooth joint P(x, y) and marginal P(y) independently, then combine: P'''(x | y) = P'(x, y) / P''(y). Correct: Smooth P(x | y) independently.
Smoothing and Conditional Probabilities (Cont.) The problem is that steps 1 and 2 do smoothing separately, so it makes no sense to divide the results. The right way to compute the smoothed conditional probability distribution P(x | y) is: 1. From the joint P(x, y), compute a smoothed joint P'(x, y). 2. From the smoothed joint P'(x, y), compute a smoothed marginal P'(y). 3. Divide them: let P'(x | y) = P'(x, y) / P'(y).
Smoothing and Conditional Probabilities (Cont.) Suppose we're on safari, and we count the animals we observe by species (x) and gender (y): From these counts, we can easily compute unsmoothed, empirical joint and marginal distributions:
Smoothing and Conditional Probabilities (Cont.) Now suppose we want to use absolute discounting, with = 0.75. The adjusted counts are: Now from these counts, we can compute smoothed joint and marginal distributions: Now, since both P'(x, y) and P'(y) result from the same smoothing operation, it's OK to divide them:
Java Bug fix!! In generateWord() function of public String generateWord() { double sample = Math.random(); double sum = 0.0; for (String word : wordCounter.keySet()) { sum += wordCounter.getCount(word) / total; if (sum > sample) { return word; } } return "*UNKNOWN*"; // a little probability mass was // reserved for unknowns } total +1
Java You want to train a trigram model on 10 million words, but you keep running out of memory. Don't use a V × V × V matrix! CounterMap? Good, but needs to be elaborated. Intern your Strings! Virtual memory: -mx2000m
Programming Assignment Tips Objectives: The assignments are intentionally open-ended Think of the assignment as an investigation Think of your report as a mini research paper Implement the basics: brigram and trigram models, smoothing, interpolation Choose one or more additions, e.g.: Fancy smoothing: Katz, Witten-Bell, Kneser-Ney, gapped bigram Compare smoothing joint vs. smoothing conditionals Crude spelling model for unknown words Trade-off between memory and performance Explain what you did, why you did it, and what you found out
Programming Assignment Tips Development Use a very small dataset during development (esp. debugging). Use validation data to tune hyperparameters. Investigate the learning curve (performance as a function of training size). Investigate the variance of your model results. Report Be concise! 6 pages is plenty. Prove that your actual, implemented distributions are proper (concisely!). Include a graph or two. Error analysis: discuss examples that your model gets wrong.