Pi-Chuan Chang, Paul Baumstarck April 11, 2008

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Integration of sensory modalities
4. Multiple Regression Analysis: Estimation -Most econometric regressions are motivated by a question -ie: Do Canadian Heritage commercials have a positive.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
Machine Learning CMPT 726 Simon Fraser University
1 Language Model (LM) LING 570 Fei Xia Week 4: 10/21/2009 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAA A A.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Likelihood Methods in Ecology November 16 th – 20 th, 2009 Millbrook, NY Instructors: Charles Canham and María Uriarte Teaching Assistant Liza Comita.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
A Maximum Entropy Language Model Integrating N-grams and Topic Dependencies for Conversational Speech Recognition Sanjeev Khudanpur and Jun Wu Johns Hopkins.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
AP Statistics From Randomness to Probability Chapter 14.
ETHEM ALPAYDIN © The MIT Press, Lecture Slides for.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
Estimating standard error using bootstrap
Hidden Markov Models BMI/CS 576
Lecture #8 Thursday, September 15, 2016 Textbook: Section 4.4
N-Grams Chapter 4 Part 2.
Chapter 4 Basic Estimation Techniques
Oliver Schulte Machine Learning 726
Chapter 25 Comparing Counts.
Combining Random Variables
From Randomness to Probability
From Randomness to Probability
Language Models for Information Retrieval
Data Mining Practical Machine Learning Tools and Techniques
Statistical Methods For Engineers
Information Based Criteria for Design of Experiments
Chapter 4 – Part 3.
The normal distribution
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
Undergraduated Econometrics
10701 / Machine Learning Today: - Cross validation,
Honors Statistics From Randomness to Probability
N-Gram Model Formulas Word sequences Chain rule of probability
Integration of sensory modalities
CSCE 771 Natural Language Processing
CS5112: Algorithms and Data Structures for Applications
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Chapter 26 Comparing Counts.
LECTURE 23: INFORMATION THEORY REVIEW
Regression Forecasting and Model Building
CPSC 503 Computational Linguistics
Parametric Methods Berlin Chen, 2005 References:
Machine learning overview
Smoothing Mengqiu Wang 09 April 2010
Mathematical Foundations of BME Reza Shadmehr
Valentin I. Spitkovsky April 16, 2010
Smoothing Ritvik Mudur 14 January 2011
Anish Johnson and Nate Chambers 10 April 2009
CS224N Section 2: EM Nate Chambers April 17, 2009
Applied Statistics and Probability for Engineers
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Pi-Chuan Chang, Paul Baumstarck April 11, 2008 Smoothing Pi-Chuan Chang, Paul Baumstarck April 11, 2008 (Thanks to Bill MacCartney, Jenny Finkel, and Sushant Prakash for these materials)

Format and content of sections Mix of theoretical and practical topics, targeted at PAs Emphasis on simple examples worked out in detail

Outline for today Information theory: intuitions and examples entropy, joint entropy, conditional entropy, mutual information relative entropy (KL divergence), cross entropy, perplexity Smoothing: examples, proofs, implementation absolute discounting example how to prove you have a proper probability distribution Good-Turing smoothing tricks smoothing and conditional distributions Java implementation representing huge models efficiently Some tips on the programming assignments

Example Toy language with two words: "A" and "B". Want to build predictive model from: "A B, B A." "A B, B A A A!" "B A, A B!" "A!" "A, A, A." "" (Can say nothing)

A Unigram Model Let's omit punctuation and put a stop symbol after each utterance: A B B A . A B B A A A . B A A B . A . A A A . . Let C(x) be the observed count of unigram x Let o(x) be the observed frequency of unigram x A multinomial probability distribution with 2 free parameters Event space {A, B, .} o(x): MLE estimates of the unknown true parameters

A Bigram Model (1/2) This time we'll put a stop symbol before and after each utterance: . A B B A . . A B B A A A . . B A A B . . A . . A A A . . . Let C(x, y) be the observed count of bigram xy Let o(x, y) be the observed frequency of bigram xy multinomial probability distribution with 8 free parameters Event space {A, B, .} × {A, B, .} MLE estimates of the unknown true parameters

A Bigram Model (2/2) o(y | x) = o(x, y) / o(x) Marginal distributions o(x) and o(y) Conditional distributions o(y | x) and o(x | y) ? o(y | x) = o(x, y) / o(x) o(x | y) = o(x, y) / o(y) o(B | A)= o(B, A) / o(A) = (1/8) / (1/2) = 1/4

Entropy (1/2) Note that entropy is (the negative of) the expected value of the log probability of x. Log probabilities come up a lot in statistical modeling. Since probabilities are always ≤ 1, log probabilities are always ≤ 0. Therefore, entropy is always ≥ 0. If X is a random variable whose distribution is p (which we write "X ~ p"), then we can define the entropy of X, H(X), as follows: Let's calculate the entropy of our observed unigram distribution, o(x):

Entropy (2/2) What does entropy mean, intuitively? Helpful to think of it as a measure of the evenness or uniformity of the distribution Lowest entropy? what parameters achieve that minimum? Highest entropy? what parameters achieve that maximum? What if m(x) = 0 for some x? By definition 0 lg 0 = 0, so that events with probability 0 do not affect the entropy calculation.

Joint Entropy (1/2) If random variables X and Y have joint distribution p(x, y), then we can define the joint entropy of X and Y, H(X, Y), as follows: Let's calculate the entropy of our observed bigram distribution, o(x, y): joint entropy H(X, Y)

Joint Entropy (2/2) Try fiddling with the parameters of the following joint distribution m(x, y), and observe what happens to the joint entropy:

Conditional Entropy (1/2) If random variables X and Y have joint distribution p(x, y), then we can define the conditional entropy of Y given X, H(Y | X), as follows:

Conditional Entropy (2/2) From previous slide: H(Y | X) = x p(x) [ - y p(y|x) lg p(y | x) ] An alternative of computing H(Y | X) H(Y | X) = H(X,Y) – H(X)

Relative entropy (1/3) (KL divergence) If we have two probability distributions, p(x) and q(x), we can define the relative entropy (aka KL divergence) between p and q, D(p||q), as follows: We define 0 log0 = 0, plog(p/0)= Relative entropy measures how much two probability distributions differ. asymmetric Identical distributions have zero relative entropy. Non-identical distributions have positive relative entropy. never negative

Relative entropy (2/3) (KL divergence) Suppose we want to compare our observed unigram distribution o(x) to some arbitrary model distribution m(x). What is the relative entropy between them? Try fiddling with the parameters of m(x) and see what it does to the KL divergence. What parameters minimize the divergence? Maximize?

Relative entropy (3/3) (KL divergence) What happens when o(x) = 0? (lg 0 is normally undefined!) Events that cannot happen (according to o(x)) do not contribute to KL divergence between o(x) and any other distribution. What happens when m(x) = 0? (division by 0!) If an event x was observed (o(x) > 0) but your model says it can't (m(x) = 0), then your model is infinitely surprised: D(o || m) = ∞. Why is D(p || q)  0? Can you prove it?

Smoothing: Absolute Discounting Idea: reduce counts of observed event types by a fixed amount , and reallocate the count mass to unobserved event types. Absolute discounting is simple and gives quite good results. Terminology: x - an event (a type) (e.g., bigram) C - count of all observations (tokens) (e.g., training size) C(x) - count of observations (tokens) of type x (e.g. bigram counts) V - # of event types: |{x}| (e.g. size of vocabulary) Nr - # of event types observed r times: |{x: C(x) = r}|  - a number between 0 and 1 (e.g. 0.75)

Absolute Discounting (Cont.) For seen types, we deduct  from the count mass: Pad(x) = (C(x) - ) / C if C(x) > 0 How much count mass did we harvest by doing this? We took  from each of V - N0 types, so we have (V - N0) to redistribute among the N0 unseen types. So each unseen type will get a count mass of (V - N0) / N0: Pad(x) = (V-N0) / N0C if C(x) = 0 To see how this works, let's go back to our original example and look at bigram counts. To bring unseens into the picture, let's suppose we have another word, C, giving rise to 7 new unseen bigrams:

Absolute Discounting (Cont.) The probabilities add up to 1, but how do we prove this in general? Also, how do you choose a good value for ? 0.75 is often recommended, but you can also use held-out data to tune this parameter. Look down the column of Pad probabilities. Anything troubling to you?

Proper Probability Distributions To prove that a function p is a probability distribution, you must show: 1. ∀x p(x) ≥ 0 2. ∑x p(x) = 1 The first is generally trivial; the second can be more challenging. A proof for absolute discounting will illustrate the general idea: ∑xPad(x) = ∑{x:C(x)>0}Pad(x) + ∑{x:C(x)=0}Pad(x) = ∑{x:C(x)>0} (C(x)-)/C + ∑{x:C(x)=0}(V-N0)/N0C [V-N0 terms] [N0 terms] = [∑{x:C(x)>0}C(x)]/C - (V-N0)/C + (V-N0)/C = ∑xC(x)/C = C/C = 1

Good-Turing Smoothing We redistribute the count mass of types observed r+1 times evenly among types observed r times. Then, estimate P(x) as r*/C, where r* is an adjusted count for types observed r times. We want r* such that: r* Nr = (r+1) Nr+1 r* = (r+1) Nr+1 / Nr => PGT(x) = ((r+1) Nr+1/Nr) / C To see how this works, let's go back to our example and look at bigram counts. To make the example more interesting, we'll assume we've also seen the sentence, "C C C C C C C C C C C", giving us another 12 bigram observations, as summarized in the following table of counts:

Good-Turing (Cont.) To see how this works, let's go back to our example and look at bigram counts. To make the example more interesting, we'll assume we've also seen the sentence, "C C C C C C C C C C C", giving us another 12 bigram observations, as summarized in the following table of counts:

Good-Turing (Cont.) But now the probabilities do not add up to 1 One problem is that for high values of r (i.e., for high-frequency bigrams), Nr+1 is quite likely to be 0 One way to address this is to use the Good-Turing estimates only for frequencies r < k for some constant cutoff k. Above this, the MLE estimates are used.

Good-Turing (Cont.) Thus a better way to define r* is: r* = (r+1) E[Nr+1] / E[Nr] Where E[Nr] means the expected number of event types (bigrams) observed r times. So we fit some function S to the observed values (r, Nr) Gale & Sampson (1995) suggest using a power curve Nr = arb, with b < -1 Fit using linear regression in logspace: log Nr = log a + b log r The observed distribution of Nr, when transformed into logspace, looks like this:

Good-Turing (Cont.) Here is a graph of the line fit in logspace and the resulting power curve: The fit is a poor fit, but it gives us smoothed values for Nr. Now, For r > 0, we use S(r) to generate the adjusted counts r*: r* = (r+1) S(r+1) / S(r)

Smoothing and Conditional Probabilities Some people have the wrong idea about how to combine smoothing with conditional probability distributions. You know that a conditional distribution can be computed as the ratio of a joint distribution and a marginal distribution: P(x | y) = P(x, y) / P(y) What if you want to use smoothing? WRONG: Smooth joint P(x, y) and marginal P(y) independently, then combine: P'''(x | y) = P'(x, y) / P''(y). Correct: Smooth P(x | y) independently.

Smoothing and Conditional Probabilities (Cont.) The problem is that steps 1 and 2 do smoothing separately, so it makes no sense to divide the results. The right way to compute the smoothed conditional probability distribution P(x | y) is: 1. From the joint P(x, y), compute a smoothed joint P'(x, y). 2. From the smoothed joint P'(x, y), compute a smoothed marginal P'(y). 3. Divide them: let P'(x | y) = P'(x, y) / P'(y).

Smoothing and Conditional Probabilities (Cont.) Suppose we're on safari, and we count the animals we observe by species (x) and gender (y): From these counts, we can easily compute unsmoothed, empirical joint and marginal distributions:

Smoothing and Conditional Probabilities (Cont.) Now suppose we want to use absolute discounting, with  = 0.75. The adjusted counts are: Now from these counts, we can compute smoothed joint and marginal distributions: Now, since both P'(x, y) and P'(y) result from the same smoothing operation, it's OK to divide them:

Java Any confusion about roulette wheel sampling in generateWord()? You want to train a trigram model on 10 million words, but you keep running out of memory. Don't use a V × V × V matrix! CounterMap? Good, but needs to be elaborated. Intern your Strings! Virtual memory: -mx2000m

Programming Assignment Tips Objectives: The assignments are intentionally open-ended Think of the assignment as an investigation Think of your report as a mini research paper Implement the basics: brigram and trigram models, smoothing, interpolation Choose one or more additions, e.g.: Fancy smoothing: Katz, Witten-Bell, Kneser-Ney, gapped bigram Compare smoothing joint vs. smoothing conditionals Crude spelling model for unknown words Trade-off between memory and performance Explain what you did, why you did it, and what you found out

Programming Assignment Tips Development Use a very small dataset during development (esp. debugging). Use validation data to tune hyperparameters. Investigate the learning curve (performance as a function of training size). Investigate the variance of your model results. Report Be concise! 6 pages is plenty. Prove that your actual, implemented distributions are proper (concisely!). Include a graph or two. Error analysis: discuss examples that your model gets wrong.