Introduction to N-grams [Many of the slides were originally created by Prof. Dan Jurafsky from Stanford.]
Who wrote this? “You are uniformly charming!” cried he, with a smile of associating and now and then I bowed and they perceived a chaise and four to wish for. This is written by a machine. It is a random sentence generated from a Jane Austen trigram model.
Assign a probability to a phrase Speech Recognition P(I saw a van) >> P(eyes awe of an) Spell Correction The office is about fifteen minuets from my house P(about fifteen minutes from) > P(about fifteen minuets from) Machine Translation P(high winds tonite) > P(large winds tonite)
Predict the next word Please turn your homework ____ …. Google search bar
Counts and probabilities P(its water is so transparent that) P(its water is so transparent that the) P(the|its water is so transparent that)
Probabilistic Language Modeling Compute the probability of a sentence or sequence of words: P(W) = P(w1w2w3w4w5…wn) Related task: probability of an upcoming word: P(w5|w1w2w3w4) A model that computes either of these: P(W) or P(wn|w1w2…wn-1) is called a language model.
The Chain Rule P(“its water is so transparent”) = P(its) × P(water|its) × P(is|its water) × P(so|its water is) × P(transparent|its water is so)
Unigram model Some automatically generated sentences from a unigram model: fifth an of futures the an incorporated a a the inflation most dollars quarter in is mass thrift did eighty said hard 'm july bullish that or limited the
Bigram model Condition on only the previous word: texaco rose one in this issue is pursuing growth in a boiler house said mr. gurria mexico 's motion control proposal without permission from five hundred fifty five yen outside new car parking lot of the agreement reached this would be a record november
N-gram models We can extend to trigrams, 4-grams, 5-grams In general this is an insufficient model of language because language has long-distance dependencies: “The computer which I had just put into the machine room on the fifth floor ___.” Predict the last word in the above sentence.
Bigram probabilities Use Maximum Likelihood Estimate
Corpus from Dr. Seuss A corpus is a collection of written texts. Here is mini-corpus of three sentences. <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> Each sentence starts with the special token <s> and ends with </s>.
Calculate bigram probabilities <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>
Calculate bigram probabilities <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s>
Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day
Unigram counts There are 9222 sentences in the corpus.
Bigram counts
Bigram probabilities Normalize by unigrams: Result: 5/2533=0.002 9/2533=0.0036 211/2417 =0.087 Sparsity: Lots of zeros.
Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(I|<s>) × P(want|I) × P(english|want) × P(food|english) × P(</s>|food) = .000031
Knowledge in the bigrams P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 verb want followed by to + infinitive P(eat | to) = .28 to + infinitive P(food | to) = 0 P(want | spend) = 0 P (i | <s>) = .25
Computing efficiency Issues Store log probabilities, not the raw probabilities. Avoid underflow. Adding is faster than multiplying.
Google 4-gram counts serve as the incoming 92 serve as the incubator 99 serve as the independent 794 serve as the index 223 serve as the indication 72 serve as the indicator 120 serve as the indicators 45 serve as the indispensable 111 serve as the indispensible 40 serve as the individual 234
Complete the sentence I always order pizza with cheese and ____. mushrooms 0.1 pepperoni 0.1 anchovies 0.01 …. fried rice 0.0001 and 1e-100 I always order pizza with cheese and ____. The current president of the US is ____. I saw a ____. Is the unigram model good at this guessing game? Which sentence perplexes you the most?
Perplexity How hard is the task of recognizing digits {0,1,2,3,4,5,6,7,8,9}? Each digit is equally likely. Perplexity = 10 How hard is recognizing 30,000 names at Microsoft. Perplexity = 30,000 Minimizing perplexity is the same as maximizing probability.
Lower perplexity = better model Training 38 million words, test 1.5 million words, WSJ Unigram Bigram Trigram Perplexity 962 170 109
Shannon Visualization Method Choose the first bigram (<s>, w) according to its probability Now choose next bigram (w, x) according to its probability And so on until we choose </s> Then string the words together <s> I I want want to to eat eat Chinese Chinese food food </s> I want to eat Chinese food
Shakespeare lines generated by N-grams
Shakespeare as corpus N=884,647 tokens size of vocabulary, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams. 99.96% of the possible bigrams were never seen. Quadrigrams: What's coming out looks like Shakespeare because it is Shakespeare
Wall Street Journal
Training set and test set Use Shakespeare corpus to train the language model. Test the model on sentences from WSJ. Your natural language processor will run into big trouble.
Problem with zeros Training set: Test set: denied the allegations denied the reports denied the claims denied the request P(“offer” | denied the) = 0 perplexity = ∞ Test set: denied the offer denied the loan QC
Bigram counts
Berkeley Restaurant Corpus: Add 1 counts
Laplace smoothing MLE estimate: Pretend we saw each word one more time than we did Add-1 estimate: Maximum Likelihood Estimation Maximum likelihood estimate Vocabulary size
Deal with zeros by smoothing When we have sparse statistics: Steal probability mass to generalize better P(w | denied the) 3 allegations 2 reports 1 claims 1 request 7 total allegations reports attack man outcome … claims request P(w | denied the) 2.5 allegations 1.5 reports 0.5 claims 0.5 request 2 other 7 total allegations allegations attack man outcome … reports claims request
Laplace-smoothed bigram probabilities concatenates
Reconstituted smoothed counts
Compare the raw count with smoothed count C(want to) went from 608 to 238, P(to|want) from .66 to .26! Discount d= c*/c d for “chinese food” =.10!!! A 10x reduction
Backoff and Interpolation Use the trigram if its probability is good else use a bigram if its probability is good otherwise unigram Interpolation Use all three by mixing them. Interpolation works better
How to set the lambdas? Training Data Held-Out Data Test Data Use the training data to find the N-gram probabilities. Use a held-out data to find the λs. Choose λs to maximize the probability of success on held-out data. After the system is set, test it on the future test data.
How to deal with words not in vocabulary? If we know all the words in advanced Vocabulary V is fixed Closed vocabulary task Often we don’t know this Out Of Vocabulary = OOV words Open vocabulary task Special unknown word token <UNK> Reduce V to V’ by throwing out all the unimportant, low-count words. At text normalization phase, change any training word not in V’ to <UNK>. Calculate probability of <UNK> as if it is a normal word. At testing time, use UNK probability for any word not in V’.
Conclusion N-gram technology is underpinned by sound probability theory and statistics. It is simplified by Markov model. Calculations involve simple counting and divisions. It is useful in spelling correction, natural language translation, imitating Shakespeare, machine writing poetry, etc.