Estimating N-gram Probabilities Language Modeling
Dan Jurafsky Estimating bigram probabilities
Dan Jurafsky An example Small Corpus: I am Sam Sam I am I do not like green eggs and ham
Dan Jurafsky More examples: Berkeley Restaurant Project sentences can you tell me about any good cantonese restaurants close by mid priced thai food is what i’m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i’m looking for a good place to eat breakfast when is caffe venezia open during the day
Dan Jurafsky Raw bigram counts Shows the bigram counts from the Berkeley Restaurant Project. Note that the majority of the values are zero. A matrix selected from a random set of seven words. 1
Dan Jurafsky Raw bigram probabilities Table2: Shows the unigram counts Table 3: Shows the bigram probabilities after normalization (dividing each row by the following unigram counts 3 2
Dan Jurafsky Bigram estimates of sentence probabilities Here are a few other useful probabilities: P(i| ) = 0.25 P(english|want) = P(food|english) = 0.5 P( |food) = 0.68 Compute the Bigram of sentence by using the info. above and table 3: P( I want english food ) = P(I| ) × P(want|I) × P(english|want) × P(food|english)× P( |food) =0.25 × 0.33× × 0.5× 0.68 =
Dan Jurafsky What kinds of knowledge? P(english|want) =.0011 P(chinese|want) =.0065 P(to|want) =.66 P(eat | to) =.28 P(food | to) = 0 P(want | spend) = 0 P (i | ) =.25 It shows Fact about world Fact about grammar. Ex: "want” base verb require infinitive verb (to + v) Fact about grammar. Two verbs in row can not be allow in English)
Dan Jurafsky Practical Issues Since probabilities are (by definition) less than or equal to 1,w e do everything in log space because b y using log probabilities instead of raw probabilities, we get numbers that are not as small. But the more probabilities we multiply together, the smaller the product becomes. Multiplying enough N-grams together would result in numerical underflow Solve by using adding because adding is faster than multiplying
Dan Jurafsky Language Modeling Toolkits SRILM
Dan Jurafsky Google N-Gram Release, August 2006 …
Dan Jurafsky Google N-Gram Release
Dan Jurafsky Google Book N-grams
Evaluation and Perplexity Language Modeling
Dan Jurafsky Evaluation: How good is our model? Does our language model prefer good sentences to bad ones? Assign higher probability to “real” or “frequently observed” sentences Than “ungrammatical” or “rarely observed” sentences? 1.There are two type of Evaluation: Intrinsic Evaluation Extrinsic Evaluation
Dan Jurafsky An Intrinsic Evaluation of N-gram models First evaluation for comparing N-gram models A and B ( N-gram model use training set or training corpus such as Google N-gram corpus to compute the probabilities) 1.Put each n-gram model (A&B) in a task such as spelling corrector, speech recognizer, MT system 2.Run the task, get an accuracy for A model and B model. How? By testing each model’s performance on test set that haven’t seen and is different from our training set, totally unused Then using an evaluation metric which tells us how well our model does on the test set. 16
Dan Jurafsky Extrinsic evaluation of N-gram models another evaluation for comparing N-gram models A and B 1.Put each n-gram model (A &B) in a task such as spelling corrector, speech recognizer, MT system 2.Run the task, get an accuracy for A model and B model. How? Counting by hand how many misspelled words corrected properly (spelling corrector) Seeing which gives the more accurate transcription ( speech recognizer) Counting by hand how many words translated correctly (MT) 3.Compare accuracy for A model and B model handily.
Dan Jurafsky Difficulty of extrinsic evaluation of N-gram models Extrinsic evaluation Time-consuming; can take days or weeks So, Sometimes use Intrinsic evaluation But Bad approximation can be caused if test data part of training test Solve by choosing our test data that large as possible, not part of training test, and unseen(unused) to avoid the bad approximation. (For example: we can divide the large corpus that we want into training and test)
Dan Jurafsky More about Intrinsic evaluation In Intrinsic evaluation: 19 In practice we don’t use raw probability as our metric for evaluating language models, but a variant called perplexity. Perplexity of a language model on a test set (sometimes called PP for short) is the inverse probability of the test set, normalized by the number of words.
Dan Jurafsky Perplexity Perplexity is the inverse probability of the test set, normalized by the number of words: We can use the chain rule to expand the probability of W: If we are computing the perplexity of W with a bigram language model we get: Minimizing perplexity is the same as maximizing probability
Dan Jurafsky Perplexity as branching factor Let’s suppose a sentence consisting of random digits What is the perplexity of this sentence according to a model that assign P=1/10 to each digit?
Dan Jurafsky Lower perplexity = better model Ex: We trained unigram, bigram, and trigram grammars on 38 million words (training set) from the Wall Street Journal, using a 19,979 word vocabulary. We then computed the perplexity of each of these models on a test set of 1.5 million words (test set) N-gram Order UnigramBigramTrigram Perplexity