Download presentation
Presentation is loading. Please wait.
1
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26
2
2 Administrivia Reminder –Homework 4 due tonight –Questions? After class
3
3 Last Time Background –general introduction to probability concepts Sample Space and Events Permutations and Combinations Rule of Counting Event Probability/Conditional Probability Uncertainty and Entropy statistical experiment: outcomes
4
4 Today’s Topic statistical methods are widely used in language processing –apply probability theory to language N-grams reading –textbook chapter 6: N-grams
5
5 N-grams: Unigrams introduction –Given a corpus of text, the n-grams are the sequences of n consecutive words that are in the corpus example ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=1 ( 8 unigrams ) –the3 –sat2 –on2 –cat1 –that1 –sofa1 –also1 –mat1
6
6 N-grams: Bigrams example ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=2 ( 8 bigrams ) –sat on2 –on the2 –the cat1 –cat that1 –that sat1 –the sofa1 –sofa also1 –also sat1 –the mat 1 2 words
7
7 N-grams: Trigrams example ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=3 ( 9 trigrams ) –most language models stop here, some stop at quadrigrams too many n-grams low frequencies –sat on the2 –the cat that1 –cat that sat1 –that sat on1 –on the sofa1 –the sofa also 1 –sofa also sat 1 –also sat on1 –on the mat1 3 words
8
8 N-grams: Quadrigrams Example: ( 12 word sentence ) – the cat that sat on the sofa also sat on the mat N=4 ( 8 quadrigrams ) –the cat that sat1 –cat that sat on1 –that sat on the1 –sat on the sofa1 –on the sofa also1 –the sofa also sat1 –sofa also sat on1 –also sat on the1 –sat on the mat1 4 words
9
9 N-grams: frequency curves family of curves sorted by frequency –unigrams, bigrams, trigrams, quadrigrams... –decreasing frequency f frequency curve family
10
10 N-grams: the word as a unit we count words but what counts as a word? –punctuation useful surface cue also = beginning of a sentence, as a dummy word part-of-speech taggers include punctuation as words (why?) –capitalization They, theysame token or not? –wordform vs. lemma cats, catsame token or not? –disfluencies part of spoken language er, um, main- mainly speech recognition systems have to cope with them
11
11 N-grams: Word what counts as a word? –punctuation useful surface cue also = beginning of a sentence, as a dummy word part-of-speech taggers include punctuation as words (why?) From the Penn Treebank tagset
12
12 Language Models and N-grams Brown corpus (1million words): –word wf(w)p(w) –the 69,9710.070 –rabbit 110.000011 given a word sequence –w 1 w 2 w 3... w n –probability of seeing w i depends on what we seen before recall conditional probability introduced last time example (section 6.2) –Just then, the whiterabbit – the –expectation is p(rabbit|white) > p(the|white) –but p(the) > p(rabbit)
13
13 Language Models and N-grams given a word sequence –w 1 w 2 w 3... w n chain rule –how to compute the probability of a sequence of words –p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) –p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) –... –p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) note –It’s not easy to collect (meaningful) statistics on p(w n |w n-1 w n-2...w 1 ) for all possible word sequences
14
14 Language Models and N-grams Given a word sequence –w 1 w 2 w 3... w n Bigram approximation –just look at the previous word only (not all the proceedings words) –Markov Assumption: finite length history –1st order Markov Model –p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )...p(w n |w 1...w n-3 w n-2 w n-1 ) –p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) note –p(w n |w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n |w 1...w n-2 w n-1 )
15
15 Language Models and N-grams Trigram approximation –2nd order Markov Model –just look at the preceding two words only –p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n-3 w n-2 w n-1 ) –p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 ) note –p(w n |w n-2 w n-1 ) is a lot easier to estimate well than p(w n |w 1...w n-2 w n-1 ) but harder than p(w n |w n-1 )
16
16 Language Models and N-grams example: (bigram language model from section 6.2) – = start of sentence –p(I want to eat British food) = p(I| )p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British) p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) figure 6.2
17
17 Language Models and N-grams example: (bigram language model from section 6.2) – = start of sentence –p(I want to eat British food) = p(I| )p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British) figure 6.3 –p(I| )p(want|I)p(to|want)p(eat|to)p(British|eat)p(food|British) –= 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60 –= 0.0000081(different from textbook)
18
18 Language Models and N-grams estimating from corpora –how to compute bigram probabilities –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 w)w is any word –Since f(w n-1 w) = f(w n-1 ) f(w n-1 ) = unigram frequency for w n-1 –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency Note: –The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)
19
19 Language Models and N-grams example –p(I want to eat British food) = p(I| )p(want|I)p(to|want)p(eat |to)p(British|eat)p(food|British) –= 0.25 * 0.32 * 0.65 * 0.26 * 0.001 * 0.60 –= 0.0000081 in practice calculations are done in log space –p(I want to eat British food) = 0.0000081a tiny number –use logprob (log 2 probability) –actually sum of (negative) log 2 s of probabilities Question: –Why sum negative log of probabilities? Answer (Part 1): –computer floating point number storage has limited range 5.0×10 −324 to 1.7×10 308 double (64 bit) danger of underflow
20
20 Language Models and N-grams Question: –Why sum negative log of probabilities? Answer (Part 2): –A = BC –log(A) = log(B) + log(C) –probabilities are in range (0, 1] –Note: want probabilities to be non-zero log(0) = - –log of probabilites will be negative (up to 0) –take negative to make them positive log function region of interest
21
21 Motivation for smoothing Smoothing: avoid zero probabilities Consider what happens when any individual probability component is zero? –multiplication law: 0×X = 0 –very brittle! even in a large corpus, many n-grams will have zero frequency –particularly so for larger n p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 )
22
22 Language Models and N-grams Example: unigram frequencies w n-1 w n bigram frequencies bigram probabilities sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing) w n-1 wnwn
23
23 Smoothing and N-grams sparse dataset means zeros are a problem –Zero probabilities are a problem p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero –Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency bigram f(w n-1 w n ) doesn’t exist in dataset smoothing –refers to ways of assigning zero probability n-grams a non-zero value –we’ll look at two ways here (just one of them today)
24
24 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts –simple and no more zeros (but there are better methods) unigram –p(w) = f(w)/N(before Add-One) N = size of corpus –p(w) = (f(w)+1)/(N+V)(with Add-One) –f*(w) = (f(w)+1)*N/(N+V)(with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram –p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(before Add-One) –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)(after Add-One) –f*(w n-1 w n ) = (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V)(after Add-One) must rescale so that total probability mass stays at 1
25
25 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts bigram –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) –(f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786 338 = figure 6.8 = figure 6.4
26
26 Smoothing and N-grams Add-One Smoothing –add 1 to all frequency counts bigram –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) –(f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7
27
27 Smoothing and N-grams Excel spreadsheet available –addone.xls
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.