LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd
Today's Topics Minimum Edit Distance Homework Corpora: frequency information tregex
Minimum Edit Distance Homework Background: – … about 20% of the time “Britney Spears” is misspelled when people search for it on Google Software for generating misspellings – If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings. –
Minimum Edit Distance Homework Top six misspellings Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): – e.g. ED(brittany) < ED(britany)
Minimum Edit Distance Homework Submit your homework in PDF – how many you got right – explain your criteria, e.g. weights, chosen you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well due by to me before next Thursday class… – put your name and 581 at the top of your submission
Part 2 Corpora: frequency information Unlabeled corpus: just words Labeled corpus: various kinds … – POS information – Information about phrases – Word sense or Semantic role labeling easy to find progressively harder to create or obtain
Language Models and N-grams given a word sequence – w 1 w 2 w 3... w n chain rule – how to compute the probability of a sequence of words – p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) – p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) –... – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) note – It’s not easy to collect (meaningful) statistics on p(w n |w n-1 w n-2...w 1 ) for all possible word sequences
Language Models and N-grams Given a word sequence – w 1 w 2 w 3... w n Bigram approximation – just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history – 1st order Markov Model – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )...p(w n |w 1...w n-3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) note – p(w n |w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n |w 1...w n-2 w n-1 )
Language Models and N-grams Trigram approximation – 2nd order Markov Model – just look at the preceding two words only – p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n- 3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 ) note – p(w n |w n-2 w n-1 ) is a lot easier to estimate well than p(w n |w 1...w n-2 w n-1 ) but harder than p(w n |w n-1 )
Language Models and N-grams estimating from corpora – how to compute bigram probabilities – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 w)w is any word – Since f(w n-1 w) = f(w n-1 ) f(w n-1 ) = unigram frequency for w n-1 – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency Note: – The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)
Motivation for smoothing Smoothing: avoid zero probability estimates Consider what happens when any individual probability component is zero? – Arithmetic multiplication law: 0×X = 0 – very brittle! even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency – particularly so for larger n-grams p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 )
Language Models and N-grams Example: unigram frequencies w n-1 w n bigram frequencies bigram probabilities sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing) w n-1 wnwn
Smoothing and N-grams sparse dataset means zeros are a problem – Zero probabilities are a problem p(w 1 w 2 w 3...w n ) p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero – Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency bigram f(w n-1 w n ) doesn’t exist in dataset smoothing – refers to ways of assigning zero probability n-grams a non-zero value
Smoothing and N-grams Add-One Smoothing (4.5.1 Laplace Smoothing) – add 1 to all frequency counts – simple and no more zeros (but there are better methods) unigram – p(w) = f(w)/N(before Add-One) N = size of corpus – p(w) = (f(w)+1)/(N+V)(with Add-One) – f*(w) = (f(w)+1)*N/(N+V)(with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(before Add-One) – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)(after Add-One) – f*(w n-1 w n ) = (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V)(after Add-One) must rescale so that total probability mass stays at 1
Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786 338 = figure 6.8 = figure 6.4
Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7
Smoothing and N-grams let’s illustrate the problem – take the bigram case: – w n-1 w n – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 ) – suppose there are cases – w n-1 w zero 1 that don’t occur in the corpus probability mass f(w n-1 ) f(w n-1 w n ) f(w n-1 w zero 1 )=0 f(w n-1 w zero m )=0...
Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1...
Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1... V = |{w i }| redistribution of probability mass –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n- 1 )+V)
Smoothing and N-grams Good-Turing Discounting (4.5.2) – N c = number of things (= n-grams) that occur c times in the corpus – N = total number of things seen – Formula: smoothed c for N c given by c* = (c+1)N c+1 /N c – Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet – estimate N 0 in terms of N 1 … – and so on but if N c =0, smooth that first using something like log(N c )=a+b log(c) – Formula: P*(things with zero freq) = N 1 /N – smaller impact than Add-One Textbook Example: – Fishing in lake with 8 species bass, carp, catfish, eel, perch, salmon, trout, whitefish – Sample data (6 out of 8 species): 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel – P(unseen new fish, i.e. bass or carp) = N 1 /N = 3/18 = 0.17 – P(next fish=trout) = 1/18 (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) – revised count for trout: c*(trout) = 2*N 2 /N 1 =2(1/3)=0.67 (discounted from 1) – revised P(next fish=trout) = 0.67/18 = 0.037
Language Models and N-grams N-gram models + smoothing – one consequence of smoothing is that – every possible concatentation or sequence of words has a non-zero probability – N-gram models can also incorporate word classes, e.g. POS labels when available
Language Models and N-grams N-gram models – data is easy to obtain any unlabeled corpus will do – they’re technically easy to compute count frequencies and apply the smoothing formula – but just how good are these n-gram language models? – and what can they show us about language?
Language Models and N-grams approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Unigram (pick random, unconnected words) Bigram
Language Models and N-grams Approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Trigram Quadrigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,066 2 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n
Language Models and N-grams A limitation: – produces ungrammatical sequences Treebank: – potential to be a better language model – Structural information: contains frequency information about syntactic rules – we should be able to generate sequences that are closer to English …
Language Models and N-grams Aside:
Part 3 tregex I assume everyone has: 1.Installed Penn Treebank v3 2.Downloaded and installed tregex
Trees in the Penn Treebank Notation: LISP S-expression Directory: TREEBANK_3/parsed/mrg/
tregex Search Example: << dominates, < immediately dominates
tregex Help
tregex Help
tregex Help: tregex expression syntax is non-standard wrt bracketing S < VP S < NP S < VP S < NP
tregex Help: tregex boolean syntax is also non-standard
tregex Help
tregex Help
tregex Pattern: – <, $+ (/,/ $+ $+ /,/=comma))) <- =comma) Key: <, first child $+ immediate left sister <- last child Key: <, first child $+ immediate left sister <- last child same node
tregex Help
tregex
Different results from: < /^WH.*-([0-9]+)$/#1%index << < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))
tregex Example: WHADVP also possible (not just WHNP)
Treebank Guides 1.Tagging Guide 2.Arpa94 paper 3.Parse Guide 1.Tagging Guide 2.Arpa94 paper 3.Parse Guide
Treebank Guides Parts-of-speech (POS) Tagging Guide, tagguid1.pdf (34 pages): tagguid2.pdf: addendum, see POS tag ‘TO’
Treebank Guides Parsing guide 1, prsguid1.pdf (318 pages): prsguid2.pdf: addendum for the Switchboard corpus