Presentation is loading. Please wait.

Presentation is loading. Please wait.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.

Similar presentations


Presentation on theme: "LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd."— Presentation transcript:

1 LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd

2 Today's Topics Minimum Edit Distance Homework Corpora: frequency information tregex

3 Minimum Edit Distance Homework Background: – … about 20% of the time “Britney Spears” is misspelled when people search for it on Google Software for generating misspellings – If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings. – http://www.geneffects.com/typopositive/

4 Minimum Edit Distance Homework http://www.google.com/jobs/archive/britney.html Top six misspellings Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): – e.g. ED(brittany) < ED(britany)

5 Minimum Edit Distance Homework Submit your homework in PDF – how many you got right – explain your criteria, e.g. weights, chosen you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well due by email to me before next Thursday class… – put your name and 581 at the top of your submission

6 Part 2 Corpora: frequency information Unlabeled corpus: just words Labeled corpus: various kinds … – POS information – Information about phrases – Word sense or Semantic role labeling easy to find progressively harder to create or obtain

7 Language Models and N-grams given a word sequence – w 1 w 2 w 3... w n chain rule – how to compute the probability of a sequence of words – p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) – p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) –... – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) note – It’s not easy to collect (meaningful) statistics on p(w n |w n-1 w n-2...w 1 ) for all possible word sequences

8 Language Models and N-grams Given a word sequence – w 1 w 2 w 3... w n Bigram approximation – just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history – 1st order Markov Model – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )...p(w n |w 1...w n-3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) note – p(w n |w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n |w 1...w n-2 w n-1 )

9 Language Models and N-grams Trigram approximation – 2nd order Markov Model – just look at the preceding two words only – p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n- 3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 ) note – p(w n |w n-2 w n-1 ) is a lot easier to estimate well than p(w n |w 1...w n-2 w n-1 ) but harder than p(w n |w n-1 )

10 Language Models and N-grams estimating from corpora – how to compute bigram probabilities – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 w)w is any word – Since f(w n-1 w) = f(w n-1 ) f(w n-1 ) = unigram frequency for w n-1 – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency Note: – The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

11 Motivation for smoothing Smoothing: avoid zero probability estimates Consider what happens when any individual probability component is zero? – Arithmetic multiplication law: 0×X = 0 – very brittle! even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency – particularly so for larger n-grams p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 )

12 Language Models and N-grams Example: unigram frequencies w n-1 w n bigram frequencies bigram probabilities sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing) w n-1 wnwn

13 Smoothing and N-grams sparse dataset means zeros are a problem – Zero probabilities are a problem p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero – Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency bigram f(w n-1 w n ) doesn’t exist in dataset smoothing – refers to ways of assigning zero probability n-grams a non-zero value

14 Smoothing and N-grams Add-One Smoothing (4.5.1 Laplace Smoothing) – add 1 to all frequency counts – simple and no more zeros (but there are better methods) unigram – p(w) = f(w)/N(before Add-One) N = size of corpus – p(w) = (f(w)+1)/(N+V)(with Add-One) – f*(w) = (f(w)+1)*N/(N+V)(with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(before Add-One) – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)(after Add-One) – f*(w n-1 w n ) = (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V)(after Add-One) must rescale so that total probability mass stays at 1

15 Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = figure 6.8 = figure 6.4

16 Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7

17 Smoothing and N-grams let’s illustrate the problem – take the bigram case: – w n-1 w n – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 ) – suppose there are cases – w n-1 w zero 1 that don’t occur in the corpus probability mass f(w n-1 ) f(w n-1 w n ) f(w n-1 w zero 1 )=0 f(w n-1 w zero m )=0...

18 Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1...

19 Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1... V = |{w i }| redistribution of probability mass –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n- 1 )+V)

20 Smoothing and N-grams Good-Turing Discounting (4.5.2) – N c = number of things (= n-grams) that occur c times in the corpus – N = total number of things seen – Formula: smoothed c for N c given by c* = (c+1)N c+1 /N c – Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet – estimate N 0 in terms of N 1 … – and so on but if N c =0, smooth that first using something like log(N c )=a+b log(c) – Formula: P*(things with zero freq) = N 1 /N – smaller impact than Add-One Textbook Example: – Fishing in lake with 8 species bass, carp, catfish, eel, perch, salmon, trout, whitefish – Sample data (6 out of 8 species): 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel – P(unseen new fish, i.e. bass or carp) = N 1 /N = 3/18 = 0.17 – P(next fish=trout) = 1/18 (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) – revised count for trout: c*(trout) = 2*N 2 /N 1 =2(1/3)=0.67 (discounted from 1) – revised P(next fish=trout) = 0.67/18 = 0.037

21 Language Models and N-grams N-gram models + smoothing – one consequence of smoothing is that – every possible concatentation or sequence of words has a non-zero probability – N-gram models can also incorporate word classes, e.g. POS labels when available

22 Language Models and N-grams N-gram models – data is easy to obtain any unlabeled corpus will do – they’re technically easy to compute count frequencies and apply the smoothing formula – but just how good are these n-gram language models? – and what can they show us about language?

23 Language Models and N-grams approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Unigram (pick random, unconnected words) Bigram

24 Language Models and N-grams Approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Trigram Quadrigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,066 2 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n

25 Language Models and N-grams A limitation: – produces ungrammatical sequences Treebank: – potential to be a better language model – Structural information: contains frequency information about syntactic rules – we should be able to generate sequences that are closer to English …

26 Language Models and N-grams Aside: http://hemispheresmagazine.com/contests/2004/intro.htm

27 Part 3 tregex I assume everyone has: 1.Installed Penn Treebank v3 2.Downloaded and installed tregex

28 Trees in the Penn Treebank Notation: LISP S-expression Directory: TREEBANK_3/parsed/mrg/

29 tregex Search Example: << dominates, < immediately dominates

30 tregex Help

31 tregex Help

32 tregex Help: tregex expression syntax is non-standard wrt bracketing S < VP S < NP S < VP S < NP

33 tregex Help: tregex boolean syntax is also non-standard

34 tregex Help

35 tregex Help

36 tregex Pattern: – (@NP <, (@NP $+ (/,/ $+ (@NP $+ /,/=comma))) <- =comma) Key: <, first child $+ immediate left sister <- last child Key: <, first child $+ immediate left sister <- last child same node

37 tregex Help

38 tregex

39 Different results from: – @SBAR < /^WH.*-([0-9]+)$/#1%index << (@NP < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))

40 tregex Example: WHADVP also possible (not just WHNP)

41 Treebank Guides 1.Tagging Guide 2.Arpa94 paper 3.Parse Guide 1.Tagging Guide 2.Arpa94 paper 3.Parse Guide

42 Treebank Guides Parts-of-speech (POS) Tagging Guide, tagguid1.pdf (34 pages): tagguid2.pdf: addendum, see POS tag ‘TO’

43 Treebank Guides Parsing guide 1, prsguid1.pdf (318 pages): prsguid2.pdf: addendum for the Switchboard corpus


Download ppt "LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd."

Similar presentations


Ads by Google