LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 6th
Adminstrivia The Homework Pipeline: No classes next week: Homework 2 graded Homework 4 not back yet… soon Homework 5 due Weds by midnight No classes next week: I'm out of town on business No new homework assigned this week
Today's Topics Homework 4 review
Homework 4 Review: Question 1 Construct a WSJ text corpus that excludes both words tagged as – NONE- and punctuation words (defined previously) Show your Python console. How many words in the corpus? How many distinct words? Plot the cumulative frequency distribution graph How many top words do you need to account for 50% of the corpus?
Homework 4 Review: Question 1 excluded = set(['-NONE-', '-LRB-', '-RRB-', 'SYM', ':', '.', ',', '``', "''"]) tokens = [x[0] for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] words = set(tokens) print('Tokens: {}; #Words: {}'.format(len(text),len(words))) Tokens: 1037490; #Words: 49184 len(words) 49184 print('Lexical diversity: {:.3f}%'.format(len(words)/len(text))) Lexical diversity: 0.047% text = nltk.Text(tokens) dist = nltk.FreqDist(text) print(dist) <FreqDist with 49184 samples and 1037490 outcomes>
Homework 4 Review: Question 1 list = sorted(dist.items(),key=lambda t:t[1],reverse=True) half = len(text) / 2.0 total = 0 index = 0 while total < half: total += list[index][1] index += 1 print('No of words: {}; total: {}'.format(index,total)) No of words: 217; total: 518763 1037490 /2 = 518745
Homework 4 Review: Question 1 print('{:12s} {:5s}'.format('Word','Freq')) for word, freq in list[:index]: print('{:12s} {:5d}'.format(word,freq))
Homework 4 Review: Question 1
Homework 4 Review: Question 2 With case folding: tokens = [x[0].lower() for x in ptb.tagged_words(categories=['news']) if x[1] not in excluded] Tokens: 1037490; #Words: 43746 Lexical diversity: 0.042% No of words: 176; total: 518944 (1037490/2= 518745)
Homework 4 Review: Question 2
Colorless green ideas examples Chomsky (1957): (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Chomsky (1957): . . . It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. idea (1) is syntactically valid, (2) is word salad One piece of supporting evidence: (1) pronounced with normal intonation (2) pronounced like a list of words …
Background: Language Models and N-grams given a word sequence w1 w2 w3 ... wn chain rule how to compute the probability of a sequence of words p(w1 w2) = p(w1) p(w2|w1) p(w1 w2 w3) = p(w1) p(w2|w1) p(w3|w1w2) ... p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2)... p(wn|w1...wn-2 wn-1) note It’s not easy to collect (meaningful) statistics on p(wn|wn-1wn-2...w1) for all possible word sequences
Background: Language Models and N-grams Given a word sequence w1 w2 w3 ... wn Bigram approximation just look at the previous word only (not all the proceedings words) Markov Assumption: finite length history 1st order Markov Model p(w1 w2 w3...wn) = p(w1) p(w2|w1) p(w3|w1w2) ...p(wn|w1...wn-3wn-2wn-1) p(w1 w2 w3...wn) p(w1) p(w2|w1) p(w3|w2)...p(wn|wn-1) note p(wn|wn-1) is a lot easier to collect data for (and thus estimate well) than p(wn|w1...wn-2 wn-1)
Colorless green ideas Sentences: Statistical Experiment (Pereira 2002) (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Statistical Experiment (Pereira 2002) bigram language model wi-1 wi
Part-of-Speech (POS) Tag Sequence Chomsky's example: colorless green ideas sleep furiously JJ JJ NNS VBP RB (POS Tags) Similar but grammatical example: revolutionary new ideas appear infrequently JJ JJ NNS VBP RB LSLT pg. 146
Stanford Parser Stanford Parser: a probabilistic PS parser trained on the Penn Treebank
Stanford Parser Stanford Parser: a probabilistic PS parser trained on the Penn Treebank
Penn Treebank (PTB) Corpus: word frequencies: Word POS Frequency colorless green NNP 33 JJ 19 NN 5 ideas NNS 32 sleep VB 4 VBP 2 1 furiously RB Word POS Frequency revolutionary JJ 6 NNP 2 NN new 1795 1459 NNPS 1 ideas NNS 32 appear VB 55 VBP 41 infrequently
Stanford Parser Structure of NPs: colorless green ideas revolutionary new ideas Phrase Frequency [NP JJ JJ NNS] 1073 [NP NNP JJ NNS] 61
An experiment examples Question: (1) colorless green ideas sleep furiously (2) furiously sleep ideas green colorless Question: Is (1) even the most likely permutation of these particular five words?
Parsing Data All 5! (=120) permutations of colorless green ideas sleep furiously .
Parsing Data The winning sentence was: furiously ideas sleep colorless green . after training on sections 02-21 (approx. 40,000 sentences) sleep selects for ADJP object with 2 heads adverb (RB) furiously modifies noun
Parsing Data The next two highest scoring permutations were: Furiously green ideas sleep colorless . Green ideas sleep furiously colorless . sleep takes NP object sleep takes ADJP object
Parsing Data (Pereira 2002) compared Chomsky’s original minimal pair: colorless green ideas sleep furiously furiously sleep ideas green colorless Ranked #23 and #36 respectively out of 120
Parsing Data But graph (next slide) shows how arbitrary these rankings are when trained on randomly chosen sections covering 14K- 31K sentences Example: #36 furiously sleep ideas green colorless outranks #23 colorless green ideas sleep furiously (and the top 3) over much of the training space Example: Chomsky's original sentence #23 colorless green ideas sleep furiously outranks both the top 3 and #36 just briefly at one data point
Sentence Rank vs. Amount of Training Data Best three sentences
Sentence Rank vs. Amount of Training Data #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless
Sentence Rank vs. Amount of Training Data #23 colorless green ideas sleep furiously #36 furiously sleep ideas green colorless