LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 20 th.

Slides:



Advertisements
Similar presentations
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Advertisements

Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
For Monday Read Chapter 23, sections 3-4 Homework –Chapter 23, exercises 1, 6, 14, 19 –Do them in order. Do NOT read ahead.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
Computational Intelligence 696i Language Lecture 2 Sandiway Fong.
LING 438/538 Computational Linguistics
IGERT External Advisory Board Meeting Wednesday, March 14, 2007 INSTITUTE FOR COGNITIVE SCIENCES University of Pennsylvania.
LING 581: Advanced Computational Linguistics Lecture Notes January 12th.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Final Exam 438/538 Instructions –538 Answer all questions –438 You may omit one question –Answers (one file only, no attachments) state name and.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong. Administrivia Homework 4 – out today – due next Wednesday – (recommend you attempt it early) Reading.
Heshaam Faili University of Tehran
For Friday Finish chapter 23 Homework: –Chapter 22, exercise 9.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
GRAMMARS David Kauchak CS159 – Fall 2014 some slides adapted from Ray Mooney.
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Sequence Models With slides by me, Joshua Goodman, Fei Xia.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Natural Language Processing Statistical Inference: n-grams
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 13 th.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
General Information on Context-free and Probabilistic Context-free Grammars İbrahim Hoça CENG784, Fall 2013.
Natural Language Processing Vasile Rus
Hidden Markov Models BMI/CS 576
Announcements/Reading
CSC 594 Topics in AI – Natural Language Processing
N-Grams Chapter 4 Part 2.
Learning linguistic structure with simple and more complex recurrent neural networks Psychology February 2, 2017.
Compiler Construction
LING/C SC/PSYC 438/538 Lecture 21 Sandiway Fong.
CSCI 5832 Natural Language Processing
Training Tree Transducers
LING/C SC 581: Advanced Computational Linguistics
CSCI 5832 Natural Language Processing
CSCI 5832 Natural Language Processing
Lecture 8 Bottom Up Parsing
Compilers Principles, Techniques, & Tools Taught by Jing Zhang
LING/C SC 581: Advanced Computational Linguistics
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Kanat Bolazar February 16, 2010
CPSC 503 Computational Linguistics
David Kauchak CS159 – Spring 2019
Parsing Bottom-Up LR Table Construction.
Parsing Bottom-Up LR Table Construction.
David Kauchak CS159 – Spring 2019
CS249: Neural Language Model
Presentation transcript:

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 20 th

Today's Topics 1. LR(k) grammar contd. Homework 1 – (Due by midnight before next lecture: i.e. Tuesday 26 th before midnight) – One PDF file writeup: to 2. N-gram models and Colorless green ideas sleep furiously

Recap: dotted rule notation notation – “dot” used to track the progress of a parse through a phrase structure rule – examples vp --> v. np means we’ve seen v and are predicting an np np -->. dt nn means we’re predicting a dt (followed by nn) vp --> vp pp. means we’ve completed the RHS of a vp

Recap: Parse State state – a set of dotted rules encodes the state of the parse – set of dotted rules = name of the state kernel vp --> v. np vp --> v. completion (of predict NP) np -->. dt nn np -->. nnp np -->. np sbar

Recap: Shift and Reduce Actions two main actions – Shift move a word from the input onto the stack Example: –np -->.dt nn – Reduce build a new constituent Example: –np --> dt nn.

Recap: LR State Machine Built by advancing the dot over terminals and nonterminals Start state 0: – SS -->. S $ – complete this state Shift action: LHS -->. POS … 1.move word with POS tag from input queue onto stack 2.goto new state indicated by the top of stack state x POS Reduce action: LHS --> RHS. 1.pop |RHS| items off the stack 2.wrap [ LHS..RHS..] and put back onto the stack 3.goto new state indicated by the top of the stack state x LHS

LR State Machine Example State 0 ss .s $ s .np vp np .np pp np .n np .d n State 0 ss .s $ s .np vp np .np pp np .n np .d n State 1 ss  s.$ State 1 ss  s.$ State 13 ss  s $. State 13 ss  s $. $ s State 4 s  np.vp np  np.pp vp .v np vp .v vp .vp pp pp .p np State 4 s  np.vp np  np.pp vp .v np vp .v vp .vp pp pp .p np State 2 np  d.n State 2 np  d.n State 12 np  d n. State 12 np  d n. State 3 np  n. State 3 np  n. n d n np State 5 np  np pp. State 5 np  np pp. pp State 6 pp  p.np np .np pp np .n np .d n State 6 pp  p.np np .np pp np .n np .d n State 7 vp  v.np vp  v. np .np pp np .n np .d n State 7 vp  v.np vp  v. np .np pp np .n np .d n p v n d d n State 8 s  np vp. vp  vp.pp pp .p np State 8 s  np vp. vp  vp.pp pp .p np vp State 9 vp  vp pp. State 9 vp  vp pp. pp State 10 vp  v np. np  np. pp pp .p np State 10 vp  v np. np  np. pp pp .p np np State 11 pp  p np. np  np. pp pp .p np State 11 pp  p np. np  np. pp pp .p np np p pp p p

Prolog Code Files on webpage: 1.grammar0.pl 2.lr0.pl 3.parse.pl 4.lr1.pl 5.parse1.pl

LR(k) in the Chomsky Hierarchy Definition: a grammar is said to be LR(k) for some k = 0,1,2.. if the LR state machine for that grammar is unambiguous – i.e. are no conflicts, only one possible action… Context-Free Languages LR(1) LR(0) RL RL = Regular Languages

LR(k) in the Chomsky Hierarchy If there is ambiguity, we can still use the LR Machine with: 1.Pick one action, and use backtracking for alternative actions, or 2.Run actions in parallel

grammar0.pl 1.rule(ss,[s,$]). 2.rule(s,[np,vp]). 3.rule(np,[dt,nn]). 4.rule(np,[nnp]). 5.rule(np,[np,pp]). 6.rule(vp,[vbd,np]). 7.rule(vp,[vbz]). 8.rule(vp,[vp,pp]). 9.rule(pp,[in,np]). 10.lexicon(the,dt). lexicon(a,dt). 11.lexicon(man,nn). lexicon(boy,nn). 12.lexicon(limp,nn). lexicon(telescope,nn). 13.lexicon(john,nnp). 14.lexicon(saw,vbd). lexicon(runs,vbz). 15.lexicon(with,in).

grammar0.pl 1.nonT(ss). nonT(s). nonT(np). nonT(vp). nonT(pp). 2.term(nnp). term(nn). 3.term(vbd). term(vbz). 4.term(in). term(dt). 5.term($). 6.start(ss).

Some useful Prolog Primitives: –tell(Filename) redirect output to Filename –told close the file and stop redirecting output Example: –tell('machine.pl'),goal,told. –means run goal and capture all output to a file called machine.pl

lr0.pl Example:

lr0.pl

action(State#, CStack, Input, ParseStack, CStack', Input',ParseStack')

parse.pl

lr0.pl

lr1.pl and parse1.p l Similar code for LR(1) – 1 symbol of lookahead

lr1.pl and parse1.p l Similar code for LR(1) – 1 symbol of lookahead

parse1.p l

Homework 1 Question 1: – How many states are built for the LR(0) and the LR(1) machines?

Homework 1 Question 2: – Examine the action predicate built by LR(0) – Assume there is no possible conflict between two shift actions, e.g. shift dt or nnp – Is grammar0.pl LR(0)? Explain. Question 3: – Is grammar0.pl LR(1)? Explain.

Homework 1 Question 4: – run the sentence: John saw the boy with the telescope – on both LR(0) and LR(1) machines – How many states are visited to parse both sentences completely in the two machines? – Is the LR(1) any more efficient than the LR(0) machine?

Homework 1 Question 5: – run the sentence: John saw the boy with a limp with the telescope – on both LR(0) and LR(1) machines – How many parses are obtained? – How many states are visited to parse the sentence completely in the two machines?

Homework 1 Question 6: – Compare these two states in the LR(1) machine: Can we merge these two states? Explain why or why not. How could you test your answer?

Break …

Language Models and N-grams given a word sequence – w 1 w 2 w 3... w n chain rule – how to compute the probability of a sequence of words – p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) – p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) –... – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) note – It’s not easy to collect (meaningful) statistics on p(w n |w n-1 w n-2...w 1 ) for all possible word sequences

Language Models and N-grams Given a word sequence – w 1 w 2 w 3... w n Bigram approximation – just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history – 1st order Markov Model – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )...p(w n |w 1...w n-3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) note – p(w n |w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n |w 1...w n-2 w n-1 )

Language Models and N-grams Trigram approximation – 2nd order Markov Model – just look at the preceding two words only – p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n- 3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 ) note – p(w n |w n-2 w n-1 ) is a lot easier to estimate well than p(w n |w 1...w n-2 w n-1 ) but harder than p(w n |w n-1 )

Language Models and N-grams estimating from corpora – how to compute bigram probabilities – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 w)w is any word – Since f(w n-1 w) = f(w n-1 ) f(w n-1 ) = unigram frequency for w n-1 – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency Note: – The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

Motivation for smoothing Smoothing: avoid zero probability estimates Consider what happens when any individual probability component is zero? – Arithmetic multiplication law: 0×X = 0 – very brittle! even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency – particularly so for larger n-grams p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 )

Language Models and N-grams Example: unigram frequencies w n-1 w n bigram frequencies bigram probabilities sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing) w n-1 wnwn

Smoothing and N-grams sparse dataset means zeros are a problem – Zero probabilities are a problem p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero – Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency bigram f(w n-1 w n ) doesn’t exist in dataset smoothing – refers to ways of assigning zero probability n-grams a non-zero value

Smoothing and N-grams Add-One Smoothing (4.5.1 Laplace Smoothing) – add 1 to all frequency counts – simple and no more zeros (but there are better methods) unigram – p(w) = f(w)/N(before Add-One) N = size of corpus – p(w) = (f(w)+1)/(N+V)(with Add-One) – f*(w) = (f(w)+1)*N/(N+V)(with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(before Add-One) – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)(after Add-One) – f*(w n-1 w n ) = (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V)(after Add-One) must rescale so that total probability mass stays at 1

Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = figure 6.8 = figure 6.4

Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7

Smoothing and N-grams let’s illustrate the problem – take the bigram case: – w n-1 w n – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 ) – suppose there are cases – w n-1 w zero 1 that don’t occur in the corpus probability mass f(w n-1 ) f(w n-1 w n ) f(w n-1 w zero 1 )=0 f(w n-1 w zero m )=0...

Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1...

Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1... V = |{w i }| redistribution of probability mass –p(w n |w n-1 ) = (f(w n- 1 w n )+1)/(f(w n-1 )+V)

Smoothing and N-grams Good-Turing Discounting (4.5.2) – N c = number of things (= n-grams) that occur c times in the corpus – N = total number of things seen – Formula: smoothed c for N c given by c* = (c+1)N c+1 /N c – Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet – estimate N 0 in terms of N 1 … – and so on but if N c =0, smooth that first using something like log(N c )=a+b log(c) – Formula: P*(things with zero freq) = N 1 /N – smaller impact than Add-One Textbook Example: – Fishing in lake with 8 species bass, carp, catfish, eel, perch, salmon, trout, whitefish – Sample data (6 out of 8 species): 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel – P(unseen new fish, i.e. bass or carp) = N 1 /N = 3/18 = 0.17 – P(next fish=trout) = 1/18 (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) – revised count for trout: c*(trout) = 2*N 2 /N 1 =2(1/3)=0.67 (discounted from 1) – revised P(next fish=trout) = 0.67/18 = 0.037

Language Models and N-grams N-gram models – data is easy to obtain any unlabeled corpus will do – they’re technically easy to compute count frequencies and apply the smoothing formula – but just how good are these n-gram language models? – and what can they show us about language?

Language Models and N-grams approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Unigram (pick random, unconnected words) Bigram

Language Models and N-grams Approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Trigram Quadrigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,066 2 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n

Language Models and N-grams A limitation: – produces ungrammatical sequences Treebank: – potential to be a better language model – Structural information: contains frequency information about syntactic rules – we should be able to generate sequences that are closer to “English”…

Language Models and N-grams Aside:

Language Models and N-grams N-gram models + smoothing – one consequence of smoothing is that – every possible concatentation or sequence of words has a non-zero probability

Colorless green ideas examples – (1) colorless green ideas sleep furiously – (2) furiously sleep ideas green colorless Chomsky (1957): –... It is fair to assume that neither sentence (1) nor (2) (nor indeed any part of these sentences) has ever occurred in an English discourse. Hence, in any statistical model for grammaticalness, these sentences will be ruled out on identical grounds as equally `remote' from English. Yet (1), though nonsensical, is grammatical, while (2) is not. idea – (1) is syntactically valid, (2) is word salad Statistical Experiment (Pereira 2002)

Colorless green ideas examples – (1) colorless green ideas sleep furiously – (2) furiously sleep ideas green colorless Statistical Experiment (Pereira 2002) wiwi w i-1 bigram language model

Interesting things to Google example – colorless green ideas sleep furiously Second hit

Interesting things to Google example – colorless green ideas sleep furiously first hit – compositional semantics – a green idea is, according to well established usage of the word "green" is one that is an idea that is new and untried. – again, a colorless idea is one without vividness, dull and unexciting. – so it follows that a colorless green idea is a new, untried idea that is without vividness, dull and unexciting. – to sleep is, among other things, is to be in a state of dormancy or inactivity, or in a state of unconsciousness. – to sleep furiously may seem a puzzling turn of phrase but one reflects that the mind in sleep often indeed moves furiously with ideas and images flickering in and out.