LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd.

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Language Modeling.

CS460/IT632 Natural Language Processing/Language Technology for the Web Lecture 2 (06/01/06) Prof. Pushpak Bhattacharyya IIT Bombay Part of Speech (PoS)

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

Smoothing Techniques – A Primer

Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

Ngram models and the Sparsity problem John Goldsmith November 2002.

September SOME BASIC NOTIONS OF PROBABILITY THEORY Universita’ di Venezia 29 Settembre 2003.

Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.

LING 438/538 Computational Linguistics

1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.

LING 581: Advanced Computational Linguistics Lecture Notes January 19th.

LING 581: Advanced Computational Linguistics Lecture Notes January 12th.

Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

1 Advanced Smoothing, Evaluation of Language Models.

Natural Language Processing Lecture 6—9/17/2013 Jim Martin.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

LING/C SC/PSYC 438/538 Lecture 23 Sandiway Fong. Administrivia Homework 4 – out today – due next Wednesday – (recommend you attempt it early) Reading.

Heshaam Faili University of Tehran

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Language acquisition

NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

Efficient Language Model Look-ahead Probabilities Generation Using Lower Order LM Look-ahead Information Langzhou Chen and K. K. Chin Toshiba Research.

9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.

Lecture 4 Ngrams Smoothing

LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

Estimating N-gram Probabilities Language Modeling.

Supertagging CMSC Natural Language Processing January 31, 2006.

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

Natural Language Processing Statistical Inference: n-grams

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 3 rd.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Feb 17 th.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Language Model for Machine Translation Jang, HaYoung.

LING 581: Advanced Computational Linguistics Lecture Notes March 2nd.

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 20 th.

N-Grams Chapter 4 Part 2.

LING/C SC 581: Advanced Computational Linguistics

LING/C SC 581: Advanced Computational Linguistics

LING/C SC 581: Advanced Computational Linguistics

LING/C SC 581: Advanced Computational Linguistics

CSCE 771 Natural Language Processing

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CSCE 771 Natural Language Processing

CPSC 503 Computational Linguistics

Conceptual grounding Nisheeth 26th March 2019.

LING/C SC 581: Advanced Computational Linguistics

Presentation transcript:

LING/C SC 581: Advanced Computational Linguistics Lecture Notes Jan 22 nd

Today's Topics Minimum Edit Distance Homework Corpora: frequency information tregex

Minimum Edit Distance Homework Background: – … about 20% of the time “Britney Spears” is misspelled when people search for it on Google Software for generating misspellings – If a person running a Britney Spears web site wants to get the maximum exposure, it would be in their best interests to include at least a few misspellings. –

Minimum Edit Distance Homework Top six misspellings Design a minimum edit algorithm that ranks these misspellings (as accurately as possible): – e.g. ED(brittany) < ED(britany)

Minimum Edit Distance Homework Submit your homework in PDF – how many you got right – explain your criteria, e.g. weights, chosen you should submit your modified Excel spreadsheet or code (e.g. Python, Perl, Java) as well due by to me before next Thursday class… – put your name and 581 at the top of your submission

Part 2 Corpora: frequency information Unlabeled corpus: just words Labeled corpus: various kinds … – POS information – Information about phrases – Word sense or Semantic role labeling easy to find progressively harder to create or obtain

Language Models and N-grams given a word sequence – w 1 w 2 w 3... w n chain rule – how to compute the probability of a sequence of words – p(w 1 w 2 ) = p(w 1 ) p(w 2 |w 1 ) – p(w 1 w 2 w 3 ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) –... – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )... p(w n |w 1...w n-2 w n-1 ) note – It’s not easy to collect (meaningful) statistics on p(w n |w n-1 w n-2...w 1 ) for all possible word sequences

Language Models and N-grams Given a word sequence – w 1 w 2 w 3... w n Bigram approximation – just look at the previous word only (not all the proceedings words) – Markov Assumption: finite length history – 1st order Markov Model – p(w 1 w 2 w 3...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )...p(w n |w 1...w n-3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) note – p(w n |w n-1 ) is a lot easier to collect data for (and thus estimate well) than p(w n |w 1...w n-2 w n-1 )

Language Models and N-grams Trigram approximation – 2nd order Markov Model – just look at the preceding two words only – p(w 1 w 2 w 3 w 4...w n ) = p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 ) p(w 4 |w 1 w 2 w 3 )...p(w n |w 1...w n- 3 w n-2 w n-1 ) – p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 1 w 2 )p(w 4 |w 2 w 3 )...p(w n |w n-2 w n-1 ) note – p(w n |w n-2 w n-1 ) is a lot easier to estimate well than p(w n |w 1...w n-2 w n-1 ) but harder than p(w n |w n-1 )

Language Models and N-grams estimating from corpora – how to compute bigram probabilities – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 w)w is any word – Since f(w n-1 w) = f(w n-1 ) f(w n-1 ) = unigram frequency for w n-1 – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency Note: – The technique of estimating (true) probabilities using a relative frequency measure over a training corpus is known as maximum likelihood estimation (MLE)

Motivation for smoothing Smoothing: avoid zero probability estimates Consider what happens when any individual probability component is zero? – Arithmetic multiplication law: 0×X = 0 – very brittle! even in a very large corpus, many possible n-grams over vocabulary space will have zero frequency – particularly so for larger n-grams p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 )

Language Models and N-grams Example: unigram frequencies w n-1 w n bigram frequencies bigram probabilities sparse matrix zeros render probabilities unusable (we’ll need to add fudge factors - i.e. do smoothing) w n-1 wnwn

Smoothing and N-grams sparse dataset means zeros are a problem – Zero probabilities are a problem p(w 1 w 2 w 3...w n )  p(w 1 ) p(w 2 |w 1 ) p(w 3 |w 2 )...p(w n |w n-1 ) bigram model one zero and the whole product is zero – Zero frequencies are a problem p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )relative frequency bigram f(w n-1 w n ) doesn’t exist in dataset smoothing – refers to ways of assigning zero probability n-grams a non-zero value

Smoothing and N-grams Add-One Smoothing (4.5.1 Laplace Smoothing) – add 1 to all frequency counts – simple and no more zeros (but there are better methods) unigram – p(w) = f(w)/N(before Add-One) N = size of corpus – p(w) = (f(w)+1)/(N+V)(with Add-One) – f*(w) = (f(w)+1)*N/(N+V)(with Add-One) V = number of distinct words in corpus N/(N+V) normalization factor adjusting for the effective increase in the corpus size caused by Add-One bigram – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 )(before Add-One) – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V)(after Add-One) – f*(w n-1 w n ) = (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V)(after Add-One) must rescale so that total probability mass stays at 1

Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) frequencies Remarks: perturbation problem add-one causes large changes in some frequencies due to relative size of V (1616) want to: 786  338 = figure 6.8 = figure 6.4

Smoothing and N-grams Add-One Smoothing – add 1 to all frequency counts bigram – p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n-1 )+V) – (f(w n-1 w n )+1)* f(w n-1 ) /(f(w n-1 )+V) Probabilities Remarks: perturbation problem similar changes in probabilities = figure 6.5 = figure 6.7

Smoothing and N-grams let’s illustrate the problem – take the bigram case: – w n-1 w n – p(w n |w n-1 ) = f(w n-1 w n )/f(w n-1 ) – suppose there are cases – w n-1 w zero 1 that don’t occur in the corpus probability mass f(w n-1 ) f(w n-1 w n ) f(w n-1 w zero 1 )=0 f(w n-1 w zero m )=0...

Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1...

Smoothing and N-grams add-one – “give everyone 1” probability mass f(w n-1 ) f(w n-1 w n )+1 f(w n-1 w 0 1 )=1 f(w n-1 w 0 m )=1... V = |{w i }| redistribution of probability mass –p(w n |w n-1 ) = (f(w n-1 w n )+1)/(f(w n- 1 )+V)

Smoothing and N-grams Good-Turing Discounting (4.5.2) – N c = number of things (= n-grams) that occur c times in the corpus – N = total number of things seen – Formula: smoothed c for N c given by c* = (c+1)N c+1 /N c – Idea: use frequency of things seen once to estimate frequency of things we haven’t seen yet – estimate N 0 in terms of N 1 … – and so on but if N c =0, smooth that first using something like log(N c )=a+b log(c) – Formula: P*(things with zero freq) = N 1 /N – smaller impact than Add-One Textbook Example: – Fishing in lake with 8 species bass, carp, catfish, eel, perch, salmon, trout, whitefish – Sample data (6 out of 8 species): 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel – P(unseen new fish, i.e. bass or carp) = N 1 /N = 3/18 = 0.17 – P(next fish=trout) = 1/18 (but, we have reassigned probability mass, so need to recalculate this from the smoothing formula…) – revised count for trout: c*(trout) = 2*N 2 /N 1 =2(1/3)=0.67 (discounted from 1) – revised P(next fish=trout) = 0.67/18 = 0.037

Language Models and N-grams N-gram models + smoothing – one consequence of smoothing is that – every possible concatentation or sequence of words has a non-zero probability – N-gram models can also incorporate word classes, e.g. POS labels when available

Language Models and N-grams N-gram models – data is easy to obtain any unlabeled corpus will do – they’re technically easy to compute count frequencies and apply the smoothing formula – but just how good are these n-gram language models? – and what can they show us about language?

Language Models and N-grams approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Unigram (pick random, unconnected words) Bigram

Language Models and N-grams Approximating Shakespeare – generate random sentences using n-grams – Corpus: Complete Works of Shakespeare Trigram Quadrigram Remarks: dataset size problem training set is small 884,647 words 29,066 different words 29,066 2 = 844,832,356 possible bigrams for the random sentence generator, this means very limited choices for possible continuations, which means program can’t be very innovative for higher n

Language Models and N-grams A limitation: – produces ungrammatical sequences Treebank: – potential to be a better language model – Structural information: contains frequency information about syntactic rules – we should be able to generate sequences that are closer to English …

Language Models and N-grams Aside:

Part 3 tregex I assume everyone has: 1.Installed Penn Treebank v3 2.Downloaded and installed tregex

Trees in the Penn Treebank Notation: LISP S-expression Directory: TREEBANK_3/parsed/mrg/

tregex Search Example: << dominates, < immediately dominates

tregex Help

tregex Help

tregex Help: tregex expression syntax is non-standard wrt bracketing S < VP S < NP S < VP S < NP

tregex Help: tregex boolean syntax is also non-standard

tregex Help

tregex Help

tregex Pattern: – <, $+ (/,/ $+ $+ /,/=comma))) <- =comma) Key: <, first child $+ immediate left sister <- last child Key: <, first child $+ immediate left sister <- last child same node

tregex Help

tregex

Different results from: < /^WH.*-([0-9]+)$/#1%index << < (/^-NONE-/ < /^\*T\*-([0-9]+)$/#1%index))

tregex Example: WHADVP also possible (not just WHNP)

Treebank Guides 1.Tagging Guide 2.Arpa94 paper 3.Parse Guide 1.Tagging Guide 2.Arpa94 paper 3.Parse Guide

Treebank Guides Parts-of-speech (POS) Tagging Guide, tagguid1.pdf (34 pages): tagguid2.pdf: addendum, see POS tag ‘TO’

Treebank Guides Parsing guide 1, prsguid1.pdf (318 pages): prsguid2.pdf: addendum for the Switchboard corpus