600.465 - Intro to NLP - J. Eisner1 Smoothing. 600.465 - Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Intro to NLP - J. Eisner1 Probabilistic CKY.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Maximum Entropy Model LING 572 Fei Xia 02/07-02/09/06.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Bit of Progress in Language Modeling Extended Version
Language acquisition
Intro to NLP - J. Eisner1 Smoothing There are more principled smoothing methods, too. We’ll look next at log-linear models, which are a good.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Language acquisition
LANGUAGE MODELING David Kauchak CS457 – Fall 2011 some slides adapted from Jason Eisner.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
Lecture 4 Ngrams Smoothing
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 How to Use Probabilities The Crash Course Jason Eisner, JHU.
Math on the Menu By Lauren.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Naïve Bayes Classification Material borrowed from Jonathan Huang and I. H. Witten’s and E. Frank’s “Data Mining” and Jeremy Wyatt and others.
Natural Language Processing Statistical Inference: n-grams
LANGUAGE MODELING David Kauchak CS159 – Fall 2014 some slides adapted from Jason Eisner.
Sampling Theory Determining the distribution of Sample statistics.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Naïve Bayes Classification Recitation, 1/25/07 Jonathan Huang.
1 How to Use Probabilities The Crash Course – Intro to NLP – J. Eisner2 Goals of this lecture Probability notation like p(X | Y): –What does.
N-Grams Chapter 4 Part 2.
Math on the Menu By Lauren Bartlett.
The greatest mathematical tool of all!!
Unit 5: Hypothesis Testing
Sampling Distributions
CSCI 5417 Information Retrieval Systems Jim Martin
CPSC 503 Computational Linguistics
Honors Statistics From Randomness to Probability
N-Gram Model Formulas Word sequences Chain rule of probability
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Brian Nisonger Shauna Eggers Joshua Johanson
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Language acquisition (4:30)
David Kauchak CS159 – Spring 2019
Smoothing Mengqiu Wang 09 April 2010
This dark art is why NLP is taught in the engineering school.
Pi-Chuan Chang, Paul Baumstarck April 11, 2008
Smoothing Ritvik Mudur 14 January 2011
Anish Johnson and Nate Chambers 10 April 2009
Presentation transcript:

Intro to NLP - J. Eisner1 Smoothing

Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e, x 6 = s, …)  p( h | BOS, BOS) * p( o | BOS, h ) * p( r | h, o ) * p( s | o, r ) * p( e | r, s ) * p( s | s, e ) * … 4470/ / / / / /21250 trigram model’s parameters values of those parameters, as naively estimated from Brown corpus.

Intro to NLP - J. Eisner3 How to Estimate?  p(z | xy) = ?  Suppose our training data includes … xya.. … xyd … … xyd … but never xyz  Should we conclude p(a | xy) = 1/3? p(d | xy) = 2/3? p(z | xy) = 0/3?  NO! Absence of xyz might just be bad luck.

Intro to NLP - J. Eisner4 Smoothing the Estimates  Should we conclude p(a | xy) = 1/3? reduce this p(d | xy) = 2/3? reduce this p(z | xy) = 0/3? increase this  Discount the positive counts somewhat  Reallocate that probability to the zeroes  Especially if the denominator is small …  1/3 probably too high, 100/300 probably about right  Especially if numerator is small …  1/300 probably too high, 100/300 probably about right

Intro to NLP - J. Eisner5 Add-One Smoothing xya11/322/29 xyb00/311/29 xyc00/311/29 xyd22/333/29 xye00/311/29 … xyz00/311/29 Total xy33/32929/29

Intro to NLP - J. Eisner6 Add-One Smoothing xya100100/ /326 xyb00/30011/326 xyc00/30011/326 xyd200200/ /326 xye00/30011/326 … xyz00/30011/326 Total xy300300/ / observations instead of 3 – better data, less smoothing

Intro to NLP - J. Eisner7 Add-One Smoothing xya11/322/29 xyb00/311/29 xyc00/311/29 xyd22/333/29 xye00/311/29 … xyz00/311/29 Total xy33/32929/29 Suppose we’re considering word types, not 26 letters

Intro to NLP - J. Eisner8 Add-One Smoothing As we see more word types, smoothed estimates keep falling see the abacus 11/322/20003 see the abbot 00/311/20003 see the abduct 00/311/20003 see the above 22/333/20003 see the Abram 00/311/20003 … see the zygote 00/311/20003 Total 33/ /20003

Intro to NLP - J. Eisner9 Add-Lambda Smoothing  Suppose we’re dealing with a vocab of words  As we get more and more training data, we see more and more words that need probability – the probabilities of existing words keep dropping, instead of converging  This can’t be right – eventually they drop too low.  So instead of adding 1 to all counts, add = 0.01  This gives much less probability to those extra events  But how to pick best value for  ? (for the size of our training corpus)  Try lots of values on a simulated test set! “held-out data”  Or even better: 10-fold cross validation (aka “jackknifing”)  Divide data into 10 subsets  To evaluate a given alpha:  Measure performance on each subset when other 9 are used for training  Average performance over the 10 subsets tells us how good alpha is

Intro to NLP - J. Eisner10 Terminology  Word type = distinct vocabulary item  Word token = occurrence of that type  A dictionary is a list of types (once each)  A corpus is a list of tokens (each type has many tokens) a100 b0 c0 d200 e0 … z0 Total xy3 26 types 300 tokens 100 tokens of this type 200 tokens of this type 0 tokens of this type

Intro to NLP - J. Eisner11 Always treat zeroes the same? a1500 both180 candy01 donuts02 every50 versus 0 farina00 grapes01 his380 ice cream07 … types 300 tokens 0/300 which zero would you expect is really rare?

Intro to NLP - J. Eisner12 Always treat zeroes the same? a1500 both180 candy01 donuts02 every50 versus 0 farina00 grapes01 his380 ice cream07 … types 300 tokens determiners: a closed class

Intro to NLP - J. Eisner13 Always treat zeroes the same? a1500 both180 candy01 donuts02 every50 versus 0 farina00 grapes01 his380 ice cream07 … types 300 tokens (food) nouns: an open class

Intro to NLP - J. Eisner14 Good-Turing Smoothing  Intuition: Can judge rate of novel events by rate of singletons.  Let N r = # of word types with r training tokens  e.g., N 0 = number of unobserved words  e.g., N 1 = number of singletons  Let N =  r N r = total # of training tokens

Intro to NLP - J. Eisner15 Good-Turing Smoothing  Let N r = # of word types with r training tokens  Let N =  r N r = total # of training tokens  Naïve estimate: if x has r tokens, p(x) = ?  Answer: r/N  Total naïve probability of all words with r tokens?  Answer: N r r / N.  Good-Turing estimate of this total probability:  Defined as: N r+1 (r+1) / N  So proportion of novel words in test data is estimated by proportion of singletons in training data.  Proportion in test data of the N 1 singletons is estimated by proportion of the N 2 doubletons in training data. Etc.  So what is Good-Turing estimate of p(x)?

Intro to NLP - J. Eisner16 Use the backoff, Luke!  Why are we treating all novel events as the same?  p(zygote | see the) vs. p(baby | see the)  Suppose both trigrams have zero count  baby beats zygote as a unigram  the baby beats the zygote as a bigram  see the baby beats see the zygote ?  As always for backoff:  Lower-order probabilities (unigram, bigram) aren’t quite what we want  But we do have enuf data to estimate them & they’re better than nothing.

Intro to NLP - J. Eisner17 Smoothing + backoff  Basic smoothing (e.g., add- or Good-Turing):  Holds out some probability mass for novel events  E.g., Good-Turing gives them total mass of N 1 /N  Divided up evenly among the novel events  Backoff smoothing  Holds out same amount of probability mass for novel events  But divide up unevenly in proportion to backoff prob.  For p(z | xy):  Novel events are types z that were never observed after xy  Backoff prob for p(z | xy) is p(z | y) … which in turn backs off to p(z)!  Note: How much mass to hold out for novel events in context xy?  Depends on whether position following xy is an open class  Usually not enough data to tell, though, so aggregate with other contexts (all contexts? similar contexts?)

Intro to NLP - J. Eisner18 Deleted Interpolation  Can do even simpler stuff:  Estimate p(z | xy) as weighted average of the naïve MLE estimates of p(z | xy), p(z | y), p(z)  The weights can depend on the context xy  If a lot of data are available for the context, then trust p(z | xy) more since well-observed  If there are not many singletons in the context, then trust p(z | xy) more since closed-class  Learn the weights on held-out data w/ jackknifing