Language acquisition

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Imbalanced data David Kauchak CS 451 – Fall 2013.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
PROBABILISTIC MODELS David Kauchak CS451 – Fall 2013.
Planning under Uncertainty
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Copyright © 2009 Pearson Education, Inc. Chapter 16 Random Variables.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Text Categorization Moshe Koppel Lecture 2: Naïve Bayes Slides based on Manning, Raghavan and Schutze.
COMP 14: Primitive Data and Objects May 24, 2000 Nick Vallidis.
Please open your laptops, log in to the MyMathLab course web site, and open Daily Quiz 18. You will have 10 minutes for today’s quiz. The second problem.
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
EVALUATION David Kauchak CS 451 – Fall Admin Assignment 3 - change constructor to take zero parameters - instead, in the train method, call getFeatureIndices()
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 16 Random Variables.
A Bit of Progress in Language Modeling Extended Version
Language acquisition
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
PROBABILITY David Kauchak CS159 – Spring Admin  Posted some links in Monday’s lecture for regular expressions  Logging in remotely  ssh to vpn.cs.pomona.edu.
LOGISTIC REGRESSION David Kauchak CS451 – Fall 2013.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
PARSING David Kauchak CS159 – Spring 2011 some slides adapted from Ray Mooney.
LANGUAGE MODELING David Kauchak CS457 – Fall 2011 some slides adapted from Jason Eisner.
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
PRIORS David Kauchak CS159 Fall Admin Assignment 7 due Friday at 5pm Project proposals due Thursday at 11:59pm Grading update.
Lecture 4 Ngrams Smoothing
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Estimating N-gram Probabilities Language Modeling.
STA 2023 Module 5 Discrete Random Variables. Rev.F082 Learning Objectives Upon completing this module, you should be able to: 1.Determine the probability.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
LANGUAGE MODELING David Kauchak CS159 – Fall 2014 some slides adapted from Jason Eisner.
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Language acquisition (4:30)
INF 141: Information Retrieval
David Kauchak CS159 Spring 2019
Evaluation David Kauchak CS 158 – Fall 2019.
Presentation transcript:

Language acquisition

LANGUAGE MODELING: SMOOTHING David Kauchak CS159 – Fall 2014 some slides adapted from Jason Eisner

Admin Assignment 2 out  bigram language modeling  Java  Can work with partners Anyone looking for a partner?  2a: Due Thursday  2b: Due Tuesday  Style/commenting (JavaDoc)  Some advice Start now! Spend 1-2 hours working out an example by hand (you can check your answers with me) HashMap

Admin Assignment submission: submit on-time! Our first quiz (when?)  In-class (~30 min.)  Topics corpus analysis regular expressions probability language modeling  Open book/notes we’ll try it out for this one better to assume closed book (30 minutes goes by fast!)  7.5% of your grade

Admin Lab next class Meet in Edmunds 105, 2:45-4pm

Today smoothing techniques

Today Take home ideas:  Key idea of smoothing is to redistribute the probability to handle less seen (or never seen) events Still must always maintain a true probability distribution  Lots of ways of smoothing data  Should take into account features in your data!

Smoothing P(I think today is a good day to be me) = P(I | ) x P(think | I) x P(today| I think) x P(is| think today) x P(a| today is) x P(good| is a) x … If any of these has never been seen before, prob = 0! What if our test set contains the following sentence, but one of the trigrams never occurred in our training data?

Smoothing P(I think today is a good day to be me) = P(I | ) x P(think | I) x P(today| I think) x P(is| think today) x P(a| today is) x P(good| is a) x … These probability estimates may be inaccurate. Smoothing can help reduce some of the noise.

The general smoothing problem see the abacus 11/3?? see the abbot 00/3?? see the abduct 00/3?? see the above 22/3?? see the Abram 00/3?? … ?? see the zygote 00/3?? Total 33/3?? probability modification

Add-lambda smoothing A large dictionary makes novel events too probable. add = 0.01 to all counts see the abacus 11/ /203 see the abbot 00/ /203 see the abduct 00/ /203 see the above 22/ /203 see the Abram 00/ /203 … /203 see the zygote 00/ /203 Total 33/3203

Add-lambda smoothing see the abacus 11/ /203 see the abbot 00/ /203 see the abduct 00/ /203 see the above 22/ /203 see the Abram 00/ /203 … /203 see the zygote 00/ /203 Total 33/3203 How should we pick lambda?

Setting smoothing parameters Idea 1: try many values & report the one that gets the best results? Test Training Is this fair/appropriate?

14 Setting smoothing parameters Test Training collect counts from 80% of the data Now use that to get smoothed counts from all 100% … … and report results of that final model on test data. Dev. pick that gets best results on 20% … problems? ideas?

Vocabulary n-gram language modeling assumes we have a fixed vocabulary  why? Whether implicit or explicit, an n-gram language model is defined over a finite, fixed vocabulary What happens when we encounter a word not in our vocabulary (Out Of Vocabulary)?  If we don’t do anything, prob = 0  Smoothing doesn’t really help us with this!

Vocabulary To make this explicit, smoothing helps us with… see the abacus see the abbot see the abduct see the above see the Abram … see the zygote all entries in our vocabulary

Vocabulary and… Vocabulary a able about account acid across … young zebra … 1 0 Counts … Smoothed counts How can we have words in our vocabulary we’ve never seen before?

Vocabulary Choosing a vocabulary: ideas?  Grab a list of English words from somewhere  Use all of the words in your training data  Use some of the words in your training data for example, all those the occur more than k times Benefits/drawbacks?  Ideally your vocabulary should represents words you’re likely to see  Too many words: end up washing out your probability estimates (and getting poor estimates)  Too few: lots of out of vocabulary

Vocabulary No matter your chosen vocabulary, you’re still going to have out of vocabulary (OOV) How can we deal with this?  Ignore words we’ve never seen before Somewhat unsatisfying, though can work depending on the application Probability is then dependent on how many in vocabulary words are seen in a sentence/text  Use a special symbol for OOV words and estimate the probability of out of vocabulary

Out of vocabulary Add an extra word in your vocabulary to denote OOV (, ) Replace all words in your training corpus not in the vocabulary with  You’ll get bigrams, trigrams, etc with p( | “I am”) p(fast | “I ”) During testing, similarly replace all OOV with

Choosing a vocabulary A common approach (and the one we’ll use for the assignment):  Replace the first occurrence of each word by in a data set  Estimate probabilities normally Vocabulary then is all words that occurred two or more times This also discounts all word counts by 1 and gives that probability mass to

Storing the table see the abacus 11/ /203 see the abbot 00/ /203 see the abduct 00/ /203 see the above 22/ /203 see the Abram 00/ /203 … /203 see the zygote 00/ /203 Total 33/3203 How are we storing this table? Should we store all entries?

Storing the table Hashtable (e.g. HashMap)  fast retrieval  fairly good memory usage Only store those entries of things we’ve seen  for example, we don’t store |V| 3 trigrams For trigrams we can:  Store one hashtable with bigrams as keys  Store a hashtable of hashtables (I’m recommending this)

Storing the table: add-lambda smoothing For those we’ve seen before: see the abacus 11/ /203 see the abbot 00/ /203 see the abduct 00/ /203 see the above 22/ /203 see the Abram 00/ /203 … /203 see the zygote 00/ /203 Total 33/3203 Unsmoothed (MLE) add-lambda smoothing What value do we need here to make sure it stays a probability distribution?

Storing the table: add-lambda smoothing For those we’ve seen before: see the abacus 11/ /203 see the abbot 00/ /203 see the abduct 00/ /203 see the above 22/ /203 see the Abram 00/ /203 … /203 see the zygote 00/ /203 Total 33/3203 Unsmoothed (MLE) add-lambda smoothing For each word in the vocabulary, we pretend we’ve seen it λ times more (V = vocabulary size).

Storing the table: add-lambda smoothing For those we’ve seen before: Unseen n-grams: p(z|ab) = ?

Problems with frequency based smoothing The following bigrams have never been seen: p( X| ate) p( X | San ) Which would add-lambda pick as most likely? Which would you pick?

Witten-Bell Discounting Some words are more likely to be followed by new words San Diego Francisco Luis Jose Marcos ate food apples bananas hamburgers a lot for two grapes …

Witten-Bell Discounting Probability mass is shifted around, depending on the context of words If P(w i | w i-1,…,w i-m ) = 0, then the smoothed probability P WB (w i | w i-1,…,w i-m ) is higher if the sequence w i-1,…,w i-m occurs with many different words w k

Witten-Bell Smoothing  For bigrams  T(w i-1 ) is the number of different words (types) that occur to the right of w i-1  N(w i-1 ) is the number of times w i-1 occurred  Z(w i-1 ) is the number of bigrams in the current data set starting with w i-1 that do not occur in the training data

Witten-Bell Smoothing  if c(w i-1,w i ) > 0 # times we saw the bigram # times w i-1 occurred + # of types to the right of w i-1

Witten-Bell Smoothing  If c(w i-1,w i ) = 0

Problems with frequency based smoothing The following trigrams have never been seen: p( cumquat | see the ) p( zygote | see the ) p( car | see the ) Which would add-lambda pick as most likely? Witten-Bell? Which would you pick?

Better smoothing approaches Utilize information in lower-order models Interpolation  Combine probabilities of lower-order models in some linear combination Backoff  Often k = 0 (or 1)  Combine the probabilities by “backing off” to lower models only when we don’t have enough information

Smoothing: Simple Interpolation Trigram is very context specific, very noisy Unigram is context-independent, smooth Interpolate Trigram, Bigram, Unigram for best combination How should we determine λ and μ ?

Smoothing: Finding parameter values Just like we talked about before, split training data into training and development Try lots of different values for   on heldout data, pick best Two approaches for finding these efficiently  EM (expectation maximization)  “Powell search” – see Numerical Recipes in C

Backoff models: absolute discounting Subtract some absolute number from each of the counts (e.g. 0.75)  How will this affect rare words?  How will this affect common words?

Backoff models: absolute discounting Subtract some absolute number from each of the counts (e.g. 0.75)  will have a large effect on low counts (rare words)  will have a small effect on large counts (common words)

Backoff models: absolute discounting What is α (xy)?

Backoff models: absolute discounting trigram model: p(z|xy) (before discounting) seen trigrams (xyz occurred) trigram model p(z|xy) (after discounting) unseen words (xyz didn’t occur seen trigrams (xyz occurred) bigram model p(z|y)* (*for z where xyz didn’t occur)

Backoff models: absolute discounting see the dog1 see the cat2 see the banana4 see the man1 see the woman1 see the car1 p( cat | see the ) = ? p( puppy | see the ) = ?

Backoff models: absolute discounting see the dog1 see the cat2 see the banana4 see the man1 see the woman1 see the car1 p( cat | see the ) = ?

Backoff models: absolute discounting see the dog1 see the cat2 see the banana4 see the man1 see the woman1 see the car1 p( puppy | see the ) = ? α (see the) = ? How much probability mass did we reserve/discount for the bigram model?

Backoff models: absolute discounting see the dog1 see the cat2 see the banana4 see the man1 see the woman1 see the car1 p( puppy | see the ) = ? α (see the) = ? # of types starting with “see the” * D count(“see the”) For each of the unique trigrams, we subtracted D/count(“see the”) from the probability distribution

Backoff models: absolute discounting see the dog1 see the cat2 see the banana4 see the man1 see the woman1 see the car1 p( puppy | see the ) = ? α (see the) = ? distribute this probability mass to all bigrams that we are backing off to # of types starting with “see the” * D count(“see the”)

Calculating α We have some number of bigrams we’re going to backoff to, i.e. those X where C(see the X) = 0, that is unseen trigrams starting with “see the” When we backoff, for each of these, we’ll be including their probability in the model: P(X | the) α is the normalizing constant so that the sum of these probabilities equals the reserved probability mass

Calculating α We can calculate α two ways  Based on those we haven’t seen:  Or, more often, based on those we do see:

Calculating α in general: trigrams Calculate the reserved mass Calculate the sum of the backed off probability. For bigram “A B”: Calculate α reserved_mass(bigram) = # of types starting with bigram * D count(bigram) either is fine, in practice the left is easier 1 – the sum of the bigram probabilities of those trigrams that we saw starting with bigram A B

Calculating α in general: bigrams Calculate the reserved mass Calculate the sum of the backed off probability. For bigram “A B”: Calculate α reserved_mass(unigram) = # of types starting with unigram * D count(unigram) either is fine in practice, the left is easier 1 – the sum of the unigram probabilities of those bigrams that we saw starting with word A

Calculating backoff models in practice Store the α s in another table  If it’s a trigram backed off to a bigram, it’s a table keyed by the bigrams  If it’s a bigram backed off to a unigram, it’s a table keyed by the unigrams Compute the α s during training  After calculating all of the probabilities of seen unigrams/bigrams/trigrams  Go back through and calculate the α s (you should have all of the information you need) During testing, it should then be easy to apply the backoff model with the α s pre-calculated

Backoff models: absolute discounting p( jumped | the Dow ) = ? What is the reserved mass? the Dow Jones10 the Dow rose5 the Dow fell5 # of types starting with “see the” * D count(“see the”)

Backoff models: absolute discounting Two nice attributes:  decreases if we’ve seen more bigrams should be more confident that the unseen trigram is no good  increases if the bigram tends to be followed by lots of other words will be more likely to see an unseen trigram reserved_mass = # of types starting with bigram * D count(bigram)