Albert Gatt Corpora and Statistical Methods – Lecture 7.

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Modeling.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
Smoothing Techniques – A Primer
Smoothing N-gram Language Models Shallow Processing Techniques for NLP Ling570 October 24, 2011.
Language modelling using N-Grams Corpora and Statistical Methods Lecture 7.
Part II. Statistical NLP Advanced Artificial Intelligence N-Gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Lars Schmidt-Thieme Most slides taken.
Part II. Statistical NLP Advanced Artificial Intelligence Markov Models and N-gramms Wolfram Burgard, Luc De Raedt, Bernhard Nebel, Kristian Kersting Some.
September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
Resampling techniques Why resampling? Jacknife Cross-validation Bootstrap Examples of application of bootstrap.
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Fall 2001 EE669: Natural Language Processing 1 Lecture 6: N-gram Models and Sparse Data (Chapter 6 of Manning and Schutze, Chapter 6 of Jurafsky and Martin,
Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.
Fall BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.
N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 20: 11/8.
1 Smoothing LING 570 Fei Xia Week 5: 10/24/07 TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAA A A AA A A A.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
LING 581: Advanced Computational Linguistics Lecture Notes January 12th.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 19: 10/31.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
Language Models Data-Intensive Information Processing Applications ― Session #9 Nitin Madnani University of Maryland Tuesday, April 6, 2010 This work is.
Learning Bit by Bit Class 4 - Ngrams. Ngrams Counting words Using observation to make predictions.
LIN3022 Natural Language Processing Lecture 5 Albert Gatt LIN Natural Language Processing.
1 Advanced Smoothing, Evaluation of Language Models.
Natural Language Processing Lecture 6—9/17/2013 Jim Martin.
Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.
Lecture 1, 7/21/2005Natural Language Processing1 CS60057 Speech &Natural Language Processing Autumn 2007 Lecture 7 8 August 2007.
Heshaam Faili University of Tehran
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
A Bit of Progress in Language Modeling Extended Version
Language acquisition
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
1 COMP 791A: Statistical Language Processing n-gram Models over Sparse Data Chap. 6.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Resolving Word Ambiguities Description: After determining word boundaries, the speech recognition process matches an array of possible word sequences from.
Language acquisition
9/22/1999 JHU CS /Jan Hajic 1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) Dr. Jan Hajič CS Dept., Johns.
1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab
LING 388: Language and Computers Sandiway Fong Lecture 27: 12/6.
Lecture 4 Ngrams Smoothing
N-gram Models CMSC Artificial Intelligence February 24, 2005.
Collocations. Definition Of Collocation (wrt Corpus Literature) A collocation is defined as a sequence of two or more consecutive words, that has characteristics.
LING/C SC/PSYC 438/538 Lecture 22 Sandiway Fong. Last Time Gentle introduction to probability Important notions: –sample space –events –rule of counting.
1 Modeling Long Distance Dependence in Language: Topic Mixtures Versus Dynamic Cache Models Rukmini.M Iyer, Mari Ostendorf.
Statistical NLP Winter 2009
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
CS Machine Learning and Statistical Natural Language Processing Prof. Shlomo Argamon, Room: 237C Office Hours: Mon 3-4 PM Book:
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Natural Language Processing Statistical Inference: n-grams
Statistical NLP Spring 2011 Lecture 3: Language Models II Dan Klein – UC Berkeley TexPoint fonts used in EMF. Read the TexPoint manual before you delete.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Speech and Language Processing Lecture 4 Chapter 4 of SLP.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,
N-Grams Chapter 4 Part 2.
N-Gram Model Formulas Word sequences Chain rule of probability
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
CSCE 771 Natural Language Processing
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Albert Gatt Corpora and Statistical Methods – Lecture 7

Smoothing (aka discounting) techniques Part 2

Overview… Smoothing methods: Simple smoothing Witten-Bell & Good-Turing estimation Held-out estimation and cross-validation Combining several n-gram models: back-off models

Rationale behind smoothing Sample frequencies seen events with probability P unseen events (including “grammatical” zeroes”) with probability 0 Real population frequencies seen events (including the unseen events in our sample) + smoothing to approximate Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing). results in

Laplace’s law, Lidstone’s law and the Jeffreys-Perks law

Instances in the Training Corpus: “inferior to ________” F(w)

Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass

Actual Probability Distribution F(w) These are non- zero probabilities in the real distribution

LaPlace’s Law (Add-one smoothing) F(w)

LaPlace’s Law (Add-one smoothing) F(w)

LaPlace’s Law F(w) NB. This method ends up assigning most prob. mass to unseens

Generalisation: Lidstone’s Law P = probability of specific n-gram C(x) = count of n-gram x in training data N = total n-grams in training data V = number of “bins” (possible n-grams) = small positive number M.L.E: = 0 LaPlace’s Law: = 1 (add-one smoothing) Jeffreys-Perks Law: = ½

Jeffreys-Perks Law F(w)

Objections to Lidstone’s Law Need an a priori way to determine. Predicts all unseen events to be equally likely Gives probability estimates linear in the M.L.E. frequency

Witten-Bell discounting

Main intuition A zero-frequency event can be thought of as an event which hasn’t happened (yet). The probability of it happening can be estimated from the probability of sth happening for the first time. The count of things which are seen only once can be used to estimate the count of things that are never seen.

Witten-Bell method 1. T = no. of times we saw an event for the first time. = no of different n-gram types (bins) NB: T is no. of types actually attested (unlike V, the no of possible types in add- one smoothing) 2. Estimate total probability mass of unseen n-grams: each token is an event & each new type is an event so above equation gives MLE of the probability of a new type event occurring (“being seen for the first time”) This is the total probability mass to be distributed among all zero events (unseens) no of actual n-grams (N) + no of actual types (T)

Witten-Bell method 3. Divide the total probability mass among all the zero n-grams. Can distribute it equally. 4. Remove this probability mass from the non-zero n-grams (discounting):

Witten-Bell vs. Add-one If we work with unigrams, Witten-Bell and Add-one smoothing give very similar results. The difference is with n-grams for n>1. Main idea: estimate probability of an unseen bigram from the probability of seeing a bigram starting with w1 for the first time.

Witten-Bell with bigrams Generalised total probability mass estimate: No. bigram types beginning with w x No. bigram tokens beginning with w x Estimated total probability of bigrams starting with some word w x

Witten-Bell with bigrams Non-zero bigrams get discounted as before, but again conditioning on history: Note: Witten-Bell won’t assign the same probability mass to all unseen n- grams. The amount assigned will depend on the first word in the bigram (first n- 1 words in the n-gram).