CPSC 503 Computational Linguistics

Slides:

Advertisements

Similar presentations

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Advertisements

1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.

Ling 570 Day 6: HMM POS Taggers 1. Overview Open Questions HMM POS Tagging Review Viterbi algorithm Training and Smoothing HMM Implementation Details.

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.

Albert Gatt Corpora and Statistical Methods – Lecture 7.

SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.

6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

6/9/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

September BASIC TECHNIQUES IN STATISTICAL NLP Word prediction n-grams smoothing.

N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.

Ngram models and the Sparsity problem John Goldsmith November 2002.

Smoothing Bonnie Dorr Christof Monz CMSC 723: Introduction to Computational Linguistics Lecture 5 October 6, 2004.

N-gram model limitations Q: What do we do about N-grams which were not in our training corpus? A: We distribute some probability mass from seen N-grams.

CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.

CPSC 322, Lecture 31Slide 1 Probability and Time: Markov Models Computer Science cpsc322, Lecture 31 (Textbook Chpt 6.5) March, 25, 2009.

Introduction to Language Models Evaluation in information retrieval Lecture 4.

LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.

SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.

1 Advanced Smoothing, Evaluation of Language Models.

8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.

BİL711 Natural Language Processing1 Statistical Language Processing In the solution of some problems in the natural language processing, statistical techniques.

6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)

Chapter 6: N-GRAMS Heshaam Faili University of Tehran.

Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.

1 Introduction to Natural Language Processing ( ) LM Smoothing (The EM Algorithm) AI-lab

Lecture 4 Ngrams Smoothing

Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.

12/6/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

12/7/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 4 Giuseppe Carenini.

Estimating N-gram Probabilities Language Modeling.

Natural Language Processing Statistical Inference: n-grams

2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

Probabilistic Pronunciation + N-gram Models CMSC Natural Language Processing April 15, 2003.

Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.

Intro to NLP - J. Eisner1 Smoothing Intro to NLP - J. Eisner2 Parameter Estimation p(x 1 = h, x 2 = o, x 3 = r, x 4 = s, x 5 = e,

Hidden Markov Models BMI/CS 576

N-Grams Chapter 4 Part 2.

CSC 594 Topics in AI – Natural Language Processing

CSCI 5832 Natural Language Processing

N-Grams and Corpus Linguistics

CSCI 5832 Natural Language Processing

CPSC 503 Computational Linguistics

Probability and Time: Markov Models

CSCI 5832 Natural Language Processing

Probability and Time: Markov Models

CPSC 503 Computational Linguistics

Honors Statistics From Randomness to Probability

N-Gram Model Formulas Word sequences Chain rule of probability

Probability and Time: Markov Models

CSCE 771 Natural Language Processing

CSCI 5832 Natural Language Processing

CSCE 771 Natural Language Processing

Lecture 10: Speech Recognition (II) October 28, 2004 Dan Jurafsky

CPSC 503 Computational Linguistics

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

CSCE 771 Natural Language Processing

Probability and Time: Markov Models

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics

CPSC 503 Computational Linguistics

Conceptual grounding Nisheeth 26th March 2019.

CS249: Neural Language Model

Presentation transcript:

CPSC 503 Computational Linguistics n-grams Lecture 8 Giuseppe Carenini 11/24/2018 CPSC503 Spring 2004

Today 6/2 More Introductions Summary so far n-grams Key ideas Estimate Smoothing techniques 11/24/2018 CPSC503 Spring 2004

Introductions Your Name Previous experience in NLP? Why am I interested in NLP? Are you thinking of NLP as your main research area? If not, what else do you want to specialize in…. Anything else………… 11/24/2018 CPSC503 Spring 2004

Course checkpoint NLP The study of formalisms, models and algorithms to allow computers to perform useful tasks involving knowledge about human languages. Are we doing this? 11/24/2018 CPSC503 Spring 2004

Spelling: multiple errors (1) Given O, for each wi compute: mei=min-edit distance(wi,O) if mei<k save corresponding edit operations in EdOpi word prior Probability of the errors generating O from w1 (2) For all the wi compute: Any doubts about argmax? (3) Sort and display top-n to user 11/24/2018 CPSC503 Spring 2004

Spelling: the problem(s) Correction Detection Find the most likely correct word funn -> funny, funnel... Non-word isolated …in this context trust funn a lot of funn Non-word context Real-word isolated ?! Check if it is in the lexicon Find candidates and select the most likely it was funne - trust fun Is it an impossible (or very unlikely) word in this context? .. a wild big. Find the most likely sub word in this context Real-word context 11/24/2018 CPSC503 Spring 2004

Real Word Spelling Errors Collect a set of common sets of confusions: C={C1 .. Cn} e.g.,{(Their/they’re/there), (To/too/two), (Weather/whether), (lave, have)..} Whenever c’  Ci is encountered Compute the probability of the sentence in which it appears Substitute all cCi (c ≠ c’) and compute the probability of the resulting sentence Choose the higher one Mental confusions Their/they’re/there To/too/two Weather/whether Typos that result in real words Lave for Have Similar process for non-word errors 11/24/2018 CPSC503 Spring 2004

Transition Up to this point we’ve mostly been discussing words in isolation Now we’re switching to sequences of words And we’re going to worry about assigning probabilities to sequences of words 11/24/2018 CPSC503 Spring 2004

Only Spelling? AB Assign a probability to a sentence Part-of-speech tagging Word-sense disambiguation Probabilistic Parsing Predict the next word Speech recognition Hand-writing recognition Augmentative communication for the disabled AB Why would you want to assign a probability to a sentence or… Why would you want to predict the next word… 11/24/2018 CPSC503 Spring 2004

Apply Chain Rule Chain Rule: Applied to a word sequence from position 1 to n: Most sentences/sequences will not appear or appear only once  Standard Solution: decompose in a set of probabilities that are easier to estimate So the probability of a sequence is 11/24/2018 CPSC503 Spring 2004

Example Sequence “The big red dog barks” P(The big red dog barks)= P(big|the) * P(red|the big)* P(dog|the big red)* P(barks|the big red dog) Note - P(The) is better expressed as: P(The|<Beginning of sentence>) written as P(The|<S>) 11/24/2018 CPSC503 Spring 2004

Not a satisfying solution  Even for small n (e.g., 6) we would need a far too large corpus to estimate: Markov Assumption: the entire prefix history isn’t necessary. unigram bigram trigram That doesn’t help since its unlikely we’ll ever gather the right statistics for the prefixes. Markov Assumption Assume that the entire prefix history isn’t necessary. In other words, an event doesn’t depend on all of its history, just a fixed length near history So for each component in the product replace each with its with the approximation (assuming a prefix of N) 11/24/2018 CPSC503 Spring 2004

N-Grams unigram bigram trigram 11/24/2018 CPSC503 Spring 2004

2-Grams <s>The big red dog barks P(The big red dog barks)= P(The|<S>) * P(big|the) * P(red|big)* P(dog|red)* P(barks|dog) 11/24/2018 CPSC503 Spring 2004

Estimates for N-Grams bigram ..in general 11/24/2018 Probability of wN-1 to be before something else ..in general 11/24/2018 CPSC503 Spring 2004

Berkeley Restaurant Project (1994) Table: Counts Berkley Restaurant Project 1994 ~10000 sentences Notice that majority of values is 0 But the full matrix would be even more sparse Corpus: ~10,000 sentences 11/24/2018 CPSC503 Spring 2004

BERP Table: 11/24/2018 CPSC503 Spring 2004

BERP Table Comparison Counts Prob. 1? 11/24/2018 CPSC503 Spring 2004

Some Observations What’s being captured here? application syntax P(want|I) = .32 P(to|want) = .65 P(eat|to) = .26 P(food|Chinese) = .56 P(lunch|eat) = .055 application syntax I want : Because the application is a question answering systems Want to: Want often subcategorize for an infinitive To eat: the systems is about eating out Chinese food: application/cultural Eat lunch: semantics application application semantics 11/24/2018 CPSC503 Spring 2004

Some Observations Speech-based restaurant consultant! I I I want I want I want to The food I want is P(I | I) P(want | I) P(I | food) Speech-based restaurant consultant! 11/24/2018 CPSC503 Spring 2004

Generation Choose N-Grams according to their probabilities and string them together You can generate the most likely sentences…. I want -> want to -> to eat -> eat lunch I want -> want to -> to eat -> eat Chinese -> Chinese food 11/24/2018 CPSC503 Spring 2004

Two problems with applying: to 0 for unseen This is qualitatively wrong as a prior model because it would never allow n-gram, While clearly some of the unseen ngrams will appear in other texts 11/24/2018 CPSC503 Spring 2004

Problem (1) We may need to multiply may very small numbers (underflows!) Easy Solution: Convert probabilities to logs and then do additions. To get the real probability (if you need it) go back to the antilog. 11/24/2018 CPSC503 Spring 2004

Problem (2) Solutions: The probability matrix for n-grams is sparse How can we assign a probability to a sequence where one of the component n-grams has a value of zero? Solutions: Add-one smoothing Witten-Bell Discounting Good-Turing Back off and Deleted Interpolation Let’s assume we’re using N-grams How can we assign a probability to a sequence where one of the component n-grams has a value of zero Assume all the words are known and have been seen Go to a lower order n-gram Back off from bigrams to unigrams Replace the zero with something else SOME USEFUL OBSERVATIONS A small number of events occur with high frequency You can collect reliable statistics on these events with relatively small samples A large number of events occur with small frequency You might have to wait a long time to gather statistics on the low frequency events Some zeroes are really zeroes Meaning that they represent events that can’t or shouldn’t occur On the other hand, some zeroes aren’t really zeroes They represent low frequency events that simply didn’t occur in the corpus 11/24/2018 CPSC503 Spring 2004

Add-One Make the zero counts 1. Rationale: If you had seen these “rare” events chances are you would only have seen them once. unigram They’re just events you haven’t seen yet. What would have been the count that would have generated P*? discount 11/24/2018 CPSC503 Spring 2004

Add-One: Bigram …… 11/24/2018 CPSC503 Spring 2004 Sum up for all the wN and it must add to 1 11/24/2018 CPSC503 Spring 2004

BERP Original vs. Add-one smoothed Counts 6 0.18 Factor of 8? 19 11/24/2018 CPSC503 Spring 2004

Add-One Smoothed Problems An excessive amount of probability mass is moved to the zero counts When compared empirically with MLE or other smoothing techniques, it performs poorly -> Not used in practice Detailed discussion Gale and Church 1994 Reduce the problem by adding smaller values 0.5 … 0.2 11/24/2018 CPSC503 Spring 2004

Witten-Bell Think about the occurrence of an unseen item (word, bigram, etc) as an event. Let’s start with unigrams: N tokens, T types and Z zero-count words. How many times was an as yet unseen type encountered? The probability of such an event can be measured in a corpus by just looking at how often it happens. Just take the single word case first. Assume a corpus of N tokens and T types. How many times was an as yet unseen type encountered? divide equally 11/24/2018 CPSC503 Spring 2004

Witten-Bell (unigram) Probability mass of unseen events Probability mass left for seen events C* as usual just multiply the probability times N We have to discount all seen 11/24/2018 CPSC503 Spring 2004

Witten-Bell (bigram) unigram bigram Zero-count prob. mass Non-zero-count prob. Our counts are now conditioned on the first word in the biagram 11/24/2018 CPSC503 Spring 2004

Original BERP vs. Witten-Bell Smoothed Counts 11/24/2018 CPSC503 Spring 2004

Next Time Read N-grams sensitivity to training corpus (pag 202-206) Comparing Probabilistic Models Intro Markov Chains and Hidden Markov Models I’ll leave material in the reading room on Mon. 11/24/2018 CPSC503 Spring 2004