N-gram Models Computational Linguistic Course

Slides:



Advertisements
Similar presentations
School of something FACULTY OF OTHER School of Computing FACULTY OF ENGINEERING Word-counts, visualizations and N-grams Eric Atwell, Language Research.
Advertisements

Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Language Modeling: Ngrams
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
N-Grams and Corpus Linguistics 6 July Linguistics vs. Engineering “But it must be recognized that the notion of “probability of a sentence” is an.
N-gram model limitations Important question was asked in class: what do we do about N-grams which were not in our training corpus? Answer given: we distribute.
Albert Gatt Corpora and Statistical Methods – Lecture 7.
SI485i : NLP Set 4 Smoothing Language Models Fall 2012 : Chambers.
CS 4705 N-Grams and Corpus Linguistics Julia Hirschberg CS 4705.
CSC 9010: Special Topics, Natural Language Processing. Spring, Matuszek & Papalaskari 1 N-Grams CSC 9010: Special Topics. Natural Language Processing.
N-Grams and Corpus Linguistics.  Regular expressions for asking questions about the stock market from stock reports  Due midnight, Sept. 29 th  Use.
1 I256: Applied Natural Language Processing Marti Hearst Sept 13, 2006.
Applicability of N-Grams to Data Classification A review of 3 NLP-related papers Presented by Andrei Missine (CS 825, Fall 2003)
N-Grams and Corpus Linguistics
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
CS 4705 Lecture 6 N-Grams and Corpus Linguistics.
Ngram models and the Sparsity problem John Goldsmith November 2002.
Part 5 Language Model CSE717, SPRING 2008 CUBS, Univ at Buffalo.
CS 4705 Lecture 15 Corpus Linguistics III. Training and Testing Probabilities come from a training corpus, which is used to design the model. –overly.
Language Model. Major role: Language Models help a speech recognizer figure out how likely a word sequence is, independent of the acoustics. A lot of.
I256 Applied Natural Language Processing Fall 2009 Lecture 7 Practical examples of Graphical Models Language models Sparse data & smoothing Barbara Rosario.
Introduction to Language Models Evaluation in information retrieval Lecture 4.
LING 438/538 Computational Linguistics Sandiway Fong Lecture 18: 10/26.
CS 4705 N-Grams and Corpus Linguistics. Spelling Correction, revisited M$ suggests: –ngram: NorAm –unigrams: anagrams, enigmas –bigrams: begrimes –trigrams:
Natural Language Understanding
SI485i : NLP Set 3 Language Models Fall 2012 : Chambers.
1 Advanced Smoothing, Evaluation of Language Models.
8/27/2015CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
1 LIN6932 Spring 2007 LIN6932: Topics in Computational Linguistics Hana Filip Lecture 5: N-grams.
Ngram Models Bahareh Sarrafzadeh Winter Agenda Ngrams – Language Modeling – Evaluation of LMs Markov Models – Stochastic Process – Markov Chain.
6. N-GRAMs 부산대학교 인공지능연구실 최성자. 2 Word prediction “I’d like to make a collect …” Call, telephone, or person-to-person -Spelling error detection -Augmentative.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Statistical NLP: Lecture 8 Statistical Inference: n-gram Models over Sparse Data (Ch 6)
Chapter 6: N-GRAMS Heshaam Faili University of Tehran.
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Lecture 4 Ngrams Smoothing
Chapter 23: Probabilistic Language Models April 13, 2004.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
1 Introduction to Natural Language Processing ( ) Language Modeling (and the Noisy Channel) AI-lab
Estimating N-gram Probabilities Language Modeling.
Natural Language Processing Statistical Inference: n-grams
2/29/2016CPSC503 Winter CPSC 503 Computational Linguistics Lecture 5 Giuseppe Carenini.
Maximum Entropy techniques for exploiting syntactic, semantic and collocational dependencies in Language Modeling Sanjeev Khudanpur, Jun Wu Center for.
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Language Model for Machine Translation Jang, HaYoung.
Statistical Methods for NLP
N-Grams Chapter 4 Part 2.
CSCI 5832 Natural Language Processing
Introduction to Textual Analysis
N-Grams and Corpus Linguistics
CSCI 5832 Natural Language Processing
N-Grams and Corpus Linguistics
Language Models for Information Retrieval
CPSC 503 Computational Linguistics
Discrete Event Simulation - 4
N-Gram Model Formulas Word sequences Chain rule of probability
Probability and Time: Markov Models
CSCI 5832 Natural Language Processing
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Probability and Time: Markov Models
CPSC 503 Computational Linguistics
CPSC 503 Computational Linguistics
Word embeddings (continued)
INF 141: Information Retrieval
Presentation transcript:

N-gram Models Computational Linguistic Course Instructor: Professor Cercone Presenter: Razieh Niazi

Outline Human Word Presentation Language Model N-gram N-gram Language Model Markov chain Training and Testing Example Reduce Number of Parameters in N-gram Maximum Likelihood Estimation(MLE) Sparse Data Problem Smoothing and Backoff Zipf’s Law Evaluation Application

Next Word Presentation From The Glob and MAIL The Ontario Brain… The Ontario Brain Institute…. The Ontario Brain Institute, announced by …. The Ontario Brain Institute, announced by Premier… he Ontario Brain Institute, announced by Premier Dalton McGuinty …. The Ontario Brain Institute, announced by Premier Dalton McGuinty on Monday, will provide … The Ontario Brain Institute, announced by Premier Dalton McGuinty on Monday, will provide $15-million over three years for collaborative research on neuroscience.

We may have ability to predict future words in an utterance Human Word Prediction We may have ability to predict future words in an utterance How? Domain Knowledge red blood Syntactic knowledge the…<adj | noun> Lexical knowledge baked <potato >

Why do we want to predict a words, given some preceding words? Applications Why do we want to predict a words, given some preceding words? Applications: spelling correction, Rank the likelihood of sequences containing various alternative hypotheses Theatre owners say popcorn/unicorn sales have doubled... handwriting recognition statistical machine translation Assess the likelihood/goodness of a sentence, e.g. for text generation or machine translation The doctor recommended a cat scan. Others…

Language Model Language Model (LM) A language model is a probability distribution over entire sentences or texts N-gram: unigrams, bigrams, trigrams,… In a simple n-gram language model, the probability of a word, conditioned on some number of previous words

In other words, using the previous N-1 words in a sequence we want to predict the next word Sue swallowed the large green ____. A. Frog B. Mountain C. Car D. Pill

An n-gram is a subsequence of n items from a given sequence. What is an N-gram? An n-gram is a subsequence of n items from a given sequence. Unigram: n-gram of size 1 Bigram: n-gram of size 2 Trigram: n-gram of size 3 Item: Phonemes Syllables Letters Words Anything else depending on the application.

the dog smelled like a skunk Bigram: Example the dog smelled like a skunk Bigram: "# the”, “the dog”, “dog smelled”, “ smelled like”, “like a”, “a skunk”, “skunk#” Trigram: "# the dog", "the dog smelled", "dog smelled like", "smelled like a", "like a skunk" and "a skunk #".

How to predict the next word? Assume a language has T word types in its lexicon, how likely is word x to follow word y? Solutions: Estimate likelihood of x occurring in new text, based on its general frequency of occurrence estimated from a corpus popcorn is more likely to occur than unicorn Condition the likelihood of x occurring in the context of previous words (bigrams, trigrams,…) mythical unicorn is more likely than mythical popcorn

The task of predicting the next word can be stated as: Statistical View The task of predicting the next word can be stated as: attempting to estimate the probability function P: In other words: we use a classification of the previous history words, to predict the next word. On the basis of having looked at a lot of text, we know which words tend to follow other words.

Models Sequences using the Statistical Properties of N-Grams N-Gram Models Models Sequences using the Statistical Properties of N-Grams Idea: Shannon given a sequence of letters, what is the likelihood of the next letter? From training data, derive a probability distribution for the next letter given a history of size n.

N-gram Model is a Markov Chain Give a set of states, S = {s1, s2, ... , sr}. The process starts in a random state based on a probability distribution Change states in sequence based on a probability distribution of the next state given the previous state This model could generate the sequence {A,C,D,B,C} of length 5 with probability: Like deterministic finite automaton, but instead of reading input, we assume that we start in a random state based on a probability distribution,

Markov assumption: the probability of the next word depends only on the previous k words. Common N-grams:

Using N-Grams For N-gram models P(w1,w2, ...,wn) By the Chain Rule we can decompose a joint probability, e.g. P(w1,w2,w3) as follows: P(w1,w2, ...,wn) = P(w1|w2,w3,...,wn) P(w2|w3, ...,wn) … P(wn-1|wn) P(wn)

Compute the product of component conditional probabilities? Example Compute the product of component conditional probabilities? P(the mythical unicorn) = P(the) * P(mythical | the) * P(unicorn|the mythical) How to calculate these probability?

N-Gram probabilities come from a training corpus Training and Testing N-Gram probabilities come from a training corpus overly narrow corpus: probabilities don't generalize overly general corpus: probabilities don't reflect task or domain A separate test corpus is used to evaluate the model

A Simple Bigram Example Estimate the likelihood of the sentence: I want to eat Chinese food. P(I want to eat Chinese food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(Chinese | eat) P(food | Chinese) P(<end>|food) What do we need to calculate these likelihoods? Bigram probabilities for each word pair sequence in the sentence Calculated from a large corpus

Early Bigram Probabilities from BERP The Berkeley Restaurant Project (BeRP) is a testbed for a speech recognition system s being developed by the International Computer Science Institute in Berkeley, CA http://www.icsi.berkeley.edu/real/berp.html The Berkeley Restaurant Project (BeRP) is a testbed for a speech recognition system is being developed by the International Computer Science Institute in Berkeley, CA

P(I want to eat British food) = P(I | <start>) P(want | I) P(to | want) P(eat | to) P(British | eat) P(food | British) = .25*.32*.65*.26*.001*.60 = .000080 N-gram models can be trained by counting and normalization

Early BERP Bigram Counts 1 4 Lunch 17 19 Food 120 2 Chinese 52 Eat 12 3 860 10 To 6 8 786 Want 13 1087 I lunch

Early BERP Bigram Probabilities Normalization: divide each row's counts by appropriate unigram counts for wn-1 Computing the bigram probability of I I C(I,I)/C( all I ) p (I|I) = 8 / 3437 = .0023

Is a high-order n-gram needed? N-gram Parameters Is a high-order n-gram needed? P(Most biologists and folklore specialists believe that in fact the mythical unicorn horns derived from the narwhal) Issues: The longer the sequence, the less likely we are to find it in a training corpus The larger the n, the more the number of parameters to estimate

Computing Number of Parameters Assume that a speaker is staying within a vocabulary of 20,000 words ,we want to see the estimates for numbers of parameters For this reason, n-gram systems currently use bigrams or trigrams (and often make do with a smaller vocabulary).

How to reduce the Number of Parameters? Reducing the number of parameters: Stemming Semantic classes In practice, it’s hard to beat a trigram model for language modeling.

Calculating N-gram Probabilities Given a certain number of pieces of training data, the second goal is then finding out : how to derive a good probability estimate for the target feature based on these data. For our running example of n-grams, we will be interested in and the prediction task since:

Maximum Likelihood Estimation Given: C(w1w2…wn): the frequency of w1w2…wn in the training text N: the total number of training n-grams:

Given the two words “comes across”, we may have: In a certain corpus, the authors found 10 training instances of the words comes across, and of those: 8 times they were followed by as, once by more and once by a. Given the two words “comes across”, we may have:

MLE is in general unsuitable for statistical inference in NLP: Sparse Data Problem MLE is in general unsuitable for statistical inference in NLP: The problem is the sparseness of our data (even with the large corpus). The vast majority of words are very uncommon longer n-grams involving them are thus much rarer The MLE assigns a zero probability to unseen events Bad …because the probability of the whole sequences will be zero computed by multiplying the probabilities of subparts

how do you handle unseen n-grams? Solution how do you handle unseen n-grams? Smoothing Backoff

Zipf’s Law Given the frequency f of a word and its rank r in the list of words ordered by their frequencies:

Words follow a Zipfian distribution Smoothing With MLE, a missing k-gram means zero probability and any longer n-gram that contains the k-gram will also have a zero probability. Words follow a Zipfian distribution no matter how big is a training set, there will always be a lot of rare events that may not be covered. Discounting/smoothing techniques: allocate some probability mass for the missing n-grams These methods are refered to as discounting The process of discounting is smoothing often referred to as smoothing, presumably because a distribution without zeroes is smoother than one with zeroes.

Laplace Smoothing Add 1 to all frequency counts. Let V be the vocabulary size Bigram: n-gram:

Unigram Smoothing Example Tiny Corpus, V=4; N=20 Word True Ct Unigram Prob New Ct Adjusted Prob eat 10 .5 11 .46 British 4 .2 5 .21 food 6 .3 7 .29 happily .0 1 .04 20 1.0 ~20 1-gram training data set N: the total number of training n-grams:

Problem with Laplace Smoothing Problem: give too much probability mass to unseen n-grams. For sparse sets of data over large vocabularies, such as n-grams, Laplace's law actually gives far too much of the probability space to unseen events. Can we smooth more usefully?

Lidstone’s Law: Add-l Smoothing Need to choose It works better than add-one, but still works horribly Extensions to Laplace’s Law

Given N-gram hierarchy Other estimators Given N-gram hierarchy P3(w3|w1,w2), P2(w3|w2), P1(w3) Back off to a lower N-gram  backoff estimation When a count for an n-gram is 0, back off to the count for the (n-1)-gram

Backoff is an alternative to smoothing for e.g. trigrams Backoff Estimator Backoff is an alternative to smoothing for e.g. trigrams try to fill any gap in n-grams at level n by looking 'back' to level n-1 Example: If a particular trigram "three years before" has zero frequency Maybe the bigram "years before" has a non-zero count At worst, we back off all the way to unigrams and we do smoothing, if we are still losing

Google NGrams http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html 40

Evaluation How do you estimate how well your language model fits a corpus once you’re done? Extrinsic Intrinsic is measured through the performance of the application.

Extrinsic Evaluation Extrinsic The language model is embedded in a wider application It is measured through the performance of the application. Slow Specific to the application Expensive

Intrinsic Evaluation The language model is evaluated directly using some measure: Perplexity Entropy

At each choice point in a grammar or LM Perplexity At each choice point in a grammar or LM What are the average number of choices that can be made, weighted by their probabilities of occurrence? How much probability does a grammar or language model (LM) assign to the sentences of a corpus, compared to another LM? One way of looking at perplexity is in terms of word-level perplexity.  The perplexity of a text with respect to the word "ice" represents the number of words which could follow it.  There are some given number of words which may follow the word "ice" (cream, age, and cube), and similarly, some words which would most likely not follow "ice" (nuclear, peripheral, and happily).  Since the number of possible words which could follow "ice" is relatively high, its perplexity is high. 

Minimizing perplexity is the same as maximizing probability The higher probability means the lower Perplexity The more information, the lower perplexity Lower perplexity means a better model The lower the perplexity, the closer we are to the true model.

Example in terms of word-level perplexity: The perplexity of a text with respect to the word "ice“: Represents the number of words which could follow it.  There are some given number of words which may follow the word "ice" (cream, age, and cube), similarly, some words which would most likely not follow "ice" (nuclear, peripheral, and happily). Since the number of possible words which could follow "ice" is relatively high, its perplexity is low. Why? 

Using N-gram models in: Applications Using N-gram models in: Lexical disambiguation Spam Filtering Detection of New Malicious Code Automating Authorship attribution Statistical Machine Translation

Automating Authorship attribution Problem: identifying the author of an anonymous text, or text whose authorship is in doubt Plagiarism detection Common approaches: Determine authorship using stylistic analysis Based on lexical measures representing: the richness of the author's vocabulary the frequency of common word use Two steps: style markers classification procedure Style marker extraction is usually accomplished by some form of non-trivial NLP analysis, such as tagging, parsing and morphological analysis. A classier is then constructed, usually by performing a non-trivial feature selection step that employs mutual information or Chi-square testing to determine relevant features.

Disadvantage: New Approach: Techniques used for style marker extraction are almost always language dependent feature selection is not a trivial process, involves setting thresholds to eliminate uninformative features perform their analysis at the word level New Approach: a byte-level n-gram author profile of an author's writing. choosing the optimal set of n-grams to be included in the profile, calculating the similarity between two profiles. although rare features contribute less than common features, they can still have an important cumulative effect Word level depends on morphological features.. moreover that many Asian languages such as Chinese and Japanese do not have word boundaries

References Manning, C.D., and Schutze, H. (1999). Foundations of Statistical Natural Language Processing, MIT Press, Cambridge, MA. Julia Hirschberg, n-grams and corpus linguistics, Power point presentations, www.cs.columbia.edu/~julia/courses/CS4705/ngrams.ppt (Accessed November 14, 2010) Julia Hockenmaier, Probability theory, N-gram and Perplexity, Powerpoint slides, http://www.cs.uiuc.edu/class/fa08/cs498jh/Slides/Lecture3.pdf ( Accessed November 14, 2010) Probability and Language Model, www.cis.upenn.edu/~cis530/slides-2010/Lecture2.ppt, ( Accessed November 14, 2010) Vlado Keselj, Fuchun Peng, Nick Cercone, and Calvin Thomas. (2003). N-gram-based Author Profiles for Authorship Attribution, In Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING'03, Dalhousie University, Halifax, Nova Scotia, Canada, pp. 255-264, August 2003.

Thank You Questions ?