Statistical Language Models

Slides:



Advertisements
Similar presentations
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Advertisements

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
1 CS 388: Natural Language Processing: N-Gram Language Models Raymond J. Mooney University of Texas at Austin.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
Statistical Translation Language Model Maryam Karimzadehgan University of Illinois at Urbana-Champaign 1.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
Mixture Language Models and EM Algorithm
N-Gram Language Models CMSC 723: Computational Linguistics I ― Session #9 Jimmy Lin The iSchool University of Maryland Wednesday, October 28, 2009.
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.
Language Modeling Approaches for Information Retrieval Rong Jin.
1 Advanced Smoothing, Evaluation of Language Models.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.
NLP Language Models1 Language Models, LM Noisy Channel model Simple Markov Models Smoothing Statistical Language Models.
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Sampling Approaches to Pattern Extraction
Chapter6. Statistical Inference : n-gram Model over Sparse Data 이 동 훈 Foundations of Statistic Natural Language Processing.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.
Ngram models and the Sparcity problem. The task Find a probability distribution for the current word in a text (utterance, etc.), given what the last.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Natural Language Processing Statistical Inference: n-grams
Essential CS & Statistics (Lecture for CS498-CXZ Algorithms in Bioinformatics) Aug. 30, 2005 ChengXiang Zhai Department of Computer Science University.
A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR
N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.
A Study of Poisson Query Generation Model for Information Retrieval
Statistical Language Models Hongning Wang CS 6501: Text Mining1.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Language Modeling Again So are we smooth now? Courtesy of Chris Jordan.
Language Modeling Part II: Smoothing Techniques Niranjan Balasubramanian Slide Credits: Chris Manning, Dan Jurafsky, Mausam.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Essential Probability & Statistics
N-Grams Chapter 4 Part 2.
Overview of Statistical Language Models
Information Retrieval Models: Language Models
Lecture 13: Language Models for IR
Hidden Markov Models (HMMs)
Bayes Net Learning: Bayesian Approaches
Course Summary (Lecture for CS410 Intro Text Info Systems)
Relevance Feedback Hongning Wang
Language Models for Information Retrieval
Hidden Markov Models (HMMs)
Introduction to Statistical Modeling
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
N-Gram Model Formulas Word sequences Chain rule of probability
Pairwise Sequence Alignment (cont.)
CS621/CS449 Artificial Intelligence Lecture Notes
Bayesian Inference for Mixture Language Models
John Lafferty, Chengxiang Zhai School of Computer Science
CSCE 771 Natural Language Processing
Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13
Chapter 6: Statistical Inference: n-gram Models over Sparse Data
Topic Models in Text Processing
Language Models Hongning Wang
Speech Recognition: Acoustic Waves
CS590I: Information Retrieval
Pattern Recognition and Machine Learning Chapter 2: Probability Distributions July chonbuk national university.
INF 141: Information Retrieval
Language Models for TR Rong Jin
Professor Junghoo “John” Cho UCLA
Presentation transcript:

Statistical Language Models (Lecture for CS410 Intro Text Info Systems) Jan. 31, 2007 ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign

What is a Statistical LM? A probability distribution over word sequences p(“Today is Wednesday”)  0.001 p(“Today Wednesday is”)  0.0000000000001 p(“The eigenvalue is positive”)  0.00001 Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

Why is a LM Useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

Source-Channel Framework (Communication System) Transmitter (encoder) Noisy Channel Receiver (decoder) Destination X Y X’ P(X) P(X|Y)=? P(Y|X) When X is text, p(X) is a language model

Given acoustic signal A, find the word sequence W Speech Recognition Acoustic signal (A) Words (W) Recognizer Recognized words Speaker Noisy Channel P(W|A)=? P(W) P(A|W) Language model Acoustic model Given acoustic signal A, find the word sequence W

Given Chinese sentence C, find its English translation E Machine Translation Chinese Words(C) English Words (E) Translator English Translation English Speaker Noisy Channel P(E|C)=? P(E) P(C|E) English Language model English->Chinese Translation model Given Chinese sentence C, find its English translation E

Spelling/OCR Error Correction “Erroneous” Words(E) Original Words (O) Corrector Corrected Text Original Text Noisy Channel P(O|E)=? P(O) P(E|O) “Normal” Language model Spelling/OCR Error model Given corrupted text E, find the original text O

Basic Issues Define the probabilistic model Estimate model parameters Event, Random Variables, Joint/Conditional Prob’s P(w1 w2 ... wn)=f(1, 2 ,…, m) Estimate model parameters Tune the model to best fit the data and our prior knowledge i=? Apply the model to a particular task Many applications

The Simplest Language Model (Unigram Model) Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution

Text Generation with Unigram LM (Unigram) Language Model  p(w| ) Sampling Document … text 0.2 mining 0.1 association 0.01 clustering 0.02 food 0.00001 Text mining paper Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 Food nutrition paper Topic 2: Health

Estimation of Unigram LM (Unigram) Language Model  p(w| )=? Estimation Document … text ? mining ? association ? database ? query ? text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 10/100 5/100 3/100 1/100 A “text mining paper” (total #words=100)

Maximum Likelihood Estimate Data: a document d with counts c(w1), …, c(wN), and length |d| Model: multinomial (unigram) M with parameters {p(wi)} Likelihood: p(d|M) Maximum likelihood estimator: M=argmax M p(d|M)  We’ll tune p(wi) to maximize l(d|M) Use Lagrange multiplier approach Set partial derivatives to zero ML estimate

Empirical distribution of words There are stable language-independent patterns in how people use natural languages A few words occur very frequently; most occur rarely. E.g., in news articles, Top 4 words: 10~15% word occurrences Top 50 words: 35~40% word occurrences The most frequent word in one corpus may be rare in another

Zipf’s Law rank * frequency  constant Word Freq. Word Rank (by Freq) Most useful words (Luhn 57) Biggest data structure (stop words) Is “too rare” a problem? Generalized Zipf’s law: Applicable in many domains

Problem with the ML Estimator What if a word doesn’t appear in the text? In general, what probability should we give a word that has not been observed? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is what “smoothing” is about …

Language Model Smoothing (Illustration) P(w) w Max. Likelihood Estimate Smoothed LM

“Add one”, Laplace smoothing Length of d (total counts) How to Smooth? All smoothing methods try to discount the probability of words seen in a document re-allocate the extra counts so that unseen words will have a non-zero count Method 1 (Additive smoothing): Add a constant  to the counts of each word Problems? Counts of w in d “Add one”, Laplace smoothing Vocabulary size Length of d (total counts)

How to Smooth? (cont.) Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model

Other Smoothing Methods Method 2 (Absolute discounting): Subtract a constant  from the counts of each word Method 3 (Linear interpolation, Jelinek-Mercer): “Shrink” uniformly toward p(w|REF) # uniq words parameter ML estimate

Other Smoothing Methods (cont.) Method 4 (Dirichlet Prior/Bayesian): Assume pseudo counts p(w|REF) Method 5 (Good Turing): Assume total # unseen events to be n1 (# of singletons), and adjust the seen events in the same way parameter

Dirichlet Prior Smoothing ML estimator: M=argmax M p(d|M) Bayesian estimator: First consider posterior: p(M|d) =p(d|M)p(M)/p(d) Then, consider the mean or mode of the posterior dist. p(d|M) : Sampling distribution (of data) P(M)=p(1 ,…, N) : our prior on the model parameters conjugate = prior can be interpreted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial sampling distribution “extra”/“pseudo” word counts i= p(wi|REF)

Dirichlet Prior Smoothing (cont.) Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing

So, which method is the best? It depends on the data and the task! Many other sophisticated smoothing methods have been proposed… Cross validation is generally used to choose the best method and/or set the smoothing parameters… For retrieval, Dirichlet prior performs well… Smoothing will be discussed further in the course…

What You Should Know What is a statistical language model What is a unigram language model Know the Zipf’s law Know what is smoothing and why is smoothing necessary Know the formula of Dirichlet prior smoothing Know that there exist more advanced smoothing methods (You don’t need to know the details)