Language Models Hongning Wang CS@UVa.

Slides:

Advertisements

Similar presentations

Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.

Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.

1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,

Language Models Hongning Wang

1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.

Probabilistic Ranking Principle

Information Retrieval Models: Probabilistic Models

Chapter 7 Retrieval Models.

Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.

Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.

Formal Multinomial and Multiple- Bernoulli Language Models Don Metzler.

. PGM: Tirgul 10 Parameter Learning and Priors. 2 Why learning? Knowledge acquisition bottleneck u Knowledge acquisition is an expensive process u Often.

Language Modeling Approaches for Information Retrieval Rong Jin.

Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.

Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.

Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.

IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR 4.2 Statistical Language Models (LMs) Principles and Basic LMs Smoothing.

Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.

Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.

Probabilistic Ranking Principle Hongning Wang

LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.

Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.

Dependence Language Model for Information Retrieval Jianfeng Gao, Jian-Yun Nie, Guangyuan Wu, Guihong Cao, Dependence Language Model for Information Retrieval,

Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.

NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.

Midterm Review Hongning Wang Core concepts Search Engine Architecture Key components in a modern search engine Crawling & Text processing Different.

Natural Language Processing Statistical Inference: n-grams

Information Retrieval Models: Vector Space Models

Relevance Feedback Hongning Wang

A Generation Model to Unify Topic Relevance and Lexicon-based Sentiment for Opinion Retrieval Min Zhang, Xinyao Ye Tsinghua University SIGIR

Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.

N-Gram Model Formulas Word sequences Chain rule of probability Bigram approximation N-gram approximation.

A Study of Poisson Query Generation Model for Information Retrieval

Statistical Language Models Hongning Wang CS 6501: Text Mining1.

Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.

2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.

A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.

Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,

1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at

Essential Probability & Statistics

Overview of Statistical Language Models

Information Retrieval Models: Language Models

Lecture 13: Language Models for IR

Statistical Language Models

Language Models for Text Retrieval

CSCI 5417 Information Retrieval Systems Jim Martin

Bayes Net Learning: Bayesian Approaches

Information Retrieval Models: Probabilistic Models

Relevance Feedback Hongning Wang

Language Models for Information Retrieval

Introduction to Statistical Modeling

10701 / Machine Learning Today: - Cross validation,

Lecture 12 The Language Model Approach to IR

John Lafferty, Chengxiang Zhai School of Computer Science

Presented by Wen-Hung Tsai Speech Lab, CSIE, NTNU 2005/07/13

Chapter 6: Statistical Inference: n-gram Models over Sparse Data

Language Model Approach to IR

CS 4501: Information Retrieval

Probabilistic Ranking Principle

Topic Models in Text Processing

Parametric Methods Berlin Chen, 2005 References:

CS590I: Information Retrieval

INF 141: Information Retrieval

Conceptual grounding Nisheeth 26th March 2019.

Language Models for TR Rong Jin

Presentation transcript:

Language Models Hongning Wang CS@UVa

CS 4501: Information Retrieval Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Generative Model Regression Model (Fox 83) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. (Wong & Yao, 89) … Doc generation Query Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) CS@UVa CS 4501: Information Retrieval

Query generation models Query likelihood p(q| d) Document prior Assuming uniform document prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Language models, we will cover it now! CS@UVa CS 4501: Information Retrieval

What is a statistical LM? A model specifying a probability distribution over word sequences p(“Today is Wednesday”)  0.001 p(“Today Wednesday is”)  0.0000000000001 p(“The eigenvalue is positive”)  0.00001 It can be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Why is a LM useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval) CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Recap: the BM25 formula TF-IDF component for document TF component for query A closer look 𝑏 is usually set to [0.75, 1.2] 𝑘 1 is usually set to [1.2, 2.0] 𝑘 2 is usually set to (0, 1000] Vector space model with TF-IDF schema! CS@UVa CS 4501: Information Retrieval

Recap: query generation models Query likelihood p(q| d) Document prior Assuming uniform document prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Language models, we will cover it now! CS@UVa CS 4501: Information Retrieval

Source-Channel framework [Shannon 48] Transmitter (encoder) Noisy Channel Receiver (decoder) Destination X Y X’ P(X) P(X|Y)=? P(Y|X) (Bayes Rule) When X is text, p(X) is a language model Many Examples: Speech recognition: X=Word sequence Y=Speech signal Machine translation: X=English sentence Y=Chinese sentence OCR Error Correction: X=Correct word Y= Erroneous word Information Retrieval: X=Document Y=Query Summarization: X=Summary Y=Document CS@UVa CS 4501: Information Retrieval

Language model for text We need independence assumptions! Probability distribution over word sequences 𝑝 𝑤1 𝑤2 … 𝑤𝑛 =𝑝 𝑤1 𝑝 𝑤 2 𝑤 1 𝑝 𝑤 3 𝑤 1 , 𝑤 2 …𝑝 𝑤 𝑛 𝑤 1 , 𝑤 2 ,…, 𝑤 𝑛−1 Complexity - 𝑂( 𝑉 𝑛 ∗ ) 𝑛 ∗ - maximum document length Chain rule: from conditional probability to joint probability sentence 475,000 main headwords in Webster's Third New International Dictionary Average English sentence length is 14.3 words A rough estimate: 𝑂( 475000 14 ) 475000 14 8𝑏𝑦𝑡𝑒𝑠× 1024 4 ≈3.38 𝑒 66 𝑇𝐵 How large is this? CS@UVa CS 4501: Information Retrieval

Unigram language model Generate a piece of text by generating each word independently 𝑝(𝑤1 𝑤2 … 𝑤𝑛)=𝑝(𝑤1)𝑝(𝑤2)…𝑝(𝑤𝑛) 𝑠.𝑡. 𝑝 𝑤𝑖 𝑖=1 𝑁 , 𝑖 𝑝 𝑤 𝑖 =1 , 𝑝 𝑤 𝑖 ≥0 Essentially a multinomial distribution over the vocabulary The simplest and most popular choice! CS@UVa CS 4501: Information Retrieval

More sophisticated LMs N-gram language models In general, 𝑝(𝑤1 𝑤2 …𝑤𝑛)=𝑝(𝑤1)𝑝(𝑤2|𝑤1)…𝑝( 𝑤 𝑛 | 𝑤 1 … 𝑤 𝑛−1 ) N-gram: conditioned only on the past N-1 words E.g., bigram: 𝑝(𝑤1 … 𝑤𝑛)=𝑝(𝑤1)𝑝(𝑤2|𝑤1) 𝑝(𝑤3|𝑤2) …𝑝( 𝑤 𝑛 | 𝑤 𝑛−1 ) Remote-dependence language models (e.g., Maximum Entropy model) Structured language models (e.g., probabilistic context-free grammar) CS@UVa CS 4501: Information Retrieval

Why just unigram models? Difficulty in moving toward more complex models They involve more parameters, so need more data to estimate They increase the computational complexity significantly, both in time and space Capturing word order or structure may not add so much value for “topical inference” But, using more sophisticated models can still be expected to improve performance ... CS@UVa CS 4501: Information Retrieval

Language models for IR [Ponte & Croft SIGIR’98] Document Language Model … text ? mining ? assocation ? clustering ? food ? nutrition ? healthy ? diet ? “data mining algorithms” Query Text mining paper ? Which model would most likely have generated this query? Food nutrition paper CS@UVa CS 4501: Information Retrieval

Ranking docs by query likelihood dN Doc LM …… q p(q| d1) p(q| d2) p(q| dN) Query likelihood d1 d2 …… dN Justification: PRP CS@UVa CS 4501: Information Retrieval

Retrieval as language model estimation Document ranking based on query likelihood Retrieval problem  Estimation of 𝑝(𝑤𝑖|𝑑) Common approach Maximum likelihood estimation (MLE) Document language model CS@UVa CS 4501: Information Retrieval

Recap: maximum likelihood estimation Data: a document d with counts c(w1), …, c(wN) Model: multinomial distribution p(𝑊|𝜃) with parameters 𝜃 𝑖 =𝑝( 𝑤 𝑖 ) Maximum likelihood estimator: 𝜃 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝜃 𝑝(𝑊|𝜃) 𝑝 𝑊 𝜃 = 𝑁 𝑐 𝑤 1 ,…,𝑐( 𝑤 𝑁 ) 𝑖=1 𝑁 𝜃 𝑖 𝑐( 𝑤 𝑖 ) ∝ 𝑖=1 𝑁 𝜃 𝑖 𝑐( 𝑤 𝑖 ) log 𝑝 𝑊 𝜃 = 𝑖=1 𝑁 𝑐 𝑤 𝑖 log 𝜃 𝑖 𝐿 𝑊,𝜃 = 𝑖=1 𝑁 𝑐 𝑤 𝑖 log 𝜃 𝑖 +𝜆 𝑖=1 𝑁 𝜃 𝑖 −1 Using Lagrange multiplier approach, we’ll tune 𝜽 𝒊 to maximize 𝑳(𝑾,𝜽) 𝜕𝐿 𝜕 𝜃 𝑖 = 𝑐 𝑤 𝑖 𝜃 𝑖 +𝜆 → 𝜃 𝑖 =− 𝑐 𝑤 𝑖 𝜆 Set partial derivatives to zero 𝑖=1 𝑁 𝜃 𝑖 =1 𝜆=− 𝑖=1 𝑁 𝑐 𝑤 𝑖 Since we have Requirement from probability 𝜃 𝑖 = 𝑐 𝑤 𝑖 𝑖=1 𝑁 𝑐 𝑤 𝑖 CS@UVa CS 4501: Information Retrieval ML estimate

Problem with MLE Unseen events There are 440K tokens on a large collection of Yelp reviews, but: Only 30,000 unique words occurred Only 0.04% of all possible bigrams occurred This means any word/N-gram that does not occur in the collection has a zero probability with MLE! No future queries/documents can contain those unseen words/N-grams A plot of word frequency in Wikipedia (Nov 27, 2006) Word frequency Word rank by frequency 𝑓 𝑘;𝑠,𝑁 ∝1/ 𝑘 𝑠 CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Problem with MLE What probability should we give a word that has not been observed in the document? log0? If we want to assign non-zero probabilities to such words, we’ll have to discount the probabilities of observed words This is so-called “smoothing” CS@UVa CS 4501: Information Retrieval

Illustration of language model smoothing P(w|d) w Max. Likelihood Estimate Smoothed LM Discount from the seen words Assigning nonzero probabilities to the unseen words CS@UVa CS 4501: Information Retrieval

General idea of smoothing All smoothing methods try to Discount the probability of words seen in a document Re-allocate the extra counts such that unseen words will have a non-zero count CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Smoothing methods Method 1: Additive smoothing Add a constant  to the counts of each word Problems? Hint: all words are equally important? Counts of w in d “Add one”, Laplace smoothing Vocabulary size Length of d (total counts) CS@UVa CS 4501: Information Retrieval

Add one smoothing for bigrams CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval After smoothing Giving too much to the unseen events CS@UVa CS 4501: Information Retrieval

Refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Smoothing methods Method 2: Absolute discounting Subtract a constant  from the counts of each word Problems? Hint: varied document length? # uniq words CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Smoothing methods Method 3: Linear interpolation, Jelinek-Mercer “Shrink” uniformly toward p(w|REF) Problems? Hint: what is missing? MLE parameter CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Smoothing methods Method 4: Dirichlet Prior/Bayesian Assume pseudo counts p(w|REF) Problems? parameter CS@UVa CS 4501: Information Retrieval

Dirichlet prior smoothing likelihood of doc given the model Bayesian estimator Posterior of LM: 𝑝 𝜃 𝑑 ∝𝑝 𝑑 𝜃 𝑝(𝜃) Conjugate prior Posterior will be in the same form as prior Prior can be interpreted as “extra”/“pseudo” data Dirichlet distribution is a conjugate prior for multinomial distribution prior over models “extra”/“pseudo” word counts, we set i= p(wi|REF) CS@UVa CS 4501: Information Retrieval

Some background knowledge Conjugate prior Posterior dist in the same family as prior Dirichlet distribution Continuous Samples from it will be the parameters in a multinomial distribution Gaussian -> Gaussian Beta -> Binomial Dirichlet -> Multinomial CS@UVa CS 4501: Information Retrieval

Dirichlet prior smoothing (cont) Posterior distribution of parameters: The predictive distribution is the same as the mean: Dirichlet prior smoothing CS@UVa CS 4501: Information Retrieval

Recap: ranking docs by query likelihood dN Doc LM …… q p(q| d1) p(q| d2) p(q| dN) Query likelihood d1 d2 …… dN Justification: PRP CS@UVa CS 4501: Information Retrieval

Recap: illustration of language model smoothing P(w|d) w Max. Likelihood Estimate Smoothed LM Discount from the seen words Assigning nonzero probabilities to the unseen words CS@UVa CS 4501: Information Retrieval

Recap: refine the idea of smoothing Should all unseen words get equal probabilities? We can use a reference model to discriminate unseen words Discounted ML estimate Reference language model CS@UVa CS 4501: Information Retrieval

Recap: smoothing methods Method 2: Absolute discounting Subtract a constant  from the counts of each word Problems? Hint: varied document length? # uniq words CS@UVa CS 4501: Information Retrieval

Recap: smoothing methods Method 3: Linear interpolation, Jelinek-Mercer “Shrink” uniformly toward p(w|REF) Problems? Hint: what is missing? MLE parameter CS@UVa CS 4501: Information Retrieval

Recap: smoothing methods Method 4: Dirichlet Prior/Bayesian Assume pseudo counts p(w|REF) Problems? parameter CS@UVa CS 4501: Information Retrieval

Estimating  using leave-one-out [Zhai & Lafferty 02] w1 log-likelihood Maximum Likelihood Estimator Leave-one-out P(w1|d- w1) w2 P(w2|d- w2) P(wn|d- wn) wn ... CS@UVa CS 4501: Information Retrieval

Understanding smoothing Topical words Query = “the algorithms for data mining” pML(w|d1): 0.04 0.001 0.02 0.002 0.003 pML(w|d2): 0.02 0.001 0.01 0.003 0.004 4.8× 10 −12 2.4× 10 −12 p( “algorithms”|d1) = p(“algorithms”|d2) p( “data”|d1) < p(“data”|d2) p( “mining”|d1) < p(“mining”|d2) Intuitively, d2 should have a higher score, but p(q|d1)>p(q|d2)… So we should make p(“the”) and p(“for”) less different for all docs, and smoothing helps to achieve this goal… 2.35× 10 −13 4.53× 10 −13 Query = “the algorithms for data mining” P(w|REF) 0.2 0.00001 0.2 0.00001 0.00001 Smoothed p(w|d1): 0.184 0.000109 0.182 0.000209 0.000309 Smoothed p(w|d2): 0.182 0.000109 0.181 0.000309 0.000409 CS@UVa CS 4501: Information Retrieval

Two-stage smoothing [Zhai & Lafferty 02] +p(w|C) + Stage-1 -Explain unseen words -Dirichlet prior (Bayesian)  Collection LM (1-) + p(w|U) Stage-2 -Explain noise in query -2-component mixture  User background model c(w,d) |d| P(w|d) = CS@UVa CS 4501: Information Retrieval

Smoothing & TF-IDF weighting Smoothed ML estimate Retrieval formula using the general smoothing scheme Reference language model Key rewriting step (where did we see it before?) 𝑐 𝑤,𝑞 >0 Similar rewritings are very common when using probabilistic models for IR… CS@UVa CS 4501: Information Retrieval

Understanding smoothing Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Doc length normalization (longer doc is expected to have a smaller d) TF weighting Ignore for ranking IDF weighting Smoothing with p(w|C)  TF-IDF + doc-length normalization CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval What you should know How to estimate a language model General idea and different ways of smoothing Effect of smoothing CS@UVa CS 4501: Information Retrieval

CS 4501: Information Retrieval Today’s reading Introduction to information retrieval Chapter 12: Language models for information retrieval CS@UVa CS 4501: Information Retrieval