Language Models for TR Rong Jin

Language Models for TR Rong Jin
Department of Computer Science and Engineering Michigan State University

What is a Statistical LM?
A probability distribution over word sequences p(“Today is Wednesday”)  0.001 p(“Today Wednesday is”)  p(“The eigenvalue is positive”)  Context-dependent! Can also be regarded as a probabilistic mechanism for “generating” text, thus also called a “generative” model

Why is a LM Useful? Provides a principled way to quantify the uncertainties associated with natural language Allows us to answer questions like: Given that we see “John” and “feels”, how likely will we see “happy” as opposed to “habit” as the next word? (speech recognition) Given that we observe “baseball” three times and “game” once in a news article, how likely is it about “sports”? (text categorization, information retrieval) Given that a user is interested in sports news, how likely would the user use “baseball” in a query? (information retrieval)

The Simplest Language Model (Unigram Model)
Generate a piece of text by generating each word INDEPENDENTLY Thus, p(w1 w2 ... wn)=p(w1)p(w2)…p(wn) Parameters: {p(wi)} p(w1)+…+p(wN)=1 (N is voc. size) Essentially a multinomial distribution over words A piece of text can be regarded as a sample drawn according to this word distribution

Text Generation with Unigram LM
(Unigram) Language Model  p(w| ) Sampling Document … text 0.2 mining 0.1 assocation 0.01 clustering 0.02 food Text mining paper Topic 1: Text mining … food 0.25 nutrition 0.1 healthy 0.05 diet 0.02 Food nutrition paper Topic 2: Health

Estimation of Unigram LM
(Unigram) Language Model  p(w| )=? Estimation Document … text ? mining ? assocation ? database ? query ? text 10 mining 5 association 3 database 3 algorithm 2 … query 1 efficient 1 10/100 5/100 3/100 1/100 A “text mining paper” (total #words=100)

Language Models for Retrieval (Ponte & Croft 98)
… text ? mining ? assocation ? clustering ? food ? nutrition ? healthy ? diet ? Document Query = “data mining algorithms” Text mining paper ? Which model would most likely have generated this query? Food nutrition paper

Ranking Docs by Query Likelihood
dN Doc LM p(q| d1) p(q| d2) p(q| dN) Query likelihood d1 q d2 dN

But, where is the relevance?
And, what’s good about this approach?

The Notion of Relevance
(Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Generative Model Regression (Fox 83) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. (Wong & Yao, 89) … Doc generation Query Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a)

Query likelihood p(q| d)
Query Generation Query likelihood p(q| d) Document prior Assuming uniform prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model

Retrieval as Language Model Estimation
Document ranking based on query likelihood Document language model Retrieval problem  Estimation of p(wi|d) Smoothing is an important issue, and distinguishes different approaches

A General Smoothing Scheme
All smoothing methods try to discount the probability of words seen in a doc re-allocate the extra probability so that unseen words will have a non-zero probability Most use a reference model (collection language model) to discriminate unseen words Discounted ML estimate Collection language model

Smoothing & TF-IDF Weighting
Plug in the general smoothing scheme to the query likelihood retrieval formula, we obtain Doc length normalization (long doc is expected to have a smaller d) TF weighting Ignore for ranking IDF weighting Smoothing with p(w|C)  TF-IDF + length norm.

Three Smoothing Methods (Zhai & Lafferty 01)
Simplified Jelinek-Mercer: Shrink uniformly toward p(w|C) Dirichlet prior (Bayesian): Assume pseudo counts p(w|C) Absolute discounting: Subtract a constant 

Comparison of Three Methods

The Need of Query-Modeling (Dual-Role of Smoothing)
Keyword queries Verbose queries

Two-stage Smoothing +p(w|C) +  (1-) + p(w|U)  c(w,d) |d|
-Explain unseen words -Dirichlet prior(Bayesian)  (1-) + p(w|U) Stage-2 -Explain noise in query -2-component mixture  c(w,d) |d| P(w|d) =

Estimating  using leave-one-out
w1 log-likelihood Maximum Likelihood Estimator Newton’s Method Leave-one-out P(w1|d- w1) w2 P(w2|d- w2) P(wn|d- wn) wn ...

Automatic 2-stage results  Optimal 1-stage results
Average precision (3 DB’s + 4 query types, 150 topics)

Acknowledgement Many thanks to Chengxiang Zhai who generously shares his slides on language modeling approach for information retrieval

Language Models for TR Rong Jin

Similar presentations

Presentation on theme: "Language Models for TR Rong Jin"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Language Models for TR Rong Jin

Similar presentations

Presentation on theme: "Language Models for TR Rong Jin"— Presentation transcript:

Similar presentations

About project

Feedback