Language Models for Information Retrieval

Language Models for Information Retrieval
Andy Luong and Nikita Sudan

Outline Language Model Types of Language Models Query Likelihood Model
Smoothing Evaluation Comparison with other approaches

Language Model A language model is a function that puts a probability measure over strings drawn from some vocabulary.

P(q|Md) instead of P(R=1|q,d)
Language Models P(q|Md) instead of P(R=1|q,d) A traditional generative model of a language, of the kind familiar from formal language theory, can be used either to recognize or to generate strings. The full set of strings that can be generated is called the language of the automaton.

Example Doc1: “frog said that toad likes frog” Doc2: “toad likes frog”
STOP M1 1/6 .2 M2 1/3 1/3 1/6 1/3

Example Continued q = “frog likes toad” P(q | M1) = (1/3)*(1/6)*(1/6)*0.8*0.8*0.2 P(q | M2) = (1/3)*(1/3)*(1/3)*0.8*0.8*0.2 P(q | M1) < P (S | M2) frog said that toad likes STOP M1 1/3 1/6 .2 M2

Types of Language Models
CHAIN RULE UNIGRAM LM BIGRAM LM For IR purposes, we use unigrams. Why? Structure is not as important Single document makes it difficult to have sufficient training data Sparseness outweighs complexity

Multinomial distribution
Order Constraint Frequency Bag or Words == Unigram Models Multinomial Coefficient – Why is not important? Because order doesn’t matter Pretend d is only a representative sample of text drawn from a model distribution… We then estimate a language model from this sample, and use that model to calculate the probability of observing any word sequence… query M is the size of the term vocabulary

Query Likelihood Model
≈ Construct a language model for each document. Our goal is to rank documents by P (d|q) where the probability of a document is interpreted as the likelihood that it is relevant to the query. P(q) same for all the documents and can thus be ignored. Similarily, P(d) is treated as uniform across all d and so it can also be ignored. However, a genuine prior could include criteria like authority, length, genre, newness and number of previous people who have read the document. Documents are ranked by the probability that a query would be observed as a random sample from the respective document model. Treat each document as a separate class

Query Likelihood Model
Infer LM for each document Estimate P(q | Md(i)) Rank documents based on probabilities Intuition about Users A user knows that there will be certain terms in a document of interest They will make a query that distinguish these documents from the collection

Smoothing Basic Intuition Why else should we smooth?
New word or unseen word in the document P( t | Md ) = 0 Zero probabilities will make P ( q | Md) = 0 Why else should we smooth? Smooth for putting weights on very uncommon words, not just zero probabilities

Smoothing Continued Non-occurring term Probability Bound
Linear Interpolation Language Model What should we do? Add 1, ½, epsilon Use collection information Large lambda means more emphasis on collection => more smoothing We can also imagine having variable lamda for different size of documents Small document may need more smoothing Infinite document is perfect

Example Doc1: “frog said that toad likes frog” Doc2: “toad likes frog”
1/3 1/6 M2 C 1/3 1/9 1/9 2/9 2/9

Example Continued q = “frog said” λ = ½
P(q | M1) = [(1/3 + 1/3)*(1/2)] * [(1/6 + 1/9)*(1/2)] = .046 P(q | M2) = [(1/3 + 1/3)*(1/2)] * [(0 + 1/9)*(1/2)] = .018 P(q | M1) > P (q | M2)

Evaluation Precision = (relevant documents ∩ retrieved documents)/ retrieved documents Recall = (relevant documents ∩ retrieved documents)/ relevant documents

Tf-Idf The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Numerator is the number of occurrences of a term in the document Denominator is the sum of the occurrences of all terms in the document numerator is the total number of documents denominator is the total number of documents in which the word occurs Discuss with Andy…..

Ponte and Croft’s Experiments
They use a multivariate Bernoulli model instead of a mixture of two multinomials. TREC topics over TREC disks 2 and 3.

Pros and Cons “Mathematically precise, conceptually simple, computationally tractable and intuitively appealing.” Relevancy is not captured The LM approach assumes that documents and expressions of information needs are objects of the same type, and assesses their match by importing the tools and methods of language modeling from speech and natural language processig.

Query vs. Document Model
Why is query likelihood more appealing that document likelihood? Data available in the document vs the query (a) Query Likelihood (b) Document Likelihood (c) Model Comparison

KL divergence Kullback-Leibler Divergence
Shown to have better results that query and document likelihood Scores are not compatible across queries Asymmetric How bad is Mq at modeling Md What is the risk? Large divergence means that the models don’t agree

Thank you.

Questions?

Language Models for Information Retrieval

Similar presentations

Presentation on theme: "Language Models for Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Language Models for Information Retrieval

Similar presentations

Presentation on theme: "Language Models for Information Retrieval"— Presentation transcript:

Similar presentations

About project

Feedback