Download presentation
Presentation is loading. Please wait.
Published byCollin Peters Modified over 8 years ago
1
1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at http://www.stanford.edu/class/cs276/
2
An Ideal Retrieval Model Ideally, a retrieval model should: –Give a set of assumptions –Present a method of ranked retrieval –Prove that under the given assumptions the proposed method of ranking will achieve better effectiveness than any other approach Does vector space ranking do this? 2
3
Probability Ranking Principle (Robertson 1977) “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data” 3
4
Probability Ranking Principle Ranking by probability of relevance can be shown to maximize precision at any given rank (e.g., precision at 10, precision at 20, etc) Problem: How to estimate probability of relevance??!! 4
5
Information Retrieval as Classification Assume that relevance is binary –Document is either relevant or non-relevant If we can compute probability of a document being classified as relevant, then classify D as relevant if P(R | D) > P(NR | D) where –P(R|D) is the probability of the document being relevant, given the document representation –P(NR|D) is the probability of the document being non- relevant, given the document representation 5 The Bayes Decision Rule
6
Computing Probabilities Not clear how to compute P(R|D) Given information about the relevant set, we may be able to compute P(D|R) –If we know how often specific words appear in relevant set, given a new document we can calculate P(D | R) Example: –Probability of “huji” in relevant set is 0.02 –Probability of “cs” in relevant set is 0.03 –Given new doc containing “huji” and “cs” what is the probability of it being relevant? 6
7
Classifying Relevant Documents Use Baye’s rule: Then, P(R|D) > P(NR|D) iff Iff So, we can rank by the likelihood ratio 7
8
Calculating the Likelihood Ratio Represent documents as sets of words –D=(d 1,…,d n ) where d i = 1 if term i is in document Define relevant and non-relevant sets using word probabilities –Assume term independence 8 Binary Independence Model
9
Calculating the Likelihood Ratio (cont) Let p i be the probability that term t i appears in a document from the relevant set Let s i be the probability that term t i appears in a document from the non-relevant set –Is p i +s i =1 necessarily? 9
10
Calculating the Likelihood Ratio (cont) The right-hand product is constant for all documents Ranking by likelihood ratio, same as ranking by Equivalently, ranking by 10
11
Calculating the Likelihood Ratio (cont) Where did the query go? –Assume that terms in query have the same probability in relevant and non-relevant documents, i.e., sum over terms in both query and document 11
12
Calculating the Likelihood Ratio (cont) How do we estimate p i, s i given no additional info? –Choose p i as a constant, say 0.5 –Estimate s i using the term occurrence in the collection, since most documents are not relevant, where n i is the number of documents containint t i and N is the total number of documents 12
13
Does this Work Well? 13 Not really –Does not take into consideration term frequencies –Does not do length normalization But, is the basis for one of the most effective ranking algorithms, called BM25
14
BM25 Ranking 14 Where: q i is the i-th query term f(q i,D) is the term frequency of q i in D n(q i ) is the number of documents containing q i N is the number of documents k 1 is a parameter, usually k 1 [1.2,2.0] b is a free parameter, usually b = 0.75 |D| is the number of words in D avgdl is the average document length
15
15 Ranking with Language Models
16
How Do We Search? When we search, we envision in our mind a document that satisfies our information need We then formulate a query by “guessing” what words are likely to be representative of the document Idea: A document is a good match if it is likely to “generate” the query –Formalize the notion of generation with language models 16
17
Language Models: Overview A statistical language model assigns a probability to a sequence of m words by means of a probability distribution –A language model is associated with each document in a collection. –Given a query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query 17
18
18 We can view a finite state automaton as a deterministic language model. Can generate: I wish I wish I wish I wish... Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic. What is a deterministic language model?
19
19 A document corresponds to a specific deterministic automata, e.g., Corresponds to the document “frog says frog likes toad” When writing a document, there are many different ways to state the same information. Would like a way to model the set of documents that could have been written to formulate the given idea What is a deterministic language model? frogsaysfroglikestoad
20
Probabilistic Language Model We now associate each node with a probability distribution over generating different terms We also give a probability of terminating, so every node is possibly a final state 20 frog says froglikestoad 0.4 0.6 0.5 0.7 0.3 1.0 0.7 0.3 1.0
21
Probabilistic Language Model The automata on the previous slide is a rather complicated way to describe an infinite set of documents, each with a probability In general, a language model provides a probability distribution over strings from some vocabulary V such that 21
22
Probabilistic Language Model For ranking, it is often sufficient to consider only unigram models, in which the probability of a word is independent of previous words In a unigram model, we are provided with a distribution over terms such that Given a probability of stopping, this provides us with a probability over words, using a bag of words model. 22
23
23 A probabilistic language model This is a one-state probabilistic finite-state automaton Called a unigram language model STOP is not a word, but a special symbol indicating that the automaton stops. Example: string = “frog said that toad likes frog STOP” P(string) = (0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01) (0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.2)
24
24 A probabilistic language model This is a one-state probabilistic finite-state automaton Called a unigram language model STOP is not a word, but a special symbol indicating that the automaton stops. Example: string = “frog said that toad likes frog STOP” P(string) = (0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01) (0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.2) We will usually omit this, since we will be interested in comparing the likelihood of a query for different document modes, and this will be constant, assuming a constant stop probability
25
25 There are different language models for each document Example: query= “frog said that toad likes frog STOP” P(query|M d1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 = 0.000000000024 = 2.4 · 10 -11 P(query|M d2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 = 0.00000000006 = 6 · 10 -11 P(query|M d1 ) < P(query|M d2 ): d 2 is “more relevant” to this string
26
26 Using language models in IR Each document is treated as (the basis for) a language model. Given a query q, rank documents based on P(d|q) P(q) is the same for all documents, so ignore P(d) is the prior – often treated as the same for all d But we can give a prior to “high-quality” documents, e.g., those with high PageRank (not done here). P(q|d) is the probability of q given d. So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent.
27
27 What next? In the LM approach to IR, we attempt to model the query generation process. Then we rank documents by the probability that a document would be generated by the language of a query. Equivalently, we rank documents by the probability that a query would be observed as a random sample from the respective document model. That is, we rank according to P(q|d). Next: how do we compute P(q|d)?
28
28 How to compute P(q|d) We will make the a conditional independence assumption (|q|: length ofr q; t k : the token occurring at position k in q) This is equivalent to: tf t,q : term frequency (# occurrences) of t in q Multinomial model (omitting constant factor)
29
29 Parameter estimation Missing piece: Where do the parameters P(t|M d ) come from? Start with maximum likelihood estimates (|d|: length of d; tf t,d : # occurrences of t in d) Intuitively, this is the probability distribution that makes d most likely.
30
30 Parameter estimation Example Start with maximum likelihood estimates (|d|: length of d; tf t,d : # occurrences of t in d) d = “I love love love to learn Hebrew in Hebrew University” P(love | M d ) = 0.3 = 3 / 10 P(Hebrew | M d ) = 0.2 P(t | M d ) = 0.1 for all other terms t in d P(t’ | M d ) = 0.0 for all terms t’ not in d What is the likelihood of the following queries? P(“Hebrew University” | d) = ? P(“Hebrew University Jerusalem” | d) = ?
31
31 Parameter estimation What happens if a document does not contain one of the query words? What will P(q|d) be? Is this good? Do you see other problems with the current formulation?
32
32 Parameter estimation A single t with P(t|M d ) = 0 will make zero. Gives a single term “veto power”. Conjunctive semantics for query For example, for query “Michael Jackson top hits” a document about “top songs” (but not using the word “hits”) would have P(t|M d ) = 0. Bad :~( We need to smooth the estimates to avoid zeros.
33
33 Smoothing Key intuition: A nonoccurring term is possible (even though it didn’t occur),... ... but no more likely than would be expected by chance in the collection. Notation: M c : the collection model; cf t : the number of occurrences of t in the collection; : the total number of tokens in the collection. We will use to “smooth” P(t|d) away from zero. Smoothing is also good for other reasons. Why do you think it is helpful? = cf t / T
34
34 Mixture model How do we combine P(t|M d ) and P(t|M c )? One simple solution is to use a mixture model P(t|d) = λP(t|M d ) + (1 - λ)P(t|M c ) Mixes the probability from the document with the general collection frequency of the word. How does our choice of λ affect the results? High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words. Low value of λ: more disjunctive, suitable for long queries Correctly setting λ is very important for good performance.
35
35 Mixture model: Summary What we model: The user has a document in mind and generates the query from this document. The equation represents the probability that the document that the user had in mind was in fact this one.
36
36 Example Collection: documents d 1 and d 2 d 1 : “Jackson was one of the most talented entertainers of all time” d 2 : “Michael Jackson anointed himself King of Pop” Query q: “Michael Jackson” Use mixture model with λ = 1/2 Calculate P(q|d 1 ) and P(q|d 2 ) Which ranks higher?
37
37 LMs vs. vector space model LMs have some things in common with vector space models. How are they the same/different? How does term frequency, inverse document frequency come to play in language models? How would you define a bigram language model? What are the advantages / disadvantages of a bigram language w.r.t a unigram language?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.