1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 12 Language Models for Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Retrieval Models and Ranking Systems CSC 575 Intelligent Information Retrieval.
CpSc 881: Information Retrieval
Information Retrieval in Practice
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model.
Information Retrieval Models: Probabilistic Models
Visual Recognition Tutorial
Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR
Chapter 7 Retrieval Models.
Hinrich Schütze and Christina Lioma
1 Language Model CSC4170 Web Intelligence and Social Computing Tutorial 8 Tutor: Tom Chao Zhou
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Modeling Modern Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
IR Models: Review Vector Model and Probabilistic.
Chapter 7 Retrieval Models.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Multi-Style Language Model for Web Scale Information Retrieval Kuansan Wang, Xiaolong Li and Jianfeng Gao SIGIR 2010 Min-Hsuan Lai Department of Computer.
Language Models for IR Debapriyo Majumdar Information Retrieval Indian Statistical Institute Kolkata Spring 2015 Credit for several slides to Jimmy Lin.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 9 9/20/2011.
Text Classification, Active/Interactive learning.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
ECE 8443 – Pattern Recognition LECTURE 07: MAXIMUM LIKELIHOOD AND BAYESIAN ESTIMATION Objectives: Class-Conditional Density The Multivariate Case General.
1 Computing Relevance, Similarity: The Vector Space Model.
CPSC 404 Laks V.S. Lakshmanan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides at UC-Berkeley.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Language Modeling Approach to Information Retrieval 한 경 수  Introduction  Previous Work  Model Description  Empirical Results  Conclusions.
C.Watterscsci64031 Probabilistic Retrieval Model.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
1 Ranking. 2 Boolean vs. Non-boolean Queries Until now, we assumed that satisfaction is a Boolean function of a query –it is easy to determine if a document.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Unsupervised Learning Part 2. Topics How to determine the K in K-means? Hierarchical clustering Soft clustering with Gaussian mixture models Expectation-Maximization.
Lecture 13: Language Models for IR
Probabilistic Retrieval Models
CSCI 5417 Information Retrieval Systems Jim Martin
Lecture 15: Text Classification & Naive Bayes
Information Retrieval Models: Probabilistic Models
Language Models for Information Retrieval
Hidden Markov Models Part 2: Algorithms
Introduction to Statistical Modeling
Where did we stop? The Bayes decision rule guarantees an optimal classification… … But it requires the knowledge of P(ci|x) (or p(x|ci) and P(ci)) We.
LECTURE 21: CLUSTERING Objectives: Mixture Densities Maximum Likelihood Estimates Application to Gaussian Mixture Models k-Means Clustering Fuzzy k-Means.
Language Model Approach to IR
CS246: Latent Dirichlet Analysis
5. Vector Space and Probabilistic Retrieval Models
CS 430: Information Discovery
CS590I: Information Retrieval
INF 141: Information Retrieval
Conceptual grounding Nisheeth 26th March 2019.
Retrieval Performance Evaluation - Measures
Information Retrieval and Web Design
Language Models for TR Rong Jin
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

1 Probabilistic Models for Ranking Some of these slides are based on Stanford IR Course slides at

An Ideal Retrieval Model Ideally, a retrieval model should: –Give a set of assumptions –Present a method of ranked retrieval –Prove that under the given assumptions the proposed method of ranking will achieve better effectiveness than any other approach Does vector space ranking do this? 2

Probability Ranking Principle (Robertson 1977) “If a reference retrieval system’s response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data have been made available to the system for this purpose, the overall effectiveness of the system to its user will be the best that is obtainable on the basis of those data” 3

Probability Ranking Principle Ranking by probability of relevance can be shown to maximize precision at any given rank (e.g., precision at 10, precision at 20, etc) Problem: How to estimate probability of relevance??!! 4

Information Retrieval as Classification Assume that relevance is binary –Document is either relevant or non-relevant If we can compute probability of a document being classified as relevant, then classify D as relevant if P(R | D) > P(NR | D) where –P(R|D) is the probability of the document being relevant, given the document representation –P(NR|D) is the probability of the document being non- relevant, given the document representation 5 The Bayes Decision Rule

Computing Probabilities Not clear how to compute P(R|D) Given information about the relevant set, we may be able to compute P(D|R) –If we know how often specific words appear in relevant set, given a new document we can calculate P(D | R) Example: –Probability of “huji” in relevant set is 0.02 –Probability of “cs” in relevant set is 0.03 –Given new doc containing “huji” and “cs” what is the probability of it being relevant? 6

Classifying Relevant Documents Use Baye’s rule: Then, P(R|D) > P(NR|D) iff Iff So, we can rank by the likelihood ratio 7

Calculating the Likelihood Ratio Represent documents as sets of words –D=(d 1,…,d n ) where d i = 1 if term i is in document Define relevant and non-relevant sets using word probabilities –Assume term independence 8 Binary Independence Model

Calculating the Likelihood Ratio (cont) Let p i be the probability that term t i appears in a document from the relevant set Let s i be the probability that term t i appears in a document from the non-relevant set –Is p i +s i =1 necessarily? 9

Calculating the Likelihood Ratio (cont) The right-hand product is constant for all documents Ranking by likelihood ratio, same as ranking by Equivalently, ranking by 10

Calculating the Likelihood Ratio (cont) Where did the query go? –Assume that terms in query have the same probability in relevant and non-relevant documents, i.e., sum over terms in both query and document 11

Calculating the Likelihood Ratio (cont) How do we estimate p i, s i given no additional info? –Choose p i as a constant, say 0.5 –Estimate s i using the term occurrence in the collection, since most documents are not relevant, where n i is the number of documents containint t i and N is the total number of documents 12

Does this Work Well? 13 Not really –Does not take into consideration term frequencies –Does not do length normalization But, is the basis for one of the most effective ranking algorithms, called BM25

BM25 Ranking 14 Where: q i is the i-th query term f(q i,D) is the term frequency of q i in D n(q i ) is the number of documents containing q i N is the number of documents k 1 is a parameter, usually k 1  [1.2,2.0] b is a free parameter, usually b = 0.75 |D| is the number of words in D avgdl is the average document length

15 Ranking with Language Models

How Do We Search? When we search, we envision in our mind a document that satisfies our information need We then formulate a query by “guessing” what words are likely to be representative of the document Idea: A document is a good match if it is likely to “generate” the query –Formalize the notion of generation with language models 16

Language Models: Overview A statistical language model assigns a probability to a sequence of m words by means of a probability distribution –A language model is associated with each document in a collection. –Given a query Q as input, retrieved documents are ranked based on the probability that the document's language model would generate the terms of the query 17

18 We can view a finite state automaton as a deterministic language model. Can generate: I wish I wish I wish I wish... Cannot generate: “wish I wish” or “I wish I”. Our basic model: each document was generated by a different automaton like this except that these automata are probabilistic. What is a deterministic language model?

19 A document corresponds to a specific deterministic automata, e.g., Corresponds to the document “frog says frog likes toad” When writing a document, there are many different ways to state the same information. Would like a way to model the set of documents that could have been written to formulate the given idea What is a deterministic language model? frogsaysfroglikestoad

Probabilistic Language Model We now associate each node with a probability distribution over generating different terms We also give a probability of terminating, so every node is possibly a final state 20 frog says froglikestoad

Probabilistic Language Model The automata on the previous slide is a rather complicated way to describe an infinite set of documents, each with a probability In general, a language model provides a probability distribution over strings from some vocabulary V such that 21

Probabilistic Language Model For ranking, it is often sufficient to consider only unigram models, in which the probability of a word is independent of previous words In a unigram model, we are provided with a distribution over terms such that Given a probability of stopping, this provides us with a probability over words, using a bag of words model. 22

23 A probabilistic language model This is a one-state probabilistic finite-state automaton Called a unigram language model STOP is not a word, but a special symbol indicating that the automaton stops. Example: string = “frog said that toad likes frog STOP” P(string) = (0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01)  (0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.2)

24 A probabilistic language model This is a one-state probabilistic finite-state automaton Called a unigram language model STOP is not a word, but a special symbol indicating that the automaton stops. Example: string = “frog said that toad likes frog STOP” P(string) = (0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01)  (0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.8 · 0.2) We will usually omit this, since we will be interested in comparing the likelihood of a query for different document modes, and this will be constant, assuming a constant stop probability

25 There are different language models for each document Example: query= “frog said that toad likes frog STOP” P(query|M d1 ) = 0.01 · 0.03 · 0.04 · 0.01 · 0.02 · 0.01 = = 2.4 · P(query|M d2 ) = 0.01 · 0.03 · 0.05 · 0.02 · 0.02 · 0.01 = = 6 · P(query|M d1 ) < P(query|M d2 ): d 2 is “more relevant” to this string

26 Using language models in IR  Each document is treated as (the basis for) a language model.  Given a query q, rank documents based on P(d|q)  P(q) is the same for all documents, so ignore  P(d) is the prior – often treated as the same for all d  But we can give a prior to “high-quality” documents, e.g., those with high PageRank (not done here).  P(q|d) is the probability of q given d.  So to rank documents according to relevance to q, ranking according to P(q|d) and P(d|q) is equivalent.

27 What next?  In the LM approach to IR, we attempt to model the query generation process.  Then we rank documents by the probability that a document would be generated by the language of a query.  Equivalently, we rank documents by the probability that a query would be observed as a random sample from the respective document model.  That is, we rank according to P(q|d).  Next: how do we compute P(q|d)?

28 How to compute P(q|d)  We will make the a conditional independence assumption (|q|: length ofr q; t k : the token occurring at position k in q)  This is equivalent to:  tf t,q : term frequency (# occurrences) of t in q  Multinomial model (omitting constant factor)

29 Parameter estimation  Missing piece: Where do the parameters P(t|M d ) come from?  Start with maximum likelihood estimates (|d|: length of d; tf t,d : # occurrences of t in d)  Intuitively, this is the probability distribution that makes d most likely.

30 Parameter estimation Example  Start with maximum likelihood estimates (|d|: length of d; tf t,d : # occurrences of t in d)  d = “I love love love to learn Hebrew in Hebrew University”  P(love | M d ) = 0.3 = 3 / 10  P(Hebrew | M d ) = 0.2  P(t | M d ) = 0.1 for all other terms t in d  P(t’ | M d ) = 0.0 for all terms t’ not in d  What is the likelihood of the following queries?  P(“Hebrew University” | d) = ?  P(“Hebrew University Jerusalem” | d) = ?

31 Parameter estimation  What happens if a document does not contain one of the query words? What will P(q|d) be?  Is this good?  Do you see other problems with the current formulation?

32 Parameter estimation  A single t with P(t|M d ) = 0 will make zero.  Gives a single term “veto power”.  Conjunctive semantics for query  For example, for query “Michael Jackson top hits” a document about “top songs” (but not using the word “hits”) would have P(t|M d ) = 0.  Bad :~(  We need to smooth the estimates to avoid zeros.

33 Smoothing  Key intuition: A nonoccurring term is possible (even though it didn’t occur),... ... but no more likely than would be expected by chance in the collection.  Notation: M c : the collection model; cf t : the number of occurrences of t in the collection; : the total number of tokens in the collection.  We will use to “smooth” P(t|d) away from zero.  Smoothing is also good for other reasons. Why do you think it is helpful? = cf t / T

34 Mixture model  How do we combine P(t|M d ) and P(t|M c )?  One simple solution is to use a mixture model P(t|d) = λP(t|M d ) + (1 - λ)P(t|M c )  Mixes the probability from the document with the general collection frequency of the word.  How does our choice of λ affect the results?  High value of λ: “conjunctive-like” search – tends to retrieve documents containing all query words.  Low value of λ: more disjunctive, suitable for long queries  Correctly setting λ is very important for good performance.

35 Mixture model: Summary  What we model: The user has a document in mind and generates the query from this document.  The equation represents the probability that the document that the user had in mind was in fact this one.

36 Example  Collection: documents d 1 and d 2  d 1 : “Jackson was one of the most talented entertainers of all time”  d 2 : “Michael Jackson anointed himself King of Pop”  Query q: “Michael Jackson”  Use mixture model with λ = 1/2  Calculate P(q|d 1 ) and P(q|d 2 )  Which ranks higher?

37 LMs vs. vector space model  LMs have some things in common with vector space models.  How are they the same/different?  How does term frequency, inverse document frequency come to play in language models?  How would you define a bigram language model?  What are the advantages / disadvantages of a bigram language w.r.t a unigram language?