Probabilistic Ranking Principle

Slides:



Advertisements
Similar presentations
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Advertisements

Language Models Naama Kraus (Modified by Amit Gross) Slides are based on Introduction to Information Retrieval Book by Manning, Raghavan and Schütze.
1 Language Models for TR (Lecture for CS410-CXZ Text Info Systems) Feb. 25, 2011 ChengXiang Zhai Department of Computer Science University of Illinois,
Language Models Hongning Wang
1 Essential Probability & Statistics (Lecture for CS598CXZ Advanced Topics in Information Retrieval ) ChengXiang Zhai Department of Computer Science University.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Probabilistic Information Retrieval Chris Manning, Pandu Nayak and
Introduction to Information Retrieval Information Retrieval and Data Mining (AT71.07) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor:
CpSc 881: Information Retrieval
Information Retrieval Models: Probabilistic Models
Visual Recognition Tutorial
IR Models: Overview, Boolean, and Vector
Chapter 7 Retrieval Models.
Hinrich Schütze and Christina Lioma
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 12: Language Models for IR.
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Probabilistic Information Retrieval.
Probabilistic Information Retrieval Part II: In Depth Alexander Dekhtyar Department of Computer Science University of Maryland.
Modeling Modern Information Retrieval
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Vector Space Model CS 652 Information Extraction and Integration.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Probabilistic Information Retrieval.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
IR Models: Review Vector Model and Probabilistic.
Language Modeling Approaches for Information Retrieval Rong Jin.
Probabilistic Models in IR Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata Using majority of the slides from.
2008 © ChengXiang Zhai China-US-France Summer School, Lotus Hill Inst., Statistical Models for Information Retrieval and Text Mining ChengXiang.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
1 Vector Space Model Rong Jin. 2 Basic Issues in A Retrieval Model How to represent text objects What similarity function should be used? How to refine.
IRDM WS Chapter 4: Advanced IR Models 4.1 Probabilistic IR Principles Probabilistic IR with Term Independence Probabilistic.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Latent Semantic Analysis Hongning Wang Recap: vector space model Represent both doc and query by concept vectors – Each concept defines one dimension.
2008 © ChengXiang Zhai Dragon Star Lecture at Beijing University, June 21-30, 龙星计划课程 : 信息检索 Overview of Text Retrieval: Part 2 ChengXiang Zhai (
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Probabilistic Ranking Principle Hongning Wang
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
Language Models Hongning Wang Recap: document generation model 4501: Information Retrieval Model of relevant docs for Q Model of non-relevant.
Boolean Model Hongning Wang Abstraction of search engine architecture User Ranker Indexer Doc Analyzer Index results Crawler Doc Representation.
KNN & Naïve Bayes Hongning Wang Today’s lecture Instance-based classifiers – k nearest neighbors – Non-parametric learning algorithm Model-based.
Language Modeling Putting a curve to the bag of words Courtesy of Chris Jordan.
Web-Mining Agents Probabilistic Information Retrieval Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Karsten Martiny (Übungen)
CpSc 881: Information Retrieval. 2 Using language models (LMs) for IR ❶ LM = language model ❷ We view the document as a generative model that generates.
Information Retrieval Models: Vector Space Models
Relevance Feedback Hongning Wang
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
A Study of Poisson Query Generation Model for Information Retrieval
Statistical Language Models Hongning Wang CS 6501: Text Mining1.
Essential Probability & Statistics (Lecture for CS397-CXZ Algorithms in Bioinformatics) Jan. 23, 2004 ChengXiang Zhai Department of Computer Science University.
2010 © University of Michigan Probabilistic Models in Information Retrieval SI650: Information Retrieval Winter 2010 School of Information University of.
KNN & Naïve Bayes Hongning Wang
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 14: Language Models for IR.
Essential Probability & Statistics
Lecture 13: Language Models for IR
Probabilistic Retrieval Models
Language Models for Text Retrieval
CSCI 5417 Information Retrieval Systems Jim Martin
Information Retrieval Models: Probabilistic Models
Relevance Feedback Hongning Wang
John Lafferty, Chengxiang Zhai School of Computer Science
CS 4501: Information Retrieval
Probabilistic Ranking Principle
Language Models Hongning Wang
CS 430: Information Discovery
Language Models for TR Rong Jin
ADVANCED TOPICS IN INFORMATION RETRIEVAL AND WEB SEARCH
Presentation transcript:

Probabilistic Ranking Principle Hongning Wang CS@UVa

Probability of Relevance Probabilistic inference Notion of relevance Relevance (Rep(q), Rep(d)) Similarity P(r=1|q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Today’s lecture Generative Model Regression Model (Fox 83) Prob. concept space model (Wong & Yao, 95) Different inference system Inference network model (Turtle & Croft, 91) Different rep & similarity Vector space model (Salton et al., 75) Prob. distr. (Wong & Yao, 89) … Doc generation Query Classical prob. Model (Robertson & Sparck Jones, 76) LM approach (Ponte & Croft, 98) (Lafferty & Zhai, 01a) CS@UVa CS 6501: Information Retrieval

Basic concepts in probability Random experiment An experiment with uncertain outcome (e.g., tossing a coin, picking a word from text) Sample space (S) All possible outcomes of an experiment, e.g., tossing 2 fair coins, S={HH, HT, TH, TT} Event (E) ES, E happens iff outcome is in S, e.g., E={HH} (all heads), E={HH,TT} (same face) Impossible event ({}), certain event (S) Probability of event 0 ≤ P(E) ≤ 1 CS@UVa CS 6501: Information Retrieval

Essential probability concepts Probability of events Mutually exclusive events 𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃(𝐵) General events 𝑃 𝐴∪𝐵 =𝑃 𝐴 +𝑃 𝐵 −𝑃(𝐴∩𝐵) Independent events 𝑃 𝐴∩𝐵 =𝑃 𝐴 𝑃 𝐵 Joint probability, or simply as 𝑃(𝐴,𝐵) CS@UVa CS 6501: Information Retrieval

Essential probability concepts Conditional probability 𝑃 𝐵 𝐴 =𝑃(𝐴,𝐵)/𝑃 𝐴 Bayes’ Rule: 𝑃 𝐵 𝐴 =𝑃 𝐴 𝐵 𝑃(𝐵)/𝑃 𝐴 For independent events, 𝑃 𝐵 𝐴 =𝑃(𝐵) Total probability If 𝐴1, …, 𝐴 𝑛 form a non-overlapping partition of S 𝑃(𝐵𝑆)=𝑃(𝐵𝐴1)+…+𝑃(𝐵 𝐴 𝑛 ) 𝑃 𝐴𝑖 𝐵 = 𝑃 𝐵 𝐴𝑖 𝑃 𝐴𝑖 𝑃 𝐵 𝐴1 𝑃 𝐴1 +…+𝑃 𝐵 𝐴𝑛 𝑃 𝐴𝑛 ∝𝑃 𝐵 𝐴 𝑖 𝑃( 𝐴 𝑖 ) This allows us to compute 𝑃(𝐴𝑖|𝐵) based on 𝑃(𝐵| 𝐴 𝑖 ) CS@UVa CS 6501: Information Retrieval

Interpretation of Bayes’ rule Hypothesis space: 𝐻={𝐻1 , …, 𝐻𝑛}, Evidence: 𝐸 𝑃 𝐻 𝑖 𝐸 = 𝑃 𝐸 𝐻 𝑖 𝑃( 𝐻 𝑖 ) 𝑃(𝐸) If we want to pick the most likely hypothesis H*, we can drop 𝑃(𝐸) Posterior probability of 𝑯𝒊 Prior probability of 𝑯𝒊 Likelihood of data/evidence given 𝑯𝒊 CS@UVa CS 6501: Information Retrieval

Theoretical justification of ranking As stated by William Cooper Rank by probability of relevance leads to optimal retrieval performance “If a reference retrieval system’s response to each request is a ranking of the documents in the collections in order of decreasing probability of usefulness to the user who submitted the request, where the probabilities are estimated as accurately as possible on the basis of whatever data made available to the system for this purpose, then the overall effectiveness of the system to its users will be the best that is obtainable on the basis of that data.” CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval Justification From decision theory Two types of loss Loss(retrieved|non-relevant)= 𝑎 1 Loss(not retrieved|relevant)= 𝑎 2 𝜙( 𝑑 𝑖 ,𝑞): probability of 𝑑 𝑖 being relevant to 𝑞 Expected loss regarding to the decision of including 𝑑 𝑖 in the final results Retrieve: 1−𝜙 𝑑 𝑖 ,𝑞 𝑎 1 Not retrieve: 𝜙 𝑑 𝑖 ,𝑞 𝑎 2 Your decision criterion? CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval Justification From decision theory We make decision by If 1−𝜙 𝑑 𝑖 ,𝑞 𝑎 1 <𝜙 𝑑 𝑖 ,𝑞 𝑎 2 , retrieve 𝑑 𝑖 Otherwise, not retrieve 𝑑 𝑖 Check if 𝜙 𝑑 𝑖 ,𝑞 > 𝑎 1 𝑎 1 + 𝑎 2 Rank documents by descending order of 𝜙 𝑑 𝑖 ,𝑞 would minimize the loss CS@UVa CS 6501: Information Retrieval

According to PRP, what we need is A relevance measure function F(q,d) For all q, d1, d2, F(q,d1) > F(q,d2) iff. p(Rel|q,d1) >p(Rel|q,d2) Assumptions Independent relevance Sequential browsing Most existing research on IR models so far has fallen into this line of thinking… (Limitations?) CS@UVa CS 6501: Information Retrieval

Probability of relevance Three random variables Query Q Document D Relevance R  {0,1} Goal: rank D based on P(R=1|Q,D) Compute P(R=1|Q,D) Actually, one only needs to compare P(R=1|Q,D1) with P(R=1|Q,D2), i.e., rank documents Several different ways to define P(R=1|Q,D) CS@UVa CS 6501: Information Retrieval

Conditional models for P(R=1|Q,D) Basic idea: relevance depends on how well a query matches a document P(R=1|Q,D)=g(Rep(Q,D)|) Rep(Q,D): feature representation of query-doc pair E.g., #matched terms, highest IDF of a matched term, docLen Using training data (with known relevance judgments) to estimate parameter  Apply the model to rank new documents Will be covered with more details in learning-to-rank a functional form CS@UVa CS 6501: Information Retrieval

Generative models for P(R=1|Q,D) Basic idea Compute Odd(R=1|Q,D) using Bayes’ rule Assumption Relevance is a binary variable Variants Document “generation” P(Q,D|R)=P(D|Q,R)P(Q|R) Query “generation” P(Q,D|R)=P(Q|D,R)P(D|R) Ignored for ranking CS@UVa CS 6501: Information Retrieval

Document generation model Ignored for ranking information retrieval retrieved is helpful for you everyone Doc1 1 Doc2 Model of relevant docs for Q Model of non-relevant docs for Q Assume independent attributes of A1…Ak ….(why?) Let D=d1…dk, where dk {0,1} is the value of attribute Ak (Similarly Q=q1…qk ) Terms occur in doc Terms do not occur in doc document relevant(R=1) nonrelevant(R=0) term present Ai=1 pi ui term absent Ai=0 1-pi 1-ui CS@UVa CS 6501: Information Retrieval

Document generation model Terms occur in doc Terms do not occur in doc Important tricks Assumption: terms not occurring in the query are equally likely to occur in relevant and nonrelevant documents, i.e., pt=ut document relevant(R=1) nonrelevant(R=0) term present Ai=1 pi ui term absent Ai=0 1-pi 1-ui CS@UVa CS 6501: Information Retrieval

Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) (RSJ model) Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc ui = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc How to estimate these parameters? Suppose we have relevance judgments, Per-query estimation! “+0.5” and “+1” can be justified by Bayesian estimation as priors CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval Parameter estimation General setting: Given a (hypothesized & probabilistic) model that governs the random experiment The model gives a probability of any data 𝑝(𝐷|𝜃) that depends on the parameter 𝜃 Now, given actual sample data X={x1,…,xn}, what can we say about the value of 𝜃? Intuitively, take our best guess of 𝜃 -- “best” means “best explaining/fitting the data” Generally an optimization problem CS@UVa CS 6501: Information Retrieval

Maximum likelihood vs. Bayesian Maximum likelihood estimation “Best” means “data likelihood reaches maximum” Issue: small sample size Bayesian estimation “Best” means being consistent with our “prior” knowledge and explaining data well A.k.a, Maximum a Posterior estimation Issue: how to define prior? 𝜽 = 𝐚𝐫𝐠𝐦𝐚𝐱 𝜽 𝐏(𝐗|𝜽) ML: Frequentist’s point of view 𝜽 = 𝐚𝐫𝐠𝐦𝐚 𝐱 𝜽 𝑷 𝜽 𝑿 =𝐚𝐫𝐠𝐦𝐚𝐱 𝜽 𝐏 𝐗 𝜽 𝐏(𝜽) MAP: Bayesian’s point of view CS@UVa CS 6501: Information Retrieval

Illustration of Bayesian estimation Posterior: p(|X) p(X|)p() Likelihood: p(X|) X=(x1,…,xN) Prior: p() : prior mode : posterior mode  ml: ML estimate CS@UVa CS 6501: Information Retrieval

Maximum likelihood estimation Data: a document d with counts c(w1), …, c(wN) Model: multinomial distribution p(𝑊|𝜃) with parameters 𝜃 𝑖 =𝑝( 𝑤 𝑖 ) Maximum likelihood estimator: 𝜃 =𝑎𝑟𝑔𝑚𝑎 𝑥 𝜃 𝑝(𝑊|𝜃) 𝑝 𝑊 𝜃 = 𝑁 𝑐 𝑤 1 ,…,𝑐( 𝑤 𝑁 ) 𝑖=1 𝑁 𝜃 𝑖 𝑐( 𝑤 𝑖 ) ∝ 𝑖=1 𝑁 𝜃 𝑖 𝑐( 𝑤 𝑖 ) log 𝑝 𝑊 𝜃 = 𝑖=1 𝑁 𝑐 𝑤 𝑖 log 𝜃 𝑖 𝐿 𝑊,𝜃 = 𝑖=1 𝑁 𝑐 𝑤 𝑖 log 𝜃 𝑖 +𝜆 𝑖=1 𝑁 𝜃 𝑖 −1 Using Lagrange multiplier approach, we’ll tune 𝜽 𝒊 to maximize 𝑳(𝑾,𝜽) 𝜕𝐿 𝜕 𝜃 𝑖 = 𝑐 𝑤 𝑖 𝜃 𝑖 +𝜆 → 𝜃 𝑖 =− 𝑐 𝑤 𝑖 𝜆 Set partial derivatives to zero 𝑖=1 𝑁 𝜃 𝑖 =1 𝜆=− 𝑖=1 𝑁 𝑐 𝑤 𝑖 Since we have Requirement from probability 𝜃 𝑖 = 𝑐 𝑤 𝑖 𝑖=1 𝑁 𝑐 𝑤 𝑖 CS@UVa CS 6501: Information Retrieval ML estimate

Robertson-Sparck Jones Model (Robertson & Sparck Jones 76) (RSJ model) Two parameters for each term Ai: pi = P(Ai=1|Q,R=1): prob. that term Ai occurs in a relevant doc ui = P(Ai=1|Q,R=0): prob. that term Ai occurs in a non-relevant doc How to estimate these parameters? Suppose we have relevance judgments, Per-query estimation! “+0.5” and “+1” can be justified by Bayesian estimation as priors CS@UVa CS 6501: Information Retrieval

RSJ Model without relevance info (Croft & Harper 79) information retrieval retrieved is helpful for you everyone Doc1 1 Doc2 (RSJ model) Suppose we do not have relevance judgments, - We will assume pi to be a constant - Estimate ui by assuming all documents to be non-relevant Reminder: IDF weighted Boolean model? N: # documents in collection ni: # documents in which term Ai occurs CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval RSJ Model: summary The most important classic probabilistic IR model Use only term presence/absence, thus also referred to as Binary Independence Model Essentially Naïve Bayes for doc ranking Designed for short catalog records When without relevance judgments, the model parameters must be estimated in an ad-hoc way Performance isn’t as good as tuned VS models CS@UVa CS 6501: Information Retrieval

Improving RSJ: adding TF Let D=d1…dk, where dk is the frequency count of term Ak Eliteness: if the term is about the concept asked in the query 2-Poisson mixture model for TF Many more parameters to estimate! (how many exactly?) Compound with document length! CS@UVa CS 6501: Information Retrieval

BM25/Okapi approximation (Robertson et al. 94) Idea: Approximate p(R=1|Q,D) with a simpler function that approximates 2-Possion mixture model Observations: log O(R=1|Q,D) is a sum of term weights occurring in both query and document Term weight Wi= 0, if TFi=0 Wi increases monotonically with TFi Wi has an asymptotic limit The simple function is CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval Adding doc. length Incorporating doc length Motivation: the 2-Poisson model assumes equal document length Implementation: penalize long doc where Pivoted document length normalization CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval Adding query TF Incorporating query TF Motivation Natural symmetry between document and query Implementation: a similar TF transformation as in document TF The final formula is called BM25, achieving top TREC performance BM: best match CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval The BM25 formula “Okapi TF/BM25 TF” becomes IDF when no relevance info is available CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval The BM25 formula TF-IDF component for document TF component for query A closer look 𝑏 is usually set to [0.75, 1.2] 𝑘 1 is usually set to [1.2, 2.0] 𝑘 2 is usually set to (0, 1000] Vector space model with TF-IDF schema! CS@UVa CS 6501: Information Retrieval

Extensions of “Doc Generation” models Capture term dependence [Rijsbergen & Harper 78] Alternative ways to incorporate TF [Croft 83, Kalt96] Feature/term selection for feedback [Okapi’s TREC reports] Estimate of the relevance model based on pseudo feedback [Lavrenko & Croft 01] to be covered later CS@UVa CS 6501: Information Retrieval

Query generation models Query likelihood p(q| d) Document prior Assuming uniform document prior, we have Now, the question is how to compute ? Generally involves two steps: (1) estimate a language model based on D (2) compute the query likelihood according to the estimated model Language models, we will cover it in next lecture! CS@UVa CS 6501: Information Retrieval

CS 6501: Information Retrieval What you should know Essential concepts in probability Justification of ranking by relevance Derivation of RSJ model Maximum likelihood estimation BM25 formula CS@UVa CS 6501: Information Retrieval