Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.
Data Mining Classification: Alternative Techniques
Information Retrieval and Organisation Chapter 11 Probabilistic Information Retrieval Dell Zhang Birkbeck, University of London.
Diversity Maximization Under Matroid Constraints Date : 2013/11/06 Source : KDD’13 Authors : Zeinab Abbassi, Vahab S. Mirrokni, Mayur Thakur Advisor :
Diversified Retrieval as Structured Prediction Redundancy, Diversity, and Interdependent Document Relevance (IDR ’09) SIGIR 2009 Workshop Yisong Yue Cornell.
INTRODUCTION TO MACHINE LEARNING Bayesian Estimation.
The Probabilistic Model. Probabilistic Model n Objective: to capture the IR problem using a probabilistic framework; n Given a user query, there is an.
Information Retrieval Models: Probabilistic Models
Empirical Development of an Exponential Probabilistic Model Using Textual Analysis to Build a Better Model Jaime Teevan & David R. Karger CSAIL (LCS+AI),
Date:2011/06/08 吳昕澧 BOA: The Bayesian Optimization Algorithm.
Personalized Search Result Diversification via Structured Learning
Distributed Search over the Hidden Web Hierarchical Database Sampling and Selection Panagiotis G. Ipeirotis Luis Gravano Computer Science Department Columbia.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 11: Probabilistic Information Retrieval.
Probabilistic IR Models Based on probability theory Basic idea : Given a document d and a query q, Estimate the likelihood of d being relevant for the.
Latent Dirichlet Allocation a generative model for text
Language Models for TR Rong Jin Department of Computer Science and Engineering Michigan State University.
Maximum Likelihood (ML), Expectation Maximization (EM)
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
Affinity Rank Yi Liu, Benyu Zhang, Zheng Chen MSRA.
Language Modeling Frameworks for Information Retrieval John Lafferty School of Computer Science Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Probabilistic Model for Definitional Question Answering Kyoung-Soo Han, Young-In Song, and Hae-Chang Rim Korea University SIGIR 2006.
Topic Models in Text Processing IR Group Meeting Presented by Qiaozhu Mei.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Language Models Hongning Wang Two-stage smoothing [Zhai & Lafferty 02] c(w,d) |d| P(w|d) = +  p(w|C) ++ Stage-1 -Explain unseen words -Dirichlet.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Presented by Chen Yi-Ting.
Partially Supervised Classification of Text Documents by Bing Liu, Philip Yu, and Xiaoli Li Presented by: Rick Knowles 7 April 2005.
Empirical Research Methods in Computer Science Lecture 7 November 30, 2005 Noah Smith.
Probabilistic Models of Novel Document Rankings for Faceted Topic Retrieval Ben Cartrette and Praveen Chandar Dept. of Computer and Information Science.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
A Model for Learning the Semantics of Pictures V. Lavrenko, R. Manmatha, J. Jeon Center for Intelligent Information Retrieval Computer Science Department,
ACM SIGIR 2009 Workshop on Redundancy, Diversity, and Interdependent Document Relevance, July 23, 2009, Boston, MA 1 Modeling Diversity in Information.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Topic Models Presented by Iulian Pruteanu Friday, July 28 th, 2006.
Diversifying Search Results Rakesh Agrawal, Sreenivas Gollapudi, Alan Halverson, Samuel Ieong Search Labs, Microsoft Research WSDM, February 10, 2009 TexPoint.
Creating Subjective and Objective Sentence Classifier from Unannotated Texts Janyce Wiebe and Ellen Riloff Department of Computer Science University of.
Diversifying Search Results Rakesh AgrawalSreenivas GollapudiSearch LabsMicrosoft Research Alan HalversonSamuel.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
NTNU Speech Lab Dirichlet Mixtures for Query Estimation in Information Retrieval Mark D. Smucker, David Kulp, James Allan Center for Intelligent Information.
Learning Sequence Motifs Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 Mark Craven
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Bayesian decision theory: A framework for making decisions when uncertainty exit 1 Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Michael Bendersky, W. Bruce Croft Dept. of Computer Science Univ. of Massachusetts Amherst Amherst, MA SIGIR
PERSONALIZED DIVERSIFICATION OF SEARCH RESULTS Date: 2013/04/15 Author: David Vallet, Pablo Castells Source: SIGIR’12 Advisor: Dr.Jia-ling, Koh Speaker:
Bayesian Optimization Algorithm, Decision Graphs, and Occam’s Razor Martin Pelikan, David E. Goldberg, and Kumara Sastry IlliGAL Report No May.
Improving QA Accuracy by Question Inversion John Prager, Pablo Duboue, Jennifer Chu-Carroll Presentation by Sam Cunningham and Martin Wintz.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture Probabilistic Information Retrieval.
The P YTHY Summarization System: Microsoft Research at DUC 2007 Kristina Toutanova, Chris Brockett, Michael Gamon, Jagadeesh Jagarlamudi, Hisami Suzuki,
The Effect of Database Size Distribution on Resource Selection Algorithms Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University.
Text Information Management ChengXiang Zhai, Tao Tao, Xuehua Shen, Hui Fang, Azadeh Shakery, Jing Jiang.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 15: Text Classification & Naive Bayes 1.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Introduction to Information Retrieval Probabilistic Information Retrieval Chapter 11 1.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Information Retrieval Models: Probabilistic Models
Modeling Diversity in Information Retrieval
John Lafferty, Chengxiang Zhai School of Computer Science
Panagiotis G. Ipeirotis Luis Gravano
Junghoo “John” Cho UCLA
Topic Models in Text Processing
Feature Selection for Ranking
Presentation transcript:

Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007

Abstract Probability Ranking Priciple (PRP) – Rank documents in decreasing order of probability of relevance. Propose a greedy algorithm that approximately optimizes the following objectives – %no metric: the percentages of queries for which no relevant documents are retrieved. – The diversity of results. 4/30/2007

Introduction Probability Ranking Principle – Rule of thumb: “optimal”. TREC robust track – %no metric – Question answering and finding a homepage. Diversity – For example, “Trojan horse” – PRP-based method may choose one “most likely” interpretation. Greedy algorithm – Fill each position in the ranking by assuming that all previous documents in the ranking are not relevant. 4/30/2007

Introduction (Cont.) Other measures – Search length (SL) – Reciprocal rank (RR) – Instance recall: the number of difference subtopics in a given result set. Retrieving for Diversity – The diversity automatically arises as a consequence of the objective function. 4/30/2007

Related Work Algorithm – Zhai and Lafferty: a risk minimization framework – Bookstein: a sequential learning retrieval system Diversity – Zhai et al.: novelty and redundancy – Clustering is an approach to quickly cover a diverse range of query interpretations. 4/30/2007

Evaluation Metrics MSL (mean search length) MRR (mean reciprocal rank) %no – k-call at n: 1 if at least k of the top n docs returned by system for the given query are deemed relevant; otherwise 0. – mean 1-call: one minus the %no metric – n-call at n: perfect precision Instance recall at rank n 4/30/2007

Bayesian Retrieval Standard Bayesian Information Retrival – The documents in a corpus should be ranked by Pr[r|d] – By a monotonic transformation – Focus on the objective function, so use Naïve Bayes framework with multinomial models (θ i ) as the family of distributions. – Determine the parameters (training) – Dirichlet prior: prior probability distribution over the parameters (θ i ). – Estimate the probability of parameters of the relevant distribution (i.e., Pr[d|r]). 4/30/2007

Object Function Considering optimizing for the k-call at n metric. – k=1: the probability that at least one of the first n relevance variables be true – For arbitrary k: the probability that at least k docs are relevant 4/30/2007

Optimization Methods NP-hard Problem – To perfectly optimize the k-call of any specific set of n docs objective function from a corpus of m docs, because Greedy algorithm (approximately optimize it) – Successively select each result of the result set. 1.Select first result by applying the conventional PRP. 2.For the ith result, we hold results 1 throught i-1 to their already selected value, and consider all remaining corpus documents as a possibility for document i. 3.Pick the document with highest k-call score as the ith result. 4/30/2007

Applying the Greedy Approach k=1 – First, choose the doc d 0 maximizing Pr[r 0 |d 0 ]. – Wish to choose d 1 maximizing the below quantity: – Choose d2 by maximizing – In general, select the optimal d i that maximizes 4/30/2007

Applying the Greedy Approach (Cont.) k=n (perfect precision) – Select the ith document according to: 1<k<n – The objective is to maximize the probability of having at least k relevant docs in the top n. – Focus on k=1 and k=n cases in this paper. 4/30/2007

Optimizing for Other Metrics Optimizing 1-call – Choose greedily conditioned on there being no previous document relevant. – Equal to minimize expected search length and maximize expected reciprocal rank. – Also optimize instance recall metric, which measures the number of distinct subtopics retrieved. If a query has t subtopics, then instance recall is 4/30/2007

Google Examples Two ambiguous queries: “Trojan horse” and “virus” – Usd the titles, summaries, and snippets of Google’s results to form a corpus of 1000 docs for each query. 4/30/2007

Experiments Methods – 1-greedy, 10-greedy, and conventional PRP Datasets – ad hoc topics from TREC-1, TREC-2, and TREC-3 to set the weight parameters of model appropriately. – TREC2004 robust track – TREC-6,7,8 interactive track – TREC-4 and TREC-6 ad hoc tracks 4/30/2007

Tuning the Weights Key weight – For the proposed model, the key weights are the strength of the relevant distribution and irrelevant distribution priors with respect to the strength of the docs. TRECs 1, 2, and 3 – Consisting about 724,000 docs, and 150 topics (topics ) – Used for tuning weight 4/30/2007

Robust Track Experiments TREC2004 robust track – 249 topics in total, about 528,000 docs – 50 topics were selected by TREC as being “difficult” queries. 4/30/2007

Instance Retrieval Experiments TREC-6, 7, and 8 interactive track – Test the performance of diversity – Total 20 topics with between 7 and 56 aspects each, and about 210,000 docs. – Zhai et al’s LM approach is better for aspect retrieval. 4/30/2007

Multiple Annotator Experiments TREC-4 and TREC-6 – Multiple independent annotators are asked to make relevant judgments for the same topics over the same corpus. – TREC-6 had three annotators, TREC-6 had two. 4/30/2007

Query Analysis A specific topic 100 – The description is: 4/30/2007

Conclusions and Future Work Conclusions – Identify the PRP is not optimal, and given an approach to directly optimize other desired objective. – The approach is algorithmically feasible. Future work – Other objective functions – More sophisticated techniques, such as local search alg. – The likelihood of relevance collections of docs Two-Poisson model Language model 4/30/2007