Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Copyright (c) 2004 Brooks/Cole, a division of Thomson Learning, Inc. Chapter 9 Inferences Based on Two Samples.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Session 8b Decision Models -- Prof. Juran.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Chapter 10: Estimating with Confidence
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Optimizing Estimated Loss Reduction for Active Sampling in Rank Learning Presented by Pinar Donmez joint work with Jaime G. Carbonell Language Technologies.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Carnegie Mellon 1 Maximum Likelihood Estimation for Information Thresholding Yi Zhang & Jamie Callan Carnegie Mellon University
Statistical Treatment of Data Significant Figures : number of digits know with certainty + the first in doubt. Rounding off: use the same number of significant.
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
5-3 Inference on the Means of Two Populations, Variances Unknown
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Chapter 9 Hypothesis Testing II. Chapter Outline  Introduction  Hypothesis Testing with Sample Means (Large Samples)  Hypothesis Testing with Sample.
Modern Retrieval Evaluations Hongning Wang
Copyright © Cengage Learning. All rights reserved. 13 Linear Correlation and Regression Analysis.
Chapter 9 Comparing More than Two Means. Review of Simulation-Based Tests  One proportion:  We created a null distribution by flipping a coin, rolling.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
Topics: Statistics & Experimental Design The Human Visual System Color Science Light Sources: Radiometry/Photometry Geometric Optics Tone-transfer Function.
Chapter 9 Hypothesis Testing II: two samples Test of significance for sample means (large samples) The difference between “statistical significance” and.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Estimating Topical Context by Diverging from External Resources SIGIR’13, July 28–August 1, 2013, Dublin, Ireland. Presenter: SHIH, KAI WUN Romain Deveaud.
Web Image Retrieval Re-Ranking with Relevance Model Wei-Hao Lin, Rong Jin, Alexander Hauptmann Language Technologies Institute School of Computer Science.
Statistical Hypotheses & Hypothesis Testing. Statistical Hypotheses There are two types of statistical hypotheses. Null Hypothesis The null hypothesis,
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
1 Nonparametric Statistical Techniques Chapter 17.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
PSEUDO-RELEVANCE FEEDBACK FOR MULTIMEDIA RETRIEVAL Seo Seok Jun.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Statistical Comparison of Two or More Systems The most relevant of all the Basic Theory Lectures. No Holidays.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
1 OUTPUT ANALYSIS FOR SIMULATIONS. 2 Introduction Analysis of One System Terminating vs. Steady-State Simulations Analysis of Terminating Simulations.
Effective Automatic Image Annotation Via A Coherent Language Model and Active Learning Rong Jin, Joyce Y. Chai Michigan State University Luo Si Carnegie.
Relevance-Based Language Models Victor Lavrenko and W.Bruce Croft Department of Computer Science University of Massachusetts, Amherst, MA SIGIR 2001.
Mining Dependency Relations for Query Expansion in Passage Retrieval Renxu Sun, Chai-Huat Ong, Tat-Seng Chua National University of Singapore SIGIR2006.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Identifying “Best Bet” Web Search Results by Mining Past User Behavior Author: Eugene Agichtein, Zijian Zheng (Microsoft Research) Source: KDD2006 Reporter:
A Framework to Predict the Quality of Answers with Non-Textual Features Jiwoon Jeon, W. Bruce Croft(University of Massachusetts-Amherst) Joon Ho Lee (Soongsil.
1 Random Walks on the Click Graph Nick Craswell and Martin Szummer Microsoft Research Cambridge SIGIR 2007.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Hypothesis Testing. Suppose we believe the average systolic blood pressure of healthy adults is normally distributed with mean μ = 120 and variance σ.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Chapter 9: Introduction to the t statistic. The t Statistic The t statistic allows researchers to use sample data to test hypotheses about an unknown.
Hypothesis Testing. Statistical Inference – dealing with parameter and model uncertainty  Confidence Intervals (credible intervals)  Hypothesis Tests.
The Probit Model Alexander Spermann University of Freiburg SS 2008.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Fewer permutations, more accurate P-values Theo A. Knijnenburg 1,*, Lodewyk F. A. Wessels 2, Marcel J. T. Reinders 3 and Ilya Shmulevich 1 1Institute for.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Evaluation Anisio Lacerda.
Evaluation of IR Systems
Applying Key Phrase Extraction to aid Invalidity Search
ارزيابی قابليت استفاده مجدد مجموعه تست‌ها دارای قضاوت‌های چندسطحی Reusability Assessment of Test Collections with Relevance Levels of Judgments مريم.
Comparing Populations
Cumulated Gain-Based Evaluation of IR Techniques
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
Learning to Rank with Ties
Presentation transcript:

Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006

Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion

Introduction Information retrieval system evaluation requires test collections: Information retrieval system evaluation requires test collections: –corpora of documents, sets of topics and relevance judgments Stable, fine-grained evaluation metrics take both of precision and recall into account, and require large sets of judgments. Stable, fine-grained evaluation metrics take both of precision and recall into account, and require large sets of judgments. –At best inefficient, at worst infeasible

Introduction The TREC conferences The TREC conferences –Goal: building test collections that are reusable –Pooling process Top results from many system runs are judged Top results from many system runs are judged Reusability is not always a major concern. Reusability is not always a major concern. –TREC-style topics may not suit a specific task –Dynamic collection such as the web

Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion

Previous Work The pooling method has been shown to be sufficient for research purposes. The pooling method has been shown to be sufficient for research purposes. [Soboroff, 2001] [Soboroff, 2001] –random assignment of relevance to documents in a pool to give a decent ranking of systems [Sanderson, 2004] [Sanderson, 2004] –ranking systems reliably from a set of judgments obtained from a single system or iterating relevance feedback runs [Carterette, 2005] [Carterette, 2005] –proposing an algorithm to achieve high rank correlation with a very small set of judgments

Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion

Average Precision Let x i be a Boolean indicator of the relevance of document i

Intuition Let S be the set of judged relevant documents, suppose ΔAP > 0 (stopping condition) (stopping condition) Intuitively, we want to increase the LHS by finding relevant documents and decrease the RHS by finding irrelevant documents. Intuitively, we want to increase the LHS by finding relevant documents and decrease the RHS by finding irrelevant documents.

An Optimal Algorithm THEOREM 1: If p i = p for all i, the set S maximizes THEOREM 1: If p i = p for all i, the set S maximizes

AP is Normally Distributed Given a set of relevance judgments, we use the normal cumulative density function (cdf) to find P(ΔAP ≦ 0), the confidence that ΔAP ≦ 0. Given a set of relevance judgments, we use the normal cumulative density function (cdf) to find P(ΔAP ≦ 0), the confidence that ΔAP ≦ 0. Figure 1: We simulated two ranked lists of 100 documents. Setting p i =0.5, we randomly generated 5000 sets of relevance judgments and calculated ΔAP for each set. The Anderson- Darling goodness of fit test concludes that we cannot reject the hypothesis that the sample came from a normal distribution.

Application to MAP Because topics are independent, Then if AP ~ N(0, 1), MAP ~ N(0, 1) as well Each (topic, document) pair is treated as a unique “ document ” Each (topic, document) pair is treated as a unique “ document ”

Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion

Outline of the Experiment 1) We ran eight retrieval systems on a set of baseline topics for which we had full sets of judgments. 2) Six annotators are asked to develop 60 new topics; these were run on the same eight systems. 3) The annotators then judged documents selected by the algorithm.

The Baseline Baseline topics Baseline topics –Used to estimate the system performance –The 2005 Robust /HARD track topics and ad hoc topics 301 through 450 Corpora Corpora –Aquaint for the Robust topics, 1 million articles –TREC disk 4&5 for the ad hoc topics, 50,000 articles Retrieval systems Retrieval systems –Six freely-available retrieval systems: Indri, Lemur, Lucene, mg, Smart, and Zettair

Experiment Results 2200 relevance judgments obtained in 2.5 hours 2200 relevance judgments obtained in 2.5 hours –About 4.5 per system per topics on average –About 2.5 per minute per annotator The rate is 2.2 in TREC. The rate is 2.2 in TREC. The systems are ranked by expected value of MAP: The systems are ranked by expected value of MAP: where p i = 1 if document i has been judged relevant, 0 if irrelevant, and 0.5 otherwise.

Experiment Results Table 1: True MAPs of eight systems over 200 baseline topics, and expected MAP, with 95% confidence intervals, over 60 new topics. Horizontal lines indicate “ bin ” divisions determined by statistical significance.

Experiment Results Figure 2: Confidence increases as more judgments are made.

Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion

Discussion Some simulations using the Robust 2005 topics and judgments by NIST evaluate performance of the algorithm. Some simulations using the Robust 2005 topics and judgments by NIST evaluate performance of the algorithm. Some questions are explored: Some questions are explored: –To what degree are the results dependent on the algorithm rather than the evaluation metric? –How many judgments are required to differentiate a single pair of ranked lists with 95% confidence? –How does confidence vary as more judgments are made? –Are test collections produced by our algorithm reusable?

Comparing εMAP and MAP Simulation: after several documents had been judged, εMAP and MAP on all systems are calculated and compared with the true ranking by Kendall ’ s tau correlation. Simulation: after several documents had been judged, εMAP and MAP on all systems are calculated and compared with the true ranking by Kendall ’ s tau correlation.

How Many Judgments? The number of judgments that must be made in comparing two systems depends on how similar they are. The number of judgments that must be made in comparing two systems depends on how similar they are. Fig. 4: Absolute difference in true AP for Robust 2005 topics vs. number of judgments to 95% confidence for pairs of ranked lists for individual topics.

Confidence over Time Incremental pooling: all documents in a pool of depth k will be judged. Incremental pooling: all documents in a pool of depth k will be judged. The pool of depth 10 contains 2228 documents and 569 of them are not in the algorithmic set. The pool of depth 10 contains 2228 documents and 569 of them are not in the algorithmic set.

Reusability of Test Collection One of the eight systems is removed to build test collections. One of the eight systems is removed to build test collections. All eight systems are ranked by εMAP, setting p i as the ratio of relevant documents in the test collection. All eight systems are ranked by εMAP, setting p i as the ratio of relevant documents in the test collection. Table 4: Reusability of test collections. The 8 th system is always placed in the correct spot or swapped with the next (statistically indistinguishable) system.

Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion

Conclusion A new perspective on average precision leads to an algorithm for selecting documents judged to evaluate retrieval systems in minimal time. A new perspective on average precision leads to an algorithm for selecting documents judged to evaluate retrieval systems in minimal time. After only six hours of annotation time, we had achieved a ranking with 90% confidence. After only six hours of annotation time, we had achieved a ranking with 90% confidence. A direction for future work is extending the analysis to other evaluation metrics for different tasks. A direction for future work is extending the analysis to other evaluation metrics for different tasks. Another direction is estimating probabilities of relevance. Another direction is estimating probabilities of relevance.