Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

Slides:



Advertisements
Similar presentations
Evaluating Novelty and Diversity Charles Clarke School of Computer Science University of Waterloo two talks in one!
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
A Maximum Coherence Model for Dictionary-based Cross-language Information Retrieval Yi Liu, Rong Jin, Joyce Y. Chai Dept. of Computer Science and Engineering.
Query Dependent Pseudo-Relevance Feedback based on Wikipedia SIGIR ‘09 Advisor: Dr. Koh Jia-Ling Speaker: Lin, Yi-Jhen Date: 2010/01/24 1.
Evaluating Search Engine
Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Rutgers’ HARD Track Experiences at TREC 2004 N.J. Belkin, I. Chaleva, M. Cole, Y.-L. Li, L. Liu, Y.-H. Liu, G. Muresan, C. L. Smith, Y. Sun, X.-J. Yuan,
Modern Information Retrieval
INFO 624 Week 3 Retrieval System Evaluation
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
CS 430 / INFO 430 Information Retrieval
Lessons Learned from Information Retrieval Chris Buckley Sabir Research
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Federated Search of Text Search Engines in Uncooperative Environments Luo Si Language Technology Institute School of Computer Science Carnegie Mellon University.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Automatic Term Mismatch Diagnosis for Selective Query Expansion Le Zhao and Jamie Callan Language Technologies Institute School of Computer Science Carnegie.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
1 Information Filtering & Recommender Systems (Lecture for CS410 Text Info Systems) ChengXiang Zhai Department of Computer Science University of Illinois,
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
University of Malta CSA4080: Topic 8 © Chris Staff 1 of 49 CSA4080: Adaptive Hypertext Systems II Dr. Christopher Staff Department.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Educational Research: Competencies for Analysis and Application, 9 th edition. Gay, Mills, & Airasian © 2009 Pearson Education, Inc. All rights reserved.
Information Retrieval Evaluation and the Retrieval Process.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
Less is More Probabilistic Models for Retrieving Fewer Relevant Documents Harr Chen, David R. Karger MIT CSAIL ACM SIGIR 2006 August 9, 2006.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Text REtrieval Conference (TREC) Implementing a Question-Answering Evaluation for AQUAINT Ellen M. Voorhees Donna Harman.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Relevance Models and Answer Granularity for Question Answering W. Bruce Croft and James Allan CIIR University of Massachusetts, Amherst.
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
CSA3200: Adaptive Hypertext Systems Dr. Christopher Staff Department of Intelligent Computer Systems University of Malta Lecture 10: Evaluation Methods.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
LEARNING IN A PAIRWISE TERM-TERM PROXIMITY FRAMEWORK FOR INFORMATION RETRIEVAL Ronan Cummins, Colm O’Riordan (SIGIR’09) Speaker : Yi-Ling Tai Date : 2010/03/15.
Bayesian Extension to the Language Model for Ad Hoc Information Retrieval Hugo Zaragoza, Djoerd Hiemstra, Michael Tipping Microsoft Research Cambridge,
Developments in Evaluation of Search Engines
Evaluation Anisio Lacerda.
Walid Magdy Gareth Jones
Modern Retrieval Evaluations
Evaluation of IR Systems
Preface to the special issue on context-aware recommender systems
Central Limit Theorem, z-tests, & t-tests
Evaluation.
Language Models for Information Retrieval
IR Theory: Evaluation Methods
Retrieval Evaluation - Reference Collections
INF 141: Information Retrieval
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Presentation transcript:

Philosophy of IR Evaluation Ellen Voorhees

NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings? User-based evaluation: how satisfied is user?

NIST Why do system evaluation? Allows sufficient control of variables to increase power of comparative experiments –laboratory tests less expensive –laboratory tests more diagnostic –laboratory tests necessarily an abstraction It works! –numerous examples of techniques developed in the laboratory that improve performance in operational settings

NIST Cranfield Tradition Laboratory testing of retrieval systems first done in Cranfield II experiment (1963) –fixed document and query sets –evaluation based on relevance judgments –relevance abstracted to topical similarity Test collections –set of documents –set of questions –relevance judgments

NIST Cranfield Tradition Assumptions Relevance can be approximated by topical similarity –relevance of one doc is independent of others –all relevant documents equally desirable –user information need doesn’t change Single set of judgments is representative of user population Complete judgments (i.e., recall is knowable) [Binary judgments]

NIST The Case Against the Cranfield Tradition Relevance judgments –vary too much to be the basis of evaluation –topical similarity is not utility –static set of judgments cannot reflect user’s changing information need Recall is unknowable Results on test collections are not representative of operational retrieval systems

NIST Response to Criticism Goal in Cranfield tradition is to compare systems gives relative scores of evaluation measures, not absolute differences in relevance judgments matter only if relative measures based on those judgments change Realism is a concern historically concern has been collection size for TREC and similar collections, bigger concern is realism of topic statement

NIST Using Pooling to Create Large Test Collections Assessors create topics. Systems are evaluated using relevance judgments. Form pools of unique documents from all submissions which the assessors judge for relevance. A variety of different systems retrieve the top 1000 documents for each topic.

NIST Documents Must be representative of real task of interest –genre –diversity (subjects, style, vocabulary) –amount –full text vs. abstract

NIST Topics Distinguish between statement of user need (topic) & system data structure (query) –topic gives criteria for relevance –allows for different query construction techniques

NIST Creating Relevance Judgments RUN A 401 RUN B Pools Top 100 Alphabetized Docnos

NIST Test Collection Reliability Recap test collections are abstractions of operational retrieval settings used to explore the relative merits of different retrieval strategies test collections are reliable if they predict the relative worth of different approaches Two dimensions to explore inconsistency: differences in relevance judgments caused by using different assessors incompleteness: violation of assumption that all documents are judged for all test queries

NIST Inconsistency Most frequently cited “problem” of test collections –undeniably true that relevance is highly subjective; judgments vary by assessor and for same assessor over time... –… but no evidence that these differences affect comparative evaluation of systems

NIST Experiment: Given three independent sets of judgments for each of 48 TREC-4 topics Rank the TREC-4 runs by mean average precision as evaluated using different combinations of judgments Compute correlation among run rankings

NIST Average Precision by Qrel

NIST Effect of Different Judgments Similar highly-correlated results found using different query sets different evaluation measures different groups of assessors single opinion vs. group opinion judgments Conclusion: comparative results are stable despite the idiosyncratic nature of relevance judgments

NIST Incompleteness Relatively new concern regarding test collection quality –early test collections were small enough to have complete judgments –current collections can have only a small portion examined for relevance for each query; portion judged is usually selected by pooling

NIST Incompleteness Study by Zobel [SIGIR-98]: –Quality of relevance judgments does depend on pool depth and diversity –TREC judgments not complete additional relevant documents distributed roughly uniformly across systems but highly skewed across topics –TREC ad hoc collections not biased against systems that do not contribute to the pools

NIST Uniques Effect on Evaluation

NIST Uniques Effect on Evaluation: Automatic Only

NIST Incompleteness Adequate pool depth (and diversity) is important to building reliable test collections With such controls, large test collections are viable laboratory tools For test collections, bias is much worse than incompleteness –smaller, fair judgment sets always preferable to larger, potentially-biased sets –need to carefully evaluate effects of new pool building paradigms with respect to bias introduced

NIST Cross-language Collections More difficult to build a cross-language collection than a monolingual collection –consistency harder to obtain multiple assessors per topic (one per language) must take care when comparing different language evaluations (e.g., cross run to mono baseline) –pooling harder to coordinate need to have large, diverse pools for all languages retrieval results are not balanced across languages haven’t tended to get recall-oriented manual runs in cross- language tasks

NIST Cranfield Tradition Test collections are abstractions, but laboratory tests are useful nonetheless –evaluation technology is predictive (i.e., results transfer to operational settings) –relevance judgments by different assessors almost always produce the same comparative results –adequate pools allow unbiased evaluation of unjudged runs

NIST Cranfield Tradition Note the emphasis on comparative !! –absolute score of some effectiveness measure not meaningful absolute score changes when assessor changes query variability not accounted for impact of collection size, generality not accounted for theoretical maximum of 1.0 for both recall & precision not obtainable by humans –evaluation results are only comparable when they are from the same collection a subset of a collection is a different collection direct comparison of scores from two different TREC collections (e.g., scores from TRECs 7&8) is invalid