Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.

Slides:



Advertisements
Similar presentations
Accurately Interpreting Clickthrough Data as Implicit Feedback Joachims, Granka, Pan, Hembrooke, Gay Paper Presentation: Vinay Goel 10/27/05.
Advertisements

Introduction to Information Retrieval
How to Pick Up Women…. Using IR Strategies By Mike Wooldridge May 9, 2006.
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
Query Chains: Learning to Rank from Implicit Feedback Paper Authors: Filip Radlinski Thorsten Joachims Presented By: Steven Carr.
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Precision and Recall.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
Modern Information Retrieval
Modern Information Retrieval Chapter 5 Query Operations.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade
Information Retrieval Ch Information retrieval Goal: Finding documents Search engines on the world wide web IR system characters Document collection.
Evaluating the Performance of IR Sytems
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Lessons Learned from Information Retrieval Chris Buckley Sabir Research
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Learning Techniques for Information Retrieval We cover 1.Perceptron algorithm 2.Least mean square algorithm 3.Chapter 5.2 User relevance feedback (pp )
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Estimate the Number of Relevant Images Using Two-Order Markov Chain Presented by: WANG Xiaoling Supervisor: Clement LEUNG.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Social Statistics S519: Evaluation of Information Systems.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Comparative Study of Search Result Diversification Methods Wei Zheng and Hui Fang University of Delaware, Newark DE 19716, USA
UOS 1 Ontology Based Personalized Search Zhang Tao The University of Seoul.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Advanced information retrieval Chapter. 02: Modeling (Set Theoretic Models) – Fuzzy model.
June 5, 2006University of Trento1 Latent Semantic Indexing for the Routing Problem Doctorate course “Web Information Retrieval” PhD Student Irina Veredina.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
Lecture 1: Overview of IR Maya Ramanath. Who hasn’t used Google? Why did Google return these results first ? Can we improve on it? Is this a good result.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 23: Probabilistic Language Models April 13, 2004.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment.
Ranking Related Entities Components and Analyses CIKM’10 Advisor: Jia Ling, Koh Speaker: Yu Cheng, Hsieh.
Performance Measurement. 2 Testing Environment.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Web Search and Text Mining Lecture 5. Outline Review of VSM More on LSI through SVD Term relatedness Probabilistic LSI.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
Active Feedback in Ad Hoc IR Xuehua Shen, ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
Evaluation of Information Retrieval Systems Xiangming Mu.
Information Retrieval Quality of a Search Engine.
Information Retrieval and Web Search IR models: Vector Space Model Term Weighting Approaches Instructor: Rada Mihalcea.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Collection Fusion in Carrot2
Walid Magdy Gareth Jones
7CCSMWAL Algorithmic Issues in the WWW
IR Theory: Evaluation Methods
Evaluation of IR Performance
Retrieval Evaluation - Reference Collections
Retrieval Performance Evaluation - Measures
Precision and Recall.
Presentation transcript:

Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş

Retrieval System Evaluation  Experiments on the accuracies of evaluation measures  Requirements for acceptable experiments:  Reasonable number of requests.  Reasonable evaluation measure.  Reasonable notion of difference. A test collection consists of a set of documents, a set of topics, and a set of relevance judgments.

Retrieval System Evaluation-2  Each retrieval strategy: a ranked list of documents for each topic  The list is ordered by decreasing likelihood  The effectiveness of a strategy is computed as a function of the ranks

IR Measures  Prec( λ )  Recall (1000)  Prec at.5 Recall  R-Prec  Average Precision

Computing error rate  Goal: to quantify the error rate associated with deciding that one retrieval method is better another  Based on experiment a particular number of topics a specific evaluation measure a particular value, as fuzziness value

 Select an evaluation measure and fuzziness value  Pick a query set for each of nine retrieval methods  Compare them first is better than, worse than or equal to the second method with respect to the fuzziness

Figure 1: Counts of the number of times the retrieval method of the row was better than, worse than, or equal to the method of the column. Counts were computed using a fuzziness factor of 5% and the original 21 query sets.

 |A > B| is the number of times method A is better than method B in an entry.  The number of times methods are deemed to be equivalent reflects on the power of a measure to discriminate among systems.  The proportion of ties

Average error rate and average proportion of ties for different evaluation measures.

Varying topic set size  investigate how changing the number of topics used in a test affects the error rate of the evaluation measures  Look topic set sizes of 5, 10, 15, 20, 25, 30, 40, and 50  100 trials for each topic set size

Varying fuzziness values  larger fuzziness values decrease the error rate but also decrease the discrimination power of the measure.

The effect of fuzziness value on average error rate.

Conclusion  Error rate depends on  Topic set size  Query size  Fuzziness value  Evaluation measure

 Thanks