Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment.

Slides:



Advertisements
Similar presentations
A Support Vector Method for Optimizing Average Precision
Advertisements

Less is More Probabilistic Model for Retrieving Fewer Relevant Docuemtns Harr Chen and David R. Karger MIT CSAIL SIGIR2006 4/30/2007.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
PRES A Score Metric for Evaluating Recall- Oriented IR Applications Walid Magdy Gareth Jones Dublin City University SIGIR, 22 July 2010.
Musings at the Crossroads of Digital Libraries, Information Retrieval, and Scientometrics Guillaume Cabanac
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Search Results Need to be Diverse Mark Sanderson University of Sheffield.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Evaluating Search Engine
Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.
Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
© Anselm Spoerri Lecture 13 Housekeeping –Term Projects Evaluations –Morse, E., Lewis, M., and Olsen, K. (2002) Testing Visual Information Retrieval Methodologies.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
1 Discussion Class 5 TREC. 2 Discussion Classes Format: Questions. Ask a member of the class to answer. Provide opportunity for others to comment. When.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Search is not only about the Web An Overview on Printed Documents Search and Patent Search Walid Magdy Centre for Next Generation Localisation School of.
Modern Retrieval Evaluations Hongning Wang
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Improving Web Search Ranking by Incorporating User Behavior Information Eugene Agichtein Eric Brill Susan Dumais Microsoft Research.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
Multilingual Retrieval Experiments with MIMOR at the University of Hildesheim René Hackl, Ralph Kölle, Thomas Mandl, Alexandra Ploedt, Jan-Hendrik Scheufen,
Predicting Question Quality Bruce Croft and Stephen Cronen-Townsend University of Massachusetts Amherst.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
1 01/10/09 1 INFILE CEA LIST ELDA Univ. Lille 3 - Geriico Overview of the INFILE track at CLEF 2009 multilingual INformation FILtering Evaluation.
How robust is CLIR? Proposal for a new robust task at CLEF Thomas Mandl Information Science Universität Hildesheim 6 th Workshop.
November 10, 2004Dmitriy Fradkin, CIKM'041 A Design Space Approach to Analysis of Information Retrieval Adaptive Filtering Systems Dmitriy Fradkin, Paul.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
© 2004 Chris Staff CSAW’04 University of Malta of 15 Expanding Query Terms in Context Chris Staff and Robert Muscat Department of.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
1 Evaluating High Accuracy Retrieval Techniques Chirag Shah,W. Bruce Croft Center for Intelligent Information Retrieval Department of Computer Science.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Automatic Labeling of Multinomial Topic Models
Modern Retrieval Evaluations Hongning Wang
AQUAINT AQUAINT Evaluation Overview Ellen M. Voorhees.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Efficient Result-set Merging Across Thousands of Hosts Simulating an Internet-scale GIR application with the GOV2 Test Collection Christopher Fallen Arctic.
Thomas Mandl: Robust CLEF Overview 1 Cross-Language Evaluation Forum (CLEF) Thomas Mandl Information Science Universität Hildesheim
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Usefulness of Quality Click- through Data for Training Craig Macdonald, ladh Ounis Department of Computing Science University of Glasgow, Scotland, UK.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Developments in Evaluation of Search Engines
Evaluation Anisio Lacerda.
Walid Magdy Gareth Jones
Evaluation of IR Systems
Lecture 10 Evaluation.
ارزيابی قابليت استفاده مجدد مجموعه تست‌ها دارای قضاوت‌های چندسطحی Reusability Assessment of Test Collections with Relevance Levels of Judgments مريم.
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
1Micheal T. Adenibuyan, 2Oluwatoyin A. Enikuomehin and 2Benjamin S
Jonathan Elsas LTI Student Research Symposium Sept. 14, 2007
Presentation transcript:

Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment CLEF’10: Conference on Multilingual and Multimodal Information Access Evaluation September 20-23, Padua, Italy

2 Outline 1.MotivationA tale about two TREC participants 2.ContextIRS effectiveness evaluation IssueTie-breaking bias effects 3.ContributionReordering strategies 4.ExperimentsImpact of the tie-breaking bias 5.Conclusion and Future Works Effect of the Tie-Breaking Bias G. Cabanac et al.

3 Outline 1.MotivationA tale about two TREC participants 2.ContextIRS effectiveness evaluation IssueTie-breaking bias effects 3.ContributionReordering strategies 4.ExperimentsImpact of the tie-breaking bias 5.Conclusion and Future Works Effect of the Tie-Breaking Bias G. Cabanac et al.

4 A tale about two TREC participants (1/2) 1. Motivation  Tie-breaking bias illustration G. Cabanac et al. 5 relevant documents Topic 031 “satellite launch contracts” ChrisEllen one single difference Why such a huge difference? unluckylucky

5 A tale about two TREC participants (2/2) 1. Motivation  Tie-breaking bias illustration G. Cabanac et al. ChrisEllen one single difference  Only difference: the name of one document  After 15 days of hard work

6 Outline 1.MotivationA tale about two TREC participants 2.ContextIRS effectiveness evaluation IssueTie-breaking bias effects 3.ContributionReordering strategies 4.ExperimentsImpact of the tie-breaking bias 5.Conclusion and Future Works Effect of the Tie-Breaking Bias G. Cabanac et al.

7 Measuring the effectiveness of IRSs User-centered vs. System-focused [Spärk Jones & Willett, 1997] Evaluation campaigns  1958CranfieldUK  1992TRECText Retrieval ConferenceUSA  1999NTCIRNII Test Collection for IR SystemsJapan  2001CLEFCross-Language Evaluation ForumEurope  … “Cranfield” methodology  Task  Test collection Corpus Topics Qrels  Measures : MAP, using trec_eval 2. Context & issue  Tie-breaking bias G. Cabanac et al. [Voorhees, 2007]

8 Runs are reordered prior to their evaluation Qrels =  qid, iter, docno, rel  Run =  qid, iter, docno, rank, sim, run_id  Reordering by trec_eval qid asc, sim desc, docno desc Effectiveness measure = f (intrinsic_quality, ) MAP, MRR… 2. Context & issue  Tie-breaking bias G. Cabanac et al.

9 Outline 1.MotivationA tale about two TREC participants 2.ContextIRS effectiveness evaluation IssueTie-breaking bias effects 3.ContributionReordering strategies 4.ExperimentsImpact of the tie-breaking bias 5.Conclusion and Future Works Effect of the Tie-Breaking Bias G. Cabanac et al.

10 Consequences of run reordering Measures of effectiveness for an IRS s  RR(s,t)1/rank of the 1 st relevant document, for topic t  P(s,t,d)precision at document d, for topic t  AP(s,t)average precision for topic t  MAP(s)mean average precision   Tie-breaking bias  Is the Wall Street Journal collection more relevant than Associated Press?   Problem 1comparing 2 systemsAP(s 1, t) vs. AP(s 2, t)   Problem 2 comparing 2 topicsAP(s, t 1 ) vs. AP(s, t 2 ) Chris Ellen 3. Contribution  Reordering strategies G. Cabanac et al.  Sensitive to document rank

11 Alternative unbiased reordering strategies   Conventional reordering (TREC)  Ties sorted Z  Aqid asc, sim desc, docno desc Realistic reordering  Relevant docs lastqid asc, sim desc, rel asc, docno desc Optimistic reordering  Relevant docs firstqid asc, sim desc, rel desc, docno desc 3. Contribution  Reordering strategies G. Cabanac et al. ex aequo

12 Outline 1.MotivationA tale about two TREC participants 2.ContextIRS effectiveness evaluation IssueTie-breaking bias effects 3.ContributionReordering strategies 4.ExperimentsImpact of the tie-breaking bias 5.Conclusion and Future Works Effect of the Tie-Breaking Bias G. Cabanac et al.

13 Effect of the tie-breaking bias Study of 4 TREC tasks  22 editions  1360 runs Assessing the effect of tie-breaking  Proportion of document ties  How frequent is the bias?  Effect on measure values Top 3 observed differences Observed difference in % Significance of the observed difference: Student’s t-test (paired, unilateral) routing web filtering adhoc GB of data from trec.nist.gov 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.

14 Ties demographics 89.6% of the runs comprise ties Ties are present all along the runs 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.

15 Proportion of tied documents in submitted runs On average, 10.6 docs in a tied group of docsOn average, 25.2 % of a result-list = tied documents 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.

16 Effect on Reciprocal Rank (RR) 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.

17 Effect on Average Precision (AP) 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.

18 Effect on Mean Average Precision (MAP) Difference of ranks computed on MAP not significant (Kendall’s  ) 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.

19 What we learnt: Beware of tie-breaking for AP Poor effect on MAP, larger effect on AP Measure bounds AP Realistic  AP Conventionnal  AP Optimistic Failure analysis for the ranking process  Error bar = element of chance  potential for improvement 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al. padre1, adhoc’94

20 Related works in IR evaluation [Voorhees, 2007] Topics reliability? [Buckley & Voorhees, 2000]  25 [Voorhees & Buckley, 2002]error rate [Voorhees, 2009]n collections Qrels reliability? [Voorhees, 1998]quality [Al-Maskari et al., 2008]TREC vs.  TREC Measures reliability? [Buckley & Voorhees, 2000]MAP [Sakai, 2008]‘system bias’ [Moffat & Zobel, 2008]new measures [Raghavan et al., 1989]Precall [McSherry & Najork, 2008]Tied scores Pooling reliability? [Zobel, 1998]approximation [Sanderson & Joho, 2004]manual [Buckley et al., 2007]size adaptation [Cabanac et al., 2010]tie-breaking bias 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.

21 Outline 1.MotivationA tale about two TREC participants 2.ContextIRS effectiveness evaluation IssueTie-breaking bias effects 3.ContributionReordering strategies 4.ExperimentsImpact of the tie-breaking bias 5.Conclusion and Future Works Effect of the Tie-Breaking Bias G. Cabanac et al.

22 Conclusions and future works Context: IR evaluation  TREC and other campaigns based on trec_eval Contributions   Measure = f (intrinsic_quality, luck)  tie-breaking bias  Measure bounds (realistic  conventional  optimistic)  Study of the tie-breaking bias effect  (conventional, realistic) for RR, AP and MAP Strong correlation, yet significant difference No difference on system rankings (based on MAP) Future works  Study of other / more recent evaluation campaigns  Reordering-free measures  Finer grained analyses: finding vs. ranking Impact du « biais des ex aequo » dans les évaluations de RI G. Cabanac et al.

Thank you CLEF’10: Conference on Multilingual and Multimodal Information Access Evaluation September 20-23, Padua, Italy

24 ‘Stuffing’ phenomenon Chris Ellen Rationale behind retrieving non relevant documents (for the IRS)? (sim = 0) Unduly score increase? realistic Effect of this issue minimized with realistic reordering strategy  relevant docs queued at the bottom 4. Experiments  Impact of the tie-breaking bias G. Cabanac et al.