Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval
Advertisements

Super Awesome Presentation Dandre Allison Devin Adair.
Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Evaluation LBSC 796/INFM 718R Session 5, October 7, 2007 Douglas W. Oard.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Precision and Recall.
Evaluating Search Engine
Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.
Deliverable #3: Document and Passage Retrieval Ling 573 NLP Systems and Applications May 10, 2011.
Evaluating Evaluation Measure Stability Authors: Chris Buckley, Ellen M. Voorhees Presenters: Burcu Dal, Esra Akbaş.
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
©2007 H5 Simultaneous Achievement of high Precision and high Recall through Socio-Technical Information Retrieval Robert S. Bauer, Teresa Jade
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Evaluation.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Retrieval Evaluation Hongning Wang
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar.
Information Retrieval: Problem Formulation & Evaluation ChengXiang Zhai Department of Computer Science University of Illinois, Urbana-Champaign.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
A Comparison of Statistical Significance Tests for Information Retrieval Evaluation CIKM´07, November 2007.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Evaluation INST 734 Module 5 Doug Oard. Agenda  Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Retrieval Evaluation Hongning Wang
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
Evaluation LBSC 796/CMSC 828o Session 8, March 15, 2004 Douglas W. Oard.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Chapter 8: Evaluation Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Evaluation LBSC 708A/CMSC 838L Session 6, October 16, 2001 Philip Resnik and Douglas W. Oard.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Predicting User Interests from Contextual Information R. W. White, P. Bailey, L. Chen Microsoft (SIGIR 2009) Presenter : Jae-won Lee.
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
Relevant Document Distribution Estimation Method for Resource Selection Luo Si and Jamie Callan School of Computer Science Carnegie Mellon University
Enterprise Track: Thread-based Retrieval Enterprise Track: Thread-based Retrieval Yejun Wu and Douglas W. Oard Goal Explore -- document expansion.
Sampath Jayarathna Cal Poly Pomona
Evaluation Anisio Lacerda.
Walid Magdy Gareth Jones
Modern Retrieval Evaluations
Evaluation of IR Systems
Lecture 10 Evaluation.
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
Evaluation of IR Performance
Lecture 6 Evaluation.
Feature Selection for Ranking
Cumulated Gain-Based Evaluation of IR Techniques
INF 141: Information Retrieval
Learning to Rank with Ties
Retrieval Performance Evaluation - Measures
Precision and Recall.
Presentation transcript:

Evaluation INST 734 Module 5 Doug Oard

Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving User studies

Which is the Best Rank Order? = relevant document A. B. C. D. E. F.

Measuring Precision and Recall = relevant document 1/11/21/31/42/53/63/74/84/94/10 5/115/125/135/145/156/166/176/186/194/20 1/14 2/143/14 4/14 5/14 6/14 Assume there are a total of 14 relevant documents Let’s evaluate a system that finds 6 of those 14 in the top 20: Precision Recall Precision Recall Hits 1-10 Hits = 0.4

Uninterpolated Average Precision –Average of precision at each retrieved relevant doc –Relevant docs not retrieved contribute zero to score = relevant document 1/11/21/31/42/53/63/74/84/94/10 5/115/125/135/145/156/166/176/186/194/20 Precision Hits 1-10 Hits The 8 relevant documents not retrieved contribute eight zeros AP =

Some Topics are Easier Than Others Ellen Voorhees, 1999

Mean Average Precision (MAP) R R 1/3=0.33 2/7=0.29 AP=0.31 R R R R 1/1=1.00 2/2=1.00 3/5=0.60 4/9=0.44 AP=0.76 R R R 1/2=0.50 3/5=0.60 4/8=0.50 AP=0.53 MAP=0.53

Visualizing Mean Average Precision Average Precision Topic

What MAP Hides Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Some Other Evaluation Measures Mean Reciprocal Rank (MRR) Geometric Mean Average Precision (GMAP) Normalized Discounted Cumulative Gain (NDCG) Binary Preference (BPref) Inferred AP (infAP)

Relevance Judgment Strategies Exhaustive assessment –Usually impractical Known-item queries –Limited to MRR, requires hundreds of queries Search-guided assessment –Hard to quantify risks to completeness Sampled judgments –Good when relevant documents are common Pooled assessment –Requires cooperative evaluation

Pooled Assessment Methodology Systems submit top 1000 documents per topic Top 100 documents for each are judged –Single pool, without duplicates, arbitrary order –Judged by the person that wrote the query Treat unevaluated documents as not relevant Compute MAP down to 1000 documents –Average in misses at 1000 as zero

Some Lessons From TREC Incomplete judgments are useful –If sample is unbiased with respect to systems tested –Additional relevant documents are highly skewed across topics Different relevance judgments change absolute score –But rarely change comparative advantages when averaged Evaluation technology is predictive –Results transfer to operational settings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

Recap: “Batch” Evaluation Evaluation measures focus on relevance –Users also want utility and understandability Goal is typically to compare systems –Values may vary, but relative differences are stable Mean values obscure important phenomena –Statistical significance tests address generalizability –Failure analysis case studies can help you improve

Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings  Interleaving User studies