Developments in Evaluation of Search Engines

Slides:



Advertisements
Similar presentations
Towards Methods for the Collective Gathering and Quality Control of Relevance Assessments SIGIR´09, July 2009.
Advertisements

1 A Systematic Review of Cross- vs. Within-Company Cost Estimation Studies Barbara Kitchenham Emilia Mendes Guilherme Travassos.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Results Need to be Diverse Mark Sanderson University of Sheffield.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Overview of Collaborative Information Retrieval (CIR) at FIRE 2012 Debasis Ganguly, Johannes Leveling, Gareth Jones School of Computing, CNGL, Dublin City.
Evaluating Search Engine
IR Challenges and Language Modeling. IR Achievements Search engines  Meta-search  Cross-lingual search  Factoid question answering  Filtering Statistical.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
Algorithms (Contd.). How do we describe algorithms? Pseudocode –Combines English, simple code constructs –Works with various types of primitives Could.
Re-ranking Documents Segments To Improve Access To Relevant Content in Information Retrieval Gary Madden Applied Computational Linguistics Dublin City.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluating the Performance of IR Sytems
Lessons Learned from Information Retrieval Chris Buckley Sabir Research
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
 Mark Sanderson, University of Sheffield Poliomyelitis and Post-Polio TREC topic 302.
Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Modern Retrieval Evaluations Hongning Wang
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
CS523 INFORMATION RETRIEVAL COURSE INTRODUCTION YÜCEL SAYGIN SABANCI UNIVERSITY.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Question Answering From Zero to Hero Elena Eneva 11 Oct 2001 Advanced IR Seminar.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Course grading Project: 75% Broken into several incremental deliverables Paper appraisal/evaluation/project tool evaluation in earlier May: 25%
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Tie-Breaking Bias: Effect of an Uncontrolled Parameter on Information Retrieval Evaluation Guillaume Cabanac, Gilles Hubert, Mohand Boughanem, Claude Chrisment.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Performance Measurement. 2 Testing Environment.
Comparing Document Segmentation for Passage Retrieval in Question Answering Jorg Tiedemann University of Groningen presented by: Moy’awiah Al-Shannaq
The Cross Language Image Retrieval Track: ImageCLEF Breakout session discussion.
1 13/05/07 1/20 LIST – DTSI – Interfaces, Cognitics and Virtual Reality Unit The INFILE project: a crosslingual filtering systems evaluation campaign Romaric.
Cs Future Direction : Collaborative Filtering Motivating Observations:  Relevance Feedback is useful, but expensive a)Humans don’t often have time.
Modern Retrieval Evaluations Hongning Wang
Why IR test collections are so bad Mark Sanderson University of Sheffield.
The Loquacious ( 愛說話 ) User: A Document-Independent Source of Terms for Query Expansion Diane Kelly et al. University of North Carolina at Chapel Hill.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
CS798: Information Retrieval Charlie Clarke Information retrieval is concerned with representing, searching, and manipulating.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
The Development of a search engine & Comparison according to algorithms Sung-soo Kim The final report.
Crowdsourcing Blog Track Top News Judgments at TREC Richard McCreadie, Craig Macdonald, Iadh Ounis {richardm, craigm, 1.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Future Direction #3: Collaborative Filtering
Evaluation Anisio Lacerda.
Walid Magdy Gareth Jones
Modern Retrieval Evaluations
Reading Notes Wang Ning Lab of Database and Information Systems
An Empirical Study of Learning to Rank for Entity Search
Lecture 10 Evaluation.
Evaluation.
Applying Key Phrase Extraction to aid Invalidity Search
ارزيابی قابليت استفاده مجدد مجموعه تست‌ها دارای قضاوت‌های چندسطحی Reusability Assessment of Test Collections with Relevance Levels of Judgments مريم.
IR Theory: Evaluation Methods
Warm Up #7 Think about a simple object, like a can opener, eyeglasses, a pen, scissors, a spoon, etc. What is something you wish the object would do or.
Author: Kazunari Sugiyama, etc. (WWW2004)
Lecture 6 Evaluation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
AP World History Multiple Choice Exam.
Cumulated Gain-Based Evaluation of IR Techniques
Retrieval Evaluation - Reference Collections
Future Direction : Collaborative Filtering
Presentation transcript:

Developments in Evaluation of Search Engines Mark Sanderson, University of Sheffield

Evaluation in IR Use a test collection: set of… Documents Topics Relevance Judgements 21/04/2018

How to get lots of judgements? Do you check all documents for all topics? In the old days Yes But this doesn’t scale 21/04/2018

To form larger test collections Get your relevance judgements from pools How does that work? 21/04/2018

Pooling – many participants Collection Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 21/04/2018

Classic pool formation 10-100 runs Judge 1-2K documents per topic 10-20 hours per topic 50 topics, too much effort for one person 21/04/2018

Look at the two problem areas Pooling requires many participants Relevance assessment requires many person hours 21/04/2018

Query pooling Don’t have multiple runs from groups Have one person create multiple queries 21/04/2018

Query pooling First proposed by Confirmed by Cormack, G.V., Palmer, R.P., Clarke, C.L.A. (1998): Efficient Constructions of Large Test Collections, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: 282-289 Confirmed by Forming test collections with no system pooling, M. Sanderson, H. Joho, In the 27th ACM Conference of the Special Interest Group in Information Retrieval 2004. 21/04/2018

Query pooling Collection Nuclear waste dumping Radioactive waste Radioactive waste storage Hazardous waste Nuclear waste storage Utah nuclear waste Waste dump 21/04/2018

Another approach Maybe your assessors Can read very fast, but can’t search very well. Form different queries with relevant feedback 21/04/2018

Query pooling, relevance feedback Collection Nuclear waste dumping Feedback 1 Feedback 2 Feedback 3 Feedback 4 Feedback 5 Feedback 6 21/04/2018

Relevance feedback Use relevance feedback to form queries Soboroff, I, Robertson, S. (2003) Building a filtering test collection for TREC 2002, in Proceedings of the ACM SIGIR conference. 21/04/2018

Both options save time With query pooling With system pooling 2 hours per topic With system pooling 10-20 hours per topic? 21/04/2018

Notice, didn’t get everything How much was missed? Attempts to estimate Zobel, ACM SIGIR 1998 Manmatha, ACM SIGIR 2001 P(r) 1 Rank 21/04/2018

Do missing Rels matter? For conventional IR testing? Just want to know No – not interested in such things Just want to know A>B A=B A<B 21/04/2018

Not good enough? 1-2 hours per topic still a lot of work Hints that 50 topics are too few Million query task of TREC What can we do? 21/04/2018

Test collections are Reproducible Reusable Encourage collaboration Cross comparison Tell you if your new idea works Help you publish your work 21/04/2018

How do you do this? Focus on reducing number of relevance assessments 21/04/2018

Simple approach TREC/CLEF: judge down to Judge down to top 10 top 100 (sometimes 50) Judge down to top 10 Far fewer documents 11%-14% relevance assessor effort Compared to top 100 21/04/2018

Impact of saving Save a lot of time Loose a little in measurement accuracy 21/04/2018

Use time saved To work on more topics Measurement accuracy improves. M. Sanderson, J. Zobel (2005) Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in the proceedings of the 28th ACM SIGIR conference 21/04/2018

Questions? m.sanderson@shef.ac.uk dis.shef.ac.uk/mark 21/04/2018