Developments in Evaluation of Search Engines Mark Sanderson, University of Sheffield
Evaluation in IR Use a test collection: set of… Documents Topics Relevance Judgements 21/04/2018
How to get lots of judgements? Do you check all documents for all topics? In the old days Yes But this doesn’t scale 21/04/2018
To form larger test collections Get your relevance judgements from pools How does that work? 21/04/2018
Pooling – many participants Collection Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 21/04/2018
Classic pool formation 10-100 runs Judge 1-2K documents per topic 10-20 hours per topic 50 topics, too much effort for one person 21/04/2018
Look at the two problem areas Pooling requires many participants Relevance assessment requires many person hours 21/04/2018
Query pooling Don’t have multiple runs from groups Have one person create multiple queries 21/04/2018
Query pooling First proposed by Confirmed by Cormack, G.V., Palmer, R.P., Clarke, C.L.A. (1998): Efficient Constructions of Large Test Collections, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: 282-289 Confirmed by Forming test collections with no system pooling, M. Sanderson, H. Joho, In the 27th ACM Conference of the Special Interest Group in Information Retrieval 2004. 21/04/2018
Query pooling Collection Nuclear waste dumping Radioactive waste Radioactive waste storage Hazardous waste Nuclear waste storage Utah nuclear waste Waste dump 21/04/2018
Another approach Maybe your assessors Can read very fast, but can’t search very well. Form different queries with relevant feedback 21/04/2018
Query pooling, relevance feedback Collection Nuclear waste dumping Feedback 1 Feedback 2 Feedback 3 Feedback 4 Feedback 5 Feedback 6 21/04/2018
Relevance feedback Use relevance feedback to form queries Soboroff, I, Robertson, S. (2003) Building a filtering test collection for TREC 2002, in Proceedings of the ACM SIGIR conference. 21/04/2018
Both options save time With query pooling With system pooling 2 hours per topic With system pooling 10-20 hours per topic? 21/04/2018
Notice, didn’t get everything How much was missed? Attempts to estimate Zobel, ACM SIGIR 1998 Manmatha, ACM SIGIR 2001 P(r) 1 Rank 21/04/2018
Do missing Rels matter? For conventional IR testing? Just want to know No – not interested in such things Just want to know A>B A=B A<B 21/04/2018
Not good enough? 1-2 hours per topic still a lot of work Hints that 50 topics are too few Million query task of TREC What can we do? 21/04/2018
Test collections are Reproducible Reusable Encourage collaboration Cross comparison Tell you if your new idea works Help you publish your work 21/04/2018
How do you do this? Focus on reducing number of relevance assessments 21/04/2018
Simple approach TREC/CLEF: judge down to Judge down to top 10 top 100 (sometimes 50) Judge down to top 10 Far fewer documents 11%-14% relevance assessor effort Compared to top 100 21/04/2018
Impact of saving Save a lot of time Loose a little in measurement accuracy 21/04/2018
Use time saved To work on more topics Measurement accuracy improves. M. Sanderson, J. Zobel (2005) Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in the proceedings of the 28th ACM SIGIR conference 21/04/2018
Questions? m.sanderson@shef.ac.uk dis.shef.ac.uk/mark 21/04/2018