Download presentation
Presentation is loading. Please wait.
1
Developments in Evaluation of Search Engines
Mark Sanderson, University of Sheffield
2
Evaluation in IR Use a test collection: set of… Documents Topics
Relevance Judgements 21/04/2018
3
How to get lots of judgements?
Do you check all documents for all topics? In the old days Yes But this doesn’t scale 21/04/2018
4
To form larger test collections
Get your relevance judgements from pools How does that work? 21/04/2018
5
Pooling – many participants
Collection Run 1 Run 2 Run 3 Run 4 Run 5 Run 6 Run 7 21/04/2018
6
Classic pool formation
runs Judge 1-2K documents per topic 10-20 hours per topic 50 topics, too much effort for one person 21/04/2018
7
Look at the two problem areas
Pooling requires many participants Relevance assessment requires many person hours 21/04/2018
8
Query pooling Don’t have multiple runs from groups
Have one person create multiple queries 21/04/2018
9
Query pooling First proposed by Confirmed by
Cormack, G.V., Palmer, R.P., Clarke, C.L.A. (1998): Efficient Constructions of Large Test Collections, in Proceedings of the 21st annual international ACM-SIGIR conference on Research and development in information retrieval: Confirmed by Forming test collections with no system pooling, M. Sanderson, H. Joho, In the 27th ACM Conference of the Special Interest Group in Information Retrieval 2004. 21/04/2018
10
Query pooling Collection Nuclear waste dumping Radioactive waste
Radioactive waste storage Hazardous waste Nuclear waste storage Utah nuclear waste Waste dump 21/04/2018
11
Another approach Maybe your assessors Can read very fast,
but can’t search very well. Form different queries with relevant feedback 21/04/2018
12
Query pooling, relevance feedback
Collection Nuclear waste dumping Feedback 1 Feedback 2 Feedback 3 Feedback 4 Feedback 5 Feedback 6 21/04/2018
13
Relevance feedback Use relevance feedback to form queries
Soboroff, I, Robertson, S. (2003) Building a filtering test collection for TREC 2002, in Proceedings of the ACM SIGIR conference. 21/04/2018
14
Both options save time With query pooling With system pooling
2 hours per topic With system pooling 10-20 hours per topic? 21/04/2018
15
Notice, didn’t get everything
How much was missed? Attempts to estimate Zobel, ACM SIGIR 1998 Manmatha, ACM SIGIR 2001 P(r) 1 Rank 21/04/2018
16
Do missing Rels matter? For conventional IR testing? Just want to know
No – not interested in such things Just want to know A>B A=B A<B 21/04/2018
17
Not good enough? 1-2 hours per topic still a lot of work
Hints that 50 topics are too few Million query task of TREC What can we do? 21/04/2018
18
Test collections are Reproducible Reusable Encourage collaboration
Cross comparison Tell you if your new idea works Help you publish your work 21/04/2018
19
How do you do this? Focus on reducing number of relevance assessments
21/04/2018
20
Simple approach TREC/CLEF: judge down to Judge down to top 10
top 100 (sometimes 50) Judge down to top 10 Far fewer documents 11%-14% relevance assessor effort Compared to top 100 21/04/2018
21
Impact of saving Save a lot of time
Loose a little in measurement accuracy 21/04/2018
22
Use time saved To work on more topics Measurement accuracy improves.
M. Sanderson, J. Zobel (2005) Information Retrieval System Evaluation: Effort, Sensitivity, and Reliability, in the proceedings of the 28th ACM SIGIR conference 21/04/2018
23
Questions? dis.shef.ac.uk/mark 21/04/2018
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.