Download presentation
Presentation is loading. Please wait.
Published byMarcus Bryant Modified over 9 years ago
1
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006
2
Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion
3
Introduction Information retrieval system evaluation requires test collections: Information retrieval system evaluation requires test collections: –corpora of documents, sets of topics and relevance judgments Stable, fine-grained evaluation metrics take both of precision and recall into account, and require large sets of judgments. Stable, fine-grained evaluation metrics take both of precision and recall into account, and require large sets of judgments. –At best inefficient, at worst infeasible
4
Introduction The TREC conferences The TREC conferences –Goal: building test collections that are reusable –Pooling process Top results from many system runs are judged Top results from many system runs are judged Reusability is not always a major concern. Reusability is not always a major concern. –TREC-style topics may not suit a specific task –Dynamic collection such as the web
5
Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion
6
Previous Work The pooling method has been shown to be sufficient for research purposes. The pooling method has been shown to be sufficient for research purposes. [Soboroff, 2001] [Soboroff, 2001] –random assignment of relevance to documents in a pool to give a decent ranking of systems [Sanderson, 2004] [Sanderson, 2004] –ranking systems reliably from a set of judgments obtained from a single system or iterating relevance feedback runs [Carterette, 2005] [Carterette, 2005] –proposing an algorithm to achieve high rank correlation with a very small set of judgments
7
Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion
8
Average Precision Let x i be a Boolean indicator of the relevance of document i
9
Intuition Let S be the set of judged relevant documents, suppose ΔAP > 0 (stopping condition) (stopping condition) Intuitively, we want to increase the LHS by finding relevant documents and decrease the RHS by finding irrelevant documents. Intuitively, we want to increase the LHS by finding relevant documents and decrease the RHS by finding irrelevant documents.
10
An Optimal Algorithm THEOREM 1: If p i = p for all i, the set S maximizes THEOREM 1: If p i = p for all i, the set S maximizes
11
AP is Normally Distributed Given a set of relevance judgments, we use the normal cumulative density function (cdf) to find P(ΔAP ≦ 0), the confidence that ΔAP ≦ 0. Given a set of relevance judgments, we use the normal cumulative density function (cdf) to find P(ΔAP ≦ 0), the confidence that ΔAP ≦ 0. Figure 1: We simulated two ranked lists of 100 documents. Setting p i =0.5, we randomly generated 5000 sets of relevance judgments and calculated ΔAP for each set. The Anderson- Darling goodness of fit test concludes that we cannot reject the hypothesis that the sample came from a normal distribution.
12
Application to MAP Because topics are independent, Then if AP ~ N(0, 1), MAP ~ N(0, 1) as well Each (topic, document) pair is treated as a unique “ document ” Each (topic, document) pair is treated as a unique “ document ”
13
Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion
14
Outline of the Experiment 1) We ran eight retrieval systems on a set of baseline topics for which we had full sets of judgments. 2) Six annotators are asked to develop 60 new topics; these were run on the same eight systems. 3) The annotators then judged documents selected by the algorithm.
15
The Baseline Baseline topics Baseline topics –Used to estimate the system performance –The 2005 Robust /HARD track topics and ad hoc topics 301 through 450 Corpora Corpora –Aquaint for the Robust topics, 1 million articles –TREC disk 4&5 for the ad hoc topics, 50,000 articles Retrieval systems Retrieval systems –Six freely-available retrieval systems: Indri, Lemur, Lucene, mg, Smart, and Zettair
16
Experiment Results 2200 relevance judgments obtained in 2.5 hours 2200 relevance judgments obtained in 2.5 hours –About 4.5 per system per topics on average –About 2.5 per minute per annotator The rate is 2.2 in TREC. The rate is 2.2 in TREC. The systems are ranked by expected value of MAP: The systems are ranked by expected value of MAP: where p i = 1 if document i has been judged relevant, 0 if irrelevant, and 0.5 otherwise.
17
Experiment Results Table 1: True MAPs of eight systems over 200 baseline topics, and expected MAP, with 95% confidence intervals, over 60 new topics. Horizontal lines indicate “ bin ” divisions determined by statistical significance.
18
Experiment Results Figure 2: Confidence increases as more judgments are made.
19
Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion
20
Discussion Some simulations using the Robust 2005 topics and judgments by NIST evaluate performance of the algorithm. Some simulations using the Robust 2005 topics and judgments by NIST evaluate performance of the algorithm. Some questions are explored: Some questions are explored: –To what degree are the results dependent on the algorithm rather than the evaluation metric? –How many judgments are required to differentiate a single pair of ranked lists with 95% confidence? –How does confidence vary as more judgments are made? –Are test collections produced by our algorithm reusable?
21
Comparing εMAP and MAP Simulation: after several documents had been judged, εMAP and MAP on all systems are calculated and compared with the true ranking by Kendall ’ s tau correlation. Simulation: after several documents had been judged, εMAP and MAP on all systems are calculated and compared with the true ranking by Kendall ’ s tau correlation.
22
How Many Judgments? The number of judgments that must be made in comparing two systems depends on how similar they are. The number of judgments that must be made in comparing two systems depends on how similar they are. Fig. 4: Absolute difference in true AP for Robust 2005 topics vs. number of judgments to 95% confidence for pairs of ranked lists for individual topics.
23
Confidence over Time Incremental pooling: all documents in a pool of depth k will be judged. Incremental pooling: all documents in a pool of depth k will be judged. The pool of depth 10 contains 2228 documents and 569 of them are not in the algorithmic set. The pool of depth 10 contains 2228 documents and 569 of them are not in the algorithmic set.
24
Reusability of Test Collection One of the eight systems is removed to build test collections. One of the eight systems is removed to build test collections. All eight systems are ranked by εMAP, setting p i as the ratio of relevant documents in the test collection. All eight systems are ranked by εMAP, setting p i as the ratio of relevant documents in the test collection. Table 4: Reusability of test collections. The 8 th system is always placed in the correct spot or swapped with the next (statistically indistinguishable) system.
25
Outline Introduction Introduction Previous Work Previous Work Intuition and Theory Intuition and Theory Experimental Setup and Results Experimental Setup and Results Discussion Discussion Conclusion Conclusion
26
Conclusion A new perspective on average precision leads to an algorithm for selecting documents judged to evaluate retrieval systems in minimal time. A new perspective on average precision leads to an algorithm for selecting documents judged to evaluate retrieval systems in minimal time. After only six hours of annotation time, we had achieved a ranking with 90% confidence. After only six hours of annotation time, we had achieved a ranking with 90% confidence. A direction for future work is extending the analysis to other evaluation metrics for different tasks. A direction for future work is extending the analysis to other evaluation metrics for different tasks. Another direction is estimating probabilities of relevance. Another direction is estimating probabilities of relevance.
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.