Evaluation INST 734 Module 5 Doug Oard
Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies
Which is the Best Rank Order? = relevant document A. B. C. D. E. F.
Measuring Precision and Recall = relevant document 1/11/21/31/42/53/63/74/84/94/10 5/115/125/135/145/156/166/176/186/194/20 1/14 2/143/14 4/14 5/14 6/14 Assume there are a total of 14 relevant documents Let’s evaluate a system that finds 6 of those 14 in the top 20: Precision Recall Precision Recall Hits 1-10 Hits = 0.4
Uninterpolated Average Precision –Average of precision at each retrieved relevant doc –Relevant docs not retrieved contribute zero to score = relevant document 1/11/21/31/42/53/63/74/84/94/10 5/115/125/135/145/156/166/176/186/194/20 Precision Hits 1-10 Hits The 8 relevant documents not retrieved contribute eight zeros AP =
Some Topics are Easier Than Others Ellen Voorhees, 1999
Mean Average Precision (MAP) R R 1/3=0.33 2/7=0.29 AP=0.31 R R R R 1/1=1.00 2/2=1.00 3/5=0.60 4/9=0.44 AP=0.76 R R R 1/2=0.50 3/5=0.60 4/8=0.50 AP=0.53 MAP=0.53
Visualizing Mean Average Precision Average Precision Topic
What MAP Hides Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Some Other Evaluation Measures Mean Reciprocal Rank (MRR) Geometric Mean Average Precision (GMAP) Normalized Discounted Cumulative Gain (NDCG) Binary Preference (BPref) Inferred AP (infAP)
Relevance Judgment Strategies Exhaustive assessment –Usually impractical Known-item queries –Limited to MRR, requires hundreds of queries Search-guided assessment –Hard to quantify risks to completeness Sampled judgments –Good when relevant documents are common Pooled assessment –Requires cooperative evaluation
Pooled Assessment Methodology Systems submit top 1000 documents per topic Top 100 documents for each are judged –Single pool, without duplicates, arbitrary order –Judged by the person that wrote the query Treat unevaluated documents as not relevant Compute MAP down to 1000 documents –Average in misses at 1000 as zero
Some Lessons From TREC Incomplete judgments are useful –If sample is unbiased with respect to systems tested –Additional relevant documents are highly skewed across topics Different relevance judgments change absolute score –But rarely change comparative advantages when averaged Evaluation technology is predictive –Results transfer to operational settings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999
Recap: “Batch” Evaluation Evaluation measures focus on relevance –Users also want utility and understandability Goal is typically to compare systems –Values may vary, but relative differences are stable Mean values obscure important phenomena –Statistical significance tests address generalizability –Failure analysis case studies can help you improve
Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings Interleaving User studies