Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.

Similar presentations


Presentation on theme: "Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving."— Presentation transcript:

1 Evaluation INST 734 Module 5 Doug Oard

2 Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving User studies

3 Which is the Best Rank Order? = relevant document A. B. C. D. E. F.

4 Measuring Precision and Recall = relevant document 1/11/21/31/42/53/63/74/84/94/10 5/115/125/135/145/156/166/176/186/194/20 1/14 2/143/14 4/14 5/14 6/14 Assume there are a total of 14 relevant documents Let’s evaluate a system that finds 6 of those 14 in the top 20: Precision Recall Precision Recall Hits 1-10 Hits 11-20 P@10 = 0.4

5 Uninterpolated Average Precision –Average of precision at each retrieved relevant doc –Relevant docs not retrieved contribute zero to score = relevant document 1/11/21/31/42/53/63/74/84/94/10 5/115/125/135/145/156/166/176/186/194/20 Precision Hits 1-10 Hits 11-20 The 8 relevant documents not retrieved contribute eight zeros AP = 0.2307

6 Some Topics are Easier Than Others Ellen Voorhees, 1999

7 Mean Average Precision (MAP) R R 1/3=0.33 2/7=0.29 AP=0.31 R R R R 1/1=1.00 2/2=1.00 3/5=0.60 4/9=0.44 AP=0.76 R R R 1/2=0.50 3/5=0.60 4/8=0.50 AP=0.53 MAP=0.53

8 Visualizing Mean Average Precision 0.0 0.2 0.4 0.6 0.8 1.0 Average Precision Topic

9 What MAP Hides Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

10 Some Other Evaluation Measures Mean Reciprocal Rank (MRR) Geometric Mean Average Precision (GMAP) Normalized Discounted Cumulative Gain (NDCG) Binary Preference (BPref) Inferred AP (infAP)

11 Relevance Judgment Strategies Exhaustive assessment –Usually impractical Known-item queries –Limited to MRR, requires hundreds of queries Search-guided assessment –Hard to quantify risks to completeness Sampled judgments –Good when relevant documents are common Pooled assessment –Requires cooperative evaluation

12 Pooled Assessment Methodology Systems submit top 1000 documents per topic Top 100 documents for each are judged –Single pool, without duplicates, arbitrary order –Judged by the person that wrote the query Treat unevaluated documents as not relevant Compute MAP down to 1000 documents –Average in misses at 1000 as zero

13 Some Lessons From TREC Incomplete judgments are useful –If sample is unbiased with respect to systems tested –Additional relevant documents are highly skewed across topics Different relevance judgments change absolute score –But rarely change comparative advantages when averaged Evaluation technology is predictive –Results transfer to operational settings Adapted from a presentation by Ellen Voorhees at the University of Maryland, March 29, 1999

14 Recap: “Batch” Evaluation Evaluation measures focus on relevance –Users also want utility and understandability Goal is typically to compare systems –Values may vary, but relative differences are stable Mean values obscure important phenomena –Statistical significance tests address generalizability –Failure analysis case studies can help you improve

15 Agenda Evaluation fundamentals Test collections: evaluating sets Test collections: evaluating rankings  Interleaving User studies


Download ppt "Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving."

Similar presentations


Ads by Google