Download presentation
Presentation is loading. Please wait.
Published byTreyton Matson Modified over 10 years ago
1
Super Awesome Presentation Dandre Allison Devin Adair
2
Comparing the Sensitivity of Information Retrieval Metrics Filip Radlinski Microsoft Cambridge, UK filiprad@microsoft.com Nick Craswell Microsoft Redmond, WA, USA nickcr@microsoft.com
3
How do you evaluate Information Retrieval effectiveness? Precision (P) Mean Average Precision (MAP) Normalized Discounted Cumulative Gain (NDCG)
4
Precision Average the number of relevant documents in the top 5 for a given query Average over all queries
5
Mean Average Precision For each relevant document in the top 10, find the precision up until its rank for a given query Sum the precisions and normalize by the known relevant documents Average over all queries
6
Normalized Discounted Cumulative Gain Normalize the Discounted Cumulative Gain by the Ideal Discounted Cumulative Gain for a given query Average over all queries
7
Normalized Discounted Cumulative Gain Discounted Cumulative Gain – Give more emphasis to relevant documents by using 2 relevance – Give more emphasis to earlier ranks by using a logarithmic reduction factor – Sums over top 5 Ideal Discounted Cumulative Gain – Same as DCG by sorts by relevance
8
What’s the problem Sensitivity Might reject small but significant improvements Bias Judges removed from search process Fidelity Evaluation should reflect user success!!
9
Alternative Evaluation Use actually user searches Judges become actual users Evaluation becomes user success
10
Interleaving System A Results + System B Results Team-Draft Algorithm
11
Captain AhabCaptain Barnacle
12
Captain AhabCaptain Barnacle Interleaved List
13
Crediting Whoever has the most distinct clicks is considered “better” In case of tie - ignored
14
Retrieval Systems Pairs Major improvements – majorAB – majorBC – majorAC Minor improvements – minorE – minorD
15
Evaluation 12,000 queries – Samples n-times with replacement count sampled queries where rankers differ – Ignores ties Percent where better ranker scores better
20
Interleaving Evaluation
23
Credit Assignment Alternatives Shared top k – Ignore? – Lower clicks treated the same Not all clicks are created equal – log(rank) – 1/rank – Top – Bottom
24
Conclusions Performance measured by: – Judgment-based – Usage-based Surprise surpise small sample size is stupid – (check out that alliteration) Interleaving is transitive
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.