Super Awesome Presentation Dandre Allison Devin Adair
Comparing the Sensitivity of Information Retrieval Metrics Filip Radlinski Microsoft Cambridge, UK Nick Craswell Microsoft Redmond, WA, USA
How do you evaluate Information Retrieval effectiveness? Precision (P) Mean Average Precision (MAP) Normalized Discounted Cumulative Gain (NDCG)
Precision Average the number of relevant documents in the top 5 for a given query Average over all queries
Mean Average Precision For each relevant document in the top 10, find the precision up until its rank for a given query Sum the precisions and normalize by the known relevant documents Average over all queries
Normalized Discounted Cumulative Gain Normalize the Discounted Cumulative Gain by the Ideal Discounted Cumulative Gain for a given query Average over all queries
Normalized Discounted Cumulative Gain Discounted Cumulative Gain – Give more emphasis to relevant documents by using 2 relevance – Give more emphasis to earlier ranks by using a logarithmic reduction factor – Sums over top 5 Ideal Discounted Cumulative Gain – Same as DCG by sorts by relevance
What’s the problem Sensitivity Might reject small but significant improvements Bias Judges removed from search process Fidelity Evaluation should reflect user success!!
Alternative Evaluation Use actually user searches Judges become actual users Evaluation becomes user success
Interleaving System A Results + System B Results Team-Draft Algorithm
Captain AhabCaptain Barnacle
Captain AhabCaptain Barnacle Interleaved List
Crediting Whoever has the most distinct clicks is considered “better” In case of tie - ignored
Retrieval Systems Pairs Major improvements – majorAB – majorBC – majorAC Minor improvements – minorE – minorD
Evaluation 12,000 queries – Samples n-times with replacement count sampled queries where rankers differ – Ignores ties Percent where better ranker scores better
Interleaving Evaluation
Credit Assignment Alternatives Shared top k – Ignore? – Lower clicks treated the same Not all clicks are created equal – log(rank) – 1/rank – Top – Bottom
Conclusions Performance measured by: – Judgment-based – Usage-based Surprise surpise small sample size is stupid – (check out that alliteration) Interleaving is transitive