Cumulated Gain-Based Evaluation of IR Techniques Liu bingbing
Motivation There are so many different kinds of IR techniques , but which one is better? And how to evaluate these techniques?
Outline Introduction Cumulated gain-based measurements Case study : comparison of some TREC-7 results at different relevance levels Discussion
Outline Introduction Cumulated gain-based measurements Case study : comparison of some TREC-7 results at different relevance levels Discussion
Background Highly relevant documents should be identified and ranked first It’s necessary to develop measures to evaluate different IR techniques
Old measures Highly and marginally relevant documents are given equal credit IR documents are judged relevant or irrelevant Graded relevance judgments
New measures CG DCG nCG nDCG
Outline Introduction Cumulated gain-based measurements Case study : comparison of some TREC-7 results at different relevance levels Discussion
Principles Highly relevant documents are more important than marginally relevant ones Documents found late are less important
Relationship CG G BV n(D)CG DCG
Direct Cumulated Gain (CG) For example G `=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, : : :> CG`=<3, 5, 8, 8, 8, 9, 11, 13, 16, 16, : : :>
Discounted Cumulated Gain (DCG) For example G`=<3, 2, 3, 0, 0, 1, 2, 2, 3, 0, : : :> DCG `=<3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, : : :>
Best possible Vectors Theoretically
A sample ideal gain vector (BV) CG`=<3, 6, 9, 11, 13, 15, 16, 17, 18, 19, 19, 19, 19, : : :> DCG`=<3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 11.21, 11.53, 11.83, 11.83, 11.83, : : :> base=2
Relative to the Ideal Measure—the Normalized (D)CG Measure Norm-vect (V, I)=<v1/i1, v2/i2, : : : , vk/ik> For example nCG=norm-vect( CG, CGI) nDCG=norm-vect(DCG,DCGI)
Comparison to Earlier Measures Average search length (ASL) estimate the average position of a relevant document Expected search length (ESL) average number of documents that must be examined to retrieve a given number of relevant documents ………………. Both of them either don’t take the degree of document relevance into account or depend on the retrieved list size or …
The strengths of new measures -CG,DCG,NCG,NDCG Take the degree of relevance of document into account Don’t depend on the size of recall base Don’t depend on outliers Be obvious to interpret
In addition DCG has further advantages Weights down the gain found later Model user persistence
Outline Introduction Cumulated gain-based measurements Case study : comparison of some TREC-7 results at different relevance levels Discussion
Data source TREC-7 50 queries from topic statements 51800 document or 1.9 GB data we used result lists for 20 topics by five participants from the TREC-7 ad hoc manual track
Relevance judgments The new judgment is reliable New judgment is stricter
Cumulated gain (a) Binary weighting (b) Nonbinary weighting
Discounting gain
Normalized (D)CG Vectors and Statistical Testing
Normalized (D)CG Vectors and Statistical Testing
About the case study D 1 2 3 4 5 6 7 8 9 10 G For example: So: 1 6 10 3 4 2 5 8 7 9 Ideal=<3,3,3,2,2,1,1,1,0,0> 3 1 4 2 6 A=<2,3,2,1,3,…> ....... D 1 2 3 4 5 6 7 8 9 10 G
Outline Introduction Cumulated gain-based measurements Case study : comparison of some TREC-7 results at different relevance levels Discussion
Several parameters Last Rank Considered Gain Values Discounting Factor
Limitations Don’t take order effects on relevance judgments or document overlap into account Deal with a single dimension only Be unable to handle dynamic changes
Benefites Take the degree of document relevance into account Model user persistence