Download presentation
Presentation is loading. Please wait.
1
Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia
2
Call For Papers of EVIA 2010 Test collection formation, evaluation metrics, and evaluation environments Statistical issues in retrieval evaluation - User studies and the evaluation of human-computer interaction in information retrieval (HCIR) Evaluation methods for multilingual, multimedia, or mobile information access Novel information access tasks and their evaluation - Evaluation and assessment using implicit user feedback, crowdsourcing, living labs, or inferential methods Evaluation issues in industrial and enterprise retrieval systems
3
Outlines Basics on IR evaluation Introduction of TREC (Text Retrieval Conference) One selected paper Select-the-Best-Ones: A new way to judge relative relevance
4
Motivated Examples Which set is better? S 1 ={r, r, r, n, n} vs. S 2 ={r, r, n, n, n} S 3 ={r} vs. S 4 ={r, r, n} Which ranking list is better? L 1 = vs. L 2 = L 3 = vs. L 4 = r: relevantn: non-relevanth: highly relevant
5
Precision & Recall Precision is fraction of the retrieved document which is relevant Recall is fraction of the relevant document which has been retrieved R (Relevant Set) A (Answer Set) RaRa
6
Precision & Recall (cont.) Assume there are 10 relevant documents in judgments Example 1: S 1 ={r, r, r, n, n} vs. S 2 ={r, r, n, n, n} P 1 = 3/5 = 0.6; R 1 = 3/10 = 0.3 P 2 = 2/5 = 0.4; R 2 = 2/10 = 0.2 S 1 > S 2 Example 2: S 3 ={r} vs. S 4 ={r, r, n} P 3 = 1/1 = 1; R 3 = 1/10 = 0.1 P 4 = 2/3 = 0.667; R 4 = 2/10 = 0.2 ? (F1-Measure) Example 3: L 1 = vs. L 2 = ? r: relevantn: non-relevanth: highly relevant
7
Mean Average Precision Defined as the mean of Average Precision for a set of queries Example 3: L 1 = vs. L 2 = AP 1 =(1/1+2/2+3/3)/10=0.3 AP 2 =(1/3+2/4+3/5)/10=0.143 L 1 > L 2
8
Other Metrics based on Binary Judgments P@10 (Precision at 10) is the number of relevant documents in the top 10 documents in the ranked list returned for a topic e.g. there is 3 relevant documents at the top 10 retrieved documents P@10=0.3 MRR (Mean Reciprocal Rank) is the mean of Reciprocal Rank for a set of queries RR is the reciprocal of the first relevant document’s rank in the ranked list returned for a topic e.g. the first relevant document is ranked as No.4 RR = ¼ = 0.25
9
Metrics based on Graded Relevance Example 4: L 3 = vs. L 4 = r: relevantn: non-relevanth: highly relevant Which ranking list is better? Cumulated Gains based metrics CG, DCG, and nDCG Two assumptions about ranked result list Highly relevant document are more valuable The greater the ranked position of a relevant document, the less valuable it is for the user
10
CG Cumulated Gains From graded-relevance judgments to gain vectors Example 4: L 3 = vs. L 4 = G 3 =, G 4 = CG 3 =, CG 4 =
11
DCG Discounted Cumulated Gains Discounted function Example 4: L 3 = vs. L 4 = DG 3 =, DG 4 = G 3 =, G 4 = DCG 3 =, DCG 4 = CG 3 =, CG 4 =
12
nDCG Normalized Discounted Cumulated Gains Ideal (D)CG vector Example 4: L 3 = vs. L 4 = L ideal = G ideal = ; DG ideal = CG ideal = ; DCG ideal =
13
nDCG Normalized Discounted Cumulated Gains Normalized (D)CG Example 4: L 3 = vs. L 4 = DCG ideal = nDCG 3 = = nDCG 4 = = L 3 < L 4
14
Something Important Dealing with small data sets Cross validation Significant test Paired, two tailed t-test Green < Yellow ? The difference is significant or just caused by chance score p(.) score p(.)
15
Any questions?
16
BY RUIHUA SONG WEB DATA MANAGEMENT GROUP, MSR ASIA MARCH 30, 2010 Introduction of TREC
17
Text Retrieval Conference Homepage: http://trec.nist.gov/ Goals To encourage retrieval research based on large test collection To increase communication among industry, academia, and government To speed the transfer of technology from research labs into commercial products To increase the availability of appropriate evaluation techniques for use by industry and academia
18
Yearly Cycle of TREC
19
The TREC Tracks
20
TREC 2009 Tracks Blog track Chemical IR track Entity track Legal track “Million Query” track Relevance Feedback track Web track Participants 67 groups representing 19 different countries
21
TREC 2010 Schedule By Feb 18 – submit your application to participate in TREC 2010 Beginning March 2 Nov 16-19: TREC 2010 at NIST in Gaithersburg, Md. USA What’s new Session track To test whether systems can improve their performance for a given query by using a previous query To evaluate system performance over an entire query session instead of a single query Track web page: http://ir.cis.udel.edu/sessions
22
Why TREC To obtain public data sets (most frequently used in IR papers) Pooling makes judgments unbiased for participants To exchange ideas in emerging areas A strong Program Committee A healthy comparison of approaches To influence evaluation methodologies By feedback or proposals
23
TREC 2009 Program Committee Ellen Voorhees, chair James Allan Chris Buckley Gord Cormack Sue Dumais Donna Harman Bill Hersh David Lewis Doug Oard John Prager Stephen Robertson Mark Sanderson Ian Soboroff Richard Tong
24
Any questions?
25
S ELECT - THE -B EST -O NES : A NEW WAY TO JUDGE RELATIVE RELEVANCE Ruihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, Hsiao-Wuen Hon Information Processing and Management, 2010
26
A BSOLUTE R ELEVANCE J UDGMENTS
27
R ELATIVE R ELEVANCE J UDGMENTS Problem formulation Connections between Absolute and Relative A can be transformed to R as follows: R can be transformed to A, if the assessors assign a relevance grade to each set.
28
Q UICK -S ORT : A PAIRWISE STRATEGY RP P P vs. … B S W B S S W W W W R S S S W W W W W B B
31
S ELECT - THE -B EST -O NES : A PROPOSED NEW S TRATEGY P P P … B P P P P P P P P P B P B P P B P P P P P P
38
U SER S TUDY Experiment Design Latin Square design to minimize possible practice effects and order effects Each tool has been used to judge all three query sets; Each query has been judged by three subjects; Each subject has used every tool and judged every query, but there are no overlapped queries when he/she uses two different tools
39
U SER S TUDY Experiment Design 30 Chinese queries are divided into three balanced sets, and cover both popular queries and long-tail queries
40
S CENE OF U SER S TUDY
41
B ASIC E VALUATION R ESULTS Efficiency Majority agreement Discriminative power
42
F URTHER A NALYSIS ON D ISCRIMINATIVE P OWER Three grades, ‘Excellent’, ‘Good’, and ‘Fair’, are split while ‘Perfect’ and ‘Bad’ are not More queries are influenced in SBO than in QS, and the splitting is distributed more evenly in SBO
43
E VALUATION E XPERIMENT ON J UDGMENT Q UALITY Collecting expert’s judgments 5 experts, for 15 Chinese queries Partial orders Judge individually + discuss as a group Experimental results
44
D ISCUSSION Absolute relevance judgment method Fast and easy-to-implement Loses some useful order information Quick-sort method Light cognitive load and scalable High complexity and unstable standard Select-the-Best-Ones method Efficient with good discriminative power Heavy cognitive load and not scalable
45
C ONCLUSION We propose a new strategy called Select-the- Best-Ones to address the problem of relative relevance judgment A user study and an evaluation experiment show that the SBO method Outperforms the absolute method in terms of agreement and discriminative power Dramatically improves the efficiency over the pairwise relative method QS strategy Reduces half of the discordant pairs, compared to the QS method
46
Thank you! rsong@microsoft.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.