Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.

Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia

Call For Papers of EVIA 2010  Test collection formation, evaluation metrics, and evaluation environments  Statistical issues in retrieval evaluation - User studies and the evaluation of human-computer interaction in information retrieval (HCIR)  Evaluation methods for multilingual, multimedia, or mobile information access  Novel information access tasks and their evaluation - Evaluation and assessment using implicit user feedback, crowdsourcing, living labs, or inferential methods  Evaluation issues in industrial and enterprise retrieval systems

Outlines  Basics on IR evaluation  Introduction of TREC (Text Retrieval Conference)  One selected paper  Select-the-Best-Ones: A new way to judge relative relevance

Motivated Examples  Which set is better?  S 1 ={r, r, r, n, n} vs. S 2 ={r, r, n, n, n}  S 3 ={r} vs. S 4 ={r, r, n}  Which ranking list is better?  L 1 = vs. L 2 =  L 3 = vs. L 4 =  r: relevantn: non-relevanth: highly relevant

Precision & Recall  Precision is fraction of the retrieved document which is relevant  Recall is fraction of the relevant document which has been retrieved R (Relevant Set) A (Answer Set) RaRa

Precision & Recall (cont.)  Assume there are 10 relevant documents in judgments  Example 1: S 1 ={r, r, r, n, n} vs. S 2 ={r, r, n, n, n}  P 1 = 3/5 = 0.6; R 1 = 3/10 = 0.3  P 2 = 2/5 = 0.4; R 2 = 2/10 = 0.2  S 1 > S 2  Example 2: S 3 ={r} vs. S 4 ={r, r, n}  P 3 = 1/1 = 1; R 3 = 1/10 = 0.1  P 4 = 2/3 = 0.667; R 4 = 2/10 = 0.2  ? (F1-Measure)  Example 3: L 1 = vs. L 2 =  ?  r: relevantn: non-relevanth: highly relevant

Mean Average Precision  Defined as the mean of Average Precision for a set of queries  Example 3: L 1 = vs. L 2 =  AP 1 =(1/1+2/2+3/3)/10=0.3  AP 2 =(1/3+2/4+3/5)/10=0.143  L 1 > L 2

Other Metrics based on Binary Judgments  P@10 (Precision at 10) is the number of relevant documents in the top 10 documents in the ranked list returned for a topic  e.g. there is 3 relevant documents at the top 10 retrieved documents  P@10=0.3  MRR (Mean Reciprocal Rank) is the mean of Reciprocal Rank for a set of queries  RR is the reciprocal of the first relevant document’s rank in the ranked list returned for a topic  e.g. the first relevant document is ranked as No.4  RR = ¼ = 0.25

Metrics based on Graded Relevance  Example 4: L 3 = vs. L 4 =  r: relevantn: non-relevanth: highly relevant  Which ranking list is better?  Cumulated Gains based metrics  CG, DCG, and nDCG  Two assumptions about ranked result list  Highly relevant document are more valuable  The greater the ranked position of a relevant document, the less valuable it is for the user

CG  Cumulated Gains  From graded-relevance judgments to gain vectors  Example 4: L 3 = vs. L 4 =  G 3 =, G 4 =  CG 3 =, CG 4 =

DCG  Discounted Cumulated Gains  Discounted function  Example 4: L 3 = vs. L 4 =  DG 3 =, DG 4 =  G 3 =, G 4 =  DCG 3 =, DCG 4 =  CG 3 =, CG 4 =

nDCG  Normalized Discounted Cumulated Gains  Ideal (D)CG vector  Example 4: L 3 = vs. L 4 =  L ideal =  G ideal = ; DG ideal =  CG ideal = ; DCG ideal =

nDCG  Normalized Discounted Cumulated Gains  Normalized (D)CG  Example 4: L 3 = vs. L 4 =  DCG ideal =  nDCG 3 =  =  nDCG 4 =  =  L 3 < L 4

Something Important  Dealing with small data sets  Cross validation  Significant test  Paired, two tailed t-test Green < Yellow ? The difference is significant or just caused by chance score p(.) score p(.)

Any questions?

BY RUIHUA SONG WEB DATA MANAGEMENT GROUP, MSR ASIA MARCH 30, 2010 Introduction of TREC

Text Retrieval Conference Homepage: http://trec.nist.gov/ Goals  To encourage retrieval research based on large test collection  To increase communication among industry, academia, and government  To speed the transfer of technology from research labs into commercial products  To increase the availability of appropriate evaluation techniques for use by industry and academia

Yearly Cycle of TREC

The TREC Tracks

TREC 2009 Tracks  Blog track  Chemical IR track  Entity track  Legal track  “Million Query” track  Relevance Feedback track  Web track Participants  67 groups representing 19 different countries

TREC 2010 Schedule  By Feb 18 – submit your application to participate in TREC 2010  Beginning March 2  Nov 16-19: TREC 2010 at NIST in Gaithersburg, Md. USA What’s new  Session track  To test whether systems can improve their performance for a given query by using a previous query  To evaluate system performance over an entire query session instead of a single query  Track web page: http://ir.cis.udel.edu/sessions

Why TREC To obtain public data sets (most frequently used in IR papers)  Pooling makes judgments unbiased for participants To exchange ideas in emerging areas  A strong Program Committee  A healthy comparison of approaches To influence evaluation methodologies  By feedback or proposals

TREC 2009 Program Committee Ellen Voorhees, chair James Allan Chris Buckley Gord Cormack Sue Dumais Donna Harman Bill Hersh David Lewis Doug Oard John Prager Stephen Robertson Mark Sanderson Ian Soboroff Richard Tong

Any questions?

S ELECT - THE -B EST -O NES : A NEW WAY TO JUDGE RELATIVE RELEVANCE Ruihua Song, Qingwei Guo, Ruochi Zhang, Guomao Xin, Ji-Rong Wen, Yong Yu, Hsiao-Wuen Hon Information Processing and Management, 2010

A BSOLUTE R ELEVANCE J UDGMENTS

R ELATIVE R ELEVANCE J UDGMENTS Problem formulation Connections between Absolute and Relative A can be transformed to R as follows: R can be transformed to A, if the assessors assign a relevance grade to each set.

Q UICK -S ORT : A PAIRWISE STRATEGY RP P P vs. … B S W B S S W W W W R S S S W W W W W B B

S ELECT - THE -B EST -O NES : A PROPOSED NEW S TRATEGY P P P … B P P P P P P P P P B P B P P B P P P P P P

U SER S TUDY Experiment Design Latin Square design to minimize possible practice effects and order effects Each tool has been used to judge all three query sets; Each query has been judged by three subjects; Each subject has used every tool and judged every query, but there are no overlapped queries when he/she uses two different tools

U SER S TUDY Experiment Design 30 Chinese queries are divided into three balanced sets, and cover both popular queries and long-tail queries

S CENE OF U SER S TUDY

B ASIC E VALUATION R ESULTS Efficiency Majority agreement Discriminative power

F URTHER A NALYSIS ON D ISCRIMINATIVE P OWER Three grades, ‘Excellent’, ‘Good’, and ‘Fair’, are split while ‘Perfect’ and ‘Bad’ are not More queries are influenced in SBO than in QS, and the splitting is distributed more evenly in SBO

E VALUATION E XPERIMENT ON J UDGMENT Q UALITY Collecting expert’s judgments 5 experts, for 15 Chinese queries Partial orders Judge individually + discuss as a group Experimental results

D ISCUSSION Absolute relevance judgment method Fast and easy-to-implement Loses some useful order information Quick-sort method Light cognitive load and scalable High complexity and unstable standard Select-the-Best-Ones method Efficient with good discriminative power Heavy cognitive load and not scalable

C ONCLUSION We propose a new strategy called Select-the- Best-Ones to address the problem of relative relevance judgment A user study and an evaluation experiment show that the SBO method Outperforms the absolute method in terms of agreement and discriminative power Dramatically improves the efficiency over the pairwise relative method QS strategy Reduces half of the discordant pairs, compared to the QS method

Thank you! rsong@microsoft.com

Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.

Similar presentations

Presentation on theme: "Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia.

Similar presentations

Presentation on theme: "Evaluation in Information Retrieval Speaker: Ruihua Song Web Data Management Group, MSR Asia."— Presentation transcript:

Similar presentations

About project

Feedback