Retrieval Evaluation Modern Information Retrieval, Chapter 3

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Modern Information Retrieval Chapter 1: Introduction
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Evaluating Search Engine
Search Engines and Information Retrieval
Modern Information Retrieval Chapter 1: Introduction
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
The Relevance Model  A distribution over terms, given information need I, (Lavrenko and Croft 2001). For term r, P(I) can be dropped w/o affecting the.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Search Engines and Information Retrieval Chapter 1.
Minimal Test Collections for Retrieval Evaluation B. Carterette, J. Allan, R. Sitaraman University of Massachusetts Amherst SIGIR2006.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Jane Reid, AMSc IRIC, QMUL, 16/10/01 1 Evaluation of IR systems Jane Reid
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
April 14, 2003Hang Cui, Ji-Rong Wen and Tat- Seng Chua 1 Hierarchical Indexing and Flexible Element Retrieval for Structured Document Hang Cui School of.
1 Information Retrieval Acknowledgements: Dr Mounia Lalmas (QMW) Dr Joemon Jose (Glasgow)
Controlling Overlap in Content-Oriented XML Retrieval Charles L. A. Clarke School of Computer Science University of Waterloo Waterloo, Canada.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
1 Using The Past To Score The Present: Extending Term Weighting Models with Revision History Analysis CIKM’10 Advisor : Jia Ling, Koh Speaker : SHENG HONG,
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Modern Information Retrieval Presented by Miss Prattana Chanpolto Faculty of Information Technology.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
Performance Measurement. 2 Testing Environment.
Information Retrieval
Reference Collections: Collection Characteristics.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
DISTRIBUTED INFORMATION RETRIEVAL Lee Won Hee.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Information Retrieval in Practice
Evaluation Anisio Lacerda.
Evaluation of IR Systems
Evaluation.
Modern Information Retrieval
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Measures
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Reference Collections
Presentation transcript:

Retrieval Evaluation Modern Information Retrieval, Chapter 3 Ricardo Baeza-Yates, Berthier Ribeiro-Neto

Outline Introduction Retrieval Performance Evaluation Recall and precision Alternative measures Reference Collections TREC Collection CACM & ISI Collection CF Collection

Introduction Type of evaluation Retrieval performance evaluation Functional analysis phase, and Error analysis phase Proper working with no errors Performance evaluation Response time/space required Retrieval performance evaluation The evaluation of how precise is the answer set

Retrieval Performance Evaluation batch query IR Relevant Docs In Answer Set |Ra| Recall=|Ra|/|R| Precision=|Ra|/|A| collection Answer Set |A| Relevant Docs |R| Sorted by relevance

Precision and Recall Precision Recall Ratio of the number of relevant documents retrieved to the total number of documents retrieved The number of hits that are relevant Recall Ratio of number of relevant documents retrieved to the total number of relevant documents The number of relevant documents that are hits

Precision and Recall Retrieved Documents Relevant Document Space Low Precision Low Recall High Precision Low Recall Low Precision High Recall High Precision High Recall

Precision and Recall Retrieved Documents |A| Relevant |R| Information Space |RA| Recall = |RA| |R| The user isn’t usually given the answer set RA at once The documents in A are sorted to a degree of relevance (ranking) which the user examines. Recall and precision vary as the user proceeds with their examination of the answer set A Will discuss further in the evaluation lecture. Precision = |RA| |A|

Precision and Recall Trade Off 100% Increase number of documents retrieved Likely to retrieve more of the relevant documents and thus increase the recall But typically retrieve more inappropriate documents and thus decrease precision

Precision versus recall curve Rq={d3,d5,d9,d25,d39,d44,d56, d71,d89,d123} P=100% at R=10% P= 66% at R=20% P= 50% at R=30% Ranking for query q: 1.d123* 2.d84 3.d56* 4.d6 5.d8 6.d9* 7.d511 8.d129 9.d187 10.d25* 11.d38 12.d48 13.d250 14.d11 15.d3* Usually based on 11 standard recall levels: 0%, 10%, ..., 100%

Precision versus recall curve For a single query Fig3.2

Average Over Multiple Queries P(r)=average precision at the recall level r Nq = Number of queries used Pi(r) =The precision at recall level r for the i-th query

Interpolated precision Rq={d3,d56,d129} P=33% at R=33% P= 25% at R=66% P= 20% at R=100% P(rj)=max rj≦ r≦ rj+1P(r) 1.d123 2.d84 3.d56* 4.d6 5.d8 6.d9 7.d511 8.d129* 9.d187 10.d25 11.d38 12.d48 13.d250 14.d11 15.d3*

Interpolated precision Let rj, j{0, 1, 2, …, 10}, be a reference to the j-th standard recall level P(rj)=max rj≦ r≦ rj+1P(r) R=30%, P3(r)~P4(r)=33% R=40%, P4(r)~P5(r) R=50%, P5(r)~P6(r) R=60%, P6(r)~P7(r)=25%

Average recall vs. precision figure: used to compare the retrieval performance of several algorithms

Single Value Summaries Average precision versus recall: Compare retrieval algorithms over a set of example queries Sometimes we need to compare individual query’s performance Average precision Need a single value summary The single value should be interpreted as a summary of the corresponding precision versus recall curve

Single Value Summaries Average Precision at Seen Relevant Documents Averaging the precision figures obtained after each new relevant document is observed. Example: Figure 3.2, (1+0.66+0.5+0.4+0.3)/5=0.57

Single Value Summaries R-Precision The precision at the R-th position in the ranking R: the total number of relevant documents of the current query (total number in Rq) Fig3.2:R=10, value=0.4 Fig3.3,R=3, value=0.33

Single Value Summaries Precision Histograms Use R-precision measures to compare the retrieval history of two algorithms through visual inspection RPA/B(i)=RPA(i)-RPB(i)

Summary Table Statistics the number of queries , total number of documents retrieved by all queries, total number of relevant documents were effectively retrieved when all queries are considered total number of relevant documents retrieved by all queries…

Precision and Recall Maximum recall Recall and precision Measures which quantify the informativeness of the retrieval process might now be more appropriate Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

Alternative Measures The Harmonic Mean The E Measure b=1, E(j)= 1 - F(j) b>1, more interested in precision b<1, more interested in recall

User-Oriented Measure Query relevant docs Coverage=|Rk|/|U|, the fraction of the documents known to the user to be relevant which has actually been retrieved Novelty=|Ru|/(|Ru|+|Rk|), the fraction of the relevant documents retrieved which was unknown to the user

Reference Collections Reference collections have been used throughout the years for the evaluation of IR systems Reference test collections TIPSTER/TREC CACM, ISI Cystic Fibrosis: small collections, relevant documents

TREC (Text REtrieval Conference) Research in IR lacks a solid formal framework as a basic foundation Lacks robust and consistent testbeds and benchmarks

TREC (Text REtrieval Conference) Initiated under the National Institute of Standards and Technology(NIST) Goals: Providing a large test collection Uniform scoring procedures Forum 1st TREC conference in 1992 7th TREC conference in 1998 Document collection: test collections, example information requests (topics), relevant docs The benchmarks tasks

The Document Collection The TREC collection has been growing steadily over the years. TREC-3: 2 gigabytes TREC-6: 5.8 gigabytes The collection can be obtained at a small fee It is distributed in 6 CDs of compressed text Documents from sub-collections are tagged with SGML to allow easy parsing.

TREC1-6 Documents

The Documents Collection TREC doc numbered WSJ-880406-0090 <doc> <docno>WSJ880406-0090</docno> <hl>AT&T Unveils Services to Upgrade Phone Networks Under Global Plan</hl> <author>Janet Guyon (WSJ Staff)</author> <dateline>New York</dateline> <text> American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad… </text> </doc>

The Example Information Requests (Topics) Example information requests can be used for testing a new ranking algorithm A request is a description of an information need in natural language Example: Topic numbered 168 <top> <num> Number:168 <title>Topic:Financing AMTRAK <desc>Description: ….. <nar>Narrative:A ….. </top>

The Relevant Documents for each Example Information Request There is more than 350 topics The set of relevant docs for each example information request (topic) is obtained from a pool of possible relevant documents. Pooling method: This pool is created by taking the top k documents (k=100) in the ranking generated by various IR systems. The documents in the pool are then shown to human assessors who ultimately decide on the relevance of each document

The (Benchmark) Tasks at the TREC Conferences Ad hoc task: Receive new requests and execute them on a pre-specified document collection Routing task Fixed requests, changing document collections first doc: training and tuning retrieval algorithm second doc: testing the tuned retrieval algorithm Similar to filtering, but still have ranking.

Other tasks: Ref. p. 90 Chinese Filtering Interactive NLP(natural language procedure) Cross languages High precision Spoken document retrieval Query Task(TREC-7)

TREC

Evaluation Measures at the TREC Conferences At the TREC conference, 4 basic types of evaluation measures are used: (ref. p.91) Summary table statistics Recall-precision averages Document level averages Average precision histogram

The CACM Collection Small collections about computer science literature Text of doc Structured subfields word stems from the title and abstract sections Categories direct references between articles: a list of pairs of documents [da,db] Bibliographic coupling connections:a list of triples[d1,d2,ncited] Number of co-citations for each pair of articles[d1,d2,nciting] A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

The ISI Collection ISI is a test collection of 1460 docs selected from a previous collection at the Institute of Scientific Information (ISI) The main purpose of the ISI collection is to support investigation of similarities based on terms and cross-citation patterns

CFC Collection 1,239 documents indexed with the term cystic fibrosis in the National Library of Medicine’s MEDLINE Each doc record is composed of: MEDLINE accession number author title source major subjects minor subjects abstract references citations

CFC Collection 100 information requests with extensive relevance judgements: 4 separate relevance scores for each request Scores proviced by human experts and by a medical bibliographer Each score: 0 (not relevant) 1 (marginally relevant) 2 (strongly relevant)

CFC Collection Small and nice collection for experimentation Number of information requests is large relative to the collection size Good relevance judgements For online access: http://www.dcc.ufmg.br/~ilmerio/cfc/servidor