Retrieval Evaluation J. H. Wang Mar. 18, 2008. Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.

Slides:



Advertisements
Similar presentations
Pseudo-Relevance Feedback For Multimedia Retrieval By Rong Yan, Alexander G. and Rong Jin Mwangi S. Kariuki
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Introduction to Information Retrieval (Part 2) By Evren Ermis.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
1 Statistical correlation analysis in image retrieval Reporter : Erica Li 2004/9/30.
Curve-Fitting Regression
A machine learning approach to improve precision for navigational queries in a Web information retrieval system Reiner Kraft
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluating Hypotheses
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
Presented by Zeehasham Rasheed
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
1 Formal Models for Expert Finding on DBLP Bibliography Data Presented by: Hongbo Deng Co-worked with: Irwin King and Michael R. Lyu Department of Computer.
Data Mining Chapter 1 Introduction -- Basic Data Mining Tasks -- Related Concepts -- Data Mining Techniques.
Context-Sensitive Information Retrieval Using Implicit Feedback Xuehua Shen : department of Computer Science University of Illinois at Urbana-Champaign.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Web- and Multimedia-based Information Systems Lecture 2.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Effective Keyword-Based Selection of Relational Databases By Bei Yu, Guoliang Li, Karen Sollins & Anthony K. H. Tung Presented by Deborah Kallina.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
Jen-Tzung Chien, Meng-Sung Wu Minimum Rank Error Language Modeling.
What Does the User Really Want ? Relevance, Precision and Recall.
Topic by Topic Performance of Information Retrieval Systems Walter Liggett National Institute of Standards and Technology TREC-7 (1999)
1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Chapter 5: Credibility. Introduction Performance on the training set is not a good indicator of performance on an independent set. We need to predict.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Measurements and Their Analysis. Introduction Note that in this chapter, we are talking about multiple measurements of the same quantity Numerical analysis.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Writing for Computer science ——Chapter 6 Graphs, figures, and tables Tao Yang
Retrieval Evaluation Modern Information Retrieval, Chapter 3
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information Retrieval Chengxiang Zhai, John Lafferty School of Computer Science Carnegie.
Evaluation of IR Systems
Evaluation.
Modern Information Retrieval
Dr. Sampath Jayarathna Cal Poly Pomona
Retrieval Evaluation - Measures
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Dr. Sampath Jayarathna Cal Poly Pomona
Introduction to information retrieval
Precision and Recall Reminder:
Presentation transcript:

Retrieval Evaluation J. H. Wang Mar. 18, 2008

Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections

Introduction Types of evaluation –Functional analysis phase, and error analysis phase –Performance evaluation Performance evaluation –Response time/space required Retrieval performance evaluation –The evaluation of how precise is the answer set

Retrieval Performance Evaluation Query in batch mode vs. interactive sessions collection Relevant Docs In Answer Set |Ra| Relevant Docs |R| Answer Set |A| Recall=|Ra|/|R| Precision=|Ra|/|A| Sorted by relevance

Precision versus Recall Curve R q ={d 3,d 5,d 9,d 25,d 39,d 44,d 56, d 71,d 89,d 123 } P = 100% at R=10% P= 66% at R=20% P= 50% at R=30% Ranking for query q: 1.d 123 * 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d d d d 25 * 11.d d d d d 3 * Usually based on 11 standard recall levels: 0%, 10%,..., 100%

Precision versus Recall Curve For a single query Fig3.2

Average Over Multiple Queries P (r)=average precision at the recall level r N q = Number of queries used P i (r)= The precision at recall level r for the i-th query

Interpolated Precision R q ={d 3,d 56,d 129 } P=33% at R=33% P=25% at R=66% P=20% at R=100% P (r j )=max r i ≦ r ≦ r j+1 P (r) 1.d d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 7.d d 129* 9.d d d d d d d 3 *

Interpolated Precision Let r j, j  {0, 1, 2, …, 10}, be a reference to the j -th standard recall level P ( r j )=max r i ≦ r ≦ r j+1 P (r) R=30%, P 3 (r)~P 4 (r)=33% R=40%, P 4 (r)~P 5 (r) R=50%, P 5 (r)~P 6 (r) R=60%, P 6 (r)~P 7 (r)=25%

Average Recall vs. Precision Figure

Single Value Summaries Average precision versus recall –Compare retrieval algorithms over a set of example queries Sometimes we need to compare individual query’s performance –Averaging precision over many queries might disguise important anomalies in the retrieval algorithms –We might be interested in whether one of them outperforms the other for each query Need a single value summary –The single value should be interpreted as a summary of the corresponding precision versus recall curve

Single Value Summaries Average Precision at Seen Relevant Documents –Averaging the precision figures obtained after each new relevant document is observed –Example: Figure 3.2, ( )/5=0.57 –This measure favors systems which retrieve relevant documents quickly (i.e., early in the ranking) R-Precision –The precision at the R-th position in the ranking –R: the total number of relevant documents of the current query (number of documents in R q ) –Fig3.2: R=10, value=0.4 –Fig3.3: R=3, value=0.33

Precision Histograms Use R-precision measures to compare the retrieval history of two algorithms through visual inspection RP A/B (i)=RP A (i)-RP B (i)

Summary Table Statistics Single value measures can be stored in a table regarding the set of all queries –the number of queries –total number of documents retrieved by all queries –total number of relevant documents which were effectively retrieved when all queries are considered –total number of relevant documents which could have been retrieved by all queries –…

Precision and Recall Appropriateness Proper estimation of maximum recall for a query requires knowledge of all documents in the collection Recall and precision are related measures which capture different aspects of the documents Measures which quantify the informativeness of the retrieval process might be more appropriate Recall and precision are easy to define when a linear ordering of the retrieved documents is enforced

Alternative Measures The Harmonic Mean –Values in [0,1] The E Measure –Relative importance of recall and precision –b=1, E(j)=F(j) –b>1, more interested in precision –b<1, more interested in recall

User-Oriented Measure Assumption: different users might have a different interpretation of which document is relevant

User-Oriented Measure Coverage=|R k |/|U| Novelty=|R u |/(|R u |+|R k |) A high coverage ratio indicates that the system is finding most of the relevant documents that the user expected to see A high novelty ratio indicates that the system is revealing many new documents which were previously unknown

Other Measures Relative recall : the ratio between the number of relevant documents found and the number of relevant documents the user expected to find Recall effort : the ratio between the number of relevant documents the user expected to find and the number of documents examined Others: expected search length, satisfaction, frustration

Reference Collections Reference test collections for the evaluation of IR systems –TIPSTER/TREC: large size, thorough experimentation –CACM, ISI: historical importance –Cystic Fibrosis: small collections, extensively studied by specialists before generation of relevant documents

Criticisms for IR Research Lacks a solid formal framework as a basic foundation –It’s difficult to dismiss due to the subjectiveness associated with the task of deciding on the relevance of a document Lacks robust and consistent testbeds and benchmarks –Early experimentation was based on relatively small test collections, and there were no widely accepted benchmarks –In early 1990s, TREC conference under Donna Harman (NIST) dedicated to experimentation with a large test collection

TREC (Text REtrieval Conference) Initiated under the National Institute of Standards and Technology(NIST) Goals: –Providing a large test collection –Uniform scoring procedures –Forum for comparing results 7 th TREC conference in 1998 –Document collection: test collections, example information requests (topics), relevant docs –The benchmarks tasks

The Documents Collection Tagged with SGML to allow easy parsing WSJ AT&T Unveils Services to Upgrade Phone Networks Under Global Plan Janet GuyonWSJ Staff) New York American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad… WSJ AT&T Unveils Services to Upgrade Phone Networks Under Global Plan Janet GuyonWSJ Staff) New York American Telephone & Telegrapj Co. introduced the first of a newgeneration of phone service with broad…

TREC1-6 Documents

The Example Information Requests (Topics) Each request ( topic ) is a description of an information need in natural language Topic number for different topics Number:168 Topic:Financing AMTRAK Description: ….. Narrative:A …..

TREC ~ Topics

TREC ~ Relevance Assessment Relevance assessment –Pooling Method –The documents in the pool are shown to human assessor to decide on the relevance Two assumptions –Vast majority of the relevant documents is collected in the assembled pool –Documents that are not in the pool can be considered to be not relevant

Pooling Method The set of relevant documents for each example information request is obtained from a pool of possible relevant documents –This pool is created by taking the top K documents (usually, K=100) in the rankings generated by the various participating retrieval systems

The (Benchmark) Tasks at the TREC Conferences Ad hoc task –Receive new requests and execute them on a pre- specified document collection Routing task –Receive test info. requests, two document collections –First doc: training and tuning retrieval algorithm –Second doc: testing the tuned retrieval algorithm

Other Tracks *Chinese Filtering Interactive *NLP (natural language processing) Cross languages High precision Spoken document retrieval Query (TREC-7) Others: Web, Terabyte, SPAM, Blog, Novelty, Question Answering, HARD, …

TREC ~ Evaluation

Evaluation Measures at the TREC Conferences Summary table statistics Recall-precision Document level averages* Average precision histogram

The CACM Collection Small collections about computer science literature ( ) Text of 3,204 documents Structured subfields –word stems from the title and abstract sections –Categories –direct references between articles: a list of document pairs [da, d b ] –Bibliographic coupling connections: a list of triples [ d 1, d 2, n cited ] –Number of co-citations for each pair of articles [ d 1, d 2, n citing ] A unique environment for testing retrieval algorithms which are based on information derived from cross-citing patterns

CACM collection also includes a set of 52 test information requests –Ex: “What articles exist which deal with TSS (Time Sharing System), an operating system for IBM computers?” Also includes two Boolean query formulations and a set of relevant documents Since the requests are fairly specific, the average number of relevant documents for each request is small (around 15) Precision and recall tend to be low

The ISI Collection The 1,460 documents in the ISI test collection were selected from a previous collection assembled by Small at ISI (Institute of Scientific Information) The documents selected were those most cited in a cross-citation study done by Small The main purpose is to support investigation of similarities based on terms and on cross-citation patterns

The Cystic Fibrosis (CF) Collection 1,239 documents indexed with the term “cystic fibrosis” (“ 囊狀纖維化 ”) in Medline database Information requests were generated by an expert in cystic fibrosis Relevance scores were provided by subject experts –0: non-relevance –1: marginal relevance –2: high relevance

Characteristics of CF collection Relevance score was generated directly by human experts It includes a good number of information requests (relative to the collection size) –The respective query vectors present overlap among themselves –This allows experimentation with retrieval strategies which take advantage of past query sessions to improve retrieval performance

Trends and Research Issues Interactive user interface –A general belief: effective retrieval is highly dependent on obtaining proper feedback from the user –Deciding which evaluation measures are most appropriate in this scenario Ex: informativeness measure in 1992 The proposal, the study, the characterization of alternative measures to recall and precision