Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval
Advertisements

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Precision and Recall.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Introduction to Information Retrieval and Web Search Lecture 8: Evaluation and Result Summaries.
Scoring and Ranking 198:541. Scoring Thus far, our queries have all been Boolean Docs either match or not Good for expert users with precise understanding.
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
CS276A Information Retrieval Lecture 8. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
CS276 Information Retrieval and Web Search
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Retrieval Evaluation Hongning Wang
Evaluating Classifiers
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Christopher Manning and Prabhakar.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluation of IR Systems
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals Test collections: evaluating sets  Test collections: evaluating rankings Interleaving.
Introduction to Information Retrieval Introduction to Information Retrieval Modified from Stanford CS276 slides Chap. 8: Evaluation.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Retrieval Evaluation Hongning Wang
Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.
What Does the User Really Want ? Relevance, Precision and Recall.
Chapter 8: Evaluation Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Learning to Rank: From Pairwise Approach to Listwise Approach Authors: Zhe Cao, Tao Qin, Tie-Yan Liu, Ming-Feng Tsai, and Hang Li Presenter: Davidson Date:
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
CS276 Information Retrieval and Web Search Lecture 8: Evaluation.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Knowledge and Information Retrieval Dr Nicholas Gibbins 32/4037.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
7CCSMWAL Algorithmic Issues in the WWW
Evaluation of IR Systems
Lecture 10 Evaluation.
Evaluation.
אחזור מידע, מנועי חיפוש וספריות
Modern Information Retrieval
IR Theory: Evaluation Methods
Lecture 6 Evaluation.
Lecture 8: Evaluation Hankz Hankui Zhuo
Dr. Sampath Jayarathna Cal Poly Pomona
Cumulated Gain-Based Evaluation of IR Techniques
Retrieval Evaluation - Measures
INF 141: Information Retrieval
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Dr. Sampath Jayarathna Cal Poly Pomona
Precision and Recall Reminder:
Precision and Recall.
Presentation transcript:

Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query: cat RelevantIrrelevant 1

Accuracy Given a query, an engine classifies each image as “Relevant” or “Nonrelevant” The accuracy of an engine: the fraction of these classifications that are correct – (tp + tn) / ( tp + fp + fn + tn) Accuracy is a commonly used evaluation measure in machine learning classification work Why is this not a very useful evaluation measure in IR? 2

3 Why not just use accuracy? How to build a % accurate search engine on a low budget…. People doing information retrieval want to find something and have a certain tolerance for junk. Search for: 0 matching results found.

Unranked retrieval evaluation: Precision and Recall Precision: fraction of retrieved images that are relevant Recall: fraction of relevant image that are retrieved Precision: P = tp/(tp + fp) Recall: R = tp/(tp + fn) 4

5 Precision/Recall You can get high recall (but low precision) by retrieving all images for all queries! Recall is a non-decreasing function of the number of images retrieved In a good system, precision decreases as either the number of images retrieved or recall increases This is not a theorem, but a result with strong empirical confirmation

6 A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): People usually use balanced F 1 measure – i.e., with  = 1 or  = ½ Harmonic mean is a conservative average – See CJ van Rijsbergen, Information Retrieval

7 Evaluating ranked results Evaluation of ranked results: – The system can return any number of results – By taking various numbers of the top returned images (levels of recall), the evaluator can produce a precision-recall curve

8 A precision-recall curve

9 Averaging over queries A precision-recall graph for one query isn’t a very sensible thing to look at You need to average performance over a whole bunch of queries. But there’s a technical issue: – Precision-recall calculations place some points on the graph – How do you determine a value (interpolate) between the points?

10 Interpolated precision Idea: If locally precision increases with increasing recall, then you should get to count that… So you take the max of precisions to right of value

11 A precision-recall curve

12 Summarizing a Ranking Graphs are good, but people want summary measures! 1.Precision and recall at fixed retrieval level Precision-at-k: Precision of top k results Recall-at-k: Recall of top k results Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages But: averages badly and has an arbitrary parameter of k

Summarizing a Ranking

14 2.Calculating precision at standard recall levels, from 0.0 to 1.0 – 11-point interpolated average precision The standard measure in the early TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the images, using interpolation (the value for 0 is always interpolated!), and average them Evaluates performance at all recall levels Summarizing a Ranking

15 A precision-recall curve

16 Typical 11 point precisions SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)

17 3.Average precision (AP) – Averaging the precision values from the rank positi ons where a relevant image was retrieved – Avoids interpolation, use of fixed recall levels – MAP for query collection is arithmetic average Macro-averaging: each query counts equally Summarizing a Ranking

18 Average Precision

19 3.Mean average precision (MAP) – summarize rankings from multiple queries by aver aging average precision – most commonly used measure in research papers – assumes user is interested in finding many relevan t images for each query Summarizing a Ranking

21 4.R-precision – If we have a known (though perhaps incomplete) set of relevant images of size Rel, then calculate precision of the top Rel images returned – Perfect system could score 1.0. Summarizing a Ranking

22 Summarizing a Ranking for Multiple Relevance levels QueryExcellentRelevantIrrelevant Key ideas Van gogh paintings - painting & good quality Bridge - Can see full bridge, visually pleasing - picture of one person verse group or no person Hugh Grant - picture of head, clear image, good quality

23 5.NDCG: Normalized Discounted Cumulative Gain – Popular measure for evaluating web search and re lated tasks – Two assumptions: Highly relevant images are more useful than marginally relevant image the lower the ranked position of a relevant image, the l ess useful it is for the user, since it is less likely to be exa mined Summarizing a Ranking for Multiple Relevance levels

24 5.DCG: Discounted Cumulative Gain – the total gain accumulated at a particular rank p: – Alternative formulation emphasis on retrieving highly relevant images Summarizing a Ranking for Multiple Relevance levels

25 5.DCG: Discounted Cumulative Gain – 10 ranked images judged on 0‐3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 – discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 – DCG: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 Summarizing a Ranking for Multiple Relevance levels

26 5.NDCG – DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking makes averaging easier for queries with different numbers of r elevant images – Perfect ranking: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 – ideal DCG values: 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10 – NDCG values (divide actual by ideal): 1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88 NDCG ≤1 at any rank position Summarizing a Ranking for Multiple Relevance levels

27 Variance For a test collection, it is usual that a system does crummily on some information needs (e.g., MAP = 0.1) and excellently on others (e.g., MAP = 0.7) Indeed, it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query. That is, there are easy information needs and hard ones!

28 Significance Tests Given the results from a number of queries, how can we conclude that ranking algorithm A is better than a lgorithm B? A significance test enables us to reject the null hypot hesis (no difference) in favor of the alternative hypot hesis (B is better than A) – the power of a test is the probability that the test will rejec t the null hypothesis correctly – increasing the number of queries in the experime nt also increases power of test