Download presentation
Presentation is loading. Please wait.
Published byBlake Welch Modified over 9 years ago
1
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query: cat RelevantIrrelevant 1
2
Accuracy Given a query, an engine classifies each image as “Relevant” or “Nonrelevant” The accuracy of an engine: the fraction of these classifications that are correct – (tp + tn) / ( tp + fp + fn + tn) Accuracy is a commonly used evaluation measure in machine learning classification work Why is this not a very useful evaluation measure in IR? 2
3
3 Why not just use accuracy? How to build a 99.9999% accurate search engine on a low budget…. People doing information retrieval want to find something and have a certain tolerance for junk. Search for: 0 matching results found.
4
Unranked retrieval evaluation: Precision and Recall Precision: fraction of retrieved images that are relevant Recall: fraction of relevant image that are retrieved Precision: P = tp/(tp + fp) Recall: R = tp/(tp + fn) 4
5
5 Precision/Recall You can get high recall (but low precision) by retrieving all images for all queries! Recall is a non-decreasing function of the number of images retrieved In a good system, precision decreases as either the number of images retrieved or recall increases This is not a theorem, but a result with strong empirical confirmation
6
6 A combined measure: F Combined measure that assesses precision/recall tradeoff is F measure (weighted harmonic mean): People usually use balanced F 1 measure – i.e., with = 1 or = ½ Harmonic mean is a conservative average – See CJ van Rijsbergen, Information Retrieval
7
7 Evaluating ranked results Evaluation of ranked results: – The system can return any number of results – By taking various numbers of the top returned images (levels of recall), the evaluator can produce a precision-recall curve
8
8 A precision-recall curve
9
9 Averaging over queries A precision-recall graph for one query isn’t a very sensible thing to look at You need to average performance over a whole bunch of queries. But there’s a technical issue: – Precision-recall calculations place some points on the graph – How do you determine a value (interpolate) between the points?
10
10 Interpolated precision Idea: If locally precision increases with increasing recall, then you should get to count that… So you take the max of precisions to right of value
11
11 A precision-recall curve
12
12 Summarizing a Ranking Graphs are good, but people want summary measures! 1.Precision and recall at fixed retrieval level Precision-at-k: Precision of top k results Recall-at-k: Recall of top k results Perhaps appropriate for most of web search: all people want are good matches on the first one or two results pages But: averages badly and has an arbitrary parameter of k
13
Summarizing a Ranking
14
14 2.Calculating precision at standard recall levels, from 0.0 to 1.0 – 11-point interpolated average precision The standard measure in the early TREC competitions: you take the precision at 11 levels of recall varying from 0 to 1 by tenths of the images, using interpolation (the value for 0 is always interpolated!), and average them Evaluates performance at all recall levels Summarizing a Ranking
15
15 A precision-recall curve
16
16 Typical 11 point precisions SabIR/Cornell 8A1 11pt precision from TREC 8 (1999)
17
17 3.Average precision (AP) – Averaging the precision values from the rank positi ons where a relevant image was retrieved – Avoids interpolation, use of fixed recall levels – MAP for query collection is arithmetic average Macro-averaging: each query counts equally Summarizing a Ranking
18
18 Average Precision
19
19 3.Mean average precision (MAP) – summarize rankings from multiple queries by aver aging average precision – most commonly used measure in research papers – assumes user is interested in finding many relevan t images for each query Summarizing a Ranking
21
21 4.R-precision – If we have a known (though perhaps incomplete) set of relevant images of size Rel, then calculate precision of the top Rel images returned – Perfect system could score 1.0. Summarizing a Ranking
22
22 Summarizing a Ranking for Multiple Relevance levels QueryExcellentRelevantIrrelevant Key ideas Van gogh paintings - painting & good quality Bridge - Can see full bridge, visually pleasing - picture of one person verse group or no person Hugh Grant - picture of head, clear image, good quality
23
23 5.NDCG: Normalized Discounted Cumulative Gain – Popular measure for evaluating web search and re lated tasks – Two assumptions: Highly relevant images are more useful than marginally relevant image the lower the ranked position of a relevant image, the l ess useful it is for the user, since it is less likely to be exa mined Summarizing a Ranking for Multiple Relevance levels
24
24 5.DCG: Discounted Cumulative Gain – the total gain accumulated at a particular rank p: – Alternative formulation emphasis on retrieving highly relevant images Summarizing a Ranking for Multiple Relevance levels
25
25 5.DCG: Discounted Cumulative Gain – 10 ranked images judged on 0‐3 relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 – discounted gain: 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 – DCG: 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 Summarizing a Ranking for Multiple Relevance levels
26
26 5.NDCG – DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking makes averaging easier for queries with different numbers of r elevant images – Perfect ranking: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 – ideal DCG values: 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10 – NDCG values (divide actual by ideal): 1, 0.83, 0.87, 0.76, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88 NDCG ≤1 at any rank position Summarizing a Ranking for Multiple Relevance levels
27
27 Variance For a test collection, it is usual that a system does crummily on some information needs (e.g., MAP = 0.1) and excellently on others (e.g., MAP = 0.7) Indeed, it is usually the case that the variance in performance of the same system across queries is much greater than the variance of different systems on the same query. That is, there are easy information needs and hard ones!
28
28 Significance Tests Given the results from a number of queries, how can we conclude that ranking algorithm A is better than a lgorithm B? A significance test enables us to reject the null hypot hesis (no difference) in favor of the alternative hypot hesis (B is better than A) – the power of a test is the probability that the test will rejec t the null hypothesis correctly – increasing the number of queries in the experime nt also increases power of test
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.