Download presentation
Presentation is loading. Please wait.
Published byHomer Bailey Modified over 9 years ago
1
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk
2
Outline Evaluation of ranked retrieval results. Precision recall curve. Interpolated Precision. Precision at k. R precision. Assessing relevance. Pooling. Kappa statistics. Q&A.
3
Evaluation of ranked retrieval results Precision-recall curve: Precision = #(relevant items retrieved) / #(retrieved items) Recall = #(relevant items retrieved) / #(relevant items) Precision and recall values can be plotted to give a precision- recall curve.
4
Evaluation of ranked retrieval results Precision-recall curve: Distinctive sawtooth shape (Blue line): If the (k+1)th document retrieved is nonrelevant, recall is the same, but precision drops. If the (k+1)th document retrieved is relevant, both precision and recall increase, and the curve jags up and to the right.
5
Evaluation of ranked retrieval results Precision-recall curve: Standard way to remove these jiggles is to use an interpolated precision. Interpolated precision (Red line): p interp at a certain recall level r is defined as the highest precision found for any recall level r’>=r:
6
Evaluation of ranked retrieval results Eleven-point interpolated average precision: There is a desire to boil the interpolated precision down to a few numbers, or perhaps even a single number. Interpolated precision is measured at the 11 recall levels of 0.0, 0.1, 0.2, …, 1.0.
7
Evaluation of ranked retrieval results Precision at k: Web search: what matters is how many good results there are on the first page or the first three pages. Measuring precision at fixed low levels of retrieval results. E.g.: Precision at 10. Advantage: Not requiring any estimate of the size of the set of relevant documents. Disadvantage: Does not average well because the total number of relevant documents for a query has a strong influence on precision at k. Why? R-precision: If there are |Rel| relevant documents for a query, we examine the top |Rel| results of a system, and find that r are relevant. r/|Rel| For each query, the |Rel| is different.
8
Assessing relevance Given information needs and documents, you need to collect relevance assessments: Time-consuming. Expensive. Large modern collections: Relevance to be assessed only for a subset of the documents for each query. Standard approach: Pooling.
9
Assessing relevance Pooling. Relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems. E.g.: top 20 results from Google, Yahoo!, Bing …
10
Assessing relevance Kappa statistics: A human is not a device that reliably reports a gold standard judgment of relevance of a document to a query. Measure the agreement between judges. kappa = (P(A)-P(E))/(1-P(E)) P(A): proportion of the times the judges agreed. P(E): proportion of the times they would be expected to agree by chance, it is usual to use marginal statistics to calculate expected agreement.
11
Assessing relevance
12
Kappa statistics: value = 1, if two judges always agree. value = 0, agree only at the rate given by chance. value < 0, they are worse than random. More than two judges: Calculate an average pairwise kappa value. A rule of thumb: Above 0.8 is taken as good agreement. Between 0.67 and 0.8 is taken as fair agreement. Below 0.67 is seen as data providing a dubious basis for an evaluation.
13
Questions?
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.