Presentation is loading. Please wait.

Presentation is loading. Please wait.

Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou

Similar presentations


Presentation on theme: "Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou"— Presentation transcript:

1 Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou Email: czhou@cse.cuhk.edu.hkczhou@cse.cuhk.edu.hk

2 Outline Evaluation of ranked retrieval results.  Precision recall curve.  Interpolated Precision.  Precision at k.  R precision. Assessing relevance.  Pooling.  Kappa statistics. Q&A.

3 Evaluation of ranked retrieval results Precision-recall curve:  Precision = #(relevant items retrieved) / #(retrieved items)  Recall = #(relevant items retrieved) / #(relevant items)  Precision and recall values can be plotted to give a precision- recall curve.

4 Evaluation of ranked retrieval results Precision-recall curve:  Distinctive sawtooth shape (Blue line): If the (k+1)th document retrieved is nonrelevant, recall is the same, but precision drops. If the (k+1)th document retrieved is relevant, both precision and recall increase, and the curve jags up and to the right.

5 Evaluation of ranked retrieval results Precision-recall curve:  Standard way to remove these jiggles is to use an interpolated precision. Interpolated precision (Red line):  p interp at a certain recall level r is defined as the highest precision found for any recall level r’>=r:

6 Evaluation of ranked retrieval results Eleven-point interpolated average precision:  There is a desire to boil the interpolated precision down to a few numbers, or perhaps even a single number.  Interpolated precision is measured at the 11 recall levels of 0.0, 0.1, 0.2, …, 1.0.

7 Evaluation of ranked retrieval results Precision at k:  Web search: what matters is how many good results there are on the first page or the first three pages.  Measuring precision at fixed low levels of retrieval results.  E.g.: Precision at 10.  Advantage: Not requiring any estimate of the size of the set of relevant documents.  Disadvantage: Does not average well because the total number of relevant documents for a query has a strong influence on precision at k. Why? R-precision:  If there are |Rel| relevant documents for a query, we examine the top |Rel| results of a system, and find that r are relevant.  r/|Rel|  For each query, the |Rel| is different.

8 Assessing relevance Given information needs and documents, you need to collect relevance assessments:  Time-consuming.  Expensive. Large modern collections:  Relevance to be assessed only for a subset of the documents for each query.  Standard approach: Pooling.

9 Assessing relevance Pooling.  Relevance is assessed over a subset of the collection that is formed from the top k documents returned by a number of different IR systems.  E.g.: top 20 results from Google, Yahoo!, Bing …

10 Assessing relevance Kappa statistics:  A human is not a device that reliably reports a gold standard judgment of relevance of a document to a query.  Measure the agreement between judges.  kappa = (P(A)-P(E))/(1-P(E))  P(A): proportion of the times the judges agreed.  P(E): proportion of the times they would be expected to agree by chance, it is usual to use marginal statistics to calculate expected agreement.

11 Assessing relevance

12 Kappa statistics:  value = 1, if two judges always agree.  value = 0, agree only at the rate given by chance.  value < 0, they are worse than random. More than two judges:  Calculate an average pairwise kappa value. A rule of thumb:  Above 0.8 is taken as good agreement.  Between 0.67 and 0.8 is taken as fair agreement.  Below 0.67 is seen as data providing a dubious basis for an evaluation.

13 Questions?


Download ppt "Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou"

Similar presentations


Ads by Google