Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Chapter 8 Evaluating Search Engine

Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out in controlled laboratory experiments  Online testing can also be done n Effectiveness, efficiency, and cost are related  e.g., if we want a particular level of effectiveness and efficiency, this will determine the cost of the system configuration  Efficiency and cost targets may impact effectiveness & vice versa 2

Evaluation Corpus n Test collections consisting of documents, queries, and relevance judgments, e.g., 3

Test Collections 4 Bibliographic Records Full-text Documents Policy Descriptions Based on TREC topics

TREC Topic Example 5 Short Query Long Query Criteria for Relevance

Relevance Judgments n TREC judgments  Depend on task being evaluated, e.g., topical relevance  Generally binary, i.e., relevant vs. non-relevant  Agreement good because of “narrative”  Emphasize on high recall n Obtaining relevance judgments is an expensive, time- consuming process that requires manual effort  Who does it?  What are the instructions?  What is the level of agreement? 6

Effectiveness Measures A is set of relevant documents; B is set of retrieved documents 7 Relevant Non-Relevant Retrieved A  B A  B Not Retrieved A  B A  B Recall = | A  B | | A | | B | | A  B | Precision = Collection Relevant Set, A Retrieved Set, B A  BA  B A  BA  B Retrieved Relevant, A  B (true positive rate) (positive predictive value)

Classification Errors n Precision is used when probability that a positive result is correct is important n False Positive (Type I Error)  Non-relevant documents retrieved: |  A  B |  False positive/False alarm rate: n False Negative (Type II Error)  Relevant documents not retrieved: | A   B |  FN Ratio: 1 - Recall 8

F Measure n Harmonic mean of recall & precision, a single measure  Harmonic mean emphasizes the importance of small values, whereas the arithmetic mean is affected more by outliers that are unusually large n More general form  β is a parameter that determines relative importance of recall and precision  What if β = 1? 9

Ranking Effectiveness 10 (1/1 1/2 2/3 3/4 4/5 5/6 5/7 5/8 5/9 6/10) (1/6 1/6 2/6 3/6 4/6 5/6 5/6 5/6 5/6 6/6) (0/1 1/2 1/3 1/4 2/5 3/6 4/7 4/8 5/9 6/10) (0/6 1/6 1/6 1/6 2/6 3/6 4/6 4/6 5/6 6/6)

Summarizing a Ranking n Calculating recall and precision at fixed rank positions n Calculating precision at standard recall levels, from 0.0 to 1.0  Requires interpolation n Averaging the precision values from the rank positions where a relevant document was retrieved 11

Average Precision 12 Mean Average Precision: (0.78 + 0.52) / 2 = 0.65

Averaging n Mean Average Precision (MAP)  Summarize rankings from multiple queries by averaging average precision  Most commonly used measure in research papers  Assumes user is interested in finding many relevant documents for each query  Requires many relevance judgments in text collection n Recall-precision graphs are also useful summaries 13

MAP 14

Recall-Precision Graph 15 n Too much information could be lost using MAP n Recall-precision graphs provide more detail on the effectiveness of the ranking at different recall levels  Example. The recall-precision graphs for the two previous queries  Ranking of Query 1     Ranking of Query 2   

16 Precision and Recall n Precision versus Recall Curve  A visual display of the evaluation measure of a retrieval strategy such that documents are ranked accordingly n Example.  Let q be a query in the text reference collection  Let R q = { d 3, d 5, d 9, d 25, d 39, d 44, d 56, d 71, d 89, d 123 } be the set of 10 relevant documents for q  Assume the following ranking of retreived documents in the answer set of q: 1. d 123 6. d 9 11. d 38 2. d 84 7. d 511 12. d 48 3. d 56 8. d 129 13. d 250 4. d 6 9. d 187 14. d 113 5. d 8 10. d 25 15. d 3

17 n Example.  |R q | = 10, total number of relevant documents for q  Given ranking of retrieved documents in the answer set of q: 1. d 123 6. d 9 11. d 38 2. d 84 7. d 511 12. d 48 3. d 56 8. d 129 13. d 250 4. d 6 9. d 187 14. d 113 5. d 8 10. d 25 15. d 3 Precision Versus Recall Curve 0 20 40 60 80 100 20406080100 Precision Recall 30 50 10 An Interpolated Value 66% 50% 40% 33% 100%

18 n To evaluate the performance of a retrieval strategy over all test queries, compute the average of precision at each recall level: where P(r) = average precision at the r th recall level P i (r) = precision at the r th recall level for the i th query N q = number of queries used Precision and Recall P(r) = __  NqNq i=1 Pi(r)Pi(r) NqNq __  NqNq i=1 Pi(r)Pi(r) NqNq 1 =

19 n Example.  R q = {d 3, d 56, d 129 } & |R q | = 3, number of relevant docs for q  Given ranking of retreived documents in the answer set of q: 1. d 123 6. d 9 11. d 38 2. d 84 7. d 511 12. d 48 3. d 56 8. d 129 13. d 250 4. d 6 9. d 187 14. d 113 5. d 8 10. d 25 15. d 3 Precision Versus Recall Curve 0 20 40 60 80 100 20406080100 Precision Recall Interpolated precision at the 11 recall levels 10 305070 90 20% 33% 25%

20 Interpolation of Precision at Various Recall Levels n Let r i (0  i < 10) be a reference to the i th recall level P(r i ) = max r i ≤ r ≤ r i+1 P(r)  Interpolation of previous levels by using the next known level 0 20 40 60 80 100 20406080100 Precision Recall Interpolated precision at the 11 recall levels 10 305070 90

21 Precision Versus Recall Curve n Compare the retrieval performance of distinct retrieval strategies using precision versus recall figures n Average precision versus recall figures become a standard evaluation strategy for IR systems  Simple and intuitive  Qualify the overall answer set 0 20 40 60 80 100 120 20406080100 Precision Recall Average precision at each recall level for two distinct retrieval strategies

Focusing on Top Documents n Users tend to look at only the top part of the ranked result list to find relevant documents n Some search tasks have only one relevant doc (P@1) in mind ̶ the top-ranked doc is expected to be relevant  e.g., in question-answering system, navigation search, etc. n Precision at R (5, 10, or 20): ratio of top ranked relevant docs  Easy to compute, average, understand  Not sensitive to rank positions less than R, higher or lower n Recall not an appropriate measure in this case  Instead need to measure how well the search engine does at retrieving relevant documents at very high ranks 22

Focusing on Top Documents n Reciprocal Rank  Reciprocal of the rank at which the 1 st relevant doc retrieved  Very sensitive to rank position, e.g., d n, d r, d n, d n, d n (RR = ½), whereas d n, d n, d n, d n, d r (RR = 1/5)  Mean Reciprocal Rank (MRR) is the average of the reciprocal ranks over a set of queries, i.e., 23 Number of Queries Normalization factor Ranking position of the first relevant document

Discounted Cumulative Gain n Popular measure for evaluating web search and related tasks n Two assumptions:  Highly relevant documents are more useful than marginally relevant document  The lower the ranked position of a relevant document, the less useful it is for the user, since it is less likely to be examined 24

Discounted Cumulative Gain n Uses graded relevance (i.e., a value) as a measure of the usefulness, or gain, from examining a document n Gain is accumulated starting at the top of the ranking and may be reduced, or discounted, at lower ranks n Typical discount is 1 / log 2 (rank)  With base 2, the discount at rank 4 is , and at rank 8 it is  25

Discounted Cumulative Gain n DCG is the total gain accumulated at a particular rank p: where rel i is the graded relevance of the document at rank i, e.g., ranging from “Bad” to “Perfect” is 0  rel i  5, or 0 or 1 n Alternative formulation:  Used by some web search companies  Emphasis on retrieving highly relevant documents 26 log 2 i is the discount/reduction factor applied to the gain

DCG Example n Assume that there are 10 ranked documents judged on a 0-3 (“not relevant” to “highly relevant”) relevance scale: 3, 2, 3, 0, 0, 1, 2, 2, 3, 0 which represent the gain at each rank n The discounted gain is 3, 2/1, 3/1.59, 0, 0, 1/2.59, 2/2.81, 2/3, 3/3.17, 0 = 3, 2, 1.89, 0, 0, 0.39, 0.71, 0.67, 0.95, 0 n DCG at each rank is formed by accumulating the numbers 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61 27 (5) (10)

Normalized DCG n DCG numbers are averaged across a set of queries at specific rank values, similar to precision@p  e.g., DCG at rank 5 is 6.89 and at rank 10 is 9.61 n DCG values are often normalized by comparing the DCG at each rank with the DCG value for the perfect ranking NDCG p = DCG p / IDCG p  Makes averaging easier for queries with different numbers of relevant documents 28

NDCG Example n Perfect ranking for which each relevance level is on the scale 0-3: 3, 3, 3, 2, 2, 2, 1, 0, 0, 0 n Ideal DCG values: 3, 6, 7.89, 8.89, 9.75, 10.52, 10.88, 10.88, 10.88, 10.88 n Given the DCG values are 3, 5, 6.89, 6.89, 6.89, 7.28, 7.99, 8.66, 9.61, 9.61, the NDCG values: 1, 0.83, 0.87, 0.77, 0.71, 0.69, 0.73, 0.8, 0.88, 0.88  NDCG  1 at any rank position 29 (5) (10) (5) (10)

Significance Tests n Given the results from a number of queries, how can we conclude that (a new) ranking algorithm B is better than (the baseline) algorithm A? n A significance test, which allows us to quantify, enables us to reject the null hypothesis (no difference) in favor of the alternative hypothesis (there is a difference)  The power of a test is the probability that the test will reject the null hypothesis correctly  Increasing the number of queries in the experiment also increases power (confidence in any judgment) of test  Type 1 Error: the null hypothesis is rejected when it is true  Type 2 Error: the null hypothesis is accepted when it is false 30

Significance Tests 31 Standard Procedure for comparing two retrieval systems:

t-Test n Assumption is that the difference between the effectiveness data values is a sample from a normal distribution n Null hypothesis is that the mean of the distribution of differences is zero n Test statistic for the paired t-test is  Example. Given that N = 10 such that A = and B =  QuickCalcs (http://www.graphpad.com/quickcalcs/ttest2.cfm) 32 Mean of the differences Standard derivation One-tailed P Value

Example: t-Test 33

Wilcoxon Signed-Ranks Test n Nonparametric test based on differences between effectiveness scores n Test statistic  To compute the signed-ranks, the differences are ordered by their absolute values (increasing), and then assigned rank values  Rank values are given the sign of the original difference  Sample website (http://scistatcalc.blogspot.com/2013/10/wilcoxon-signed- rank-test-calculator.html)http://scistatcalc.blogspot.com/2013/10/wilcoxon-signed- 34

Wilcoxon Test Example n 9 non-zero differences are (in order of absolute value): 2, 9, 10, 24, 25, 25, 41, 60, 70 n Signed-ranks: -1, +2, +3, -4, +5.5, +5.5, +7, +8, +9 n w (sum of the signed-ranks) = 35, p-value = 0.038 n Null hypothesis holds if  of ‘+’ ranks =  of the ‘-’ ranks 35

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Similar presentations

Presentation on theme: "Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Similar presentations

Presentation on theme: "Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out."— Presentation transcript:

Similar presentations

About project

Feedback