Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

Performance Measures

Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems  Informally, effectiveness measures the ability of a system to find the right information, whereas efficiency measures how quickly the retrieval can be processed n Effectiveness, efficiency, and cost are related  Efficiency & cost have significant impact on the effectiveness & vice versa n Data for evaluation  Online versus benchmarks

Online Experiments 3 n Actively involve users in gathering information to evaluate their uses & preferences of an IR system  It is required to represent the population under evaluation so that conclusions based on any interactions with these users are considered valid  Sites such as Mechanical Turk can help, but … do they really?

Test Collections 4 n AP: Associated Press newswire docs from 1988-1990  Queries, topics, and relevance judgments n Yahoo! Answers Dataset  Questions, answers, and metadata for answers and users n BookCrossing  Rated books by users, demographic information on users n LETOR: Learning to Rank  Large query-URL pairs, ranking information n Yahoo Search  Query logs to entity search

Effectiveness Measures 5 n Precision & Recall n False Positive (FP) versus False Negative (FN)  FP  |A| - |RR|: a non-relevant doc is retrieved  FN  |R| - |RR| : a relevant docs is not retrieved Collection of Docs Relevant |R| Retrieved Set |A| Precision at K (P@K) Relevant Retrieved |RR|

Effectiveness Measures 6 n F-Measure (F)  P is Precision and R is Recall  Weighted variations are also often considered

Effectiveness Measures 7 n Mean Average Precision (MAP)  Summarize rankings from multiple tasks by averaging mean precision  Most commonly used measure in research papers  Finding many relevant resources for each task  Requires relevance judgments in a collection

Does Precision Always Work? 8 Relevant Doc Retrieval System 1 Retrieval System 2 What is the precisions of System 1? And System 2? Are both system equivalents in terms of performance?

Effectiveness Measures 9 n Normalized Discounted Cumulative Gain (nDCG)  Assumes that Highly relevant resources are more useful than marginally relevant resources, common ranges: (0..1) and (1..5) The lower the ranked position of a relevant resource, the less useful it is for the user, since it is less likely to be examined Computed for the perfect ranking Normalization factor Penalization/reduction/ discount factors Graded relevance of the doc at rank i

An Example of nDCG 10 Retrieved Resources Given Rank3230012230 Discounted Gain 321.89000.390.710.670.950 DCG356.89 7.287.998.669.61 Ideal Rank3332221000 Ideal DCG367.898.899.7510.5210.88

Effectiveness Measures 11 n Mean Reciprocal Rank (MRR)  Identify the average number of resources a user has to scan through before identifying a relevant one Ranking position of the first relevant (i.e., correct) resource Tasks Normalization factor

User-Oriented Measures 12 n Coverage  In a recommender system, coverage is the percentage of items in a collection that can ever be recommended n Diversity  Degree to which the result set is heterogeneous, i.e., a measure of dissimilarity among recommended items n Novelty  Fraction of retrieved relevant documents that are unknown/ new to the user n Serendipity  Degree to which retrieved results are surprising/unexpected to the user

Efficiency Metrics 13 n Scalability  With a growing number of users and data, how can the retrieval system handle the work load? n Overall Response Performance  Real time vs offline tasks n Query throughput  Number of queries processed per unit of time

Significance Tests 14 n Given the results from a number of queries, how can we conclude that IR system/strategy B is better than IR system/strategy A?  A significance test enables us to reject the null hypothesis (no difference) in favor of the alternative hypothesis (B is better than A)  The power of a test is the probability that the test will reject the null hypothesis correctly  Increasing the number of “trials” in the experiment also increases power of test  Common significance tests T-test, Wilcoxon signed-ranked test, sign test

Significance Tests (on 2 IR Systems) 15

Example of a t-Test 16 Mean of the differences Sample standard deviation for the differences Size of sample P-value = 0.02 Reject Null Hypothesis Accept Null Hypothesis TaskABB-A 1253510 2438441 33915-24 475 0 5436825 6158570 7208060 85250-2 94958 9 10507525

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

Similar presentations

Presentation on theme: "Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

Similar presentations

Presentation on theme: "Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems."— Presentation transcript:

Similar presentations

About project

Feedback