Download presentation
Presentation is loading. Please wait.
Published byBeverly Malone Modified over 9 years ago
1
Performance Measures
2
Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems Informally, effectiveness measures the ability of a system to find the right information, whereas efficiency measures how quickly the retrieval can be processed n Effectiveness, efficiency, and cost are related Efficiency & cost have significant impact on the effectiveness & vice versa n Data for evaluation Online versus benchmarks
3
Online Experiments 3 n Actively involve users in gathering information to evaluate their uses & preferences of an IR system It is required to represent the population under evaluation so that conclusions based on any interactions with these users are considered valid Sites such as Mechanical Turk can help, but … do they really?
4
Test Collections 4 n AP: Associated Press newswire docs from 1988-1990 Queries, topics, and relevance judgments n Yahoo! Answers Dataset Questions, answers, and metadata for answers and users n BookCrossing Rated books by users, demographic information on users n LETOR: Learning to Rank Large query-URL pairs, ranking information n Yahoo Search Query logs to entity search
5
Effectiveness Measures 5 n Precision & Recall n False Positive (FP) versus False Negative (FN) FP |A| - |RR|: a non-relevant doc is retrieved FN |R| - |RR| : a relevant docs is not retrieved Collection of Docs Relevant |R| Retrieved Set |A| Precision at K (P@K) Relevant Retrieved |RR|
6
Effectiveness Measures 6 n F-Measure (F) P is Precision and R is Recall Weighted variations are also often considered
7
Effectiveness Measures 7 n Mean Average Precision (MAP) Summarize rankings from multiple tasks by averaging mean precision Most commonly used measure in research papers Finding many relevant resources for each task Requires relevance judgments in a collection
8
Does Precision Always Work? 8 Relevant Doc Retrieval System 1 Retrieval System 2 What is the precisions of System 1? And System 2? Are both system equivalents in terms of performance?
9
Effectiveness Measures 9 n Normalized Discounted Cumulative Gain (nDCG) Assumes that Highly relevant resources are more useful than marginally relevant resources, common ranges: (0..1) and (1..5) The lower the ranked position of a relevant resource, the less useful it is for the user, since it is less likely to be examined Computed for the perfect ranking Normalization factor Penalization/reduction/ discount factors Graded relevance of the doc at rank i
10
An Example of nDCG 10 Retrieved Resources Given Rank3230012230 Discounted Gain 321.89000.390.710.670.950 DCG356.89 7.287.998.669.61 Ideal Rank3332221000 Ideal DCG367.898.899.7510.5210.88
11
Effectiveness Measures 11 n Mean Reciprocal Rank (MRR) Identify the average number of resources a user has to scan through before identifying a relevant one Ranking position of the first relevant (i.e., correct) resource Tasks Normalization factor
12
User-Oriented Measures 12 n Coverage In a recommender system, coverage is the percentage of items in a collection that can ever be recommended n Diversity Degree to which the result set is heterogeneous, i.e., a measure of dissimilarity among recommended items n Novelty Fraction of retrieved relevant documents that are unknown/ new to the user n Serendipity Degree to which retrieved results are surprising/unexpected to the user
13
Efficiency Metrics 13 n Scalability With a growing number of users and data, how can the retrieval system handle the work load? n Overall Response Performance Real time vs offline tasks n Query throughput Number of queries processed per unit of time
14
Significance Tests 14 n Given the results from a number of queries, how can we conclude that IR system/strategy B is better than IR system/strategy A? A significance test enables us to reject the null hypothesis (no difference) in favor of the alternative hypothesis (B is better than A) The power of a test is the probability that the test will reject the null hypothesis correctly Increasing the number of “trials” in the experiment also increases power of test Common significance tests T-test, Wilcoxon signed-ranked test, sign test
15
Significance Tests (on 2 IR Systems) 15
16
Example of a t-Test 16 Mean of the differences Sample standard deviation for the differences Size of sample P-value = 0.02 Reject Null Hypothesis Accept Null Hypothesis TaskABB-A 1253510 2438441 33915-24 475 0 5436825 6158570 7208060 85250-2 94958 9 10507525
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.