Evaluation.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Precision and Recall.
Evaluating Search Engine
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Modern Information Retrieval
Ch 4: Information Retrieval and Text Mining
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
A Task Oriented Non- Interactive Evaluation Methodology for IR Systems By Jane Reid Alyssa Katz LIS 551 March 30, 2004.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
Evaluating the Performance of IR Sytems
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Retrieval Evaluation. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Recuperação de Informação. IR: representation, storage, organization of, and access to information items Emphasis is on the retrieval of information (not.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Information Retrieval and Web Search IR Evaluation and IR Standard Text Collections.
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Chapter 6: Information Retrieval and Web Search
IR System Evaluation Farhad Oroumchian. IR System Evaluation System-centered strategy –Given documents, queries, and relevance judgments –Try several.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Sampath Jayarathna Cal Poly Pomona
Evaluation of Information Retrieval Systems
Evaluation Anisio Lacerda.
Text Based Information Retrieval
7CCSMWAL Algorithmic Issues in the WWW
Evaluation of IR Systems
Lecture 10 Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
Q4 Measuring Effectiveness
Lecture 6 Evaluation.
Evaluation of Information Retrieval Systems
Dr. Sampath Jayarathna Cal Poly Pomona
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Dr. Sampath Jayarathna Cal Poly Pomona
Precision and Recall Reminder:
Precision and Recall.
Presentation transcript:

Evaluation

Evaluation The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion of relevance. What constitute relevance ?

Relevance Relevance is subjective in nature i.e. it depends upon a specific user’s judgment. Given a query, the same document may be judged as relevant by one user and non-relevant by another user. Only the user can tell the true relevance. however not possible to measure this “true relevance” Most of the evaluation of IR systems so far has been done on document test collections with known relevance judgments.

Another issue with relevance is the degree of relevance. Traditionally, relevance has been visualized as a binary concept i.e. a document is judged either as relevant or not relevant whereas relevance is a continuous function (a document may exactly what the user want or it may be closely related)

Why System Evaluation? There are many retrieval models/ algorithms/ systems, which one is the best? What is the best component for: Ranking function (dot-product, cosine, …) Term selection (stop word removal, stemming…) Term weighting (TF, TF-IDF,…) How far down the ranked list will a user need to look to find some/all relevant documents?

Evaluation of IR Systems The evaluation of IR system is the process of assessing how well a system meets the information needs of its users (Voorhees, 2001). Criteria's for evaluation -Coverage of the collection -Time lag - Presentation format - User effort - Precision - Recall

Of these criteria, recall and precision have most frequently been applied in measuring information retrieval. Both these criteria are related with the effectiveness aspect of IR system i.e. its ability to retrieve relevant documents in response to user query.

Effectiveness is purely a measure of the ability of the system to satisfy user in terms of the relevance of documents retrieved Aspects of effectiveness include: - whether the documents being returned are relevant to the user -whether they are presented in the order of relevance - whether a significant number of relevant documents in the collection are being returned to the user etc

Evaluation of IR Systems The IR evaluation models can be broadly classified as system driven model and user-centered model. System driven model focus on measuring how well the system can rank documents user–centered evaluation model attempt to measure the user’s satisfaction with the system. the time that elapses between submission of the query and getting back the response : the effort made by the user in obtaining answers to her queries : the proportion of retrieved documents which are relevant : the proportion of relevant documents that are retrieved.

Effectiveness measures The most commonly used measures of effectiveness are precision and recall. These measures are based on relevance judgments.

Evaluation of IR Systems Traditional goal of IR is to retrieve all and only the relevant documents in response to a query All is measured by recall: the proportion of relevant documents in the collection which are retrieved i.e. P(retrieved|relevant) Only is measured by precision: the proportion of retrieved documents which are relevant Both these criteria are related with the effectiveness aspect of IR system i.e. its ability to retrieve relevant documents in response to user query.

Precision vs. Recall All docs Retrieved RelRetrieved Relevant

These definitions of precision and recall are based on binary relevance judgment, which means that every retrievable item is recognizably “relevant”, or recognizably “not relevant”. Hence, for every search result all retrievable documents will be either (i) relevant or non-relevant and (ii) retrieved or not retrieved.

Trade-off between Recall and Precision Returns relevant documents but misses many useful ones too The ideal 1 Precision Returns most relevant documents but includes lots of junk 1 There exists a trade-off between precision and recall, even though a high value of both at the same time is desirable. The trade-off is shown in figure 8.14. Precision is high at low recall values. As recall increases precision degrades. The ideal case of perfect retrieval requires all relevant documents be retrieved before the first non-relevant document is retrieved. This is shown in figure by the line parallel to x-axis having a precision of 1.0 at all recall point. Recall is an additive process. Once highest recall (1.0) is achieved, it remains 1.0 for any subsequent document retrieved. We can always achieve 100% recall by retrieving all documents in the collection, but this defeats the mere intent of an information retrieval system. Recall

Test collection approach The total number of relevant documents in a collection must be known in order for recall to be calculated. To provide a framework of evaluation of IR systems, a number of test collections have been developed (Cranfield, TREC etc.). These document collections are accompanied by a set of queries and relevance judgments. relevant documents in a collection must be known in order for recall to be calculated. The amount of effort and time required on behalf of user makes this almost impossible in most of the actual operating environment. These test collections made it possible for IR researchers efficiently evaluate their experimental approaches, and to compare the effectiveness of their system with that of others.

IR test collections Collection Number of documents Number of queries Cranfield 1400 225 CACM 3204 64 CISI 1460 112 LISA 6004 35 TIME 423 83 ADI 82 35 MEDLINE 1033 30 TREC-1 742,611 100 ___________________________________________________________

Fixed Recall Levels One way to evaluate is to look at average precision at fixed recall levels Provides the information needed for precision/recall graphs In order to evaluate the performance of IR system recall and precision are almost always used together. One measure is to calculate precision at a particular cutoff. Typical cutoffs are 5 documents, 10 documents, 15 documents etc.

Document Cutoff Levels Another way to evaluate: Fix the number of documents retrieved at several levels: top 5 top 10 top 20 top 50 top 100 Measure precision at each of these levels Take (weighted) average over results focuses on how well the system ranks the first k documents.

Computing Recall/Precision Points For a given query, produce the ranked list of retrievals. Mark each document in the ranked list that is relevant according to the gold standard. Compute a recall/precision pair for each position in the ranked list that contains a relevant document.

Computing Recall/Precision Points: An Example Let total # of relevant docs = 6 Check each new recall point: R=1/6=0.167; P=1/1=1 R=2/6=0.333; P=2/2=1 R=3/6=0.5; P=3/4=0.75 R=4/6=0.667; P=4/6=0.667 Missing one relevant document. Never reach 100% recall R=5/6=0.833; p=5/13=0.38

Interpolating a Recall/Precision Curve Interpolate a precision value for each standard recall level: rj {0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0} r0 = 0.0, r1 = 0.1, …, r10=1.0 The interpolated precision at the j-th standard recall level is the maximum known precision at any recall level greater than or equal to j-th level.

Example: Interpolated Precision Precision at observed recall points: Recall Precision 0.25 1.0 0.4 0.67 0.55 0.8 0.8 0.6 1.0 0.5 The interpolated precision : 0.0 1.0 0.1 1.0 0.2 1.0 0.3 0.8 0.4 0.8 0.5 0.8 0.6 0.6 0.7 0.6 0.8 0.6 0.9 0.5 1.0 0.5 Interpolated average precision = 0.745

Recall-Precision graph

Average Recall/Precision Curve Compute average precision at each standard recall level across all queries. Plot average precision/recall curves to evaluate overall system performance on a document/query corpus.

Average Recall/Precision Curve Model 1. doc = “atn”, query = “ntc” 2. doc = “atn”, query = “atc” 3. doc = “atc”, query = “atc” 4. doc = “atc”, query = “ntc” 5. doc = “ntc”, query = “ntc” 6. doc = “ltc”, query = “ltc” 7. doc = “nnn”, query= “nnn”

1.000000 1.000000 1.000000 0.250000 0.250000 0.150000 0.150000 0.060606 0.016611 0.016611 0.016611 1.000000 1.000000 0.333333 0.333333 0.022727 0.022727 0.030075 0.026042 0.026042 0.026042 0.026042 1.000000 0.068966 0.081081 0.031250 0.024155 0.020115 0.022663 0.023377 0.021692 0.010138 0.010138 0.111111 0.111111 0.166667 0.085714 0.078431 0.078431 0.087719 0.086957 0.063636 0.034335 0.034335 1.000000 1.000000 1.000000 1.000000 0.200000 0.200000 0.200000 0.029703 0.029703 0.029703 0.029703 1.000000 1.000000 0.636364 0.142857 0.142857 0.135922 0.100000 0.055866 0.024974 0.014123 0.014123

Problems with Precision/Recall Can’t know true recall value except in small collections Precision/Recall are related A combined measure sometimes more appropriate Assumes batch mode Interactive IR is important and has different criteria for successful searches Assumes a strict rank ordering matters.

Other measures: R-Precision R-Precision is the precision after R documents have been retrieved, where R is the number of relevant documents for a topic. It de-emphasizes exact ranking of the retrieved relevant documents. The average is simply the mean R-Precision for individual topics in the run.

Other measures: F-measure F-measure takes into account both precision and recall. It is defined as harmonic mean of recall and precision. Compared to arithmetic mean both need to be high for harmonic mean to be high. Jane Reid, “A Task-Oriented Non-Interactive Evaluation Methodology for Information Retrieval Systems”, Information Retrieval, 2, 115–129, 2000

E-measure E-measure is a variant of F-measure that allows weighting emphasis on precision over recall. It is defined as: Value of b controls the trade-off between precision and recall. Setting b to 1, gives equal weight to precision and recall (E = F) b>1 weight precision more whereas b<1 gives more weight to recall.

Normalized recall Normalized recall measures how close is the set of the retrieved document to an ideal retrieval in which the most relevant NRrel documents appear in first NRrel positions. If relevant documents are ranked 1,2,3, … then Ideal rank (IR) is given by

Let the average rank (AR) over the set of relevant documents retrieved by the system be: Rankr represents the rank of the rth relevant document

The difference between AR and IR, given by AR-IR, represents a measure of the effectiveness of the system. This difference ranges from 0 (for the perfect retrieval ) to (N-NRrel) for worst case retrieval

The expression AR-IR can be normalized by dividing it by (N-NRrel) and then by subtracting the result from 1, we get the normalized recall (NR) given by: This measure ranges from 1 for the best case to 0 for the worst case.

Evaluation Problems Realistic IR is interactive; traditional IR methods and measures are based on non-interactive situations Evaluating interactive IR requires human subjects (no gold standard or benchmarks) [Ref.: See Borlund, 2000 & 2003; Borlund & Ingwersen, 1997 for IIR evaluation]