26-01-2012 | 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.

Slides:



Advertisements
Similar presentations
Chapter 5: Introduction to Information Retrieval
Advertisements

Retrieval Evaluation J. H. Wang Mar. 18, Outline Chap. 3, Retrieval Evaluation –Retrieval Performance Evaluation –Reference Collections.
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
MT Evaluation: Human Measures and Assessment Methods : Machine Translation Alon Lavie February 23, 2011.
Exercising these ideas  You have a description of each item in a small collection. (30 web sites)  Assume we are looking for information about boxers,
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Modern Information Retrieval
Introduction to Information Retrieval Introduction to Information Retrieval Hinrich Schütze and Christina Lioma Lecture 8: Evaluation & Result Summaries.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Evaluating the Performance of IR Sytems
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
An investigation of query expansion terms Gheorghe Muresan Rutgers University, School of Communication, Information and Library Science 4 Huntington St.,
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
LIS618 lecture 11 i/r performance evaluation Thomas Krichel
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Searching the Web Dr. Frank McCown Intro to Web Science Harding University This work is licensed under Creative Commons Attribution-NonCommercial 3.0Attribution-NonCommercial.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.
Evaluation INST 734 Module 5 Doug Oard. Agenda Evaluation fundamentals  Test collections: evaluating sets Test collections: evaluating rankings Interleaving.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
WIRED Week 3 Syllabus Update (next week) Readings Overview - Quick Review of Last Week’s IR Models (if time) - Evaluating IR Systems - Understanding Queries.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.
Performance Measurement. 2 Testing Environment.
Advantages of Query Biased Summaries in Information Retrieval by A. Tombros and M. Sanderson Presenters: Omer Erdil Albayrak Bilge Koroglu.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
What Does the User Really Want ? Relevance, Precision and Recall.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
CS : NLP, Speech and Web-Topics-in-AI Pushpak Bhattacharyya CSE Dept., IIT Bombay Lecture 34: Precision, Recall, F- score, Map.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Evaluation of Information Retrieval Systems
Evaluation Anisio Lacerda.
Evaluation of IR Systems
Lecture 10 Evaluation.
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
Lecture 6 Evaluation.
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Precision and Recall Reminder:
Presentation transcript:

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation

2 of 25 Retrieval performance General performance of a system speed security usability ….. Retrieval performance and evaluation is the system presenting documents related to the query? is the user satisfied? information need?

User Queries Same words, different intent -”Check my cash” -“Cash my check” Different words, same intent -”Can I open an interest-bearing account?” -”open savings account” -”start account for saving” Gap between users’ language and official terminology – “daylight lamp” – “random reader”, “edentifier”, “log in machine”, “card reader”

What is relevancy? What if you have to search through a 2 page doc to find what you need? What if doc is 35 pages? Or 235? What if you need to click though once to get to answer? Or 2 times? 3 times? Is relevancy a characterisics of a result, or of a result set? What is the effect of an irrelevant result in an otherwise good result set? Determining relevancy is complex!

6 Translation of info need Each information need has to be translated into the "language" of the IR system reality document info need query relevance

General retrieval evaluation: batch mode (automatic) testing Test set consisting of: set of documents set of queries file with relevant document numbers for each query (human evaluation!) Experimental test sets: (a.o.) ADI, CACM, Cranfield, TREC testsets

Example CACM data files query.text. I 1.W What articles exist which deal with TSS (Time Sharing System), an operating system for IBM computers? cacm.all.I 1410.T Interarrival Statistics for Time Sharing Systems.W The optimization of time-shared system performance requires the description of the stochastic processes governing the user inputs and the program activity. This paper provides a statistical description of the user input process in the SDC-ARPA general-purpose […] qrels.text

Basic performance measures: precision, recall, F-score c a b d

Continguency table of results +Rel-Rel +Ansaba+b -Anscdc+d a+cb+dN

Exercise Testset documents in database for query Q available: 50 relevant docs Resultset query Q 100 documents relevant: 20 docs 20/50=.4 What is the recall? What is the precision? What is the generality? 20/100=.2 50/10.000=.005

Harmonic mean F Difficult to compare systems if we have two numbers. If one system has P=0.4 and R=0.6 and another has P=0.35 and R=0.65 – which one is better? Combine both numbers: harmonic mean You go to school by bike: 10 kilometers. In the morning, you bike 30 km/h. In the afternoon, you bike 20 km/h. What is your average speed?

Harmonic mean F Difficult to compare systems if we have two numbers. If one system has P=0.4 and R=0.6 and another has P=0.35 and R=0.65 – which one is better? Combine both numbers: harmonic mean You go to school by bike: 10 kilometers. In the morning, you bike 30 km/h. In the afternoon, you bike 20 km/h. What is your average speed? minutes for 20 km. 24 km/h! Harmonic mean: (2 * v1 * v2)/(v1 + v2)

Harmonic mean F If precision is higher, F-score is higher too If recall is higher, F-score is higher too F-score maximal when precision AND recall are high

Harmonic mean F What if P=0.1 and R=0.9 What if P=0.4 and R=0.6

Retrieval performance Retrieval of all relevant docs first number of retrieved docs => recall random perverse parabolic 100% N ideal

Until now: ordering of results is ignored

Recall and precision on rank x set of 80 docs, 4 relevant for query Q ordered answer set for Q: NRNNNNNRN..NRN…NRN……N Rel. docrankrecallprec d12 d28 d330 d440 recall always rising? precision always falling?

Recall and precision changes 80 docs, 4 relevant for query Q Answer set: NRNNNNNRN..NRN…NRN……N docrankrecallprec d d d d recall always rising? rising or equal precision always falling? falling, equal, but also rising: if d3 had rank 9

Recall-precision graph 100% recall 100% prec precision on different recall levels - for 1 query - average over queries - comparing systems

R/P graph:comparing 2 systems

Interpolation Interpolation: if a higher recall level has a higher precision, use it for the lower recall level as well, no spikes

Single value summaries for ordered result lists (1) 11 pt average precision averaging the (interpolated) precision values on the 11 recall levels (0%-100%) 3 pt average precision same at recall levels 20%, 50%, 80%

Single value summaries for ordered result lists (2) precision at a document cut-off value (n = 5, 10, 20, 50,…) usual measure for web retrieval (why?) recall at a document cut-off value R-precision: precision on rank R where R=total number of relevant docs for the query (why??)

Single value summaries for ordered result lists (3) Average precision average of precision measured on the ranks of relevant docs seen for a query (non-interpolated) MAP mean of the average precisions on set of queries

Example average precision docrang r p d d d d Average precision = 1.18 / 4 = 0.295

What do these numbers really mean? How do we know the user was happy with the result? Click? What determines whether you click or not? Snippet Term highlighting What determines whether you click back and try another result?

Annotator agreement To be sure that judgements are reliable, more than one judge should give his rate A common measure for the agreement between judgements is the kappa statistic Of course there will always be some agreement. This expected chance agreement is included in kappa.

Kappa measure P(A)the proportion of agreement cases P(E) the expected proportion of agreement For more than 2 judges, kappa is calculated between each pair and the outcomes are averaged

Kappa example: page 152, table 8.2

Kappa interpretation for binary relevance decisions, the agreement is generally not higher than fair =1 complete agreement >0.8good agreement >0.67fair agreement <0.67dubious 0just chance <0worse than chance