2011.03.02 - SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

Slides:

Advertisements

Similar presentations

Introduction to Information Retrieval

Advertisements

Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.

1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)

Evaluating Search Engine

Information Retrieval Review

- SLAYT 1 BBY 220 Re-evaluation of IR Systems Yaşar Tonta Hacettepe Üniversitesi yunus.hacettepe.edu.tr/~tonta/ BBY220 Bilgi Erişim.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

INFO 624 Week 3 Retrieval System Evaluation

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30.

Evaluating the Performance of IR Sytems

SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.

SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.

ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:

Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.

| 1 › Gertjan van Noord2014 Zoekmachines Lecture 5: Evaluation.

Evaluation David Kauchak cs458 Fall 2012 adapted from:

Evaluation David Kauchak cs160 Fall 2009 adapted from:

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

Evaluating Search Engines in chapter 8 of the book Search Engines Information Retrieval in Practice Hongfei Yan.

University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

C.Watterscs64031 Evaluation Measures. C.Watterscs64032 Evaluation? Effectiveness? For whom? For what? Efficiency? Time? Computational Cost? Cost of missed.

Performance Measurement. 2 Testing Environment.

Performance Measures. Why to Conduct Performance Evaluation? 2 n Evaluation is the key to building effective & efficient IR (information retrieval) systems.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.

1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --

Evaluation of Information Retrieval Systems Xiangming Mu.

Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.

SLIDE 1IS 202 – FALL 2002 Lecture 20: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.

Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008 Annotations by Michael L. Nelson.

Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson, Chris Manning.

Sampath Jayarathna Cal Poly Pomona

Evaluation of Information Retrieval Systems

Evaluation of IR Systems

Lecture 10 Evaluation.

Evaluation of Information Retrieval Systems

Modern Information Retrieval

Lecture 6 Evaluation.

Evaluation of Information Retrieval Systems

Evaluation of Information Retrieval Systems

Evaluation of Information Retrieval Systems

Cumulated Gain-Based Evaluation of IR Techniques

Retrieval Evaluation - Measures

INF 141: Information Retrieval

Retrieval Performance Evaluation - Measures

Precision and Recall Reminder:

Presentation transcript:

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval Lecture 12: Evaluation Cont.

SLIDE 2IS 240 – Spring 2011 Overview Evaluation of IR Systems –Review Blair and Maron Calculating Precision vs. Recall Using TREC_eval Theoretical limits of precision and recall Measures for very large-scale systems

SLIDE 3IS 240 – Spring 2011 Overview Evaluation of IR Systems –Review Blair and Maron Calculating Precision vs. Recall Using TREC_eval Theoretical limits of precision and recall

SLIDE 4IS 240 – Spring 2011 What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall proportion of relevant material actually retrieved –Precision proportion of retrieved material actually relevant effectiveness

SLIDE 5IS 240 – Spring 2011 Relevant vs. Retrieved Relevant Retrieved All docs

SLIDE 6IS 240 – Spring 2011 Precision vs. Recall Relevant Retrieved All docs

SLIDE 7IS 240 – Spring 2011 Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: ? Why don’t we use Accuracy for IR? –(Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd

SLIDE 8IS 240 – Spring 2011 The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

SLIDE 9IS 240 – Spring 2011 The F-Measure (new) Another single measure that combines precision and recall: Where: and “Balanced” when:

SLIDE 10IS 240 – Spring 2011 TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –2000 was the 9th year - 10th TREC in November Collection: 5 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT, San Jose Mercury, LA Times) –Government documents (federal register, Congressional Record) –FBIS (Foreign Broadcast Information Service) –US Patents

SLIDE 11IS 240 – Spring 2011 Sample TREC queries (topics) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

SLIDE 12IS 240 – Spring 2011

SLIDE 13IS 240 – Spring 2011

SLIDE 14IS 240 – Spring 2011

SLIDE 15IS 240 – Spring 2011

SLIDE 16IS 240 – Spring 2011

SLIDE 17IS 240 – Spring 2011

SLIDE 18IS 240 – Spring 2011

SLIDE 19IS 240 – Spring 2011 TREC Results Differ each year For the main (ad hoc) track: –Best systems not statistically significantly different –Small differences sometimes have big effects how good was the hyphenation model how was document length taken into account –Systems were optimized for longer queries and all performed worse for shorter, more realistic queries Ad hoc track suspended in TREC 9

SLIDE 20IS 240 – Spring 2011 Overview Evaluation of IR Systems –Review Blair and Maron Calculating Precision vs. Recall Using TREC_eval Theoretical limits of precision and recall

SLIDE 21IS 240 – Spring 2011 Blair and Maron 1985 A classic study of retrieval effectiveness –earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit –~350,000 pages of text –40 queries –focus on high recall –Used IBM’s STAIRS full-text system Main Result: –The system retrieved less than 20% of the relevant documents for a particular information need; lawyers thought they had 75% But many queries had very high precision

SLIDE 22IS 240 – Spring 2011 Blair and Maron, cont. How they estimated recall – generated partially random samples of unseen documents –had users (unaware these were random) judge them for relevance Other results: –two lawyers searches had similar performance –lawyers recall was not much different from paralegal’s

SLIDE 23IS 240 – Spring 2011 Blair and Maron, cont. Why recall was low –users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings –Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

SLIDE 24IS 240 – Spring 2011 Overview Evaluation of IR Systems –Review Blair and Maron Calculating Precision vs. Recall Using TREC_eval Theoretical limits of precision and recall

SLIDE 25IS 240 – Spring 2011 How Test Runs are Evaluated First ranked doc is relevant, which is 10% of the total relevant. Therefore Precision at the 10% Recall level is 100% Next Relevant gives us 66% Precision at 20% recall level Etc…. 1.d 123* 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d d d d 25 * 11. d d d d d 3 * R q ={d 3,d 5,d 9,d 25,d 39,d 44,d 56,d 71,d 89,d 123 } : 10 Relevant Examples from Chapter 3 in Baeza-Yates

SLIDE 26IS 240 – Spring 2011 Graphing for a Single Query PRECISIONPRECISION RECALL

SLIDE 27IS 240 – Spring 2011 Averaging Multiple Queries

SLIDE 28IS 240 – Spring 2011 Interpolation R q ={d 3,d 56,d 129 } 1.d 123* 2.d 84 3.d 56 * 4.d 6 5.d 8 6.d 9 * 7.d d d d 25 * 11. d d d d d 3 * First relevant doc is 56, which is gives recall and precision of 33.3% Next Relevant (129) gives us 66% recall at 25% precision Next (3) gives us 100% recall with 20% precision How do we figure out the precision at the 11 standard recall levels?

SLIDE 29IS 240 – Spring 2011 Interpolation

SLIDE 30IS 240 – Spring 2011 Interpolation So, at recall levels 0%, 10%, 20%, and 30% the interpolated precision is 33.3% At recall levels 40%, 50%, and 60% interpolated precision is 25% And at recall levels 70%, 80%, 90% and 100%, interpolated precision is 20% Giving graph…

SLIDE 31IS 240 – Spring 2011 Interpolation PRECISIONPRECISION RECALL

SLIDE 32IS 240 – Spring 2011 Overview Evaluation of IR Systems –Review Blair and Maron Calculating Precision vs. Recall Using TREC_eval Theoretical limits of precision and recall

SLIDE 33IS 240 – Spring 2011 Using TREC_EVAL Developed from SMART evaluation programs for use in TREC –trec_eval [-q] [-a] [-o] trec_qrel_file top_ranked_file NOTE: Many other options in current version Uses: –List of top-ranked documents QID iter docno rank sim runid 030 Q0 ZF prise1 –QRELS file for collection QID docno rel FT FT FT

SLIDE 34IS 240 – Spring 2011 Running TREC_EVAL Options –-q gives evaluation for each query –-a gives additional (non-TREC) measures –-d gives the average document precision measure – -o gives the “old style” display shown here

SLIDE 35IS 240 – Spring 2011 Running TREC_EVAL Output: –Retrieved: number retrieved for query –Relevant: number relevant in qrels file –Rel_ret: Relevant items that were retrieved

SLIDE 36IS 240 – Spring 2011 Running TREC_EVAL - Output Total number of documents over all queries Retrieved: Relevant: 1583 Rel_ret: 635 Interpolated Recall - Precision Averages: at at at at at at at at at at at Average precision (non-interpolated) for all rel docs(averaged over queries)

SLIDE 37IS 240 – Spring 2011 Plotting Output (using Gnuplot)

SLIDE 38IS 240 – Spring 2011 Plotting Output (using Gnuplot)

SLIDE 39IS 240 – Spring 2011 Gnuplot code set title "Individual Queries" set ylabel "Precision" set xlabel "Recall" set xrange [0:1] set yrange [0:1] set xtics 0,.5,1 set ytics 0,.2,1 set grid plot 'Group1/trec_top_file_1.txt.dat' title "Group1 trec_top_file_1" with lines 1 pause -1 "hit return" trec_top_file_1.txt.dat

SLIDE 40IS 240 – Spring 2011 Overview Evaluation of IR Systems –Review Blair and Maron Calculating Precision vs. Recall Using TREC_eval Theoretical limits of precision and recall

SLIDE 41IS 240 – Spring 2011 Problems with Precision/Recall Can’t know true recall value –except in small collections Precision/Recall are related –A combined measure sometimes more appropriate (like F or MAP) Assumes batch mode –Interactive IR is important and has different criteria for successful searches –We will touch on this in the UI section Assumes a strict rank ordering matters

SLIDE 42IS 240 – Spring 2011 Relationship between Precision and Recall Doc is Relevant Doc is NOT relevant Doc is retrieved Doc is NOT retrieved Buckland & Gey, JASIS: Jan 1994

SLIDE 43IS 240 – Spring 2011 Recall Under various retrieval assumptions Buckland & Gey, JASIS: Jan RECALLRECALL Proportion of documents retrieved Random Perfect Perverse Tangent Parabolic Recall Parabolic Recall 1000 Documents 100 Relevant

SLIDE 44IS 240 – Spring 2011 Precision under various assumptions 1000 Documents 100 Relevant PRECISIONPRECISION Proportion of documents retrieved Random Perfect Perverse Tangent Parabolic Recall Parabolic Recall

SLIDE 45IS 240 – Spring 2011 Recall-Precision 1000 Documents 100 Relevant PRECISIONPRECISION RECALL Random Perfect Perverse Tangent Parabolic Recall Parabolic Recall

SLIDE 46IS 240 – Spring 2011 CACM Query 25

SLIDE 47IS 240 – Spring 2011 Relationship of Precision and Recall

SLIDE 48IS 240 – Spring 2011 Measures for Large-Scale Eval Typical user behavior in web search systems has shown a preference for high precision Also graded scales of relevance seem more useful than just “yes/no” Measures have been devised to help evaluate situations taking these into account

SLIDE 49IS 240 – Spring 2011 Cumulative Gain measures If we assume that highly relevant documents are more useful when appearing earlier in a result list (are ranked higher) And, highly relevant documents are more useful than marginally relevant documents, which are in turn more useful than non-relevant documents Then measures that take these factors into account would better reflect user needs

SLIDE 50IS 240 – Spring 2011 Simple CG Cumulative Gain is simply the sum of all the graded relevance values the items in a ranked search result list The CG at a particular rank p is Where i is the rank and rel i is the relevance score

SLIDE 51IS 240 – Spring 2011 Discounted Cumulative Gain DCG measures the gain (usefulness) of a document based on its position in the result list –The gain is accumulated (like simple CG) with the gain of each result discounted at lower ranks The idea is that highly relevant docs appearing lower in the search result should be penalized proportion to their position in the results

SLIDE 52IS 240 – Spring 2011 Discounted Cumulative Gain The DCG is reduced logarithmically proportional to the position (p) in the ranking Why logs? No real reason except smooth reduction. Another formulation is: Puts a stronger emphasis on high ranks

SLIDE 53IS 240 – Spring 2011 Normalized DCG Because search results lists vary in size depending on the query, comparing results across queries doesn’t work with DCG alone To do this DCG is normalized across the query set: –First create an “ideal” result by sorting the result list by relevance score –Use that ideal value to create a normalized DCG

SLIDE 54IS 240 – Spring 2011 Normalized DCG Using the ideal DCG at a given position and the observed DCG at the same position The nDCG values for all test queries can then be averaged to give a measure of the average performance of the system If a system does perfect ranking, the IDCG and DCG will be the same, so nDCG will be 1 for each rank (nDCG p ranges from 0-1)

SLIDE 55IS 240 – Spring 2011 Next week Issues in Evaluation –Has there been improvement in systems? IR Components: Relevance Feedback