2007.02.27 - SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Slides:

Advertisements

Similar presentations

Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson.

Advertisements

Information Extraction Lecture 4 – Named Entity Recognition II CIS, LMU München Winter Semester Dr. Alexander Fraser, CIS.

1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.

1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)

Search Engines and Information Retrieval

Information Retrieval Review

- SLAYT 1 BBY 220 Re-evaluation of IR Systems Yaşar Tonta Hacettepe Üniversitesi yunus.hacettepe.edu.tr/~tonta/ BBY220 Bilgi Erişim.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.

Modern Information Retrieval

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

INFO 624 Week 3 Retrieval System Evaluation

Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.

SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.

Current Topics in Information Access: IR Background

Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.

SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30.

SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.

SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.

SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.

SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.

The Wharton School of the University of Pennsylvania OPIM 101 2/16/19981 The Information Retrieval Problem n The IR problem is very hard n Why? Many reasons,

Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.

WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.

ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.

Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.

SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Evaluation Information retrieval Web. Purposes of Evaluation System Performance Evaluation efficiency of data structures and methods operational profile.

Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:

Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.

Evaluation David Kauchak cs458 Fall 2012 adapted from:

Evaluation David Kauchak cs160 Fall 2009 adapted from:

Search Engines and Information Retrieval Chapter 1.

Information Retrieval and Web Search IR Evaluation and IR Standard Text Collections.

Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?

IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.

Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.

University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.

Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.

Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.

Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.

Performance Measurement. 2 Testing Environment.

1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.

What Does the User Really Want ? Relevance, Precision and Recall.

1 CS 430: Information Discovery Lecture 8 Evaluation of Retrieval Effectiveness II.

1 What Makes a Query Difficult? David Carmel, Elad YomTov, Adam Darlow, Dan Pelleg IBM Haifa Research Labs SIGIR 2006.

Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.

A Logistic Regression Approach to Distributed IR Ray R. Larson : School of Information Management & Systems, University of California, Berkeley --

Evaluation of Information Retrieval Systems Xiangming Mu.

Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.

Information Retrieval Quality of a Search Engine.

Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.

SLIDE 1IS 202 – FALL 2002 Lecture 20: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.

Evaluation of Information Retrieval Systems

Modern Information Retrieval

IR Theory: Evaluation Methods

Evaluation of Information Retrieval Systems

Evaluation of Information Retrieval Systems

Retrieval Evaluation - Reference Collections

Retrieval Evaluation - Measures

Retrieval Performance Evaluation - Measures

Precision and Recall Reminder:

Presentation transcript:

SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00 pm Spring Principles of Information Retrieval Lecture 12: Evaluation Intro

SLIDE 2IS 240 – Spring 2007 Today Mini-TREC Setup reports Evaluation of IR Systems –Precision vs. Recall –Cutoff Points –Test Collections/TREC –Blair & Maron Study

SLIDE 3IS 240 – Spring 2007 Mini-TREC Setup Reports Why did you choose the system? How easy was it to set up the system? Have you indexed the data yet? –Issues and problems? Plans or ideas about how to improve or change the system?

SLIDE 4IS 240 – Spring 2007 Today Mini-TREC Setup reports Evaluation of IR Systems –Precision vs. Recall –Cutoff Points –Test Collections/TREC –Blair & Maron Study

SLIDE 5IS 240 – Spring 2007 Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

SLIDE 6IS 240 – Spring 2007 Why Evaluate? Determine if the system is desirable Make comparative assessments Test and improve IR algorithms

SLIDE 7IS 240 – Spring 2007 What to Evaluate? How much of the information need is satisfied. How much was learned about a topic. Incidental learning: –How much was learned about the collection. –How much was learned about other topics. How inviting the system is.

SLIDE 8IS 240 – Spring 2007 Relevance In what ways can a document be relevant to a query? –Answer precise question precisely. –Partially answer question. –Suggest a source for more information. –Give background information. –Remind the user of other knowledge. –Others...

SLIDE 9IS 240 – Spring 2007 Relevance How relevant is the document –for this user for this information need. Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

SLIDE 10IS 240 – Spring 2007 What to Evaluate? What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of Information –Form of Presentation –Effort required/Ease of Use –Time and Space Efficiency –Recall proportion of relevant material actually retrieved –Precision proportion of retrieved material actually relevant effectiveness

SLIDE 11IS 240 – Spring 2007 Relevant vs. Retrieved Relevant Retrieved All docs

SLIDE 12IS 240 – Spring 2007 Precision vs. Recall Relevant Retrieved All docs

SLIDE 13IS 240 – Spring 2007 Why Precision and Recall? Get as much good stuff while at the same time getting as little junk as possible.

SLIDE 14IS 240 – Spring 2007 Retrieved vs. Relevant Documents Relevant Very high precision, very low recall

SLIDE 15IS 240 – Spring 2007 Retrieved vs. Relevant Documents Relevant Very low precision, very low recall (0 in fact)

SLIDE 16IS 240 – Spring 2007 Retrieved vs. Relevant Documents Relevant High recall, but low precision

SLIDE 17IS 240 – Spring 2007 Retrieved vs. Relevant Documents Relevant High precision, high recall (at last!)

SLIDE 18IS 240 – Spring 2007 Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

SLIDE 19IS 240 – Spring 2007 Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

SLIDE 20IS 240 – Spring 2007 Precision/Recall Curves

SLIDE 21IS 240 – Spring 2007 Document Cutoff Levels Another way to evaluate: –Fix the number of relevant documents retrieved at several levels: top 5 top 10 top 20 top 50 top 100 top 500 –Measure precision at each of these levels –Take (weighted) average over results This is sometimes done with just number of docs This is a way to focus on how well the system ranks the first k documents.

SLIDE 22IS 240 – Spring 2007 Problems with Precision/Recall Can’t know true recall value –except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –We will touch on this in the UI section Assumes a strict rank ordering matters.

SLIDE 23IS 240 – Spring 2007 Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: ? Why don’t we use Accuracy for IR? –(Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd

SLIDE 24IS 240 – Spring 2007 The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

SLIDE 25IS 240 – Spring 2007 Old Test Collections Used 5 test collections –CACM (3204) –CISI (1460) –CRAN (1397) –INSPEC (12684) –MED (1033)

SLIDE 26IS 240 – Spring 2007 TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –2001 was the 10th year - 11th TREC in November Collection: 5 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT, San Jose Mercury, LA Times) –Government documents (federal register, Congressional Record) –FBIS (Foreign Broadcast Information Service) –US Patents

SLIDE 27IS 240 – Spring 2007 TREC (cont.) Queries + Relevance Judgments –Queries devised and judged by “Information Specialists” –Relevance judgments done only for those documents retrieved -- not entire collection! Competition –Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) –Results judged on precision and recall, going up to a recall level of 1000 documents Following slides from TREC overviews by Ellen Voorhees of NIST.

SLIDE 28IS 240 – Spring 2007

SLIDE 29IS 240 – Spring 2007

SLIDE 30IS 240 – Spring 2007

SLIDE 31IS 240 – Spring 2007

SLIDE 32IS 240 – Spring 2007

SLIDE 33IS 240 – Spring 2007

SLIDE 34IS 240 – Spring 2007 Sample TREC queries (topics) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to aMTRAK would also be relevant.

SLIDE 35IS 240 – Spring 2007

SLIDE 36IS 240 – Spring 2007

SLIDE 37IS 240 – Spring 2007

SLIDE 38IS 240 – Spring 2007

SLIDE 39IS 240 – Spring 2007

SLIDE 40IS 240 – Spring 2007

SLIDE 41IS 240 – Spring 2007

SLIDE 42IS 240 – Spring 2007

SLIDE 43IS 240 – Spring 2007

SLIDE 44IS 240 – Spring 2007

SLIDE 45IS 240 – Spring 2007

SLIDE 46IS 240 – Spring 2007 TREC Benefits: –made research systems scale to large collections (pre-WWW) –allows for somewhat controlled comparisons Drawbacks: –emphasis on high recall, which may be unrealistic for what most users want –very long queries, also unrealistic –comparisons still difficult to make, because systems are quite different on many dimensions –focus on batch ranking rather than interaction There is an interactive track.

SLIDE 47IS 240 – Spring 2007 TREC is changing Ad hoc track suspended in TREC 9 Emphasis on specialized “tracks” –Interactive track –Natural Language Processing (NLP) track –Multilingual tracks (Chinese, Spanish) –Filtering track –High-Precision –High-Performance

SLIDE 48IS 240 – Spring 2007 TREC Results Differ each year For the main track: –Best systems not statistically significantly different –Small differences sometimes have big effects how good was the hyphenation model how was document length taken into account –Systems were optimized for longer queries and all performed worse for shorter, more realistic queries

SLIDE 49IS 240 – Spring 2007 The TREC_EVAL Program Takes a “qrels” file in the form… – qid iter docno rel Takes a “top-ranked” file in the form… –qid iter docno rank sim run_id –030 Q0 ZF prise1 Produces a large number of evaluation measures. For the basic ones in a readable format use “-o” Demo…

SLIDE 50IS 240 – Spring 2007 Blair and Maron 1985 A classic study of retrieval effectiveness –earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit –~350,000 pages of text –40 queries –focus on high recall –Used IBM’s STAIRS full-text system Main Result: –The system retrieved less than 20% of the relevant documents for a particular information need; lawyers thought they had 75% But many queries had very high precision

SLIDE 51IS 240 – Spring 2007 Blair and Maron, cont. How they estimated recall – generated partially random samples of unseen documents –had users (unaware these were random) judge them for relevance Other results: –two lawyers searches had similar performance –lawyers recall was not much different from paralegal’s

SLIDE 52IS 240 – Spring 2007 Blair and Maron, cont. Why recall was low –users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … differing technical terminology slang, misspellings –Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied

SLIDE 53IS 240 – Spring 2007 What to Evaluate? Effectiveness –Difficult to measure –Recall and Precision are one way –What might be others?

SLIDE 54IS 240 – Spring 2007 Next Time No Class Thursday Attend Andrei Broder’s talk in CS this afternoon (4-5 in 310 Soda Hall) Next Week –Calculating standard IR measures and more on trec_eval –Theoretical limits of Precision and Recall –Intro to Alternative evaluation metrics