2002.11.12 SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.

Slides:



Advertisements
Similar presentations
Evaluation of Information Retrieval Systems Thanks to Marti Hearst, Ray Larson.
Advertisements

1 Retrieval Performance Evaluation Modern Information Retrieval by R. Baeza-Yates and B. Ribeiro-Neto Addison-Wesley, (Chapter 3)
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluating Search Engine
Search Engines and Information Retrieval
Information Retrieval Review
- SLAYT 1 BBY 220 Re-evaluation of IR Systems Yaşar Tonta Hacettepe Üniversitesi yunus.hacettepe.edu.tr/~tonta/ BBY220 Bilgi Erişim.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
7/16/2002JCDL 2002, Ray Larson The “Entry Vocabulary Index” Approach to Multilingual Search Ray R. Larson, Fredric Gey, Aitao Chen, Michael Buckland University.
1 CS 430: Information Discovery Lecture 10 Cranfield and TREC.
Modern Information Retrieval
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 202 – FALL 2004 Lecture 13: Midterm Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am -
SLIDE 1IS 240 – Spring 2010 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
INFO 624 Week 3 Retrieval System Evaluation
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
Information Access I Measurement and Evaluation GSLT, Göteborg, October 2003 Barbara Gawronska, Högskolan i Skövde.
SLIDE 1IS 202 – FALL 2003 Lecture 21: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
Current Topics in Information Access: IR Background
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
6 Nov 2001IS202: Information Organization and Retrieval Information Extraction Ray Larson & Warren Sack IS202: Information Organization and Retrieval Fall.
SLIDE 1IS 202 – FALL 2004 Lecture 10: IR Evaluation Workshop Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30.
SLIDE 1IS 202 – FALL 2003 Lecture 26: Final Review Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 202 – FALL 2004 Lecture 9: IR Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
Indexing and Representation: The Vector Space Model Document represented by a vector of terms Document represented by a vector of terms Words (or word.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
SLIDE 1IS 240 – Spring 2007 Prof. Ray Larson University of California, Berkeley School of Information Tuesday and Thursday 10:30 am - 12:00.
SLIDE 1IS 240 – Spring 2011 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
WXGB6106 INFORMATION RETRIEVAL Week 3 RETRIEVAL EVALUATION.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
SLIDE 1IS 202 – FALL 2003 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Jennie Ning Zheng Linda Melchor Ferhat Omur. Contents Introduction WordNet Application – WordNet Data Structure - WordNet FrameNet Application – FrameNet.
1 Query Operations Relevance Feedback & Query Expansion.
SLIDE 1IS 240 – Spring 2009 Prof. Ray Larson University of California, Berkeley School of Information Principles of Information Retrieval.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Wordnet - A lexical database for the English Language.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Information Retrieval CSE 8337 Spring 2007 Introduction/Overview Some Material for these slides obtained from: Modern Information Retrieval by Ricardo.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
SLIDE 1IS 202 – FALL 2002 Lecture 20: Web Search Issues and Algorithms Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday.
Evaluation of Information Retrieval Systems
Evaluation.
IR Theory: Evaluation Methods
CS 430: Information Discovery
CSE 635 Multimedia Information Retrieval
Introduction to Information Retrieval
Evaluation of Information Retrieval Systems
Retrieval Evaluation - Reference Collections
Retrieval Evaluation - Measures
Retrieval Performance Evaluation - Measures
Presentation transcript:

SLIDE 1IS 202 – FALL 2002 Lecture 20: Evaluation Prof. Ray Larson & Prof. Marc Davis UC Berkeley SIMS Tuesday and Thursday 10:30 am - 12:00 pm Fall SIMS 202: Information Organization and Retrieval

SLIDE 2IS 202 – FALL 2002 Lecture Overview Review –Lexical Relations –WordNet –Can Lexical and Semantic Relations be Exploited to Improve IR? Evaluation of IR systems –Precision vs. Recall –Cutoff Points –Test Collections/TREC –Blair & Maron Study Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

SLIDE 3IS 202 – FALL 2002 Syntax The syntax of a language is to be understood as a set of rules which accounts for the distribution of word forms throughout the sentences of a language These rules codify permissible combinations of classes of word forms

SLIDE 4IS 202 – FALL 2002 Semantics Semantics is the study of linguistic meaning Two standard approaches to lexical semantics (cf., sentential semantics and logical semantics): –(1) Compositional –(2) Relational

SLIDE 5IS 202 – FALL 2002 Pragmatics Deals with the relation between signs or linguistic expressions and their users Deixis (literally “pointing out”) –E.g., “I’ll be back in an hour” depends upon the time of the utterance Conversational implicature –A: “Can you tell me the time?” –B: “Well, the milkman has come.” [I don’t know exactly, but perhaps you can deduce it from some extra information I give you.] Presupposition –“Are you still such a bad driver?” Speech acts –Constatives vs. performatives –E.g., “I second the motion.” Conversational structure –E.g., turn-taking rules

SLIDE 6IS 202 – FALL 2002 Major Lexical Relations Synonymy Polysemy Metonymy Hyponymy/Hyperonymy Meronymy Antonymy

SLIDE 7IS 202 – FALL 2002 Thesauri and Lexical Relations Polysemy: Same word, different senses of meaning –Slightly different concepts expressed similarly Synonyms: Different words, related senses of meanings –Different ways to express similar concepts Thesauri help draw all these together Thesauri also commonly define a set of relations between terms that is similar to lexical relations –BT, NT, RT

SLIDE 8IS 202 – FALL 2002 WordNet Started in 1985 by George Miller, students, and colleagues at the Cognitive Science Laboratory, Princeton University Can be downloaded for free: – “In terms of coverage, WordNet’s goals differ little from those of a good standard college-level dictionary, and the semantics of WordNet is based on the notion of word sense that lexicographers have traditionally used in writing dictionaries. It is in the organization of that information that WordNet aspires to innovation.” –(Miller, 1998, Chapter 1)

SLIDE 9IS 202 – FALL 2002 WordNet: Size POSUniqueSynsets Strings Noun Verb Adjective Adverb Totals WordNet Uses “Synsets” – sets of synonymous terms

SLIDE 10IS 202 – FALL 2002 Structure of WordNet

SLIDE 11IS 202 – FALL 2002 Structure of WordNet

SLIDE 12IS 202 – FALL 2002 Structure of WordNet

SLIDE 13IS 202 – FALL 2002 Lexical Relations and IR Recall that most IR research has primarily looked at statistical approaches to inferring the topicality or meaning of documents I.e., Statistics imply Semantics –Is this really true or correct? How has (or might) WordNet be used to provide more functionality in searching? What about other thesauri, classification schemes, and ontologies?

SLIDE 14IS 202 – FALL 2002 Using NLP Strzalkowski TextNLPrepres Dbase search TAGGER NLP: PARSERTERMS

SLIDE 15IS 202 – FALL 2002 NLP & IR: Possible Approaches Indexing –Use of NLP methods to identify phrases Test weighting schemes for phrases –Use of more sophisticated morphological analysis Searching –Use of two-stage retrieval Statistical retrieval Followed by more sophisticated NLP filtering

SLIDE 16IS 202 – FALL 2002 Can Statistics Approach Semantics? One approach is the Entry Vocabulary Index (EVI) work being done here… (The following slides are from my presentation at JCDL 2002)

SLIDE 17IS 202 – FALL 2002 What is an Entry Vocabulary Index? EVIs are a means of mapping from user’s vocabulary to the controlled vocabulary of a collection of documents…

SLIDE 18IS 202 – FALL 2002 Solution: Entry Level Vocabulary Indexes. Index EVI pass mtr veh spark ign eng” = “Automobile”

SLIDE 19IS 202 – FALL 2002 Find Plutonium In Arabic Chinese Greek Japanese Korean Russian Tamil Statistical association Digital library resources

SLIDE 20IS 202 – FALL 2002 Lecture Overview Review –Lexical Relations –WordNet –Can Lexical and Semantic Relations be Exploited to Improve IR? Evaluation of IR systems –Precision vs. Recall –Cutoff Points –Test Collections/TREC –Blair & Maron Study Credit for some of the slides in this lecture goes to Marti Hearst and Warren Sack

SLIDE 21IS 202 – FALL 2002 IR Evaluation Why Evaluate? What to Evaluate? How to Evaluate?

SLIDE 22IS 202 – FALL 2002 Why Evaluate? Determine if the system is desirable Make comparative assessments –Is system X better than system Y? Others?

SLIDE 23IS 202 – FALL 2002 What to Evaluate? How much of the information need is satisfied How much was learned about a topic Incidental learning: –How much was learned about the collection –How much was learned about other topic How inviting the system is

SLIDE 24IS 202 – FALL 2002 Relevance In what ways can a document be relevant to a query? –Answer precise question precisely –Partially answer question –Suggest a source for more information –Give background information –Remind the user of other knowledge –Others...

SLIDE 25IS 202 – FALL 2002 Relevance How relevant is the document? –For this user for this information need Subjective, but Measurable to some extent –How often do people agree a document is relevant to a query? How well does it answer the question? –Complete answer? Partial? –Background Information? –Hints for further exploration?

SLIDE 26IS 202 – FALL 2002 What can be measured that reflects users’ ability to use system? (Cleverdon 66) –Coverage of information –Form of presentation –Effort required/ease of use –Time and space efficiency –Recall Proportion of relevant material actually retrieved –Precision Proportion of retrieved material actually relevant What to Evaluate? Effectiveness

SLIDE 27IS 202 – FALL 2002 Relevant vs. Retrieved Relevant Retrieved All Docs

SLIDE 28IS 202 – FALL 2002 Precision vs. Recall Relevant Retrieved All Docs

SLIDE 29IS 202 – FALL 2002 Why Precision and Recall? Get as much good stuff while at the same time getting as little junk as possible

SLIDE 30IS 202 – FALL 2002 Retrieved vs. Relevant Documents Very high precision, very low recall Relevant

SLIDE 31IS 202 – FALL 2002 Retrieved vs. Relevant Documents Very low precision, very low recall (0 in fact) Relevant

SLIDE 32IS 202 – FALL 2002 Retrieved vs. Relevant Documents High recall, but low precision Relevant

SLIDE 33IS 202 – FALL 2002 Retrieved vs. Relevant Documents High precision, high recall (at last!) Relevant

SLIDE 34IS 202 – FALL 2002 Precision/Recall Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall Note: this is an AVERAGE over MANY queries precision recall x x x x

SLIDE 35IS 202 – FALL 2002 Precision/Recall Curves Difficult to determine which of these two hypothetical results is better: precision recall x x x x

SLIDE 36IS 202 – FALL 2002 TREC (Manual Queries)

SLIDE 37IS 202 – FALL 2002 Document Cutoff Levels Another way to evaluate: –Fix the number of RELEVANT documents retrieved at several levels: Top 5 Top 10 Top 20 Top 50 Top 100 Top 500 –Measure precision at each of these levels –Take (weighted) average over results This is a way to focus on how well the system ranks the first k documents

SLIDE 38IS 202 – FALL 2002 Problems with Precision/Recall Can’t know true recall value –Except in small collections Precision/Recall are related –A combined measure sometimes more appropriate Assumes batch mode –Interactive IR is important and has different criteria for successful searches –We will touch on this in the UI section Assumes a strict rank ordering matters

SLIDE 39IS 202 – FALL 2002 Relation to Contingency Table Accuracy: (a+d) / (a+b+c+d) Precision: a/(a+b) Recall: ? Why don’t we use Accuracy for IR Evaluation? (Assuming a large collection) –Most docs aren’t relevant –Most docs aren’t retrieved –Inflates the accuracy value Doc is Relevant Doc is NOT relevant Doc is retrieved ab Doc is NOT retrieved cd

SLIDE 40IS 202 – FALL 2002 The E-Measure Combine Precision and Recall into one number (van Rijsbergen 79) P = precision R = recall b = measure of relative importance of P or R For example, b = 0.5 means user is twice as interested in precision as recall

SLIDE 41IS 202 – FALL 2002 F Measure (Harmonic Mean)

SLIDE 42IS 202 – FALL 2002 Test Collections Cranfield 2 – –1400 Documents, 221 Queries –200 Documents, 42 Queries INSPEC – 542 Documents, 97 Queries UKCIS -- > Documents, multiple sets, 193 Queries ADI – 82 Document, 35 Queries CACM – 3204 Documents, 50 Queries CISI – 1460 Documents, 35 Queries MEDLARS (Salton) 273 Documents, 18 Queries

SLIDE 43IS 202 – FALL 2002 TREC Text REtrieval Conference/Competition –Run by NIST (National Institute of Standards & Technology) –1999 was the 8th year - 9th TREC in early November Collection: >6 Gigabytes (5 CRDOMs), >1.5 Million Docs –Newswire & full text news (AP, WSJ, Ziff, FT) –Government documents (federal register, Congressional Record) –Radio Transcripts (FBIS) –Web “subsets” (“Large Web” separate with 18.5 Million pages of Web data – 100 Gbytes) –Patents

SLIDE 44IS 202 – FALL 2002 TREC (cont.) Queries + Relevance Judgments –Queries devised and judged by “Information Specialists” –Relevance judgments done only for those documents retrieved—not entire collection! Competition –Various research and commercial groups compete (TREC 6 had 51, TREC 7 had 56, TREC 8 had 66) –Results judged on precision and recall, going up to a recall level of 1000 documents Following slides from TREC overviews by Ellen Voorhees of NIST

SLIDE 45IS 202 – FALL 2002

SLIDE 46IS 202 – FALL 2002

SLIDE 47IS 202 – FALL 2002

SLIDE 48IS 202 – FALL 2002

SLIDE 49IS 202 – FALL 2002

SLIDE 50IS 202 – FALL 2002 Sample TREC Query (Topic) Number: 168 Topic: Financing AMTRAK Description: A document will address the role of the Federal Government in financing the operation of the National Railroad Transportation Corporation (AMTRAK) Narrative: A relevant document must provide information on the government’s responsibility to make AMTRAK an economically viable entity. It could also discuss the privatization of AMTRAK as an alternative to continuing government subsidies. Documents comparing government subsidies given to air and bus transportation with those provided to AMTRAK would also be relevant.

SLIDE 51IS 202 – FALL 2002

SLIDE 52IS 202 – FALL 2002

SLIDE 53IS 202 – FALL 2002

SLIDE 54IS 202 – FALL 2002

SLIDE 55IS 202 – FALL 2002

SLIDE 56IS 202 – FALL 2002 TREC Benefits: –Made research systems scale to large collections (pre-WWW) –Allows for somewhat controlled comparisons Drawbacks: –Emphasis on high recall, which may be unrealistic for what most users want –Very long queries, also unrealistic –Comparisons still difficult to make, because systems are quite different on many dimensions –Focus on batch ranking rather than interaction There is an interactive track

SLIDE 57IS 202 – FALL 2002 TREC is Changing Emphasis on specialized “tracks” –Interactive track –Natural Language Processing (NLP) track –Multilingual tracks (Chinese, Spanish) –Filtering track –High-Precision –High-Performance

SLIDE 58IS 202 – FALL 2002 Blair and Maron 1985 A classic study of retrieval effectiveness –Earlier studies were on unrealistically small collections Studied an archive of documents for a legal suit –~350,000 pages of text –40 queries –Focus on high recall –Used IBM’s STAIRS full-text system Main Result: –The system retrieved less than 20% of the relevant documents for a particular information need –Lawyers thought they had 75% But many queries had very high precision

SLIDE 59IS 202 – FALL 2002 Blair and Maron (cont.) How they estimated recall –Generated partially random samples of unseen documents –Had users (unaware these were random) judge them for relevance Other results: –Two lawyers searches had similar performance –Lawyers recall was not much different from paralegal’s

SLIDE 60IS 202 – FALL 2002 Blair and Maron (cont.) Why recall was low –Users can’t foresee exact words and phrases that will indicate relevant documents “accident” referred to by those responsible as: “event,” “incident,” “situation,” “problem,” … Differing technical terminology Slang, misspellings –Perhaps the value of higher recall decreases as the number of relevant documents grows, so more detailed queries were not attempted once the users were satisfied