INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Advertisements

Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Web Search – Summer Term 2006 I. General Introduction (c) Wolfgang Hürst, Albert-Ludwigs-University.
Evaluating Search Engine
CS276A Information Retrieval Lecture 8. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluation CSC4170 Web Intelligence and Social Computing Tutorial 5 Tutor: Tom Chao Zhou
Web Search – Summer Term 2006 II. Information Retrieval (Basics Cont.) (c) Wolfgang Hürst, Albert-Ludwigs-University.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Evaluation.
CS276 Information Retrieval and Web Search
Search and Retrieval: Relevance and Evaluation Prof. Marti Hearst SIMS 202, Lecture 20.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation Experiments and Experience from the Perspective of Interactive Information Retrieval Ross Wilkinson Mingfang Wu ICT Centre CSIRO, Australia.
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar.
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 7 9/13/2011.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Basic Implementation and Evaluations Aj. Khuanlux MitsophonsiriCS.426 INFORMATION RETRIEVAL.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Performance Measurement. 2 Testing Environment.
Evaluation of Information Retrieval Systems Xiangming Mu.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
Evaluation CS276 Information Retrieval and Web Search
Developments in Evaluation of Search Engines
Evaluation Anisio Lacerda.
Text Indexing and Search
7CCSMWAL Algorithmic Issues in the WWW
Evaluation of IR Systems
Lecture 10 Evaluation.
Evaluation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Modern Information Retrieval
IR Theory: Evaluation Methods
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Lecture 6 Evaluation.
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval Systems
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Chris Manning and Pandu Nayak
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 24 BENCHMARKS FOR THE EVALUATION OF IR SYSTEMS

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the underline sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Happiness: elusive to measure Gold Standard search algorithm Relevance judgments Standard relevance benchmarks Evaluating an IR system

Happiness: elusive to measure Most common proxy: relevance of search results But how do you measure relevance? We will detail a methodology here, then examine its issues Relevance measurement requires 3 elements: A benchmark document collection A benchmark suite of queries A usually binary assessment of either Relevant or Nonrelevant for each query and each document Some work on more-than-binary, but not the standard 00:02:00  00:02:47 00:03:10  00:03:20 00:03:45  00:04:00

Human Labeled Corpora (Gold Standard) Start with a corpus of documents. Collect a set of queries for this corpus. Have one or more human experts exhaustively label the relevant documents for each query. Typically assumes binary relevance judgments. Requires considerable human effort for large document/query corpora. 00:06:15  00:07:06(start + collect) 00:07:40  00:08:00(have one) 00:08:55  00:09:20(tipically) 00:09:35  00:09:55(requires)

So you want to measure the quality of a new search algorithm Benchmark documents – nozama’s products Benchmark query suite – more on this Judgments of document relevance for each query Do we really need every query-doc pair? Relevance judgment 5 million nozama.com products 00:10:18  00:11:10 00:11:40  00:12:00 50000 sample queries

Relevance judgments Binary (relevant vs. non-relevant) in the simplest case, more nuanced (0, 1, 2, 3 …) in others What are some issues already? 5 million times 50K takes us into the range of a quarter trillion judgments If each judgment took a human 2.5 seconds, we’d still need 1011 seconds, or nearly $300 million if you pay people $10 per hour to assess 10K new products per day 00:12:29  00:13:05 00:13:35  00:14:05

Crowd source relevance judgments? Present query-document pairs to low-cost labor on online crowd- sourcing platforms Hope that this is cheaper than hiring qualified assessors Lots of literature on using crowd-sourcing for such tasks Main takeaway – you get some signal, but the variance in the resulting judgments is very high 00:14:58  00:15:15 00:16:35  00:17:00 00:18:35  00:19:00

What else? Still need test queries Classically (non-Web) Sec. 8.5 What else? Still need test queries Must be germane to docs available Must be representative of actual user needs Random query terms from the documents generally not a good idea Sample from query logs if available Classically (non-Web) Low query rates – not enough query logs Experts hand-craft “user needs” 00:20:00  00:20:30 00:21:45  00:21:55 00:22:55  00:23:20 00:25:55  00:26:20

Standard relevance benchmarks TREC (Text REtrieval Conference)- National Institute of Standards and Technology (NIST) has run a large IR test bed for many years Reuters and other benchmark doc collections used “Retrieval tasks” specified sometimes as queries Human experts mark, for each query and for each doc, Relevant or Nonrelevant or at least for subset of docs that some system returned for that query 00:27:00  00:27:20 (TREC) 00:27:40  00:28:20 (Reuters) 00:28:35  00:29:20 (Retrieval Tasks + Human Experts)

Benchmarks A benchmark collection contains: A set of standard documents and queries/topics. A list of relevant documents for each query. Standard collections for traditional IR: Smart collection: ftp://ftp.cs.cornell.edu/pub/smart TREC: http://trec.nist.gov/ Precision and recall 00:30:55  00:31:45 00:31:55  00:32:10 00:32:53  00:33:40 00:34:40  00:34:50 Retrieved result Standard document collection Algorithm under test Evaluation Standard queries Standard result

Some public test Collections 00:35:00  00:39:20 Typical TREC

Evaluating an IR system Note: user need is translated into a query Relevance is assessed relative to the user need, not the query E.g., Information need: My swimming pool bottom is becoming black and needs to be cleaned. Query: pool cleaner Assess whether the doc addresses the underlying need, not whether it has these words 00:49:40  00:50:00 00:50:45  00:51:50