Assessing the Retrieval Chapter 2 considered various ways of breaking text into indexable features Chapter 3 considered various ways of weighting combinations.

Slides:



Advertisements
Similar presentations
Search Engines Information Retrieval in Practice All slides ©Addison Wesley, 2008.
Advertisements

Introduction to Information Retrieval (Part 2) By Evren Ermis.
Data and the Nature of Measurement
Evaluating Search Engine
Search Engines and Information Retrieval
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Query Refinement: Relevance Feedback Information Filtering.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Database Management Systems, R. Ramakrishnan1 Computing Relevance, Similarity: The Vector Space Model Chapter 27, Part B Based on Larson and Hearst’s slides.
Modern Information Retrieval
Chapter 5: Query Operations Baeza-Yates, 1999 Modern Information Retrieval.
INFO 624 Week 3 Retrieval System Evaluation
Retrieval Evaluation. Brief Review Evaluation of implementations in computer science often is in terms of time and space complexity. With large document.
Reference Collections: Task Characteristics. TREC Collection Text REtrieval Conference (TREC) –sponsored by NIST and DARPA (1992-?) Comparing approaches.
Retrieval Evaluation: Precision and Recall. Introduction Evaluation of implementations in computer science often is in terms of time and space complexity.
CS 430 / INFO 430 Information Retrieval
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Evaluating the Performance of IR Sytems
Query Reformulation: User Relevance Feedback. Introduction Difficulty of formulating user queries –Users have insufficient knowledge of the collection.
Online Learning for Web Query Generation: Finding Documents Matching a Minority Concept on the Web Rayid Ghani Accenture Technology Labs, USA Rosie Jones.
ISP 433/633 Week 6 IR Evaluation. Why Evaluate? Determine if the system is desirable Make comparative assessments.
Evaluation.  Allan, Ballesteros, Croft, and/or Turtle Types of Evaluation Might evaluate several aspects Evaluation generally comparative –System A vs.
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Evaluation David Kauchak cs458 Fall 2012 adapted from:
Evaluation David Kauchak cs160 Fall 2009 adapted from:
Search Engines and Information Retrieval Chapter 1.
Chapter Eight The Concept of Measurement and Attitude Scales
Philosophy of IR Evaluation Ellen Voorhees. NIST Evaluation: How well does system meet information need? System evaluation: how good are document rankings?
IR Evaluation Evaluate what? –user satisfaction on specific task –speed –presentation (interface) issue –etc. My focus today: –comparative performance.
A Simple Unsupervised Query Categorizer for Web Search Engines Prashant Ullegaddi and Vasudeva Varma Search and Information Extraction Lab Language Technologies.
Modern Information Retrieval: A Brief Overview By Amit Singhal Ranjan Dash.
Query Operations J. H. Wang Mar. 26, The Retrieval Process User Interface Text Operations Query Operations Indexing Searching Ranking Index Text.
Mining the Web to Create Minority Language Corpora Rayid Ghani Accenture Technology Labs - Research Rosie Jones Carnegie Mellon University Dunja Mladenic.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
CS 533 Information Retrieval Systems.  Introduction  Connectivity Analysis  Kleinberg’s Algorithm  Problems Encountered  Improved Connectivity Analysis.
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Relevance Feedback Hongning Wang What we have learned so far Information Retrieval User results Query Rep Doc Rep (Index) Ranker.
Assessing The Retrieval A.I Lab 박동훈. Contents 4.1 Personal Assessment of Relevance 4.2 Extending the Dialog with RelFbk 4.3 Aggregated Assessment.
University of Malta CSA3080: Lecture 6 © Chris Staff 1 of 20 CSA3080: Adaptive Hypertext Systems I Dr. Christopher Staff Department.
IR Theory: Relevance Feedback. Relevance Feedback: Example  Initial Results Search Engine2.
LANGUAGE MODELS FOR RELEVANCE FEEDBACK Lee Won Hee.
Chapter 8 Evaluating Search Engine. Evaluation n Evaluation is key to building effective and efficient search engines  Measurement usually carried out.
Latent Semantic Indexing and Probabilistic (Bayesian) Information Retrieval.
Measuring How Good Your Search Engine Is. *. Information System Evaluation l Before 1993 evaluations were done using a few small, well-known corpora of.
Lecture 3: Retrieval Evaluation Maya Ramanath. Benchmarking IR Systems Result Quality Data Collection – Ex: Archives of the NYTimes Query set – Provided.
Performance Measurement. 2 Testing Environment.
1 CS 430 / INFO 430 Information Retrieval Lecture 8 Evaluation of Retrieval Effectiveness 1.
Chapter. 3: Retrieval Evaluation 1/2/2016Dr. Almetwally Mostafa 1.
Relevance Feedback Hongning Wang
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
Relevance Feedback Prof. Marti Hearst SIMS 202, Lecture 24.
Information Retrieval Lecture 3 Introduction to Information Retrieval (Manning et al. 2007) Chapter 8 For the MSc Computer Science Programme Dell Zhang.
1 CS 430 / INFO 430 Information Retrieval Lecture 12 Query Refinement and Relevance Feedback.
INFORMATION RETRIEVAL MEASUREMENT OF RELEVANCE EFFECTIVENESS 1Adrienn Skrop.
1 CS 430 / INFO 430 Information Retrieval Lecture 10 Evaluation of Retrieval Effectiveness 1.
Sampath Jayarathna Cal Poly Pomona
Lecture 12: Relevance Feedback & Query Expansion - II
Text Based Information Retrieval
Measuring Social Life: How Many? How Much? What Type?
Evaluation.
Modern Information Retrieval
IR Theory: Evaluation Methods
CS 430: Information Discovery
Retrieval Utilities Relevance feedback Clustering
INF 141: Information Retrieval
Retrieval Performance Evaluation - Measures
Precision and Recall Reminder:
Presentation transcript:

Assessing the Retrieval Chapter 2 considered various ways of breaking text into indexable features Chapter 3 considered various ways of weighting combinations of those features to find the best match to a query So many alternatives are possible – which is the best combination? Users can give their personal points of view through relevance feedback. The system builder wants to construct a search engine that robustly finds the right documents for each query – an omniscient expert often determines which documents should have been retrieved.

Personal Assessment of Relevance What is relevance? The lack of a fully satisfactory definition of the core concept (e.g. relevance, information, intelligence) does not entirely stop progress. How thorough a search does the user wish to perform? A single football result, or any science that might cure a patient / all previous relevant court cases. This variability can be observed across different users, and even across the same user at different times.

Prototypic retrievals Relevance feedback (Oddy) is a task of object recognition The object to be recognised is an internally represented prototypic “ideal” document satisfying the user’s information need. For each retrieved document, the users have to judge how well it matches the prototype. We assume that users are capable of grading the quality of this match, e.g. on a five-point scale: not_relevant, no_response, possibly_relevant, relevant, critically_relevant. Extreme judgements are most useful for relevance feedback.

Relevance Feedback (RF) is Nonmetric While users find it easy to critique documents with + (relevant), # (neutral), - (not_relevant), they would find it harder to assign numeric quantities reflecting the exact degree of relevance. E.g. If the best document was rated 10, the second 6 and the third 2, could we be sure that the difference in relevance between the best and the second was exactly the same as the difference between the second and the third? Types of scales: Ratio (2 metres is twice as long as one metre) Interval (gap between 30 degrees and 20 degrees is the same as the gap between 20 degrees and 10 degrees – but 20 degrees is not exactly twice as hot as 10 degrees) Ordinal e.g. + > # > - in RF Nominal: separate unranked categories e.g. noun, verb, adjective.

Extending the dialogue with RF In RF, users’ reactions to just retrieved documents provide the link between assessments, to form the FOA search dialogue. Do we assess retrieval performance according to which documents come out first time around, or after a number of iterations of “berrypicking” (Bates, 1986). It seems that in reality an assessment of one document’s relevance will depend greatly on the “basket” of other documents we have already seen – but IR uses the independence assumption

Using RF for Query Refinement (1) We expect that there is some localised region in vector space where + relevant documents are most likely to occur. If these positively rated documents are in fact clustered, we can consider a hypothetical centroid (average) document d+, which is at the centre of all those documents the users have rated relevant. To calculate the centroid, e.g. consider a vocabulary of [apple, bean, carrot] Let doc_vector1 be [1, 0, 0] Let doc_vector2 be [2, 1, 0] Let doc_vector3 be [1, 1, 0] Then d+ = [1.33, 0.67, 0] It is less reasonable to imagine that negatively labelled documents are similarly clustered.

RF for Query Refinement (2) Most typically, RF is used to refine the user’s query – we “take a step toward” the centroid of the positively rated cluster d+ The size of this step can vary, e.g. half the distance between the original query and d+ E.g. d+ = [1.33, 0.67, 0] Original_query = [ 1, 1, 1 ] New_query = [1.17, 0.83, 0.5] Negative RF involves “taking a step away from” the centroid of the negatively rated documents, but this works less well due to the cluster being less well defined.

RF: Research Ideas 1. Make changes to the documents rather than the query (see next slide). Adaptive document modifications made in response to RF are not expected to be of (immediate) use to the users who provide it, but made available to later searchers: useful documents have been moved slowly into that part of the semantic space where users’ queries are concentrated (Salton & McGill, 1983) 2. Weight propagation (Yamout, two slides away).

Documents Propagating Positive and Negative Weights (Fadi Yamout, 2006)

Search Engine Performance We have discussed RF from the user’s point of view and how this information can be used to modify users’ retrievals. Another use of RF information is to evaluate which search engine is doing a better job. If one system can consistently, across a range of typical queries, more frequently retrieve documents that the users mark as relevant and fewer that they mark as irrelevant, then that system is doing a better job.

Underlying assumptions Real FOA vs. laboratory retrieval. Our lab setting is similar to real life, i.e. “guinea pig” users will have reactions that mirror real ones. Intersubject reliability = consistency between users. But users differ in education, time available, preferred writing styles, etc. See consensual relevance, Kappa statistic. The relevance of a document can be assessed independently of assessments of other documents – a questionable assumption We are assessing the document proxy rather than the document itself.

Traditional Evaluation Methodologies When text corpora were small, it was possible to have a set of test queries compared exhaustively against every document in the corpus. E.g. the Cranfield collection:1400 documents in metallurgy, with 221 queries generated by some of the documents’ authors. The Text Retrieval Conference is held annually for search engine evaluation. It uses much larger corpora, and avoids exhaustive assessment of all documents by the pooling method. The pooling method uses each search engine independently, and then pool their results (top k documents, where k = 100) to form a set of documents that are at least potentially relevant. All unassessed documents are assumed to be irrelevant.

Recall and Precision: A reminder Recall = | Ret ∩ Rel | / | Rel | Precision = | Ret ∩ Rel | / | Ret |

Notes on the recall/precision curves 4.10 top ranked doc relevant, second relevant, third non- relevant etc dotted line shows what would happen if top ranked doc non-relevant, second and third relevant (i.e. swap the judgements for the top and third-ranked documents relative to Best retrieval envelope would be achieved if the top 5 ranked documents were all relevant, all lower-ranked documents non-relevant Worst retrieval envelope would be achieved if the top ranked documents were al non-relevant, and the 5 lowest- ranked documents were all relevant.

Multiple retrievals across a set of queries 4.14 shows R/P curves for two queries. Even with these two queries, there is no guarantee that we will have R/P data points at any particular recall level. This necessitates interpolation of data points at desired recall levels, e.g. 0, 0.25, 0.5, 0.75 and 1. The 11-point average curve finds the average precision over a set of queries at recall-levels of 0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9 and 1. Another useful measure for web search engines is the precision over the top 10 ranked documents.

Combining Precision and Recall Jardine & Van Rijsbergen’s F-measure The “harmonic mean” of Precision and Recall F = (2 * P * R) / (P + R).