Top-K documents Exact retrieval

Slides:



Advertisements
Similar presentations
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 7: Scoring and results assembly.
Advertisements

Relevance Feedback User tells system whether returned/disseminated documents are relevant to query/information need or not Feedback: usually positive sometimes.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar.
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
Kevin Knight,
Intelligent Information Retrieval 1 Vector Space Model for IR: Implementation Notes CSC 575 Intelligent Information Retrieval These notes are based, in.
Information Retrieval IR 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
Learning Techniques for Information Retrieval Perceptron algorithm Least mean.
SIMS 202 Information Organization and Retrieval Prof. Marti Hearst and Prof. Ray Larson UC Berkeley SIMS Tues/Thurs 9:30-11:00am Fall 2000.
Algoritmi per IR Ranking. The big fight: find the best ranking...
Evaluation of Image Retrieval Results Relevant: images which meet user’s information need Irrelevant: images which don’t meet user’s information need Query:
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Document ranking Paolo Ferragina Dipartimento di Informatica Università di Pisa.
COMP423.  Query expansion  Two approaches ◦ Relevance feedback ◦ Thesaurus-based  Most Slides copied from ◦
Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 8.
Information Retrieval Lecture 7. Recap of the last lecture Vector space scoring Efficiency considerations Nearest neighbors and approximations.
1 Query Operations Relevance Feedback & Query Expansion.
Chapter 6: Information Retrieval and Web Search
Evaluation of (Search) Results How do we know if our results are any good? Evaluating a search engine  Benchmarks  Precision and recall Results summaries:
Concept-based P2P Search How to find more relevant documents Ingmar Weber Max-Planck-Institute for Computer Science Joint work with Holger Bast Torino,
Information Retrieval Techniques MS(CS) Lecture 7 AIR UNIVERSITY MULTAN CAMPUS Most of the slides adapted from IIR book.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning and Pandu Nayak Efficient.
Evaluation. The major goal of IR is to search document relevant to a user query. The evaluation of the performance of IR systems relies on the notion.
Information Retrieval Quality of a Search Engine.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 9: Relevance feedback & query.
Kevin Knight,
Introduction to Information Retrieval Introduction to Information Retrieval Lecture 10 Evaluation.
Query processing: optimizations Paolo Ferragina Dipartimento di Informatica Università di Pisa Reading 2.3.
Lecture 9: Query Expansion. This lecture Improving results For high recall. E.g., searching for aircraft doesn’t match with plane; nor thermodynamic with.
Introduction to Information Retrieval Introduction to Information Retrieval Information Retrieval and Web Search Lecture 8: Evaluation.
Sampath Jayarathna Cal Poly Pomona
Index construction: Compression of postings
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Large Scale Search: Inverted Index, etc.
An Efficient Algorithm for Incremental Update of Concept space
Sampath Jayarathna Cal Poly Pomona
Information Retrieval and Web Search
Text Indexing and Search
Query processing: phrase queries and positional indexes
Information Retrieval in Practice
Lecture 12: Relevance Feedback & Query Expansion - II
Quality of a search engine
7CCSMWAL Algorithmic Issues in the WWW
COIS 442 Information Retrieval Foundations
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Lecture 10 Evaluation.
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Relevance Feedback Hongning Wang
אחזור מידע, מנועי חיפוש וספריות
Modern Information Retrieval
Index Construction: sorting
Basic Information Retrieval
Information Organization: Clustering
Lecture 6 Evaluation.
Index construction: Compression of postings
Relevance Feedback & Query Expansion
Lectures 4: Skip Pointers, Phrase Queries, Positional Indexing
8. Efficient Scoring Most slides were adapted from Stanford CS 276 course and University of Munich IR course.
Chapter 5: Information Retrieval and Web Search
Query processing: phrase queries and positional indexes
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"
Index construction: Compression of postings
Retrieval Utilities Relevance feedback Clustering
Information Retrieval and Web Design
Information Retrieval and Web Design
CS276 Information Retrieval and Web Search
Presentation transcript:

Top-K documents Exact retrieval Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Top-K documents Exact retrieval

Goal Given a query Q, find the exact top K docs for Q, using some ranking function r Simplest Strategy: Find all documents in the intersection Compute score r(d) for all these documents d Sort results by score Return top K results

Background Score computation is a large fraction of the CPU work on a query Generally, we have a tight budget on latency (say, 100ms) We can’t exhaustively score every document on every query! Goal is to cut CPU usage for scoring, without compromising (too much) on the quality of results Basic idea: avoid scoring docs that won’t make it into the top K

The WAND technique We wish to compute the top K docs It is a pruning method which uses a max heap over the scores, and there is a proof that these are indeed the exact top-K Basic idea reminiscent of branch and bound We maintain a running threshold score = the K-th highest score computed so far We prune away all docs whose scores are guaranteed to be below the threshold We compute exact scores for only the un-pruned docs

Index structure for WAND Postings ordered by docID Assume a special iterator on the postings of the form “go to the first docID greater than X” using skip pointers Using the Elias-Fano’s compressed lists We have a “pointer” at some docID in the postings of each query term: Each pointer moves only to the right, to larger docIDs

Score Functions We assume that: r(t,d) = score of t in d The score of the document d is the sum of the scores of query terms: r(d) = r(t1,d) + … + r(tn,d) Also, for each query term t, there is some upperbound UB(t) such that for all d, r(t,d) ≤ UB(t) These values are to be pre-computed and stored

Threshold We keep inductively a threshold  such that for every document d within the top-K, it holds that r(d) ≥  Can be initialized to 0 Can be raised when we find K documents satisfying the query Can continue to be raised, whenever the “worst” of the currently found top-K has a score above the threshold

The Algorithm Sort the pointers to the inverted lists by increasing document id Example Query: catcher in the rye rye 304 catcher 273 the 762 in 589

Sort Pointer Sort the pointers to the inverted lists by increasing document id Example Query: catcher in the rye catcher 273 rye 304 in 589 the 762

Find Pivot Find the “pivot”: The first pointer in this order for which the sum of upper-bounds of the terms is at least equal to the threshold Pivot UBcatcher = 2.3 UBrye = 1.8 UBin = 3.3 UBthe = 4.3 Threshold = 6.8 catcher 273 rye 304 in 589 the 762

Increment to Pivot Increment all previous pointers until they reach the pivot ID: All documents that have been skipped on the way are certainly not part of the result Pivot UBcatcher = 2.3 UBrye = 1.8 UBin = 3.3 UBthe = 4.3 Threshold = 6.8 catcher 273 rye 304 in 589 the 762

Prune docs that have no hope UBcatcher = 2.3 UBrye = 1.8 UBin = 3.3 UBthe = 4.3 Threshold = 6.8 Pivot catcher 273 Hopeless docs rye 304 in 589 the 762

Compute 589’s score if need be If 589 is present in enough postings (soft AND), compute its full score – else move pointers right of 589 If 589 is inserted in the current top-K, update threshold! Pivot again … catcher 589 rye 589 in 589 the 762

WAND summary In tests, WAND leads to a 90+% reduction in score computation Better gains on longer queries WAND gives us safe ranking Possible to devise “careless” variants that are a bit faster but not safe

Blocked WAND UB(t) is over the full list of t To avoid this: Partition the list into blocks Store for each block b the maximum score UB_b(t) among the docIDs stored into it

The new algorithm: Block-Max WAND Algorithm (2-level checks) As in classic WAND: p = pivoting docIDs via threshold q of the max-heap, and let d be the docID in list(p) Move the pointers/blocks in lists 0..p-1to blocks that may contain d, so thier docID-ranges overlap Sum the UBs of those blocks if the sum ≤ q skip the leftmost right-end block; repeat If score(d) ≤ q move iterators to next docIDs > d Insert d in the max-heap and re-evaluate q

Document RE-ranking Relevance feedback Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Document RE-ranking Relevance feedback

Sec. 9.1 Relevance Feedback Relevance feedback: user feedback on relevance of docs in initial set of results User issues a (short, simple) query The user marks some results as relevant or non-relevant. The system computes a better representation of the information need based on feedback. Relevance feedback can go through one or more iterations.

Rocchio (SMART) Dr = set of known relevant doc vectors Sec. 9.1.1 Rocchio (SMART) Used in practice: Dr = set of known relevant doc vectors Dnr = set of known irrelevant doc vectors qm = modified query vector; q0 = original query vector; α,β,γ: weights (hand-chosen or set empirically) New query moves toward relevant documents and away from irrelevant documents SMART: Cornell (Salton) IR system of 1970s to 1990s.

Relevance Feedback: Problems Users are often reluctant to provide explicit feedback It’s often harder to understand why a particular document was retrieved after applying relevance feedback There is no clear evidence that relevance feedback is the “best use” of the user’s time. A long vector space query is like a disjunction: you have to consider documents with any of the words, and sum partial cosine similarities over the various terms.

Pseudo relevance feedback Sec. 9.1.6 Pseudo relevance feedback Pseudo-relevance feedback automates the “manual” part of true relevance feedback. Retrieve a list of hits for the user’s query Assume that the top k are relevant. Do relevance feedback (e.g., Rocchio) Works very well on average But can go horribly wrong for some queries. Several iterations can cause query drift.

Sec. 9.2.2 Query Expansion In relevance feedback, users give additional input (relevant/non-relevant) on documents, which is used to reweight terms in the documents In query expansion, users give additional input (good/bad search term) on words or phrases

How augment the user query? Sec. 9.2.2 How augment the user query? Manual thesaurus (costly to generate) E.g. MedLine: physician, syn: doc, doctor, MD Global Analysis (static; all docs in collection) Automatically derived thesaurus (co-occurrence statistics) Refinements based on query-log mining Common on the web Local Analysis (dynamic) Analysis of documents in result set

Quality of a search engine Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Quality of a search engine Paolo Ferragina Dipartimento di Informatica Università di Pisa

Is it good ? How fast does it index How fast does it search Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Is it good ? How fast does it index Number of documents/hour (Average document size) How fast does it search Latency as a function of index size Expressiveness of the query language

Measures for a search engine Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Measures for a search engine All of the preceding criteria are measurable The key measure: user happiness …useless answers won’t make a user happy User groups for testing !!

General scenario collection Retrieved Relevant Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" General scenario Relevant Retrieved collection

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Precision vs. Recall Precision: % docs retrieved that are relevant [issue “junk” found] Recall: % docs relevant that are retrieved [issue “info” found] collection Retrieved Relevant

Precision P = tp/(tp + fp) Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" How to compute them Precision: fraction of retrieved docs that are relevant Recall: fraction of relevant docs that are retrieved Precision P = tp/(tp + fp) Recall R = tp/(tp + fn) Relevant Not Relevant Retrieved tp (true positive) fp (false positive) Not Retrieved fn (false negative) tn (true negative)

Precision-Recall curve Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" Precision-Recall curve Measure Precision at various levels of Recall precision x x x x recall

A common picture x precision x x x recall Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" A common picture x precision x x x recall

F measure Combined measure (weighted harmonic mean): Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" F measure Combined measure (weighted harmonic mean): People usually use balanced F1 measure i.e., with  = ½ thus 1/F = ½ (1/P + 1/R)