1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.

Slides:



Advertisements
Similar presentations
Traditional IR models Jian-Yun Nie.
Advertisements

Web Information Retrieval
Chapter 5: Introduction to Information Retrieval
Introduction to Information Retrieval
1 Evaluation Rong Jin. 2 Evaluation  Evaluation is key to building effective and efficient search engines usually carried out in controlled experiments.
Crawling, Ranking and Indexing. Organizing the Web The Web is big. Really big. –Over 3 billion pages, just in the indexable Web The Web is dynamic Problems:
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
IR Models: Overview, Boolean, and Vector
Information Retrieval in Practice
Information Retrieval Ling573 NLP Systems and Applications April 26, 2011.
ISP 433/533 Week 2 IR Models.
Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY.
Query Operations: Automatic Local Analysis. Introduction Difficulty of formulating user queries –Insufficient knowledge of the collection –Insufficient.
1 Algorithms for Large Data Sets Ziv Bar-Yossef Lecture 2 March 26, 2006
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
T.Sharon - A.Frank 1 Internet Resources Discovery (IRD) IR Queries.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
Modeling Modern Information Retrieval
ODISSEA: a Peer-to-Peer Architecture for Scalable Web Search and IR Torsten Suel with C. Mathur, J. Wu, J. Zhang, A. Delis, M. Kharrazi, X. Long, K. Shanmugasunderam.
Aggregation Algorithms and Instance Optimality
Recommender systems Ram Akella February 23, 2011 Lecture 6b, i290 & 280I University of California at Berkeley Silicon Valley Center/SC.
Vector Space Model CS 652 Information Extraction and Integration.
Retrieval Models II Vector Space, Probabilistic.  Allan, Ballesteros, Croft, and/or Turtle Properties of Inner Product The inner product is unbounded.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
IR Models: Review Vector Model and Probabilistic.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Query Operations: Automatic Global Analysis. Motivation Methods of local analysis extract information from local set of documents retrieved to expand.
Chapter 5: Information Retrieval and Web Search
CSCI 5417 Information Retrieval Systems Jim Martin Lecture 6 9/8/2011.
Modeling (Chap. 2) Modern Information Retrieval Spring 2000.
Web Search. Structure of the Web n The Web is a complex network (graph) of nodes & links that has the appearance of a self-organizing structure  The.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Term Weighting and Ranking Models Debapriyo Majumdar Information Retrieval – Spring 2015 Indian Statistical Institute Kolkata.
CSE 6331 © Leonidas Fegaras Information Retrieval 1 Information Retrieval and Web Search Engines Leonidas Fegaras.
« Pruning Policies for Two-Tiered Inverted Index with Correctness Guarantee » Proceedings of the 30th annual international ACM SIGIR, Amsterdam 2007) A.
« Performance of Compressed Inverted List Caching in Search Engines » Proceedings of the International World Wide Web Conference Commitee, Beijing 2008)
Querying Structured Text in an XML Database By Xuemei Luo.
Search - on the Web and Locally Related directly to Web Search Engines: Part 1 and Part 2. IEEE Computer. June & August 2006.
Xiaoying Gao Computer Science Victoria University of Wellington Intelligent Agents COMP 423.
Qingqing Gan Torsten Suel CSE Department Polytechnic Institute of NYU Improved Techniques for Result Caching in Web Search Engines.
Chapter 6: Information Retrieval and Web Search
Introduction to Digital Libraries hussein suleman uct cs honours 2003.
Ranking in Information Retrieval Systems Prepared by: Mariam John CSE /23/2006.
Web- and Multimedia-based Information Systems Lecture 2.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Vector Space Models.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Search Engines WS 2009 / 2010 Prof. Dr. Hannah Bast Chair of Algorithms and Data Structures Department of Computer Science University of Freiburg Lecture.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
1 CS 430: Information Discovery Lecture 5 Ranking.
1 Efficient Computation of Diverse Query Results Erik Vee joint work with Utkarsh Srivastava, Jayavel Shanmugasundaram, Prashant Bhat, Sihem Amer Yahia.
Xiaoying Gao Computer Science Victoria University of Wellington COMP307 NLP 4 Information Retrieval.
CS791 - Technologies of Google Spring A Web­based Kernel Function for Measuring the Similarity of Short Text Snippets By Mehran Sahami, Timothy.
Presented By: Carlton Northern and Jeffrey Shipman The Anatomy of a Large-Scale Hyper-Textural Web Search Engine By Lawrence Page and Sergey Brin (1998)
1 CS 430 / INFO 430: Information Retrieval Lecture 20 Web Search 2.
Information Retrieval in Practice
Why indexing? For efficient searching of a document
Indexing & querying text
Information Retrieval and Web Search
Chapter 12: Query Processing
Top-k Query Processing
Rank Aggregation.
Laks V.S. Lakshmanan Depf. of CS UBC
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval and Web Design
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
VECTOR SPACE MODEL Its Applications and implementations
Presentation transcript:

1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information Retrieval and Web Mining

2 Algorithms for Large Data Sets Ziv Bar-Yossef

3 Abstract Formulation Ingredients: D: document collection Q: query space f: D x Q  R: relevance scoring function For every q in Q, f induces a ranking (partial order)  q on D Functions of an IR system: Preprocess D and create an index I Given q in Q, use I to produce a permutation  on D

4 Document Representation T = { t 1,…, t k }: a “token space” (a.k.a. “feature space” or “term space”) Ex: all words in English Ex: phrases, URLs, … A document: a real vector d in R k d i : “weight” of token t i in d Ex: d i = normalized # of occurrences of t i in d

5 Classic IR (Relevance) Models The Boolean model The Vector Space Model (VSM)

6 The Boolean Model A document: a boolean vector d in {0,1} k d i = 1 iff t i belongs to d A query: a boolean formula q over tokens q: {0,1} k  {0,1} Ex: “Michael Jordan” AND (NOT basketball) Ex: +“Michael Jordan” –basketball Relevance scoring function: f(d,q) = q(d)

7 The Boolean Model: Pros & Cons Advantages: Simplicity for users Disadvantages: Relevance scoring is too coarse

8 The Vector Space Model (VSM) A document: a real vector d in R k d i = weight of t i in d (usually TF-IDF score) A query: a real vector q in R k q i = weight of t i in q Relevance scoring function: f(d,q) = sim(d,q) “similarity” between d and q

9 Popular Similarity Measures L 1 or L 2 distance d,q are first normalized to have unit norm Cosine similarity d q d –q  d q

10 TF-IDF Score: Motivation Motivating principle: A term t i is relevant to a document d if: t i occurs many times in d relative to other terms that occur in d t i occurs many times in d relative to its number of occurrences in other documents Examples 10 out of 100 terms in d are “java” 10 out of 10,000 terms in d are “java” 10 out of 100 terms in d are “the”

11 TF-IDF Score: Definition n(d,t i ) = # of occurrences of t i in d N =  i n(d,t i ) (# of tokens in d) D i = # of documents containing t i D = # of documents in the collection TF(d,t i ): “Term Frequency” Ex: TF(d,t i ) = n(d,t i ) / N Ex: TF(d,t i ) = n(d,t i ) / (max j { n(d,t j ) }) IDF(t i ): “Inverse Document Frequency” Ex: IDF(t i ) = log (D/D i ) TFIDF(d,t i ) = TF(d,t i ) x IDF(t i )

12 VSM: Pros & Cons Advantages: Better granularity in relevance scoring Good performance in practice Efficient implementations Disadvantages: Assumes term independence

13 Retrieval Evaluation Notations: D: document collection D q : documents in D that are “relevant” to query q Ex: f(d,q) is above some threshold L q : list of results on query q D LqLq DqDq Recall: Precision:

14 Precision & Recall: Example 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List A 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B Relevant docs: d 123, d 56, d 9, d 25, d 3 Recall(A) = 80% Precision(A) = 40% Recall(B) = 100% Precision(B) = 50%

15 and Notations: D q : documents in D that are “relevant” to q L q,k : top k results on the list

16 Example 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List A 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B

17 Example 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List A 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B

18 “Interpolated” Precision Notations: D q : documents in D that are “relevant” to q r: a recall level (e.g., 20%) k(r): first k so that >= r Interpolated recall level r = max { : k >= k(r) }

19 Precision vs. Recall: Example 1.d d 84 3.d 56 4.d 6 5.d 8 6.d 9 7.d d d d 25 List A 1.d 81 2.d 74 3.d 56 4.d d d 25 7.d 9 8.d d 3 10.d 5 List B

20 Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor Based on the presentation of Wesley Sebrechts, Joost Voordouw. Modified by Vagelis Hristidis

21 Why top-k query processing Multimedia brings fuzzy data attribute values are graded typically [0,1] No clear boundary between “answer” / “no answer” A query in a multimedia database means combining graded attributes Combine attributes by aggregation function Aggregation function gives overall grade of object Return k objects with highest overall grade Example:

22 Top-k query processing = Finding k objects that have the highest overall grades How ?  Which algorithms? Fagin’s Algorithm (FA) Threshold Algorithm (TA) Which is the best algorithm? Keep in mind: Database system serves as middleware Multimedia (objects) may be kept in different subsystems e.g. photoDB, videoDB, search engine Take into account the limitations of these subsystems Top-k query processing

23 Simple database model Simple query Explaining Fagin’s Algorithm (FA) Finding top-k with FA Explaining Threshold Algortihm (TA) Finding top-k with TA Example

24 (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) Sorted L 1 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) N a b c d Object ID Attribute Attribute M Sorted L 2 Example – Simple Database model

25 Find the top 2 (k = 2) objects on the following ‘query’ executed on the middleware: A1 & A2 (eg: color=red & shape=round) Example – Simple Query Aggregation function: function that gives objects an overall grade based on attribute grades examples : min, max functions Monotonicity! A1 & A2 as a ‘query’ to the middleware results in the middleware combining the grades of A1 en A2 by min(A1, A2)

26 c ID A1A1 A2A2 Min(A 1,A 2 ) STEP 1 Read attributes from every sorted list Stop when k objects have been seen in common from all lists (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d b Example – Fagin’s Algorithm

27 c IDA1A1 A2A2 Min(A 1,A 2 ) STEP 2 Random access to find missing grades (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d b Example – Fagin’s Algorithm

28 c IDA1A1 A2A2 Min(A 1,A 2 ) STEP 3 Compute the grades of the seen objects. Return the k highest graded objects. (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d b Example – Fagin’s Algorithm

29 Read all grades of an object once seen from a sorted access No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects: a: 0.9 b: 0.8 c: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: f: 0.65 d: 0.6 f: 0.6 Seen Possibly unseen Threshold value New Idea !!! Threshold Algorithm (TA) T = min(0.72, 0.7) = 0.7

30 IDA1A1 A2A2 Min(A 1,A 2 ) Step 1: - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer Example – Threshold Algorithm

31 IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) a d T = min(0.9, 0.9) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

32 IDA1A1 A2A2 Min(A 1,A 2 ) Step 1 (Again): - parallel sorted access to each list (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) L1L1 L2L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) a d For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer b Example – Threshold Algorithm

33 IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) a b T = min(0.8, 0.85) = objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 Example – Threshold Algorithm

34 IDA1A1 A2A2 Min(A 1,A 2 ) a: 0.9 b: 0.8 c: 0.72 d: L1L1 L2L2 d: 0.9 a: 0.85 b: 0.7 c: Situation at stopping condition a b T = min(0.72, 0.7) = 0.7 Example – Threshold Algorithm

35 Comparison of Fagin’s and Threshold Algorithm TA sees less objects than FA TA stops at least as early as FA When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA. TA may perform more random accesses than FA In TA, (m-1) random accesses for each object In FA, Random accesses are done at the end, only for missing grades TA requires only bounded buffer space (k) At the expense of more random seeks FA makes use of unbounded buffers

36 The best algorithm Which algorithm is the best? Define “best” middleware cost concept of instance optimality Consider: wild guesses aggregation functions characteristics Monotone, strictly monotone, strict database restrictions distinctness property

37 middleware cost = cost for processing data subsystems = sc + rc A = class of algorithms, A E A represents an algorithm D = legal inputs to algorithms (databases), D E D represents a database Cost(A,D ) = middleware cost when running algorithm A over database D The best algorithm: concept of optimality Algorithm B is instance optimal over A and D if : B E A and Cost(B,D ) = O(Cost(A,D )) A E A, D E D Which means that: Cost(B,D ) ≤ c · Cost(A,D ) + c’, A E A, D E D optimality ratio A A A

38 Intuitively: B instance optimal = always the best algorithm in A = always optimal In reality: always is “always”  we will exclude wild guesses algorithms Wild guess = random access on object not previously encountered by sorted access In practice not possible Database need to know ID to do random access If wild guesses allowed in A then no algorithm can be instance optimal Wild guesses can find top-k objects by k·m random accesses (k = #objects, m = #lists) The best algorithm: instance optimality & wild guesses

39 The best algorithm: aggregation functions Aggregation function t combines object grades into object’s overall grade: x 1,…,x m t(x 1,…,x m ) Monotone : t(x 1,…,x m ) ≤ t(x’ 1,…,x’ m ) if x i ≤ x’ i for every i Strictly monotone: t(x 1,…,x m ) < t(x’ 1,…,x’ m ) if x i < x’ i for every i Strict: t(x 1,…,x m ) = 1 precisely when x i = 1 for every i

40 The best algorithm: database restrictions Distinctness property: A database has no (sorted) attribute list in which two objects have the same grade

41 Fagin’s Algorithm - Database with N objects, each with m attributes. - Orderings of lists are independent FA finds top-k with middleware cost O(N (m-1)/m k 1/m ) FA = optimal with high probability in the worst case for strict monotone aggregation functions

42 TA = instance optimal (always optimal) for every monotone aggregation function, over every database (excluding wild guesses) = optimal in much stronger sense than Fagin’s Algorithm If strict monotone aggregation function: Optimality ratio = m + m (m-1)c R /c s = best possible (m = # attributes) If random acces not possible (c r = 0 )  optimality ratio = m If sorted access not possible (c s = 0)  optimality ratio = infinite  TA not instance optimal TA = instance optimal (always optimal) for every strictly monotone aggregation function, over every database (including wild guesses) that satisfies the distinctness property Optimality ratio = cm 2 with c = max {c R /c S, c S /c R } Threshold Algorithm

Optimized Query Execution in Large Search Engines with Global Page Ordering Xiaohui Long Torsten Suel CIS Department Polytechnic University Brooklyn, NY 11201

intro: query processing in search engines related work: query execution and pruning techniques algorithmic techniques experimental evaluation: single and multiple nodes concluding remarks Talk Outline: “how to optimize query throughput in large search engines, when the ranking function is a combination of term-based ranking and a global ordering such as Pagerank” The Problem:

pages index pages index pages index pages index pages index broadcasts each query and combines the results LAN Cluster with global index organization Query Processing in Parallel Search Engines query integrator local index: every node stores and indexes subset of pages every query broadcast to all nodes by query integrator (QI) every node supplies top-10, and QI computes global top-10 note: we don’t really need top-10 from all, maybe only top-2 low-cost cluster architecture (usually with additional replication)

IR: optimized evaluation of cosine measures (since 1980s) DB: top-k queries for multimedia databases (Fagin 1996) does not consider combinations of term-based and global scores Brin/Page 1998: fancy lists in Google Related Work on top-k Queries basic idea: “ presort entries in each inverted list by contribution to cosine” also process inverted lists from shortest to longest list various schemes, either reliable or probabilistic most closely related: - Persin/Zobel/Sacks-Davis 1993/96 - Anh/Moffat 1998, Anh/deKretzer/Moffat 2001 typical assumptions: many keywords/query, OR semantics Related Work (IR)

motivation: searching multimedia objects by several criteria typical assumptions: few attributes, OR semantics, random access FA (Fagin’s algorithm), TA (Threshold algorithm), others formal bounds: for k lists if lists independent term-based ranking: presort each list by contribution to cosine Related Work (DB) (Fagin 1996 and others)

“fancy lists” optimization in Google create extra shorter inverted list for “fancy matches” (matches that occur in URL, anchor text, title, bold face, etc.) note: fancy matches can be modeled by higher weights in the term-based vector space model no details given or numbers published Related Work (Google) (Brin/Page 1998) chair table fancy list rest of list with other matches

pruning techniques for query execution in large search engines focus on a combination of a term-based and a global score (such as Pagerank) techniques combine previous approaches such as fancy lists and presorting of lists by term scores experimental evaluation on 120 million pages very significant savings with almost no impact on results it’s good to have a global ordering! Results of our Paper

exhaustive algorithm: “no pruning, traverse entire list” first-m: “a naïve algorithm with lists sorted by Pagerank; stop after m elements in intersection found” fancy first-m: “use fancy and non-fancy lists, each sorted by Pagerank, and stop after m elements found” reliable pruning: “stop when top-k results found” fancy last-m: “stop when at most m elements unresolved” single-node and parallel case with optimization Algorithms:

120 million pages on 16 machines (1.8TB uncompressed) P-4 1.7Ghz with 2x80GB Seagate Barracuda IDE compressed index based on Berkeley DB (using the mg compression macros) queries from Excite query trace from December 1999 queries with 2 terms in the following local index organization with query integrator first results for one node (7.5 million pages), then 16 note: do not need top-10 from every node motivates top-1, top-4 schemes and precision at 1, 4 ranking by cosine + log(PR) with normalization Experimental setup:

sort inverted lists by Pagerank (docID = rank due to Pagerank) exhaustive: top-10 first-m: return 10 highest scoring among first 10/100/1000 pages in intersection A naïve approach: first-m

for first-10, about 45% of top-10 results belong in top-10 for first-1000, about 85% of top-10 results belong in top-10 first-m (ctd.) loose/strict precision, relative to “correct” cosine + log(PR) for first-100, about 80% of queries return correct top-1 result for first-1000, about 70% of queries return all correct top-10 results average cost per query in terms of disk blocks

(1) Use better stopping criteria? reliable pruning: stop when we are sure probabilistic pruning: stop when almost sure do not work well for Pagerank-sorted index (2) Reorganize index structure? sort lists by term score (cosine) instead of Pagerank - does not do any better than sorting by Pagerank only sort lists by term log(PR) (or some combination of these) - some problems in normalization and dependence on # of keywords generalized fancy lists - for each list, put entries with highest term value in fancy list - sort both lists by pagerank docID - note: anything that does well in 2 out of 3 scores is found soon - deterministic or probabilistic pruning, or first-k How can we do better? chair table fancy list rest of list, cosine < x rest of list, cosine < y

loose vs. strict precision for various sizes of the fancy lists Results for generalized fancy lists MUCH better precision than without fancy lists! for first-1000, we always get correct top-1 in these runs

cost similar to first-m without fancy lists plus the additional cost of reading fancy lists cost increases slightly with size of fancy list slight inefficiency: fancy list items not removed from other list note: we do not consider savings due to caching Costs of Fancy Lists

always gives “correct” result top-4 can be computed reliably with ~20% of original cost with 16 nodes, top-4 from each node suffice with 99% prob. to get top-10 Reliable Pruning

first-30 returns correct top-10 for almost 98% of all queries Results for 16 Nodes

top-10 queries on 16 machines with 120 million pages up to 10 queries/sec with reliable pruning up to 20 queries per second with first-30 scheme Throughput and Latency for 16 Nodes Note: reliable pruning not implemented in purely incremental manner

results for 3+ terms and incremental query integrator need to do precision/recall study need to engineer ranking function and reevaluate how to include term distance in documents impact of caching at lower level working on publicly available engine prototype tons of loose ends and open questions Current and Future Work