INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID

Slides:



Advertisements
Similar presentations
Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.
Advertisements

Web Information Retrieval
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Automated Ranking of Database Query Results Sanjay Agarwal, Surajit Chaudhuri, Gautam Das, Aristides Gionis Presented by Mahadevkirthi Mahadevraj Sameer.
6/15/20151 Top-k algorithms Finding k objects that have the highest overall grades.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
1 Searching and Integrating Information on the Web Seminar 4: Ranking Queries and Data Privacy Professor Chen Li UC Irvine.
Top-k and Skyline Computation in Database Systems
Aggregation Algorithms and Instance Optimality
Combining Fuzzy Information: an Overview Ronald Fagin Abdullah Mueen -- Slides by Abdullah Mueen.
Evaluating Top-k Queries over Web-Accessible Databases Nicolas Bruno Luis Gravano Amélie Marian Columbia University.
1 INF 2914 Information Retrieval and Web Search Lecture 10: Query Processing These slides are adapted from Stanford’s class CS276 / LING 286 Information.
Top- K Query Evaluation with Probabilistic Guarantees Martin Theobald, Gerhard Weikum, Ralf Schenkel Presenter: Avinandan Sengupta.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Automated Ranking Of Database Query Results  Sanjay Agarwal - Microsoft Research  Surajit Chaudhuri - Microsoft Research  Gautam Das - Microsoft Research.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
Information Networks Rank Aggregation Lecture 10.
Answering Top-k Queries Using Views By: Gautam Das (Univ. of Texas), Dimitrios Gunopulos (Univ. of California Riverside), Nick Koudas (Univ. of Toronto),
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Combining Fuzzy Information: An Overview Ronald Fagin.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Advance Database Systems Query Optimization Ch 15 Department of Computer Science The University of Lahore.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
Fast Indexes and Algorithms For Set Similarity Selection Queries M. Hadjieleftheriou A.Chandel N. Koudas D. Srivastava.
Database Searching and Information Retrieval Presented by: Tushar Kumar.J Ritesh Bagga.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
Chapter 13 Query Optimization Yonsei University 1 st Semester, 2015 Sanghyun Park.
Indexing & querying text
Database Management System
Database Applications (15-415) DBMS Internals- Part VII Lecture 16, October 25, 2016 Mohammad Hammoud.
Indexing & querying text
Chapter 12: Query Processing
Evaluation of Relational Operations
Top-k Query Processing
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Rank Aggregation.
Laks V.S. Lakshmanan Depf. of CS UBC
Popular Ranking Algorithms
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Information Retrieval Systems
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Evaluation of Relational Operations: Other Techniques
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Query Specific Ranking
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Presentation transcript:

INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID Lecture # 40 Top-k Query Processing

ACKNOWLEDGEMENTS The presentation of this lecture has been taken from the following sources “Introduction to information retrieval” by Prabhakar Raghavan, Christopher D. Manning, and Hinrich Schütze “Managing gigabytes” by Ian H. Witten, ‎Alistair Moffat, ‎Timothy C. Bell “Modern information retrieval” by Baeza-Yates Ricardo, ‎  “Web Information Retrieval” by Stefano Ceri, ‎Alessandro Bozzon, ‎Marco Brambilla

Outline Top-k Query Processing Simple Database model Fagin’s Algorithm Threshold Algorithm Comparison of Fagin’s and Threshold Algorithm

Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor 00:05:00  00:05:10

Top-k vs Nested Loop Query 00:06:55  00:07:20 00:08:45  00:09:15 00:09:30  00:09:45 00:10:40  00:11:05

Example Simple database model Simple query Explaining Fagin’s Algorithm (FA) Finding top-k with FA Explaining Threshold Algortihm (TA) Finding top-k with TA 00:19:45  00:20:05

Example – Simple Database model N a b c d . Object ID 0.9 0.8 0.72 0.6 Attribute 1 0.85 0.2 Attribute 2 0.7 M Sorted L1 Sorted L2 (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) . 00:20:22  00:21:25 00:22:15  00:22:25 00:22:50  00:24:00 We will start by introducing the database model used in the paper. A database has only one relation. Hence we have one table containiing n objects having m, in this case 2, attributes. Each object has a grade for each attribute. The same database can be represented by sorted lists for each attribute, ordered by grade. The entries of these list contain an id and a grade.

Example – Fagin’s Algorithm STEP 1 Read attributes from every sorted list Stop when k objects have been seen in common from all lists (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) ID A1 A2 Min(A1,A2) a 0.9 0.85 00:24:10  00:26:30 00:27:30  00:28:40 00:29:15  00:30:00 d 0.9 b 0.8 0.7 0.72 c

Example – Fagin’s Algorithm STEP 2 Random access to find missing grades (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) c ID A1 A2 Min(A1,A2) a 0.9 0.85 00:30:01  00:30:55 00:31:20  00:31:30 d 0.6 0.9 b 0.8 0.7 0.72 0.2

Example – Fagin’s Algorithm STEP 3 Compute the grades of the seen objects. Return the k highest graded objects. L1 L2 (a, 0.9) (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) . c ID A1 A2 Min(A1,A2) (b, 0.8) 00:31:31  00:31:56 00:32:32  00:33:15 a (c, 0.72) 0.9 0.85 0.85 d 0.6 0.6 . 0.9 b 0.8 0.7 0.7 0.72 0.2 0.2 (d, 0.6)

New Idea !!! Threshold Algorithm (TA) Read all grades of an object once seen from a sorted access No need to wait until the lists give k common objects Do sorted access (and corresponding random accesses) until you have seen the top k answers. How do we know that grades of seen objects are higher than the grades of unseen objects ? Predict maximum possible grade unseen objects: L1 L2 00:34:15  00:35:15 (read & do sorted) 00:36:00  00:37:05 (L1 & L2) a: 0.9 d: 0.9 a: 0.85 b: 0.7 c: 0.2 . Seen b: 0.8 c: 0.72 T = min(0.72, 0.7) = 0.7 . f: 0.6 f: 0.65 Possibly unseen Threshold value d: 0.6

Example – Threshold Algorithm Step 1: - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) ID A1 A2 Min(A1,A2) 00:38:05  00:39:10 a 0.9 0.85 0.85 d 0.6 0.9 0.6

Example – Threshold Algorithm Step 2: - Determine threshold value based on objects currently seen under sorted access. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 a: 0.9 b: 0.8 c: 0.72 d: 0.6 . L1 L2 d: 0.9 a: 0.85 b: 0.7 c: 0.2 ID A1 A2 Min(A1,A2) a d 0.9 0.85 0.85 00:39:15  00:40:00 00:40:25  00:40:35 0.6 0.6 T = min(0.9, 0.9) = 0.9

Example – Threshold Algorithm Step 1 (Again): - parallel sorted access to each list For each object seen: - get all grades by random access - determine Min(A1,A2) - amongst 2 highest seen ? keep in buffer (a, 0.9) (b, 0.8) (c, 0.72) (d, 0.6) . L1 L2 (d, 0.9) (a, 0.85) (b, 0.7) (c, 0.2) ID A1 A2 Min(A1,A2) 00:40:37  00:41:20 00:41:50  00:42:10 Sorted acces = sequential access a 0.9 0.85 0.85 d 0.6 0.9 0.6 b 0.8 0.7 0.7

Example – Threshold Algorithm Step 2 (Again): - Determine threshold value based on objects currently seen. T = min(L1, L2) - 2 objects with overall grade ≥ threshold value ? stop else go to next entry position in sorted list and repeat step 1 a: 0.9 b: 0.8 c: 0.72 d: 0.6 . L1 L2 d: 0.9 a: 0.85 b: 0.7 c: 0.2 ID A1 A2 Min(A1,A2) a b 0.9 0.7 0.85 0.85 00:42:15  00:43:05 0.8 0.7 T = min(0.8, 0.85) = 0.8

Situation at stopping condition Example – Threshold Algorithm Situation at stopping condition a: 0.9 b: 0.8 c: 0.72 d: 0.6 . L1 L2 d: 0.9 a: 0.85 b: 0.7 c: 0.2 ID A1 A2 Min(A1,A2) a b 0.9 0.7 0.85 0.85 0.8 0.7 00:43:25  00:44:35 00:45:20  00:45:45 T = min(0.72, 0.7) = 0.7

Comparison of Fagin’s and Threshold Algorithm TA sees less objects than FA TA stops at least as early as FA When we have seen k objects in common in FA, their grades are higher or equal than the threshold in TA. TA may perform more random accesses than FA In TA, (m-1) random accesses for each object In FA, Random accesses are done at the end, only for missing grades TA requires only bounded buffer space (k) At the expense of more random seeks FA makes use of unbounded buffers 00:46:05  00:48:15 (TA sees & TA may) 00:48:30  00:49:10 (TA require) When we have seen k objects in common, their grades are higher or equal than the threshold Still somewhat vague

Which algorithm is the best: TA, FA?? The best algorithm Which algorithm is the best: TA, FA?? Define “best” middleware cost concept of instance optimality Consider: wild guesses aggregation functions characteristics Monotone, strictly monotone, strict database restrictions distinctness property 00:49:40  00:50:30

The best algorithm: aggregation functions Aggregation function t combines object grades into object’s overall grade: x1,…,xm t(x1,…,xm) Monotone : t(x1,…,xm) ≤ t(x’1,…,x’m) if xi ≤ x’i for every i Strictly monotone: t(x1,…,xm) < t(x’1,…,x’m) if xi < x’i for every i Strict: t(x1,…,xm) = 1 precisely when xi = 1 for every i 00:50:37  00:51:35

Extending TA What if sorted access is restricted ? e.g. use distance database TA z What if random access not possible? e.g. web search engine No Random Access Algorithm What if we want only the approximate top k objects? TAθ What if we consider relative costs of random and sorted access? Combined Algorithm (between TA and NRA) 00:51:45  00:52:20 00:52:55  00:53:30

Taxonomy of Top-k Joins 00:53:42  00:54:15