Topic 3 Top-K and Skyline Algorithms. 2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted.

Slides:



Advertisements
Similar presentations
Chapter 13: Query Processing
Advertisements

Analysis of Computer Algorithms
Subspace Embeddings for the L1 norm with Applications Christian Sohler David Woodruff TU Dortmund IBM Almaden.
Dynamic Programming Introduction Prof. Muhammad Saeed.
Evaluating Window Joins over Unbounded Streams Author: Jaewoo Kang, Jeffrey F. Naughton, Stratis D. Viglas University of Wisconsin-Madison CS Dept. Presenter:
Comp 122, Spring 2004 Order Statistics. order - 2 Lin / Devi Comp 122 Order Statistic i th order statistic: i th smallest element of a set of n elements.
Ken C. K. Lee, Baihua Zheng, Huajing Li, Wang-Chien Lee VLDB 07 Approaching the Skyline in Z Order 1.
Chapter 4: Informed Heuristic Search
1 The tiling algorithm Learning in feedforward layered networks: the tiling algorithm writed by Marc M é zard and Jean-Pierre Nadal.
Databasteknik Databaser och bioinformatik Data structures and Indexing (II) Fang Wei-Kleiner.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
Chapter 18 Methodology – Monitoring and Tuning the Operational System Transparencies © Pearson Education Limited 1995, 2005.
Eftychia Baikousi Panos Vassiliadis
Database Performance Tuning and Query Optimization
Randomized Algorithms Randomized Algorithms CS648 1.
Ack: Several slides from Prof. Jim Anderson’s COMP 202 notes.
Parallel List Ranking Advanced Algorithms & Data Structures Lecture Theme 17 Prof. Dr. Th. Ottmann Summer Semester 2006.
Chapter 4 Memory Management Basic memory management Swapping
Capacity Planning For Products and Services
Text Categorization.
1 Evaluations in information retrieval. 2 Evaluations in information retrieval: summary The following gives an overview of approaches that are applied.
Introduction to Information Retrieval Outline ❶ Latent semantic indexing ❷ Dimensionality reduction ❸ LSI in information retrieval 1.
Traditional IR models Jian-Yun Nie.
Nonparametric Methods: Nearest Neighbors
Lecture plan Outline of DB design process Entity-relationship model
Executional Architecture
Unit 1:Parallel Databases
Reaching Agreements II. 2 What utility does a deal give an agent? Given encounter  T 1,T 2  in task domain  T,{1,2},c  We define the utility of a.
Slippery Slope
© 2006 Pearson Addison-Wesley. All rights reserved10 A-1 Chapter 10 Algorithm Efficiency and Sorting.
Local Search Jim Little UBC CS 322 – CSP October 3, 2014 Textbook §4.8
CPSC 322, Lecture 14Slide 1 Local Search Computer Science cpsc322, Lecture 14 (Textbook Chpt 4.8) Oct, 5, 2012.
Choosing an Order for Joins
© John A. Stratton 2009 ECE 498AL, University of Illinois, Urbana-Champaign 1 ECE 498AL Lecture 23: Kernel and Algorithm Patterns for CUDA.
Introduction to Information Retrieval Introduction to Information Retrieval CS276 Information Retrieval and Web Search Chris Manning, Pandu Nayak and Prabhakar.
Web Information Retrieval
Supporting Top-k join Queries in Relational Databases By:Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Calvin R Noronha ( )
 Introduction  Views  Related Work  Preliminaries  Problems Discussed  Algorithm LPTA  View Selection Problem  Experimental Results.
Supporting top-k join queries in relational databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by Rebecca M. Atchley Thursday, April.
Rank Aggregation. Rank Aggregation: Settings Multiple items – Web-pages, cars, apartments,…. Multiple scores for each item – By different reviewers, users,
Aggregation Algorithms and Instance Optimality
A Unified Approach for Computing Top-k Pairs in Multidimensional Space Presented By: Muhammad Aamir Cheema 1 Joint work with Xuemin Lin 1, Haixun Wang.
CS246 Ranked Queries. Junghoo "John" Cho (UCLA Computer Science)2 Traditional Database Query (Dept = “CS”) & (GPA > 3.5) Boolean semantics Clear boundary.
Minimal Probing: Supporting Expensive Predicates for Top-k Queries Kevin C. Chang Seung-won Hwang Univ. of Illinois at Urbana-Champaign.
1 Evaluating top-k Queries over Web-Accessible Databases Paper By: Amelie Marian, Nicolas Bruno, Luis Gravano Presented By Bhushan Chaudhari University.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
©Silberschatz, Korth and Sudarshan13.1Database System Concepts Chapter 13: Query Processing Overview Measures of Query Cost Selection Operation Sorting.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
Ranking in DB Laks V.S. Lakshmanan Depf. of CS UBC.
“Artificial Intelligence” in my research Seung-won Hwang Department of CSE POSTECH.
Efficient Processing of Top-k Spatial Preference Queries
1University of Texas at Arlington.  Introduction  Motivation  Requirements  Paper’s Contribution.  Related Work  Overview of Ripple Join  Rank.
All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)
Supporting Top-k join Queries in Relational Databases Ihab F. Ilyas, Walid G. Aref, Ahmed K. Elmagarmid Presented by: Z. Joseph, CSE-UT Arlington.
Presented by Suresh Barukula 2011csz  Top-k query processing means finding k- objects, that have highest overall grades.  A query in multimedia.
Ranking of Database Query Results Nitesh Maan, Arujn Saraswat, Nishant Kapoor.
Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.
Top-k Query Processing Optimal aggregation algorithms for middleware Ronald Fagin, Amnon Lotem, and Moni Naor + Sushruth P. + Arjun Dasgupta.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Abolfazl Asudeh Azade Nazi Nan Zhang Gautam DaS
Indexing & querying text
Database Management System
Top-k Query Processing
Spatio-temporal Pattern Queries
File Processing : Query Processing
Rank Aggregation.
Laks V.S. Lakshmanan Depf. of CS UBC
INFORMATION RETRIEVAL TECHNIQUES BY DR. ADNAN ABID
Efficient Processing of Top-k Spatial Preference Queries
Presentation transcript:

Topic 3 Top-K and Skyline Algorithms

2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted list with scores –Usually from among N items, where N >> k Application domains –Search over structured datasets with user-defined preferences Find large apartments in a good school district in Brooklyn Find cheap hotels that are near a beach –Web search & other document retrieval / ranking tasks Find documents about Massachusetts election health care –Search over multimedia repositories (with probability scores) Find images that show a palm tree next to a house –Many others….

3 SQL Example SELECT h.id, s.name, h.price, s.tuition FROM Houses h, Schools s WHERE h.location = s.location ORDER BY h.price, s.tuition; SELECT h.id, s.name FROM Houses h, Schools s WHERE h.location = s.location ORDER BY h.price + 10 x s.tuition as score STOP AFTER 10 SELECT h.id, s.name FROM Houses h, Schools s WHERE distance(h.location, s.location) as dist ORDER BY h.price + 10 x s.tuition as score + 2 * dist as score STOP AFTER 10

4 Top-K vs. SQL querying Compared to standard SQL querying –Relevance to a degree, not Boolean –Return only the best items, not all items –Quality of an item is expressed by a score SELECT h.id, s.name FROM Houses h, Schools s WHERE distance(h.location, s.location) as dist ORDER BY h.price + 10 x s.tuition as score + 2 * dist as score STOP AFTER 10

5 Why do we care about top-k processing? Many practical applications Representative of many data management problems –Solid application scenarios, and new emerging every day E.g., top-K over Web-accessible resources E.g., top-k for social search –Variety of algorithmic approaches –Explores the trade-off between run-time performance and space overhead –Costs and balances the use of available operators

6 Ranking functions Ranking (scoring) functions are used to compute the score of an item. Item r(x 1, …, x m ), where x i are the ranking attributes, e.g., square footage of a house, number of times Massachusetts occurs in the text, etc. score(r) = g (f 1 (x 1 ), …, f m (x m )) –where f i are monotone functions, e.g. f(x) = 2 * x –g is a monotone aggregation functions, e.g. sum, average, max, min –e.g. score (i) = 2 * sq. ft. + 3 * quality of school district Definition A function f is monotone if f(x) < f(y) whenever x < y An aggregation function g is monotone if g(r) < g(r) whenever r. x i < r. x i, for all i

7 Quantifying top-K algorithm performance Execution time –Sequential access (SA) accessing items in order, e.g., by reading from a cursor similar concept to a sequential disk read, where seek time is amortized over multiple accesses –Random access (RA) accessing items out of order, e.g., a primary key lookup similar to a random disk read typically more expensive than an SA (even orders of magnitude), sometimes impossible –Why not use wall clock time? Buffer size –How much state do we have to keep during computation –Is the size bounded by some constant (e.g. k), or is it linear in the size of the dataset (N)? (recall that k << N)

8 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)

9 Naïve Computation of the top-k Answers Example 1 (on the board & next slide): R (id, annual income, net worth) score(r) = r.income + r.net worth Algorithm –Compute the score of each item –Sort items in decreasing order of score –Return k items with the highest score Properties of naïve solution –Advantage - simple –Disadvantage - unacceptable run-time performance when N is high Idea : throw space at the problem –pre-compute inverted lists for components of the score –aggregate partial scores at run-time

10 Example 1 idincome (K$)net worth (K$)score = income + net worth r1r r2r r3r r4r r5r r6r r7r r8r r9r r

11 The Basic Indexing Structure: Inverted List idincome r1r1 150 r2r2 r3r3 125 r4r4 100 r5r5 r6r6 80 r7r7 75 r8r8 r9r9 50 r L1L1 L2L2 idnet worth r7r7 500 r3r3 450 r4r4 r2r2 425 r1r1 350 r9r9 300 r5r5 200 r6r6 100 r8r8 50 r 10 50

12 Fagin Algorithm (FA) Algorithm –Access all lists sequentially (SA), in parallel –STOP once k items have been seen sequentially in all lists –Compute scores of incomplete items by performing a random access (RA) –Sort on score, return the best k items Work out Example 1 on the board for k = 3 Performances –8 SA + 2 RA –5 objects in buffer Is this algorithm correct?

13 Threshold Algorithm (TA) Algorithm –Access all lists sequentially (SA), in parallel –After each cursor move Compute the score of the item r under the cursor with random accesses (RA) Record r in the buffer if –(i) buffer size < k –(ii) rs score > k th score, remove k th item from buffer Update the threshold = current list scores STOP when k th score > –Return the k items currently in the buffer Work out Example 1 on the board for k = 3 Performance –5 SA + 4 RA; #RA = #SA * (m-1), where m is the number of lists –3 objects in buffer

14 Comparison between FA and TA Example 2 on the board (k=3) Theorem: # SA in TA < # SA in FA Theorem: TA requires only bounded buffers, FAs buffers are not bounded

Instance optimiality of TA Theorem: TA is instance-optimal over the class of algorithms A that: 1.correctly find the top-k answers and 2.do not make any random guesses What about over all algorithms that correctly identify the top-k answers? –Example 4.4 (PODS 2001) on board 15

16 What if we couldnt do random accesses? Sometimes it suffices to output the top-k as a set Sometimes we can get away with outputting top-k in sorted order, but with no scores Example 8.3 from [Fagin et al. JCSS 2003]

17 No Random Access Algorithm (NRA) Algorithm –Access all lists sequentially (SA), in parallel –After each cursor move compute Worst-case score W(r), best-case score B(r) for each seen r Sort all seen items on W(r), breaking ties by B(r) = current list scores (this is the best-case score of any unseen object) STOP when W(r) of k th object > –If random accesses are possible, compute complete scores of the top-K items –Return the top-k items Example 1 on the board for k = 3 Performance –13 SA + 0 RA –Optimal performance if no RAs are allowed –In reality, computation may be slow -- re-sorting potentially large buffers at each step

18 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)

19 Top-k Selection Rank Aggregation from Multiple Lists BOTH Sorted Access Only Random Access Only FA, TA, Quick-combine, Multi-Step NRA, Stream-combine MPro, Upper, Pick Slide courtesy of Ihab F. Ilyas and Walid G. Aref

20 Minimal Probing (MPro) Recall: TA sequentially accesses one list, probes all other lists for each encountered object. What if random accesses (probes) were very expensive? –User-defined functions –Calls to external services, e.g., web-based Idea: execute only the necessary probes [Chang & Hwang, SIGMOD 2002] Assumptions –One component of the score accessed sequentially (SA), other components are probed (RA) –Probing schedule is given, global (same for all items) or per-item

21 The MPro Example SELECT id FROM House WHERE new (age) s, -- sequential access cheap(price, size) p C -- random access, a.k.a. probe large(size) p L -- random access ORDER BY MIN(s, r C, r L ) STOP AFTER 2 ids new (age) p C cheap (price, size) p L large (size) score MIN(s, p C, p l ) a b c d e

22 The MPro Algorithm For an item r : –W(r) - score seen so far –B(r) - best-case score, unseen components have max score (1.0) What is a necessary probe? –The probe pr(r) is necessary if there do not exist k items v 1, …, v k s.t. B(r) < B(v i ) for each v i The MPro Algorithm –Access items sequentially –Maintain a best-case queue (a.k.a. the ceiling queue) STOP when k items with complete scores are at the head of the queue Probe the first incomplete item, re-sort the queue Example 5 on the board (Figure 4 from [KH, SIGMOD 2002]) –MIN is the aggregation, not SUM! –find the top-2 items (K=2)

The big picture: top-k is term-at-a-time [Query evaluation: strategies and optimizations. Turtle & Flood, 1995.] Term-at-a-time (TAAT) –Partial scores for documents kept in accumulators –May (it better!) terminate before all documents are considered, but each document may be considered more than once –Typically outperform DAAT when contributions of terms to the score are independent and dataset is reasonably small Document-at-a-time (DAAT) –Complete scores for documents computed (i.e., typically all documents are considered) –Smaller memory footprint than in TAAT –Easier to execute in parallel than TAAT

Score-at-a-time revisited Lots of trade-offs determine whether TAAT or DAAT is better –E.g., optimized DAAT costly because of re-sorting (for long queries) –E.g., TAAT costly because of memory footprint –E.g., DAAT much better for context-sensitive queries –Particular algorithms, datasets, scoring functions, architectures bring their own trade-offs What about score-at-a-time (Anh and Moffat, SIGIR 2006)? –Similar to TAAT, but the order in which terms are considered is determined by the impact of a given term in a given document –A custom (efficient) indexing method for a custom (effective) scoring function 24

25 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)

Ranking functions and dominance Monotone aggregation ranking functions express user preferences –r (price, distToBeach); score(r) = w 1 * price+ w 2 * distToBeach –score(r 1 ) < score(r 2 ) whenever r 1.price < r 2.price AND r 1.distToBeach < r 2.distToBeach –this holds for any positive weights w 1 and w 2 ! (What if one of the weights was negative? What if both were negative?) –In this example, we are interested in minimizing the score Mapping a multi-dimensional (e.g.,, 2D) space into R –In top-k, we look for k tuples that have the best score, which is weight-specific, but … users may differ on the weights in their preferences users may want to understand the structure of the 2D space –Skylines to the rescue! makes the structure of the space explicit are independent of the weights 26

Skylines represent dominance Tuple r 1 is better than r 2, according to its score if –score(r 1 ) < score(r 2 ), i.e., –r 1.price < r 2.price AND r 1.distToBeach < r 2.distToBeach This is equivalent to a notion of dominance (Pareto-optimality) –Definition: r 1 dominates r 2 iff r 1 is as good or better than r 2 in all dimensions, and strictly better in at least one dimension. –Some tuples may be incomparable r 1 : $100, 1.0 mi r 2 : $ 90, 1.2 mi r 3 : $110, 1.1 mi r 4 : $120, 1.1 mi –Observe: dominance is transitive 27

Skyline examples 28

Skyline properties For any monotone ranking function F –If point r is best according to F, then r is on the skyline A top-1 hotel will always be on the skyline, irrespective of the weights! –If point r is on the skyline, then there exists an F for which r is best Every hotel on the skyline is someones favorite 29

Computing the Skyline: nested-loops We will focus on 2D skylines, these methods generalize to multiple dimensions (some non-trivially) SELECT * FROM Hotels h WHERE h.city = Nassau AND NOT EXISTS ( SELECT * FROM Hotels h1 WHERE h1.city = Nassau AND h1.distToBeach <= h.distToBeach AND h1.price <= h.price AND ( h1.distToBeach < h.distToBeach OR h1.price < h.price)) How is this evaluated? Complexity? 30

Computing the Skyline: block-nested-loops Maintain a window of incomparable tuples in memory Iterate until all tuples are either on the skyline or discarded For each tuple r –If r is dominated by any tuple in the window, discard r –If r dominates a tuple r in the window, discard r, insert r –If r is incomparable with all tuples in the window, insert r –If the window if full – write r to a temporary file on disk At the end of the iteration, output all tuples in the window to a temporary file, with their sequence numbers r is on the skyline if it was compared to all other tuples; we can tell this by looking at the sequence number (see ICDE 2001 for details) Complexity: O(n) best case – when skyline is small, O(n 2 ) worst case 31

Computing the Skyline: divide-and-conquer Based on multidimensional divide-and-conquer [Bentley80] In the block-nested-loops algorithm, tuples were considered multiple times –This is because no order was assumed, and a sequence number was assigned to keep order –Idea – sort tuples on one attributes, then compute dominance The algorithm –Compute the median along dimension d 1 –Recursively compute skylines S 1 and S 2 for the partitions –Merge S 1 and S 2 : eliminate all r in S 2 that are dominated by some r in S 1 Complexity: O(n log n) for d=2, best and worst case 32

What if one of the coordinates is unknown? Skylines for scientific literature search [ICDE 2010] –PubMed: 19 million scientific articles –Score components: relevance (higher is better) + publication date (more recent is better) –Relevance is expensive to compute: a combination of non- independent components (so, document-at-a-time) –An upper bound on relevance is much cheaper to compute Idea: modify divide-an-conquer to work with score upper-bounds An add-on: compute the skyline and k contours Demo 33

34 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 relevance batch boundary

35 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 relevance batch boundary UB sort order

36 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 batch boundary 3 skyline contours dominated, with score dominated, no score relevance

37 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 batch boundary 3 skyline contours dominated, with score dominated, no score relevance

38 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 batch boundary 3 skyline contours dominated, with score dominated, no score relevance

39 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)

40 References 1.Optimal aggregation algorithms for middleware. Ronald Fagin, Amnon Lotem and Moni Naor. J. Comput. Syst. Sci. 66(4): (2003). 2.Minimal probing: supporting expensive predicates for top-K queries. Kevin Chen-Chuan Chang and Seung-won Hwang. SIGMOD The Skyline operator. Stephan Börzsönyi, Donald Kossmann, Konrad Stocker. ICDE Semantic Ranking and Result Visualization for Life Sciences Publications. Julia Stoyanovich, William Mee and Kenneth A. Ross. ICDE 2010.