Topic 3 Top-K and Skyline Algorithms
2 What is top-k processing? Find k items that best answer a users query –As a set, as a sorted list, or as a sorted list with scores –Usually from among N items, where N >> k Application domains –Search over structured datasets with user-defined preferences Find large apartments in a good school district in Brooklyn Find cheap hotels that are near a beach –Web search & other document retrieval / ranking tasks Find documents about Massachusetts election health care –Search over multimedia repositories (with probability scores) Find images that show a palm tree next to a house –Many others….
3 SQL Example SELECT h.id, s.name, h.price, s.tuition FROM Houses h, Schools s WHERE h.location = s.location ORDER BY h.price, s.tuition; SELECT h.id, s.name FROM Houses h, Schools s WHERE h.location = s.location ORDER BY h.price + 10 x s.tuition as score STOP AFTER 10 SELECT h.id, s.name FROM Houses h, Schools s WHERE distance(h.location, s.location) as dist ORDER BY h.price + 10 x s.tuition as score + 2 * dist as score STOP AFTER 10
4 Top-K vs. SQL querying Compared to standard SQL querying –Relevance to a degree, not Boolean –Return only the best items, not all items –Quality of an item is expressed by a score SELECT h.id, s.name FROM Houses h, Schools s WHERE distance(h.location, s.location) as dist ORDER BY h.price + 10 x s.tuition as score + 2 * dist as score STOP AFTER 10
5 Why do we care about top-k processing? Many practical applications Representative of many data management problems –Solid application scenarios, and new emerging every day E.g., top-K over Web-accessible resources E.g., top-k for social search –Variety of algorithmic approaches –Explores the trade-off between run-time performance and space overhead –Costs and balances the use of available operators
6 Ranking functions Ranking (scoring) functions are used to compute the score of an item. Item r(x 1, …, x m ), where x i are the ranking attributes, e.g., square footage of a house, number of times Massachusetts occurs in the text, etc. score(r) = g (f 1 (x 1 ), …, f m (x m )) –where f i are monotone functions, e.g. f(x) = 2 * x –g is a monotone aggregation functions, e.g. sum, average, max, min –e.g. score (i) = 2 * sq. ft. + 3 * quality of school district Definition A function f is monotone if f(x) < f(y) whenever x < y An aggregation function g is monotone if g(r) < g(r) whenever r. x i < r. x i, for all i
7 Quantifying top-K algorithm performance Execution time –Sequential access (SA) accessing items in order, e.g., by reading from a cursor similar concept to a sequential disk read, where seek time is amortized over multiple accesses –Random access (RA) accessing items out of order, e.g., a primary key lookup similar to a random disk read typically more expensive than an SA (even orders of magnitude), sometimes impossible –Why not use wall clock time? Buffer size –How much state do we have to keep during computation –Is the size bounded by some constant (e.g. k), or is it linear in the size of the dataset (N)? (recall that k << N)
8 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)
9 Naïve Computation of the top-k Answers Example 1 (on the board & next slide): R (id, annual income, net worth) score(r) = r.income + r.net worth Algorithm –Compute the score of each item –Sort items in decreasing order of score –Return k items with the highest score Properties of naïve solution –Advantage - simple –Disadvantage - unacceptable run-time performance when N is high Idea : throw space at the problem –pre-compute inverted lists for components of the score –aggregate partial scores at run-time
10 Example 1 idincome (K$)net worth (K$)score = income + net worth r1r r2r r3r r4r r5r r6r r7r r8r r9r r
11 The Basic Indexing Structure: Inverted List idincome r1r1 150 r2r2 r3r3 125 r4r4 100 r5r5 r6r6 80 r7r7 75 r8r8 r9r9 50 r L1L1 L2L2 idnet worth r7r7 500 r3r3 450 r4r4 r2r2 425 r1r1 350 r9r9 300 r5r5 200 r6r6 100 r8r8 50 r 10 50
12 Fagin Algorithm (FA) Algorithm –Access all lists sequentially (SA), in parallel –STOP once k items have been seen sequentially in all lists –Compute scores of incomplete items by performing a random access (RA) –Sort on score, return the best k items Work out Example 1 on the board for k = 3 Performances –8 SA + 2 RA –5 objects in buffer Is this algorithm correct?
13 Threshold Algorithm (TA) Algorithm –Access all lists sequentially (SA), in parallel –After each cursor move Compute the score of the item r under the cursor with random accesses (RA) Record r in the buffer if –(i) buffer size < k –(ii) rs score > k th score, remove k th item from buffer Update the threshold = current list scores STOP when k th score > –Return the k items currently in the buffer Work out Example 1 on the board for k = 3 Performance –5 SA + 4 RA; #RA = #SA * (m-1), where m is the number of lists –3 objects in buffer
14 Comparison between FA and TA Example 2 on the board (k=3) Theorem: # SA in TA < # SA in FA Theorem: TA requires only bounded buffers, FAs buffers are not bounded
Instance optimiality of TA Theorem: TA is instance-optimal over the class of algorithms A that: 1.correctly find the top-k answers and 2.do not make any random guesses What about over all algorithms that correctly identify the top-k answers? –Example 4.4 (PODS 2001) on board 15
16 What if we couldnt do random accesses? Sometimes it suffices to output the top-k as a set Sometimes we can get away with outputting top-k in sorted order, but with no scores Example 8.3 from [Fagin et al. JCSS 2003]
17 No Random Access Algorithm (NRA) Algorithm –Access all lists sequentially (SA), in parallel –After each cursor move compute Worst-case score W(r), best-case score B(r) for each seen r Sort all seen items on W(r), breaking ties by B(r) = current list scores (this is the best-case score of any unseen object) STOP when W(r) of k th object > –If random accesses are possible, compute complete scores of the top-K items –Return the top-k items Example 1 on the board for k = 3 Performance –13 SA + 0 RA –Optimal performance if no RAs are allowed –In reality, computation may be slow -- re-sorting potentially large buffers at each step
18 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)
19 Top-k Selection Rank Aggregation from Multiple Lists BOTH Sorted Access Only Random Access Only FA, TA, Quick-combine, Multi-Step NRA, Stream-combine MPro, Upper, Pick Slide courtesy of Ihab F. Ilyas and Walid G. Aref
20 Minimal Probing (MPro) Recall: TA sequentially accesses one list, probes all other lists for each encountered object. What if random accesses (probes) were very expensive? –User-defined functions –Calls to external services, e.g., web-based Idea: execute only the necessary probes [Chang & Hwang, SIGMOD 2002] Assumptions –One component of the score accessed sequentially (SA), other components are probed (RA) –Probing schedule is given, global (same for all items) or per-item
21 The MPro Example SELECT id FROM House WHERE new (age) s, -- sequential access cheap(price, size) p C -- random access, a.k.a. probe large(size) p L -- random access ORDER BY MIN(s, r C, r L ) STOP AFTER 2 ids new (age) p C cheap (price, size) p L large (size) score MIN(s, p C, p l ) a b c d e
22 The MPro Algorithm For an item r : –W(r) - score seen so far –B(r) - best-case score, unseen components have max score (1.0) What is a necessary probe? –The probe pr(r) is necessary if there do not exist k items v 1, …, v k s.t. B(r) < B(v i ) for each v i The MPro Algorithm –Access items sequentially –Maintain a best-case queue (a.k.a. the ceiling queue) STOP when k items with complete scores are at the head of the queue Probe the first incomplete item, re-sort the queue Example 5 on the board (Figure 4 from [KH, SIGMOD 2002]) –MIN is the aggregation, not SUM! –find the top-2 items (K=2)
The big picture: top-k is term-at-a-time [Query evaluation: strategies and optimizations. Turtle & Flood, 1995.] Term-at-a-time (TAAT) –Partial scores for documents kept in accumulators –May (it better!) terminate before all documents are considered, but each document may be considered more than once –Typically outperform DAAT when contributions of terms to the score are independent and dataset is reasonably small Document-at-a-time (DAAT) –Complete scores for documents computed (i.e., typically all documents are considered) –Smaller memory footprint than in TAAT –Easier to execute in parallel than TAAT
Score-at-a-time revisited Lots of trade-offs determine whether TAAT or DAAT is better –E.g., optimized DAAT costly because of re-sorting (for long queries) –E.g., TAAT costly because of memory footprint –E.g., DAAT much better for context-sensitive queries –Particular algorithms, datasets, scoring functions, architectures bring their own trade-offs What about score-at-a-time (Anh and Moffat, SIGIR 2006)? –Similar to TAAT, but the order in which terms are considered is determined by the impact of a given term in a given document –A custom (efficient) indexing method for a custom (effective) scoring function 24
25 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)
Ranking functions and dominance Monotone aggregation ranking functions express user preferences –r (price, distToBeach); score(r) = w 1 * price+ w 2 * distToBeach –score(r 1 ) < score(r 2 ) whenever r 1.price < r 2.price AND r 1.distToBeach < r 2.distToBeach –this holds for any positive weights w 1 and w 2 ! (What if one of the weights was negative? What if both were negative?) –In this example, we are interested in minimizing the score Mapping a multi-dimensional (e.g.,, 2D) space into R –In top-k, we look for k tuples that have the best score, which is weight-specific, but … users may differ on the weights in their preferences users may want to understand the structure of the 2D space –Skylines to the rescue! makes the structure of the space explicit are independent of the weights 26
Skylines represent dominance Tuple r 1 is better than r 2, according to its score if –score(r 1 ) < score(r 2 ), i.e., –r 1.price < r 2.price AND r 1.distToBeach < r 2.distToBeach This is equivalent to a notion of dominance (Pareto-optimality) –Definition: r 1 dominates r 2 iff r 1 is as good or better than r 2 in all dimensions, and strictly better in at least one dimension. –Some tuples may be incomparable r 1 : $100, 1.0 mi r 2 : $ 90, 1.2 mi r 3 : $110, 1.1 mi r 4 : $120, 1.1 mi –Observe: dominance is transitive 27
Skyline examples 28
Skyline properties For any monotone ranking function F –If point r is best according to F, then r is on the skyline A top-1 hotel will always be on the skyline, irrespective of the weights! –If point r is on the skyline, then there exists an F for which r is best Every hotel on the skyline is someones favorite 29
Computing the Skyline: nested-loops We will focus on 2D skylines, these methods generalize to multiple dimensions (some non-trivially) SELECT * FROM Hotels h WHERE h.city = Nassau AND NOT EXISTS ( SELECT * FROM Hotels h1 WHERE h1.city = Nassau AND h1.distToBeach <= h.distToBeach AND h1.price <= h.price AND ( h1.distToBeach < h.distToBeach OR h1.price < h.price)) How is this evaluated? Complexity? 30
Computing the Skyline: block-nested-loops Maintain a window of incomparable tuples in memory Iterate until all tuples are either on the skyline or discarded For each tuple r –If r is dominated by any tuple in the window, discard r –If r dominates a tuple r in the window, discard r, insert r –If r is incomparable with all tuples in the window, insert r –If the window if full – write r to a temporary file on disk At the end of the iteration, output all tuples in the window to a temporary file, with their sequence numbers r is on the skyline if it was compared to all other tuples; we can tell this by looking at the sequence number (see ICDE 2001 for details) Complexity: O(n) best case – when skyline is small, O(n 2 ) worst case 31
Computing the Skyline: divide-and-conquer Based on multidimensional divide-and-conquer [Bentley80] In the block-nested-loops algorithm, tuples were considered multiple times –This is because no order was assumed, and a sequence number was assigned to keep order –Idea – sort tuples on one attributes, then compute dominance The algorithm –Compute the median along dimension d 1 –Recursively compute skylines S 1 and S 2 for the partitions –Merge S 1 and S 2 : eliminate all r in S 2 that are dominated by some r in S 1 Complexity: O(n log n) for d=2, best and worst case 32
What if one of the coordinates is unknown? Skylines for scientific literature search [ICDE 2010] –PubMed: 19 million scientific articles –Score components: relevance (higher is better) + publication date (more recent is better) –Relevance is expensive to compute: a combination of non- independent components (so, document-at-a-time) –An upper bound on relevance is much cheaper to compute Idea: modify divide-an-conquer to work with score upper-bounds An add-on: compute the skyline and k contours Demo 33
34 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 relevance batch boundary
35 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 relevance batch boundary UB sort order
36 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 batch boundary 3 skyline contours dominated, with score dominated, no score relevance
37 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 batch boundary 3 skyline contours dominated, with score dominated, no score relevance
38 Skyline Computation Mar 2010Feb 2010Jan 2010Dec 2009 batch boundary 3 skyline contours dominated, with score dominated, no score relevance
39 Outline Intro Fundamentals of top-k processing –Fagin algorithm (FA) –Threshold algorithm (TA) –No random access algorithm (NRA) Extensions and alternatives –top-k with expensive predicates (MPro) –TAAT vs. DAAT vs. …. Skylines –Dominance –Fundamental algorithms –Extensions (demo)
40 References 1.Optimal aggregation algorithms for middleware. Ronald Fagin, Amnon Lotem and Moni Naor. J. Comput. Syst. Sci. 66(4): (2003). 2.Minimal probing: supporting expensive predicates for top-K queries. Kevin Chen-Chuan Chang and Seung-won Hwang. SIGMOD The Skyline operator. Stephan Börzsönyi, Donald Kossmann, Konrad Stocker. ICDE Semantic Ranking and Result Visualization for Life Sciences Publications. Julia Stoyanovich, William Mee and Kenneth A. Ross. ICDE 2010.