Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU), Trondheim, Norway # Athens University of Economics and Business (AUEB), Greece
2 Outline Motivation & Preliminaries Monochromatic Reverse Top-k Queries Bichromatic Reverse Top-k Queries Threshold-based Algorithm Materialized Views Experimental Evaluation Conclusions & Future Work
3 Rank-aware Query Processing Huge amount of available data Users prefer to retrieve a limited set of k ranked data objects that best match their preferences (top-k queries)
4 Top-k Query Given a scoring function f(), retrieve the k object that best match the user preferences Linear scoring function f w (p) = Σ w[i]*p[i] Weight w[i]: relative importance of attribute i Definition TOP k (w): Given a weighting vector w and a positive integer k, find the k data points p with the minimum f(p) scores Query line of w at point p: defines the score of p Query space of w defined by point p: number of enclosed points determines the rank of p
5 From the perspective of manufacturers: it is important that a product is returned in the highest ranked positions for as many user preferences as possible estimate the impact of a product compared to their competitors products advertise a product to potential customers Reversing the Top-k Query sales representative customer Which customers would be interested?
6 Reverse top-k query: Given a potential product q and a positive integer k, which are the weighting vectors w for which q is in the top-k query result set? Two different versions Monochromatic: no knowledge of user preferences Bichromatic: a dataset with user preferences is given Reversing the Top-k Query sales representative customer
7 Car Database Example A database containing information about different cars Different users have different preferences Bob prefers a cheap car, and does not care much about the age the best choice (top-1) for Bob is the car p 1 with score 2.5 Tom prefers a newer car rather than a cheap car the best choice for Tom and Max is the car p 2
8 Car Database Example Query point q=p 2, k=1: Bichromatic reverse top-k: {(0.2,0.8), (0.5,0.5)} advertise product to Tom and Max Monochromatic reverse top-k: line segment w[price]=[1/7,5/6] estimate the impact of p 2 as 69% Query point q=p 3, k=1: empty result set for the bichromatic query
9 Outline Motivation & Preliminaries Monochromatic Reverse Top-k Queries Bichromatic Reverse Top-k Queries Threshold-based Algorithm Materialized Views Experimental Evaluation Conclusions & Future Work
10 Monochromatic Reverse Top-k Query mRTOP k (q): Given a point q, a positive number k and a dataset S, the result set of the monochromatic reverse top-k query is the locus for which there exists p in TOP k (w i ) such that f wi (q) ≤ f wi (p). The solution space W can be split into a finite set of non- adjacent partitions such that query point q has the same rank for all the weighting vectors. For the monochromatic case: we focus on the 2-d space Solution space 2 2 mRTOP 1 (q) 1
11 Geometric Interpretation d=2, k =1 If q belongs to the convex hull, then there exists exactly one partition in mRTOP 1 (q) Weighting vectors that are perpendicular to pq and qr define the line segment For weighting vectors with smaller and larger slopes than w 1, the relative order of p and q changes Monochromatic reverse top-k, k>1: The solution space may contain more than 1 partition
12 Geometric Interpretation, k >1 Monochromatic reverse top-k, k>1: The solution space may contain more than 1 partition Point q is in the top-2 result set for both w 1 and w 3, but not for w w1w1 w2w2 w3w3 w2w2
13 Example of Algorithm Points that dominate q are ranked higher for any w Points that are dominated by q are ranked lower for any w Incomparable points change their relative ranking once Vectors that are perpendicular to the line segments p i q define the partitions
14 Outline Motivation & Preliminaries Monochromatic Reverse Top-k Queries Bichromatic Reverse Top-k Queries Threshold-based Algorithm Materialized Views Experimental Evaluation Conclusions & Future Work
15 Bichromatic Reverse Top-k Query bRTOPk(q): Given a point q, a positive number k and two datasets S and W, where S represents data points and W is a dataset containing different weighting vectors, a weighting vector w i belongs to the result set, if and only if there exists p in TOP k (w i ) such that f wi (q) ≤ f wi (p) Naïve approach: for each weighting vector process the top-k query test if query point q is in the top-k list
16 Threshold-based Algorithm (RTA) Goal: reduce the number of top-k evaluations by discarding weighting vectors Threshold-based Algorithm (RTA): sort the weighting vectors based on pairwise similarity top-k queries defined by similar vectors, have similar result sets evaluate the first top-k query, calculate a threshold For each weighting vector possibly prune based on threshold refine threshold
17 Example of RTA Algorithm (k=2) Evaluate top-2 query for w 1 Set threshold based on w 2 f w2 (q) > threshold discard w 2 Refine threshold for w 3 W=[ w 1, w 2, w 3 ] Buffer: p1, p2 w1w1 q p4p4 p1p1 p2p2 p3p3 p5p5 p6p6 p7p7 p8p8 p9p9 p 10 w2w2 w3w3
18 Materialized Views Threshold-based Algorithm (RTA) reduce the top-k evaluations by discarding some weighting vectors that are not in the reverse top-k result set process at least as many top-k evaluations as the cardinality of the result set Materialized Views find weighting vectors that belong definitely to the result without top-k evaluation
19 Materialized Views Grid-based space partitioning cell C i lower left corner C i L upper right corner C i U We store for each cell C i the results of reverse top-k queries for corners C i L and C i U w 1, w 2, w 3
20 Materialized Views Given a point q enclosed in cell C i all weighting vectors in RTOP k (C i U ) belong to the result set of q only weighting vectors in RTOP k (C i L ) - RTOP k (C i U ) have to be examined Materialized views can be generalized for arbitrary k<K values w 1, w 2, w 3 w 1, w 2, w 3, w 4
21 Outline Motivation & Preliminaries Monochromatic Reverse Top-k Queries Bichromatic Reverse Top-k Queries Threshold-based Algorithm Materialized Views Experimental Evaluation Conclusions & Future Work
22 Experimental Setup Comparison between Naïve and RTA (varying dimensionality, cardinality, data distribution – real data) Queries: uniform and k-skyband points Metrics: time I/Os number of top-k evaluations
23 RTA vs. Naïve RTA outperforms naive by 1 to 2 orders of magnitude as dimensionality increases, |RTOP k (q)| decreases leading to fewer top-k evaluations uniform distribution of S and uniform weights W |S|=10K, |W|=10K, top-k=10, skyband query points
24 Scalability of RTA Algorithm naive requires |W| top-k query evaluations |W|=5K, correlated dataset: RTA needs on 544 out of 5000 top-k evaluations (saves 89.12% of the cost) the average size of the result set is 459 various distributions (UN, AC, CO) of S and uniform weights W |S|=10K or |W|=10K, d=5, top-k=10, skyband query points
25 Performance of RTA on Real Data uniform and clustered weights W (|W|=10K) clustered weights lead to fewer top-k evaluations NBA consists of tuples, d=5 (number of points scored, rebounds, assists, steals and blocks) HOUSE consists of tuples, d=6 (income spent on gas, electricity, water, heating, insurance, and property tax)
26 Outline Motivation & Preliminaries Example of Reverse Top-k Queries Monochromatic Reverse Top-k Queries Bichromatic Reverse Top-k Queries Threshold-based Algorithm Materialized Views Experimental Evaluation Conclusions & Future Work
27 We introduced reverse top-k queries geometric interpretation of the solution space efficient algorithm for bichromatic reverse top-k query materialized reverse top-k views Future Work interpretation of solution space for higher dimensions (monochromatic reverse top-k) improve the performance of the bichromatic reverse top-k computation Conclusions and Future Work
28 Thank you! More information: Related work: Akrivi Vlachou, Christos Doulkeridis, Yannis Kotidis, Kjetil Nørvåg: "Reverse Top-k Queries" Akrivi Vlachou, Christos Doulkeridis, Kjetil Nørvåg, Yannis Kotidis: "Identifying the Most Influential Data Objects with Reverse Top-k Queries"