Download presentation
Presentation is loading. Please wait.
Published byHilda Campbell Modified over 9 years ago
1
1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006
2
2 Outline Motivations and applications Background Skyline-based algorithm Best-first algorithms Experimental results Conclusions
3
3 Top-k query, skyline query D: dataset of points in multi- dimensional space d Top-k query k points with the lowest F values Top-2: p 4, p 6 Require a ranking function Result affected by scales of dimensions Skyline query p>p’: ( i, p[i] < p’[i] ) ( i, p[i] p’[i] ) Points not dominated by any other point Skyline: p 1, p 4, p 6, p 7 Result size cannot be controled
4
4 Top-k dominating query Intuitive score function (p) = | { p’ D, p>p’ } | Property: p,p’ D, p>p’ (p)> (p’) Top-k dominating query k points with the highest values Also known as k-dominating query [PTFS05] Top-2 dominating points: p 4 (3), p 5 (2) Applications: decision support systems, find the most `popular’ objects Advantages Control of result size No need to specify ranking function Result independent of scales of dimensions
5
5 Related work Spatial aggregation processing aggregate measures (e.g., number of cars in car-parks) in a region (e.g., district) aR-trees [PKZT01] Each entry is augmented with the aggregate measure of all points in its subtree Example: COUNT R-tree Query: find the number of points intersect W Prune entries that do not intersect W Fully covered by W, increment by its count Partially covered by W, recursive call Cost: 10 for aR-tree, but 17 for typical R-tree
6
6 Related work: skyline computation Non-indexed data DC (divide-and-conquer), BNL (block-nested loop), SFS (sort-filter-skyline), LESS (linear elimination sort for skyline) Indexed data NN, BBS [PTFS05] Skyline variants based on dominance relationship Top-k frequent skyline points [CJT+06a] Frequency (p): number of subspaces that p is a skyline point k-dominant skyline points [CJT+06b] Relax the dominance relationship by k k=d: original skyline; k decreases skyline size decreases Data cube for analyzing dominance relationship of points [LOTW06]
7
7 Top-k dominating query How to answer top-k dominating query Block nested loop join: compute the score of every point Quadratic cost Skyline based solution: retrieve the skyline points and compute their scores, find the top-1 from the skyline Expensive for datasets with large skylines Goal: develop efficient algorithm for the query on indexed multi-dimensional data
8
8 Problem characteristics Pre-computation possible? Materialize the `score’ of every point Updates: change the ‘score’ of influenced points Update cost is expensive for dynamic datasets Find (K >> k) points with the highest dominating area, compute their scores to get best-k results Approximate solution, hard to specify K Dominating area cannot provide bounds for DomArea(p 1 ) = (1 – 0.25) * (1 – 0.50) = 0.375 DomArea(p 4 ) = (1 – 0.45) * (1 – 0.40) = 0.330 (p 1 )=1 < (p 4 )=2 !!! Unlike the dominating area, computing value (or even its upper bound) requires accessing data
9
9 Skyline-based solution BBS Top-k dominating algorithm [PTFS05] Example: top-2 dominating query Iteration 1 Find the skyline points Score of a point is smaller than the one dominating it (if any) Compute their scores (by accessing the tree) Report p 2 (4) as the first result Iteration 2 Find the constrained skyline (gray region) WHY ? Region dominated by p 2 but not others (p 1, p 3 ) Compute their scores and compare them with retrieved points in all previous iterations Report p 4 (2) as the second result
10
10 Our optimizations Hilbert ordering of retrieved points before counting Exhibit locality of node accesses Batch counting Pack B (page capacity) points into one page and count their scores simultaneously e – and e + denote the lower and upper corners (virtual points) of an entry e respectively Properties p 1 > e – p 1 dominates all points in e p 2 > e + and p 2 > e – p 2 may dominate some points in e p 3 > e + p 3 dominates no points in e
11
11 The best-first approach The optimized BBS is inefficient when the skyline is large Not necessary to compute the whole skyline Best-first approach: visit the nodes in descending order of their upper bound scores Use a max-heap H for organizing the entries to be visited in descending order of their upper bound scores Keep an array W of the best-k data points found so far Terminates when the top entry in H cannot improve the result Compute upper bound scores of entries in the same non-leaf node Upper score of the entry e is (e_) WHY? For each entry e in the node, put the point e_ in the set T Perform batch counting for the points in T
12
12 Optimizations for best-first search Pruning technique Let be the best-k score found so far (lowest score in result array) Suppose that a point p satisfies (p) . Any point p’ dominated by p (p’) < . Keep a pruner set F of visited data points whose scores are Among the points in F, only need to maintain their skyline Apply F to eliminate unqualified entries Lazy counting (for computing scores of leaf entries) some data points (in the same leaf node) remain before counting, not cost-effective to perform batch counting for them Use a FIFO queue L to store discovered points Once L is full (i.e., |L|= page capacity B), perform batch counting for the points in L, update the result and clear L
13
13 Lightweight best-first search Expensive to compute upper bound scores for non-leaf entries Root node contains e 1, e 2, e 3 Compute (e 1– ), (e 2– ), (e 3– ) in batch e 2– may dominate some points in e 1 and e 3 Cost: 3 node accesses; (e 1– )=3, (e 2– )=7, (e 3– )=3 Use a lightweight function to compute upper bound scores for non-leaf entries Goal: do not allow leaf nodes to be accessed Compute ub (e 1– ), ub (e 2– ), ub (e 3– ) in batch Cost: 1 node access, since leaf nodes not accessed e 2– dominates all points in e 2 and some in e 1 and e 3 ub (e 1– )=3, ub (e 2– )=9, ub (e 3– )=3 Correct bound ! Approx. preserve original ordering of entries !
14
14 Incremental best-first search No objects need to be pruned Data points are inserted into the heap H after their scores have been computed When a data point p is deheaped, check whether its score is greater than those in the Lazy Counting Queue L If yes, report p as the next result If not, consider the points in L whose (upper bound) score is greater than p Compute their actual scores and insert them to H Insert p back to H again The next result is now at the top of H (to be found in next iteration)
15
15 Query variant Bichromatic top-k dominating query Given a provider dataset D P and a consumer dataset D A, a point p in D P, A (p) = | { a D A, p>a } | A (p 1 )=2, A (p 2 )=3, A (p 3 )=1 Bichromatic top-1 point: p 2 Application: find the most popular hotel, where D P contains hotels and D A specify requirements from different customers Query processing The proposed algorithms are still applicable Search for the results in D P Perform counting on D A
16
16 Setup of efficiency experiments Algorithms BBS (skyline-based method) Best first search: BF1 (basic), BF2 (lightweight) Incremental best-first: IBF1, IBF2 Synthetic datasets UI (independent), CO (correlated), AC (anti-correlated) Parameters and other settings aR tree node page size: 4K bytes LRU buffer size (%): 0, 1, 2, 5, 10, 15, 20 Datasize N (million): 0.25, 0.5, 1, 2, 4 Data dimensionality d: 2, 3, 4, 5 Result size k: 1, 4, 16, 64, 256
17
17 I/O cost vs buffer size UI CO AC
18
18 I/O cost vs k UI CO AC
19
19 I/O cost vs d UI CO AC
20
20 I/O cost vs N UI CO AC
21
21 Bichromatic queries, I/O cost vs dataset combination Column UI/CO means provider dataset D P is UI and consumer dataset D A is CO BF1 is more efficient than BBS in 7 cases BF2 outperforms its competitors in all cases
22
22 Meaningfulness of results Explore the meaningfulness of the results returned by top-k dominating queries Real datasets Statistics of NBA players http://basketballreference.com/stats_download.htm 19112 players (identified by both name and year) Attributes for query: GP (games played), PTS (points), REB (rebounds), and AST (assists) Statistics BASEBALL pitchers http://baseball1.com/statistics/ 36898 players (identified by both name and year) Attributes for query: W (Wins), H (Hits), ERA (Earned Run Average), and R (Runs Allowed)
23
23 Top-k dominating points meaningful? Top-5 dominating points Results match the public’s view of super-star players in NBA and BASEBALL Enables users to discover `top’ objects without any specific domain knowledge
24
24 Skyline vs top-k dominating points NBA BASEBALL Perform a skyline query, compute top-k dominating points by setting k to the skyline size (69 for NBA and 700 for BASEBALL) Plot their dominating scores in descending order Observations Top-k dominating points have much higher scores than skyline points Top-k dominating points are more informative to users
25
25 Conclusions Study top-k dominating queries on indexed multi-dimensional data Present algorithms for the problem The lightweight best-first algorithm BF2 performs the best Top-k dominating queries produce more meaningful results than skylines
26
26 References [FLN01] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. In PODS, 2001. [BKS01] S. Borzsonyi, D. Kossmann, and K. Stocker. The Skyline Operator. In ICDE, 2001. [PKZT01] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP Operations in Spatial Data Warehouses. In SSTD, 2001. [PTFS05] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive Skyline Computation in Database Systems. TODS, 30(1):41–82, 2005. [CJT+06a] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. On High Dimensional Skylines. In EDBT, 2006. [CJT+06b] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. Finding k-Dominant Skylines in High Dimensional Space. In SIGMOD, 2006. [LOTW06] C. Li, B. C. Ooi, A. Tung, and S.Wang. DADA: A Data Cube for Dominant Relationship Analysis. In SIGMOD, 2006.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.