1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006.

Slides:

Advertisements

Similar presentations

Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.

Advertisements

Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.

1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.

1 Top-k Spatial Joins

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Nearest Neighbor Queries using R-trees Based on notes from G. Kollios.

Efﬁcient Reverse k-Nearest Neighbors Retrieval with Local kNN-Distance Estimation Mike Lin.

Continuous Intersection Joins Over Moving Objects Rui Zhang University of Melbourne Dan Lin Purdue University Kotagiri Ramamohanarao University of Melbourne.

July 29HDMS'08 Caching Dynamic Skyline Queries D. Sacharidis 1, P. Bouros 1, T. Sellis 1,2 1 National Technical University of Athens 2 Institute for Management.

Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.

2-dimensional indexing structure

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.

Top-k and Skyline Computation in Database Systems

1 Continuous k-dominant Skyline Query Processing Presented by Prasad Sriram Nilu Thakur.

Spatial Queries Nearest Neighbor Queries.

Spatio-Temporal Databases. Introduction Spatiotemporal Databases: manage spatial data whose geometry changes over time Geometry: position and/or extent.

Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.

R-Trees 2-dimensional indexing structure. R-trees 2-dimensional version of the B-tree: B-tree of maximum degree 8; degree between 3 and 8 Internal nodes.

Indexing Spatio-Temporal Data Warehouses Dimitris Papadias, Yufei Tao, Panos Kalnis, Jun Zhang Department of Computer Science Hong Kong University of Science.

1 Query Processing: The Basics Chapter Topics How does DBMS compute the result of a SQL queries? The most often executed operations: –Sort –Projection,

Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)

Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.

Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,

AAU A Trajectory Splitting Model for Efficient Spatio-Temporal Indexing Presented by YuQing Zhang  Slobodan Rasetic Jorg Sander James Elding Mario A.

Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Join-Queries between two Spatial Datasets Indexed by a Single R*-tree Michael Vassilakopoulos.

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:

Maximal Vector Computation in Large Data Sets The 31st International Conference on Very Large Data Bases VLDB 2005 / VLDB Journal 2006, August Parke Godfrey,

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.

Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),

Mehdi Kargar Aijun An York University, Toronto, Canada Keyword Search in Graphs: Finding r-cliques.

Efficient Processing of Top-k Spatial Preference Queries

Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

Probabilistic Contextual Skylines D. Sacharidis 1, A. Arvanitis 12, T. Sellis 12 1 Institute for the Management of Information Systems — “Athena” R.C.,

Bin Yao (Slides made available by Feifei Li) R-tree: Indexing Structure for Data in Multi- dimensional Space.

On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:

9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.

The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.

All right reserved by Xuehua Shen 1 Optimal Aggregation Algorithms for Middleware Ronald Fagin, Amnon Lotem, Moni Naor (PODS01)

Space-Efficient Online Computation of Quantile Summaries SIGMOD 01 Michael Greenwald & Sanjeev Khanna Presented by ellery.

CS4432: Database Systems II Query Processing- Part 2.

A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES

Efficient Computation of Combinatorial Skyline Queries Author: Yu-Chi Chung, I-Fang Su, and Chiang Lee Source: Information Systems, 38(2013), pp

Spatial Indexing Techniques Introduction to Spatial Computing CSE 5ISC Some slides adapted from Spatial Databases: A Tour by Shashi Shekhar Prentice Hall.

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Optimal Aggregation Algorithms for Middleware By Ronald Fagin, Amnon Lotem, and Moni Naor.

Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.

HKU CSIS DB Seminar Skyline Queries HKU CSIS DB Seminar 9 April 2003 Speaker: Eric Lo.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Database Applications (15-415) DBMS Internals- Part III Lecture 13, March 06, 2016 Mohammad Hammoud.

1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.

Strategies for Spatial Joins

Spatial Queries Nearest Neighbor and Join Queries.

Nearest Neighbor Queries using R-trees

Spatio-temporal Pattern Queries

Introduction to Spatial Databases

Spatio-Temporal Databases

Structure and Content Scoring for XML

Similarity Search: A Matching Based Approach

Structure and Content Scoring for XML

Relaxing Join and Selection Queries

The Skyline Query in Databases Which Objects are the Most Important?

Efficient Processing of Top-k Spatial Preference Queries

Efficient Aggregation over Objects with Extent

Presentation transcript:

1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006

2 Outline Motivations and applications Background Skyline-based algorithm Best-first algorithms Experimental results Conclusions

3 Top-k query, skyline query D: dataset of points in multi- dimensional space  d Top-k query k points with the lowest F values Top-2: p 4, p 6 Require a ranking function  Result affected by scales of dimensions  Skyline query p>p’: (  i, p[i] < p’[i] )  (  i, p[i]  p’[i] ) Points not dominated by any other point Skyline: p 1, p 4, p 6, p 7 Result size cannot be controled 

4 Top-k dominating query Intuitive score function  (p) = | { p’  D, p>p’ } | Property:  p,p’  D, p>p’   (p)>  (p’) Top-k dominating query k points with the highest  values Also known as k-dominating query [PTFS05] Top-2 dominating points: p 4 (3), p 5 (2) Applications: decision support systems, find the most `popular’ objects Advantages Control of result size No need to specify ranking function Result independent of scales of dimensions

5 Related work Spatial aggregation processing aggregate measures (e.g., number of cars in car-parks) in a region (e.g., district) aR-trees [PKZT01] Each entry is augmented with the aggregate measure of all points in its subtree Example: COUNT R-tree Query: find the number of points intersect W Prune entries that do not intersect W Fully covered by W, increment by its count Partially covered by W, recursive call Cost: 10 for aR-tree, but 17 for typical R-tree

6 Related work: skyline computation Non-indexed data DC (divide-and-conquer), BNL (block-nested loop), SFS (sort-filter-skyline), LESS (linear elimination sort for skyline) Indexed data NN, BBS [PTFS05] Skyline variants based on dominance relationship Top-k frequent skyline points [CJT+06a] Frequency (p): number of subspaces that p is a skyline point k-dominant skyline points [CJT+06b] Relax the dominance relationship by k k=d: original skyline; k decreases  skyline size decreases Data cube for analyzing dominance relationship of points [LOTW06]

7 Top-k dominating query How to answer top-k dominating query Block nested loop join: compute the score of every point Quadratic cost Skyline based solution: retrieve the skyline points and compute their scores, find the top-1 from the skyline Expensive for datasets with large skylines Goal: develop efficient algorithm for the query on indexed multi-dimensional data

8 Problem characteristics Pre-computation possible?  Materialize the `score’ of every point Updates: change the ‘score’ of influenced points Update cost is expensive for dynamic datasets Find (K >> k) points with the highest dominating area, compute their scores to get best-k results  Approximate solution, hard to specify K Dominating area cannot provide bounds for  DomArea(p 1 ) = (1 – 0.25) * (1 – 0.50) = DomArea(p 4 ) = (1 – 0.45) * (1 – 0.40) =  (p 1 )=1 <  (p 4 )=2 !!! Unlike the dominating area, computing  value (or even its upper bound) requires accessing data

9 Skyline-based solution BBS Top-k dominating algorithm [PTFS05] Example: top-2 dominating query Iteration 1 Find the skyline points Score of a point is smaller than the one dominating it (if any) Compute their scores (by accessing the tree) Report p 2 (4) as the first result Iteration 2 Find the constrained skyline (gray region) WHY ? Region dominated by p 2 but not others (p 1, p 3 ) Compute their scores and compare them with retrieved points in all previous iterations Report p 4 (2) as the second result

10 Our optimizations Hilbert ordering of retrieved points before counting Exhibit locality of node accesses Batch counting Pack B (page capacity) points into one page and count their scores simultaneously e – and e + denote the lower and upper corners (virtual points) of an entry e respectively Properties p 1 > e –  p 1 dominates all points in e p 2 > e + and p 2  > e –  p 2 may dominate some points in e p 3  > e +  p 3 dominates no points in e

11 The best-first approach The optimized BBS is inefficient when the skyline is large Not necessary to compute the whole skyline Best-first approach: visit the nodes in descending order of their upper bound scores Use a max-heap H for organizing the entries to be visited in descending order of their upper bound scores Keep an array W of the best-k data points found so far Terminates when the top entry in H cannot improve the result Compute upper bound scores of entries in the same non-leaf node Upper score of the entry e is  (e_) WHY? For each entry e in the node, put the point e_ in the set T Perform batch counting for the points in T

12 Optimizations for best-first search Pruning technique Let  be the best-k score found so far (lowest score in result array) Suppose that a point p satisfies  (p)  . Any point p’ dominated by p   (p’) < . Keep a pruner set F of visited data points whose scores are   Among the points in F, only need to maintain their skyline Apply F to eliminate unqualified entries Lazy counting (for computing scores of leaf entries) some data points (in the same leaf node) remain before counting, not cost-effective to perform batch counting for them Use a FIFO queue L to store discovered points Once L is full (i.e., |L|= page capacity B), perform batch counting for the points in L, update the result and clear L

13 Lightweight best-first search Expensive to compute upper bound scores for non-leaf entries Root node contains e 1, e 2, e 3 Compute  (e 1– ),  (e 2– ),  (e 3– ) in batch e 2– may dominate some points in e 1 and e 3 Cost: 3 node accesses;  (e 1– )=3,  (e 2– )=7,  (e 3– )=3 Use a lightweight function to compute upper bound scores for non-leaf entries Goal: do not allow leaf nodes to be accessed Compute  ub (e 1– ),  ub (e 2– ),  ub (e 3– ) in batch Cost: 1 node access, since leaf nodes not accessed e 2– dominates all points in e 2 and some in e 1 and e 3  ub (e 1– )=3,  ub (e 2– )=9,  ub (e 3– )=3 Correct bound ! Approx. preserve original ordering of entries !

14 Incremental best-first search No objects need to be pruned Data points are inserted into the heap H after their scores have been computed When a data point p is deheaped, check whether its score is greater than those in the Lazy Counting Queue L If yes, report p as the next result If not, consider the points in L whose (upper bound) score is greater than p Compute their actual scores and insert them to H Insert p back to H again The next result is now at the top of H (to be found in next iteration)

15 Query variant Bichromatic top-k dominating query Given a provider dataset D P and a consumer dataset D A, a point p in D P,  A (p) = | { a  D A, p>a } |  A (p 1 )=2,  A (p 2 )=3,  A (p 3 )=1 Bichromatic top-1 point: p 2 Application: find the most popular hotel, where D P contains hotels and D A specify requirements from different customers Query processing The proposed algorithms are still applicable Search for the results in D P Perform counting on D A

16 Setup of efficiency experiments Algorithms BBS (skyline-based method) Best first search: BF1 (basic), BF2 (lightweight) Incremental best-first: IBF1, IBF2 Synthetic datasets UI (independent), CO (correlated), AC (anti-correlated) Parameters and other settings aR tree node page size: 4K bytes LRU buffer size (%): 0, 1, 2, 5, 10, 15, 20 Datasize N (million): 0.25, 0.5, 1, 2, 4 Data dimensionality d: 2, 3, 4, 5 Result size k: 1, 4, 16, 64, 256

17 I/O cost vs buffer size UI CO AC

18 I/O cost vs k UI CO AC

19 I/O cost vs d UI CO AC

20 I/O cost vs N UI CO AC

21 Bichromatic queries, I/O cost vs dataset combination Column UI/CO means provider dataset D P is UI and consumer dataset D A is CO BF1 is more efficient than BBS in 7 cases BF2 outperforms its competitors in all cases

22 Meaningfulness of results Explore the meaningfulness of the results returned by top-k dominating queries Real datasets Statistics of NBA players players (identified by both name and year) Attributes for query: GP (games played), PTS (points), REB (rebounds), and AST (assists) Statistics BASEBALL pitchers players (identified by both name and year) Attributes for query: W (Wins), H (Hits), ERA (Earned Run Average), and R (Runs Allowed)

23 Top-k dominating points meaningful? Top-5 dominating points Results match the public’s view of super-star players in NBA and BASEBALL Enables users to discover `top’ objects without any specific domain knowledge

24 Skyline vs top-k dominating points NBA BASEBALL Perform a skyline query, compute top-k dominating points by setting k to the skyline size (69 for NBA and 700 for BASEBALL) Plot their dominating scores in descending order Observations Top-k dominating points have much higher scores than skyline points Top-k dominating points are more informative to users

25 Conclusions Study top-k dominating queries on indexed multi-dimensional data Present algorithms for the problem The lightweight best-first algorithm BF2 performs the best Top-k dominating queries produce more meaningful results than skylines

26 References [FLN01] R. Fagin, A. Lotem, and M. Naor. Optimal Aggregation Algorithms for Middleware. In PODS, [BKS01] S. Borzsonyi, D. Kossmann, and K. Stocker. The Skyline Operator. In ICDE, [PKZT01] D. Papadias, P. Kalnis, J. Zhang, and Y. Tao. Efficient OLAP Operations in Spatial Data Warehouses. In SSTD, [PTFS05] D. Papadias, Y. Tao, G. Fu, and B. Seeger. Progressive Skyline Computation in Database Systems. TODS, 30(1):41–82, [CJT+06a] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. On High Dimensional Skylines. In EDBT, [CJT+06b] C.-Y. Chan, H. Jagadish, K.-L. Tan, A. Tung, and Z. Zhang. Finding k-Dominant Skylines in High Dimensional Space. In SIGMOD, [LOTW06] C. Li, B. C. Ooi, A. Tung, and S.Wang. DADA: A Data Cube for Dominant Relationship Analysis. In SIGMOD, 2006.