1 Progressive Computation of Constrained Subspace Skyline Queries Evangelos Dellis 1 Akrivi Vlachou 1 Ilya Vladimirskiy 1 Bernhard Seeger 1 Yannis Theodoridis.

Slides:



Advertisements
Similar presentations
Using Trees to Depict a Forest
Advertisements

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.
Ranking Outliers Using Symmetric Neighborhood Relationship Wen Jin, Anthony K.H. Tung, Jiawei Han, and Wei Wang Advances in Knowledge Discovery and Data.
VLDB 2011 Pohang University of Science and Technology (POSTECH) Republic of Korea Jongwuk Lee, Seung-won Hwang VLDB 2011.
Probabilistic Skyline Operator over Sliding Windows Wenjie Zhang University of New South Wales & NICTA, Australia Joint work: Xuemin Lin, Ying Zhang, Wei.
1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.
Query Optimization of Frequent Itemset Mining on Multiple Databases Mining on Multiple Databases David Fuhry Department of Computer Science Kent State.
Fast Algorithms For Hierarchical Range Histogram Constructions
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
Searching on Multi-Dimensional Data
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
Efficient Reverse k-Nearest Neighbors Retrieval with Local kNN-Distance Estimation Mike Lin.
SLIQ: A Fast Scalable Classifier for Data Mining Manish Mehta, Rakesh Agrawal, Jorma Rissanen Presentation by: Vladan Radosavljevic.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Stabbing the Sky: Efficient Skyline Computation over Sliding Windows COMP9314 Lecture Notes.
Content Based Image Clustering and Image Retrieval Using Multiple Instance Learning Using Multiple Instance Learning Xin Chen Advisor: Chengcui Zhang Department.
Using Trees to Depict a Forest Bin Liu, H. V. Jagadish EECS, University of Michigan, Ann Arbor Presented by Sergey Shepshelvich 1.
Quantile-Based KNN over Multi- Valued Objects Wenjie Zhang Xuemin Lin, Muhammad Aamir Cheema, Ying Zhang, Wei Wang The University of New South Wales, Australia.
1 Continuous k-dominant Skyline Query Processing Presented by Prasad Sriram Nilu Thakur.
1 An Empirical Study on Large-Scale Content-Based Image Retrieval Group Meeting Presented by Wyman
Probabilistic Skyline Operator over sliding Windows Wan Qian HKUST DB Group.
Euripides G.M. PetrakisIR'2001 Oulu, Sept Indexing Images with Multiple Regions Euripides G.M. Petrakis Dept.
Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Spatial and Temporal Databases Efficiently Time Series Matching by Wavelets (ICDE 98) Kin-pong Chan and Ada Wai-chee Fu.
Fast Subsequence Matching in Time-Series Databases Christos Faloutsos M. Ranganathan Yannis Manolopoulos Department of Computer Science and ISR University.
Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces Jian Pei # Wen Jin # Martin Ester # Yufei Tao + # Simon Fraser University,
SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:
The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.
Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.
1 Introduction to Spatial Databases Donghui Zhang CCIS Northeastern University.
A Query Adaptive Data Structure for Efficient Indexing of Time Series Databases Presented by Stavros Papadopoulos.
Reverse Top-k Queries Akrivi Vlachou *, Christos Doulkeridis *, Yannis Kotidis #, Kjetil Nørvåg * *Norwegian University of Science and Technology (NTNU),
Parallel dynamic batch loading in the M-tree Jakub Lokoč Department of Software Engineering Charles University in Prague, FMP.
Reporter : Yu Shing Li 1.  Introduction  Querying and update in the cloud  Multi-dimensional index R-Tree and KD-tree Basic Structure Pruning Irrelevant.
Efficient Processing of Top-k Spatial Preference Queries
Zhuo Peng, Chaokun Wang, Lu Han, Jingchao Hao and Yiyuan Ba Proceedings of the Third International Conference on Emerging Databases, Incheon, Korea (August.
On Computing Top-t Influential Spatial Sites Authors: T. Xia, D. Zhang, E. Kanoulas, Y.Du Northeastern University, USA Appeared in: VLDB 2005 Presenter:
9/2/2005VLDB 2005, Trondheim, Norway1 On Computing Top-t Most Influential Spatial Sites Tian Xia, Donghui Zhang, Evangelos Kanoulas, Yang Du Northeastern.
Query Sensitive Embeddings Vassilis Athitsos, Marios Hadjieleftheriou, George Kollios, Stan Sclaroff.
ICDE 2002, San Jose, CA Efficient Temporal Join Processing using Indices Donghui Zhang University of California, Riverside Vassilis J. Tsotras University.
A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES
Information Technology Selecting Representative Objects Considering Coverage and Diversity Shenlu Wang 1, Muhammad Aamir Cheema 2, Ying Zhang 3, Xuemin.
Accelerating Dynamic Time Warping Clustering with a Novel Admissible Pruning Strategy Nurjahan BegumLiudmila Ulanova Jun Wang 1 Eamonn Keogh University.
Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.
1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree : An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.
Efficient Skyline Computation on Vertically Partitioned Datasets Dimitris Papadias, David Yang, Georgios Trimponias CSE Department, HKUST, Hong Kong.
Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.
Spatial Range Querying for Gaussian-Based Imprecise Query Objects Yoshiharu Ishikawa, Yuichi Iijima Nagoya University Jeffrey Xu Yu The Chinese University.
03/02/20061 Evaluating Top-k Queries Over Web-Accessible Databases Amelie Marian Nicolas Bruno Luis Gravano Presented By: Archana and Muhammed.
CLUSTERING HIGH-DIMENSIONAL DATA Elsayed Hemayed Data Mining Course.
Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.
1 Overview of Query Evaluation Chapter Outline  Query Optimization Overview  Algorithm for Relational Operations.
1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.
Computer Science and Engineering Jianye Yang 1, Ying Zhang 2, Wenjie Zhang 1, Xuemin Lin 1 Influence based Cost Optimization on User Preference 1 The University.
Click to edit Present’s Name AP-Tree: Efficiently Support Continuous Spatial-Keyword Queries Over Stream Xiang Wang 1*, Ying Zhang 2, Wenjie Zhang 1, Xuemin.
Dense-Region Based Compact Data Cube
Tian Xia and Donghui Zhang Northeastern University
Data Transformation: Normalization
Efficient Image Classification on Vertically Decomposed Data
Spatial Indexing I Point Access Methods.
Efficient Image Classification on Vertically Decomposed Data
Introduction to Spatial Databases
Xu Zhou Kenli Li Yantao Zhou Keqin Li
Skyline query with R*-Tree: Branch and Bound Skyline (BBS) Algorithm
Similarity Search: A Matching Based Approach
The Skyline Query in Databases Which Objects are the Most Important?
Efficient Processing of Top-k Spatial Preference Queries
Donghui Zhang, Tian Xia Northeastern University
Efficient Aggregation over Objects with Extent
Presentation transcript:

1 Progressive Computation of Constrained Subspace Skyline Queries Evangelos Dellis 1 Akrivi Vlachou 1 Ilya Vladimirskiy 1 Bernhard Seeger 1 Yannis Theodoridis 2 1 Department of Computer Science, University of Marburg, Germany 2 Department of Computer Science, University of Piraeus, Greece

2 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

3 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

4 Finding A Hotel Close to the Beach Which one is better? i or h? (i, because its price and distance dominate those of h) i or k?

5 Skyline Queries Retrieve points not dominated by any other point: A point p dominates another point q if it is as good or better as p in all dimensions and better in at least one dimension.

6 Skyline of Manhattan Which buildings can we see? Higher or nearer (a building dominates another building if it is higher, closer to the river, and has the same x position)

7 SQL Extension Examples: SQL syntax: a) Find a hotel that is cheap and close to the beach. b) Find salespersons who were very successful in 1999 and have low salary

8 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

9 Motivation Constrained Skyline (car database): A user may only be interested in records within the price range from 3 thousand to 7 thousand euros and with mileage reading between 20K and 100K. The traditional skyline (dashed line) fails to return interesting points.

10 Motivation (continued) Subspace Skyline: A car database could contain many other attributes of the cars: horsepower, age, fuel consumption, etc… A customer that is sensitive on the price and the mileage reading (2-dimensional subspace) would like to pose a skyline query on those attributes, rather than on the whole data space. While the dimensionality of the corresponding data space might be rather high, skyline queries generally refer to a low dimensional subspace. The constrained subspace skyline queries form the generalization of all meaningful skyline queries over a given dataset.

11 Related Work SKYCUBE [VLDB 2005, SIGMOD 2006]: The Skyline Cube (SKYCUBE), consists of the skylines in all possible (2 d -1) subspaces. Drawback: It is not possible to pre-calculate the points of the full space skyline and their duplicates, since the result depends on the given constraints (static). SUBSKY [ICDE 2006]: Transforms the multi-dimensional data into one-dimensional, and therefore permits indexing the dataset with a B + -tree. Drawbacks: 1. is unable to answer constrained subspace skyline queries as all points have to be transformed in a pre-processing step. 2. does not deliver the skyline points progressively.

12 BBS [SIGMOD 2003, TODS 2005]: all points are indexed in an R-tree. mindist(MBR) = the L 1 distance between its lower-left corner and the origin (NN). Keep a heap of index entries and objects, ordered by mindist. Is still the most efficient method for (constrained subspace) skyline retrieval! Related Work (Continued)

13 Related Work (Continued) Shortcomings of BBS: Maintaining a high-dimensional index to support constrained skyline queries in arbitrary dimensionality is not suitable: It has been shown that the performance of such high-dimensional indexes deteriorates with an increasing number of dimensions. (Curse of Dimensionality) The performance of low-dimensional constrained skyline queries decreases when the dimensionality of the indexed space is high in contrast to the query space that is low. (Random Grouping Effect) Only low-dimensional indexes, e.g. R-trees, seem to perform well in practice and for that reason have found their place in commercial database management systems (DBMS).

14 Our Approach We partition vertically the data space among several low-dimensional subspaces and index each of these subspaces using an R-tree. A constrained skyline query is then partitioned into several sub-queries, each of them is processed by utilizing the corresponding index using incremental NN search. TA-INDEX [DAWAK 2005]: An algorithm for vertically partitioned nearest neighbor queries.

15 Contributions We present a threshold-based skyline algorithm (called STA), which exploits multiple indexes. We propose different pruning strategies to identify dominated regions and to discard irrelevant sub-trees of the indexes. A workload-adaptive strategy for determining the number of indexes and the assignment of dimensions to the indexes is presented.

16 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

17 Problem Definition Constrained Subspace Skyline Queries: For a point p ∈ D c in the dimension set S΄: the dominance region contains points which are dominated by p. the anti-dominance region refers to the set of points dominating p. A point p ∈ D is said to dominate another point q ∈ D on subspace S΄ if: 1. on every dimension d i ∈ S΄, p i ≤ q i ; and 2. on at least one dimension d j ∈ S΄, p j < q j.

18 One-point Pruning Observation: A point p is a skyline point in S΄ if and only if there exists no point q that belongs to the anti-dominance area of p for all dimension sets S i ΄ (1≤ i ≤ n). Pruning with the Nearest Neighbor: need to prune objects  not part of skyline. 1. because it is a member of the skyline, there is no dominating point. 2. among all the skyline points it is the one with a large volume, and hence, it is also expected to prune a large percentage of the data points.

19 STA: A Threshold-based Skyline Algorithm Our algorithm works in two steps: Filter step: All retrieved points are organized in a priority queue (heap) based on their Manhattan distance according to the dimension set S΄. We use the Manhattan distance of the last reported point of S i ΄ as a threshold to speed up the filtering phase. Refinement step: (domination test) The refinement step begins when the first constrained nearest neighbor based on S΄ is returned by the filter step. This point is guaranteed to be a skyline point. In the next iteration, where another candidate is found, the refinement step needs to determine whether this candidate is a skyline point or not. The dominance test is performed in a way similar to traditional window queries using a main-memory R-tree whose dimensionality is equal to the query dimensionality.

20 Index Scheduling Round Robin strategy: Inefficient We are interested in more advanced strategies resulting in a fast increase of the threshold. We choose the index that will increase the partial distance mostly as it is more beneficial for our threshold. Strategies for index scheduling for nearest neighbor search on a vertically partitioned data set have been studied in [DAWAK 2005].

21 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

22 Improved Pruning Motivating example: Non uniform distributions  Points form clusters Need: Pruning using multiple points Simultaneous pruning: we are not able to prune simultaneously in both subspaces using the same point.

23 Multiple-point Pruning Observation: when points lying in the dominance region of a point are not discarded in at least one subspace, then we are able, under certain conditions, to discard points in all remaining subspaces, while we guarantee no false dismissals. we use the points that are retrieved as local constrained nearest neighbors from an index, for pruning in all other indexes. Example: 4-dimensional data space is divided into two 2-dimensional subspaces. When the point p1 is retrieved from subspace S1 then the dominance area of the point p1 in subspace S2 is used for pruning.

24 Avoiding False Hits Unfortunately, by following this strategy some skyline points are falsely discarded. Case 1: Let the point q in the projection S2 collapse on the point q1. The point p is not a skyline point in S, since it is dominated by q in all dimensions sets of S. Case 2: On the other hand, if the point q in the projection S2 collapses on the point q2, then point p may be discarded falsely, since it is a potential skyline point. Solution: To discard points from the dominance area of p in S2, the point p and a point qi must be dominated by the projection of the same point in S2 and S1 respectively. This condition must hold for each point qi which belongs in the discarded area of S1.

25 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

26 Random Grouping Effect Random Grouping Effect: Since not all dimensions are used for splitting the axes during the index creation for a leaf node, when a query that requires projection is posed to the index the performance of the index corresponds to a random low- dimensional index, i.e. an index that groups the points into leaf nodes in a mostly random manner. Example: consider a 10-d data space and assume that we are interested in retrieving the skyline of any 2-d subspace. If only two dimensions are used for splitting, then the probability that the chosen dimensions have been used for splitting is very small. Thus, the query performance is similar to the performance of a 2-d index, where the data points were grouped together randomly.

27 Number of Indexes If every leaf node is splitted at least once in each dimension, we need a total number of at least 2 d leaf nodes. Well-performing index: every leaf node is splitted by each dimension once (L ≥ 2 d ). (Defines a maximum dimensionality for a low-dimensional index) Example: 32-d Color dataset, 68,040 records. Our formula suggests  2 indexes In this way we index more effectively high dimensional datasets, by avoiding performance degradation due to random grouping effect.

28 Dimension Assignment Algorithm Number of Distinct Values: a quality measure of a subspace Si points whose projections coincide to a low-dimensional point, so that it is dominated by some duplicate point in the query-dimensional space. DAA: a greedy algorithm to distribute the attributes over the n indexes. restrict the random grouping effect maximize the number of distinct values

29 Workload-adaptive Extension User preferences are correlated: use multiple indexes, which are built on the most preferred subspaces Simple, but very powerful extension: associate some probability with each subspace (the frequency with which it is queried) weight the cost estimation of each dimension set by its probability. This extension allows us to examine the performance of our algorithm under a workload, which is closer to real applications, instead of picking random subspaces.

30 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

31 Experimental Evaluation Datasets: Three data sets from real-world applications: NBA dataset contains 17, dimensional points, where each point corresponds to the statistics of a player in 13 categories. Color moments dataset contain 9-dimensional features of 68,040 photo images extracted from the Corel Draw database. Color histogram consists of 32-dimensional features, representing the histogram of an image. Additionally, we generated 10-dimensional uniform datasets with a cardinality of 10,000, 50,000 and 100,000 data points. Implementation Details: We compare our algorithm against the current state-of-the-art method BBS. We set the page size for each R-tree to 4K and each dimension was represented by a real number. Measurement: The number of disc I/O’s (page accesses)

32 Examination of Constrained Subspace Skylines Effect of Constrained Region: Varying constrained region from 50% to 100% of each axis. We examine subspaces with dimensionality of d sub =3. Uniform dataset: full space dimensionality of 10-d and a cardinality of 50,000 points. Observation: the performance of our algorithm is not affected significantly by the size of the constrained region.

33 Examination of Constrained Subspace Skylines Effect of Subspace Dimensionality We vary the query subspace dimensionality from 2 to 4. We set the constrained region constant (represented as 60% of the values of each requested axis). These results demonstrate that the STA algorithm leads to substantially less page accesses than BBS. a) 10-d Uniform Dataset, 50k b) 9-d Color Dataset, 68k These results demonstrate that the STA algorithm leads to substantially less page accesses than BBS.

34 Scalability with the Dataset Cardinality We use uniform datasets, (dimensionality of 10-D) Vary the cardinality between 10,000 and 100,000 points. We set the constrained region to cover 60% of each axis. In addition we request the skyline of 3-dimensional subspaces. The proposed method scale better with cardinality than BBS.

35 Scalability with Full-space Dimensionality Varying the Full-space Dimensionality: We set the constrained region to cover 60% of each axis. In addition we request the skyline of 3-dimensional subspaces. Uniform dataset with varied dimensionality of 10, 20 and 30-d. Real datasets with varied dimensionality of 9, 13 and 32-d a) Uniform Datasets b) Real Datasets In both cases our algorithm constantly outperforms BBS in this experiment.

36 Adaptation to the query Workload Query-workload using the “80-20” law: 20% of the attributes contribute to 80% of the queries 32-dimensional Color histogram dataset, which consists of 68,040 records a) I/O cost b) CPU cost Scalability using the “80-20” law: Subspace skyline with d sub = 3 Constrained Region: 60% of each axis

37 Overview Introduction Motivation - Related Work Basic STA Improved Pruning Indexing using Low-dimensional R-trees Experimental Evaluation Conclusions – Future Work

38 Conclusions – Future Work We addressed the problem of Constrained Subspace Skyline Queries and we have presented a threshold-based skyline algorithm, which exploits multiple indexes. We proposed different pruning strategies to identify dominated regions and to discard irrelevant sub-trees of the indexes. A workload-adaptive strategy for determining the number of indexes and the assignment of dimensions to the indexes is presented. Extensive performance evaluation show the superiority of our proposed technique against related work. Future Work may include: Examination of STA using external queues Development of a Cost Model for Constrained Subspace Skyline Queries

39 References SKYCUBE [VLDB 2005, SIGMOD 2006]: Yuan, Y., Lin, X., Liu, Q., Wang, W., Yu, J., Zhang, Q.: Efficient Computation of the Skyline Cube. Very Large Data Bases Conference (VLDB), Trondheim, Norway, August 30 - September 2, Pei, J. Jin, W, Ester, M., Tao, Y.: Catching the Best Views of Skyline: A Semantic Approach Based on Decisive Subspaces. Very Large Data Bases Conference (VLDB), Trondheim, Norway, August 30 - September 2, Xia, T., Zhang, D.: Refreshing the Sky: The Compressed Skycube with Efficient Support for Frequent Updates. To appear in Proceedings of the 2006 ACM SIGMOD International Conforerence on Management of Data (SIGMOD), Chicago, IL, USA SUBSKY [ICDE 2006]: Tao, Y., Xiao, X., Pei, J. SUBSKY: Efficient Computation of Skylines in Subspaces. IEEE International Conference on Data Engineering (ICDE), Atlanta, Georgia, USA, April 3-7, BBS [SIGMOD 2003, TODS 2005]: Papadias, D., Tao, Y., Fu, G., Seeger, B. An Optimal and Progressive Algorithm for Skyline Queries. ACM Conference on the Management of Data (SIGMOD), San Diego, CA, June 9- 12, Papadias, D., Tao, Y., Fu, G., Seeger, B. Progressive Skyline Computation in Database Systems. ACM Transactions on Database Systems, 30(1): 41-82, TA-INDEX [DAWAK 2005]: Dellis, E., Seeger, B., Vlachou, A. Nearest Neighbor Search on Vertically Partitioned High- Dimensional Data. In Proceedings of 7th International Conference on Data Warehousing and Knowledge Discovery (DaWaK), Copenhagen, Denmark, 2005

40 Thank You Questions?