Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)

Slides:

Advertisements

Similar presentations

The A-tree: An Index Structure for High-dimensional Spaces Using Relative Approximation Yasushi Sakurai (NTT Cyber Space Laboratories) Masatoshi Yoshikawa.

Advertisements

A General Algorithm for Subtree Similarity-Search The Hebrew University of Jerusalem ICDE 2014, Chicago, USA Sara Cohen, Nerya Or 1.

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.

Ch2 Data Preprocessing part3 Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.

Spatial Indexing SAMs. Spatial Indexing Point Access Methods can index only points. What about regions? Z-ordering and quadtrees Use the transformation.

1 A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES Leong Hou U, Nikos Mamoulis, Kyriakos Mouratidis Gruppo 10: Paolo Barboni, Tommaso Campanella, Simone.

1 Top-k Spatial Joins

Fast Algorithms For Hierarchical Range Histogram Constructions

Pete Bohman Adam Kunk.  Introduction  Related Work  System Overview  Indexing Scheme  Ranking  Evaluation  Conclusion.

Indexing and Range Queries in Spatio-Temporal Databases

School of Computer Science and Engineering Finding Top k Most Influential Spatial Facilities over Uncertain Objects Liming Zhan Ying Zhang Wenjie Zhang.

Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.

Spatial Mining.

Answering Metric Skyline Queries by PM-tree Tomáš Skopal, Jakub Lokoč Department of Software Engineering, FMP, Charles University in Prague.

I/O-Algorithms Lars Arge Aarhus University February 27, 2007.

Branch and Bound Searching Strategies

Subscription Subsumption Evaluation for Content-Based Publish/Subscribe Systems Hojjat Jafarpour, Bijit Hore, Sharad Mehrotra, and Nalini Venkatasubramanian.

Efficient Processing of Top-k Spatial Keyword Queries João B. Rocha-Junior, Orestis Gkorgkas, Simon Jonassen, and Kjetil Nørvåg 1 SSTD 2011.

Spatial Indexing for NN retrieval

Optimization of Spatial Joins on Mobile Devices N. Mamoulis 1, P. Kalnis 2, S. Bakiras 3, X. Li 2 1 Department of Computer Science and Information Systems,

Computer Science Spatio-Temporal Aggregation Using Sketches Yufei Tao, George Kollios, Jeffrey Considine, Feifei Li, Dimitris Papadias Department of Computer.

1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.

Lower and Upper Bounds on Obtaining History Independence Niv Buchbinder and Erez Petrank Technion, Israel.

I/O-Algorithms Lars Arge Spring 2009 March 3, 2009.

Approximate Computation of Multidimensional Aggregates of Sparse Data Using Wavelets Based on the work of Jeffrey Scott Vitter and Min Wang.

Hierarchical Constraint Satisfaction in Spatial Database Dimitris Papadias, Panos Kalnis And Nikos Mamoulis.

Lars Arge1, Mark de Berg2, Herman Haverkort3 and Ke Yi1

Chapter 3: Data Storage and Access Methods

Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.

1 SINA: Scalable Incremental Processing of Continuous Queries in Spatio-temporal Databases Mohamed F. Mokbel, Xiaopeng Xiong, Walid G. Aref Presented by.

Ad-hoc Distributed Spatial Joins on Mobile Devices Panos Kalnis, Xiaochen Li National University of Singapore Nikos Mamoulis The University of Hong Kong.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.

Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.

Chetan Bhirud Raza Mohammad Abinash Sahoo Online Marketing Giant.

Improving Min/Max Aggregation over Spatial Objects Donghui Zhang, Vassilis J. Tsotras University of California, Riverside ACM GIS’01.

Spatial Data Management Chapter 28. Types of Spatial Data Point Data –Points in a multidimensional space E.g., Raster data such as satellite imagery,

SUBSKY: Efficient Computation of Skylines in Subspaces Authors: Yufei Tao, Xiaokui Xiao, and Jian Pei Conference: ICDE 2006 Presenter: Kamiru Superviosr:

OnLine Analytical Processing (OLAP)

The X-Tree An Index Structure for High Dimensional Data Stefan Berchtold, Daniel A Keim, Hans Peter Kriegel Institute of Computer Science Munich, Germany.

Sorting with Heaps Observation: Removal of the largest item from a heap can be performed in O(log n) time Another observation: Nodes are removed in order.

Towards Robust Indexing for Ranked Queries Dong Xin, Chen Chen, Jiawei Han Department of Computer Science University of Illinois at Urbana-Champaign VLDB.

Top-k Similarity Join over Multi- valued Objects Wenjie Zhang Jing Xu, Xin Liang, Ying Zhang, Xuemin Lin The University of New South Wales, Australia.

Benjamin AraiUniversity of California, Riverside Reliable Hierarchical Data Storage in Sensor Networks Song Lin – Benjamin.

Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.

Prof. Bayer, DWH, Ch.4, SS Chapter 4: Dimensions, Hierarchies, Operations, Modeling.

Efficient Subwindow Search: A Branch and Bound Framework for Object Localization ‘PAMI09 Beyond Sliding Windows: Object Localization by Efficient Subwindow.

Designing Aggregations. Performance Fundamentals - Aggregations Pre-calculated summaries of data Intersections of levels from each dimension Tradeoff.

1 Top-k Dominating Queries DB seminar Speaker: Ken Yiu Date: 25/05/2006.

Efficient Processing of Top-k Spatial Preference Queries

Spatio-temporal Pattern Queries M. Hadjieleftheriou G. Kollios P. Bakalov V. J. Tsotras.

Prof. Bayer, DWH, Ch.5, SS Chapter 5. Indexing for DWH D1Facts D2.

The university of Hong Kong Department of Computer Science Continuous Monitoring of Top-k Queries over Sliding Windows Authors: Kyriakos Mouratidis, Spiridon.

1 Branch and Bound Searching Strategies Updated: 12/27/2010.

August 30, 2004STDBM 2004 at Toronto Extracting Mobility Statistics from Indexed Spatio-Temporal Datasets Yoshiharu Ishikawa Yuichi Tsukamoto Hiroyuki.

A FAIR ASSIGNMENT FOR MULTIPLE PREFERENCE QUERIES

R-Trees: A Dynamic Index Structure For Spatial Searching Antonin Guttman.

Efficient OLAP Operations in Spatial Data Warehouses Dimitris Papadias, Panos Kalnis, Jun Zhang and Yufei Tao Department of Computer Science Hong Kong.

1 CSIS 7101: CSIS 7101: Spatial Data (Part 1) The R*-tree ： An Efficient and Robust Access Method for Points and Rectangles Rollo Chan Chu Chung Man Mak.

Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.

1 Complex Spatio-Temporal Pattern Queries Cahide Sen University of Minnesota.

Progressive Computation of The Min-Dist Optimal-Location Query Donghui Zhang, Yang Du, Tian Xia, Yufei Tao* Northeastern University * Chinese University.

DASFAA 2005, Beijing 1 Nearest Neighbours Search using the PM-tree Tomáš Skopal 1 Jaroslav Pokorný 1 Václav Snášel 2 1 Charles University in Prague Department.

23 1 Christian Böhm 1, Florian Krebs 2, and Hans-Peter Kriegel 2 1 University for Health Informatics and Technology, Innsbruck 2 University of Munich Optimal.

1 Spatial Query Processing using the R-tree Donghui Zhang CCIS, Northeastern University Feb 8, 2005.

Dense-Region Based Compact Data Cube

Spatial Data Management

Spatial Indexing I Point Access Methods.

Efficient Processing of Top-k Spatial Preference Queries

Efficient Aggregation over Objects with Extent

Presentation transcript:

Evaluation of Top-k OLAP Queries Using Aggregate R-trees Nikos Mamoulis (HKU) Spiridon Bakiras (HKUST) Panos Kalnis (NUS)

2 Background  On-line Analytical Processing (OLAP) refers to the set of operations that are applied on a Data Warehouse to assist analysis and decision support.  Some measures (e.g., sales) are summarized with respect to some interesting dimensions (e.g., products, stores, time, etc.), representing business perspectives.  E.g., “retrieve the total sales per month, product-color and store location”.

3 Background  Fact table: stores measures and values of all dimensions at the most refined abstraction level.  Dimensional tables: store information about the multi-level hierarchies of each dimension.  Some of the dimensions could be spatial (e.g., location). Hierarchies for spatial attributes may exist (e.g., exact location, district, city, county, state, country).

4 Background  OLAP queries: summarize measures based on selected dimensions and hierarchies thereof.  E.g.: SELECT product-type, store-city, sum(quantity) FROM Sales GROUP BY product-type, store-city

5 Top-k OLAP Queries  a top-k OLAP query selects the k groups with the highest aggregate values.  related: iceberg query SELECT product-type, store-city, sum(quantity) FROM Sales GROUP BY product-type, store-city ORDER BY sum(quantity) STOP AFTER k SELECT product-type, store-city, sum(quantity) FROM Sales GROUP BY product-type, store-city HAVING sum(quantity)>1000

6 Problem formulation  D = {d 1,d 2,...,d n } a set of interesting dimensions (some dimensions could be spatial) e.g., product, location, etc.  The values of each dimension d i are partitioned to a set R i of ranges, either ad-hoc or based on some hierarchy level. e.g., product-ids partitioned to product-types e.g., spatial dimension is partitioned using a grid  Retrieve the k multi-dimensional groups with the greatest values loc-x loc-y prod-id type-A type-B

7 Assumption  We assume that the set of interesting dimensions is already indexed by an aggregate R-tree (aR-tree)  Realistic, by view selection on the most- refined hierarchical level We do not require hierarchical data summaries to be indexed.  Objective: Evaluate top-k OLAP queries (and iceberg queries) using the aR-tree.

8 The aggregate R-tree  Each entry is augmented with aggregate information about all objects indexed in the subtree pointed by it.  Simple aggregate range queries: retrieving the aggregate for a region that spatially covers an entry does not require accessing the corresponding node.

9 Using the aR-tree for top-k OLAP queries  Observation: By browsing the tree partially, we can derive upper and lower bounds for the aggregate value of each cell: E.g., c 1.agg = 120, 90  c 4.agg  140 c 6 (with c 6.agg = 150) is the top-1 result

10 Using the aR-tree for top-k OLAP queries  Lemma: Let t be the k-th largest lower bound of all cells. Let e i be an aR–tree entry. If for all cells c that intersect e i, c.ub ≤ t, then the subtree pointed by e i cannot contribute to any top-k result, and thus it can be pruned from search. E.g., t=c 6.agg=150, e 9 intersects c 8,c 9, c 8.ub=20, c 9.ub=40

11 Sketch of basic algorithm  Assume (for now) that information about all cube cells (upper and lower aggregate bounds) can be maintained in memory.  Initialize c.lb=c.ub=0, for all cells.  Maintain a heap H with the non-visited entries yet. Initially H contains the root entries of the aR-tree. Update c.lb, c.ub for all cells based on their intersection/containment of entries.  Maintain a heap LB with the cells of top-k lower bounds. Let t be the lowest c.lb.  Each entry e in H is prioritized based on e.ub = max{c.ub, for all c intersected by e}  While top(H).ub > t, remove top entry from H, visit the corresponding R-tree node and update upper/lower bounds and LB.

12 Example (  =sum)  visit root; t=0; c 2.ub=420; pick e 1 (greatest e.agg);  visit e 1.ptr; t=90=c 4.lb; c 2.ub=300; pick e 2 ;  visit e 2.ptr; t=90=c 4.lb; c 2.ub=250; pick e 7 ;  visit e 7.ptr; t=140=c 6.lb; c 6.ub=170; pick e 6 ;  visit e 6.ptr; t=150=c 6.lb; c 4.ub=140<t; terminate

13 Reducing Memory Requirements  The basic algorithm requires maintenance of lower/upper bounds for each cell. This might be infeasible in practice (huge number of groups).  Optimizations need not keep information about cells that are intersected by at most one entry keep a single upper bound for all cells intersected by the same set of entries need not keep information about cells that may not end up in the top-k result maintain upper bounds only at tree entries (not at cells)

14 Extensions  Iceberg queries. Similar algorithm; replace floating bound of k-th c.lb by constant t. No need of a priority queue; visit nodes in depth-first order.  Range-restricted top-k OLAP queries. Use selection range to prune.  Non-orthocanonical partitions. Use a bipartite graph that links tree entries to regions of partitions  Multiple measures and different aggregate functions. Easy to extend assuming that the tree stores aggregates (e.g., sum, min, etc.)

15 Experimental Settings  Synthetic data (uniform): d-dimensional points in a [1:10000] d map. 10 (random) anchor points measure generated using a Zipfian distribution; points close to an anchor get higher values.  Real spatial data: Centroids of 400K road segments from North America Measures generated as for synthetic data  Comparison includes aR-tree based top-k OLAP algorithm naive, hash-based method; find the measures for all cells, then select top-k cells. Assumes all cells fit in mem (best case).  Default parameters 200K points; d=2 dimensions; =1 (Zipf parameter).

16 Performance and Scalability (synthetic data)

17 Effect of skew on measures (synthetic data)

18 Effect of the number of partitions (syn. data) I/O costmemory requirements

19 Effect of dimensionality (synthetic data)

20 Effect of partitions and skew (spatial data)

21 Conclusions and current work  This is the first work (to our knowledge) that studies top-k OLAP queries.  We developed an efficient branch-and-bound algorithm for top-k OLAP queries that operates on aR-trees.  The algorithm can be easily applied for iceberg queries and other query variants.  Experiments show that it performs well for spatial data and low-dimensional data in general.  Does not scale well with dimensionality (due to aR-trees and the dimensionality “curse”)  Currently working on branch-and-bound algorithms for top-k OLAP queries on non- indexed multi-dimensional data.