An Array-Based Algorithm for Simultaneous Multidimensional Aggregates

Slides:



Advertisements
Similar presentations
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Advertisements

Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Indexing. Efficient Retrieval Documents x terms matrix t 1 t 2... t j... t m nf d 1 w 11 w w 1j... w 1m 1/|d 1 | d 2 w 21 w w 2j... w 2m 1/|d.
1 Multi-way Algorithm for Cube Computation CPS Notes 8.
6.830 Lecture 9 10/1/2014 Join Algorithms. Database Internals Outline Front End Admission Control Connection Management (sql) Parser (parse tree) Rewriter.
Fast Algorithms For Hierarchical Range Histogram Constructions
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.
Multidimensional Data
Database Management Systems 3ed, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 14, Part B.
Database Management Systems, R. Ramakrishnan and Johannes Gehrke1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Implementation & Computation of DW and Data Cube.
Database Implementation Issues CPSC 315 – Programming Studio Spring 2008 Project 1, Lecture 5 Slides adapted from those used by Jennifer Welch.
Physical Database Monitoring and Tuning the Operational System.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Query Optimization 3 Cost Estimation R&G, Chapters 12, 13, 14 Lecture 15.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 11 Database Performance Tuning and Query Optimization.
1 Computing the cube Abhinandan Das CS 632 Mar
1 Evaluation of Relational Operations: Other Techniques Chapter 12, Part B.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Spatial Indexing I Point Access Methods. Spatial Indexing Point Access Methods (PAMs) vs Spatial Access Methods (SAMs) PAM: index only point data Hierarchical.
Multidimensional Data Many applications of databases are ``geographic'' = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Memory Management Last Update: July 31, 2014 Memory Management1.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
Spring 2006 Copyright (c) All rights reserved Leonard Wesley0 B-Trees CMPE126 Data Structures.
Query optimization in relational DBs Leveraging the mathematical formal underpinnings of the relational model.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
OnLine Analytical Processing (OLAP)
Efficient Methods for Data Cube Computation and Data Generalization
March 16 & 21, Csci 2111: Data and File Structures Week 9, Lectures 1 & 2 Indexed Sequential File Access and Prefix B+ Trees.
1 Fast Computation of Sparse Datacubes Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan Ken :: Yiu Man Lung.
Physical Database Design I, Ch. Eick 1 Physical Database Design I About 25% of Chapter 20 Simple queries:= no joins, no complex aggregate functions Focus.
BI Terminologies.
Designing Aggregations. Performance Fundamentals - Aggregations Pre-calculated summaries of data Intersections of levels from each dimension Tradeoff.
Frank Dehnewww.dehne.net Parallel Data Cube Data Mining OLAP (On-line analytical processing) cube / group-by operator in SQL.
1 Biometric Databases. 2 Overview Problems associated with Biometric databases Some practical solutions Some existing DBMS.
1 Using Tiling to Scale Parallel Datacube Implementation Ruoming Jin Karthik Vaidyanathan Ge Yang Gagan Agrawal The Ohio State University.
Implementing Data Cube Construction Using a Cluster Middleware: Algorithms, Implementation Experience, and Performance Ge Yang Ruoming Jin Gagan Agrawal.
CS4432: Database Systems II Query Processing- Part 2.
Variant Indexes. Specialized Indexes? Data warehouses are large databases with data integrated from many independent sources. Queries are often complex.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
What is OLAP?.
Indexing OLAP Data Sunita Sarawagi Monowar Hossain York University.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
CS4432: Database Systems II
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
Database Systems, 8 th Edition SQL Performance Tuning Evaluated from client perspective –Most current relational DBMSs perform automatic query optimization.
1 Parallel Datacube Construction: Algorithms, Theoretical Analysis, and Experimental Evaluation Ruoming Jin Ge Yang Gagan Agrawal The Ohio State University.
Dense-Region Based Compact Data Cube
Module 11: File Structure
Record Storage, File Organization, and Indexes
Ge Yang Ruoming Jin Gagan Agrawal The Ohio State University
A paper on Join Synopses for Approximate Query Answering
Spatial Indexing I Point Access Methods.
Chapter 15 QUERY EXECUTION.
Evaluation of Relational Operations: Other Operations
Database Implementation Issues
External Joins Query Optimization 10/4/2017
Indexing and Hashing Basic Concepts Ordered Indices
Spatial Indexing I R-trees
DATABASE IMPLEMENTATION ISSUES
Evaluation of Relational Operations: Other Techniques
Database Implementation Issues
Chapter 11: Indexing and Hashing
Evaluation of Relational Operations: Other Techniques
Database Implementation Issues
Presentation transcript:

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates Y. Zhao, P. Deshpande, J. Naughton

Motivation Previous papers showed the usefulness of the CUBE operator. There are several algorithms for computing the CUBE in Relational OLAP systems. This paper proposes an algorithm for computing the CUBE in Multidimensional OLAP systems.

CUBE in ROLAP In ROLAP systems, 3 main ideas for efficiently computing the CUBE Group related tuples together (using sorting or hashing) Use grouping performed on sub-aggregates to speed computation Compute an aggregate from another aggregate rather than the base table

CUBE in MOLAP Cannot transfer algorithms from ROLAP to MOLAP, because of the nature of the data In ROLAP, data is stored in tuples that can be sorted and reordered by value In MOLAP, data cannot be rearranged, because the position of the data determines the attribute values

Multidimensional Array Storage Data is stored in large, sparse arrays, which leads to certain problems: The array may be too big for memory Many of the cells may be empty and the array will be too sparse

Chunking Arrays Why chunk? A simple row major layout (partitioning by dimension) will favor certain dimensions over others. What is chunking? A method for dividing a n-dimensional array into smaller n-dimensional chunks and storing each chunk as one object on disk

Chunks Dimension B CB CB CA CA CA Dimension A

Array Compression Chunk-offset compression: for each valid entry, we store (offsetInChunk, data) where offsetInChunk is the offset from the start of the chunk. Compression is done on dense arrays (defined as arrays more than 40% filled with data)

Naïve Array Cubing Algorithm Similar to ROLAP, each aggregation is computed from its parent in the lattice. Each chunk is aggregated completely and then written to disk before moving on the next chunk. ABC AB AC BC A B C {}

Illustrative example Example for BC: Start with BC face on 1 and “sweep” through dimension A to aggregate.

Problems with Naïve approach Each sub aggregate is calculated independently E.g. this algorithm will compute AB from ABC, then rescan ABC to calculate AC, then rescan ABC to calculate BC We need a method to simultaneously compute all children of a parent in a single pass over the parent

Single-Pass Multi-Way Array Cubing Algorithm The order of scanning is vitally important in determining how much memory is needed to compute the aggregates. A dimension order O = (Dj1, Dj2, … Djn) defines the order in which dimensions are scanned. |Di| = size of dimension i |Ci| = size of the chunk for dimension i |Ci| << |Di| in general

Example of Multi-way Array

Concrete Example |Ci| = 4, |Di| = 16 For BC group-bys, we need 1 chunk (4x4) For AC, we need 4 chunks (16x4) For AB, we need to keep track of whole slice of the AB plane, so (16x16)

How much memory? A formula for the minimum amount of memory can be generalized. Define p = size of the largest common prefix between the current group-by and its parent P n-1  |Di| x  |Ci| i=1 I=p+1

Example calculation O = {A B C D}, |Ci| =10, |Di| ={100, 200, 300, 400} For the ABD group-by, the max common prefix is AB. Therefore the minimum amount of memory necessary is: |DA| x |DB| x |CD| = 100 x 200 x 10

More Memory Notes In simple terms, every element q in the common prefix contributes |Dq| while every other element r not in the prefix contributes |Cr| Since |Ci| << |Di|, to minimize the memory usage, we should minimize the max common prefix and reorder the dimension order so that the largest dimensions appear in the fewest prefixes

Minimum Memory Spanning Trees O = { A B C } Why is the cost of B=4?

Minimum Memory Spanning Trees (cont.) Using the formula for calculating the minimum amount of memory, we can build a MMST, s.t. the total memory requirement is minimum for a given dimension order. For different dimension orders, the MMSTs may be very different with very different memory requirements

Effects of Dimension Order

More Effects of Dimension Order The early elements in O (particularly the first one) appear in the most prefixes and therefore, contribute their dimension sizes to the memory requirements. The last element in O can never appear in any prefix. Therefore, the total memory requirement for computing the CUBE is independent of the size of the last dimension.

Optimal Dimension Order Based on the previous two ideas, the optimal ordering for dimension is to sort them on increasing dimension size. The total memory requirement will be minimized and will be independent of the size of the largest dimension.

Graphs And Results

ROLAP vs. MOLAP

MOLAP wins

MOLAP for ROLAP system The last chart demonstrates one of the unexpected results from this paper. We can use the MOLAP algorithm with ROLAP systems by: Scan the table and load into an array. Compute the CUBE on the array. Convert results into tables.

MOLAP for ROLAP (cont.) The results show that even with the additional cost of conversion between data structures, the MOLAP algorithm runs faster than directly computing the CUBE on the ROLAP tables and it scales much better. In this scheme, the multi-array is used as a query evaluation data structure rather than a persistent storage structure.

Summary The multidimensional array of MOLAP should be chunked and compressed. The Single-Pass Multi-Way Array method is a simultaneous algorithm that allows the CUBE to be calculated with a single pass over the data. By minimizing the overlap in prefixes and sorting dimensions in order of increasing size, we can build a MMST that gives a plan for computing the CUBE.

More Summary On MOLAP systems, the CUBE is calculated much faster than on ROLAP systems. The most surprising (and useful) result is that the MOLAP algorithm is so much faster that it can be used on ROLAP systems as an intermediate step in computing the CUBE.

Caching Multidimensional Queries Using Chunks P. Deshpande, K. Ramasamy, A. Shukla, J. Naughton

Caching Caching is very important in OLAP systems, since the queries are complex and they are required to respond quickly. Previous work in caching dealt with table-level caching and query-level caching. This paper will propose another level of granularity using chunks.

Chunk-based caching Benefits: Frequently accessed chunks will stay in the cache. A new query need not be “contained” within a cached query to benefit from the cache

More on Chunking More benefits: Closure property of chunks: we can aggregate chunks on one level to obtain chunks at different levels of aggregation. Less redundant storage leads to a better hit ratio of the cache.

Chunking the Multi-dimensional Space To divide the space into chunks, distinct values along each dimension have to be divided into ranges. Hierarchical locality in OLAP queries suggests that ordering by the hierarchy level is the best option.

Ordering on Dimensions

Chunk Ranges Uniform chunk ranges do not work so well with hierarchical data.

Hierarchical Chunk Ranges

Caching Query Results When a new query is issued, chunks needed to answer the query are determined. The list of chunks in broken into 2 parts: Relevant chunks from the cache Missing chunks that have to be computed from the backend

Chunked File Organization The cost of a chunk miss can be reduced by organizing data in chunks at the backend. One possible method is to use multi-dimensional arrays, but these require changing the data structures a great deal and may result in the loss of relational access to data.

Chunk Indexing A chunk index is created so that given a chunk number, it is possible to access all tuples in that chunk. The chunked file will have two interfaces: the relational interface for SQL statement, and chunk-based interface for direct access to chunks.

Query Processing How to determine whether a cached chunk can be used to answer a query Level of aggregation – cached chunks at the same level are used. Condition Clause – selection on non group-by predicates must match exactly.

Implementation of Chunked Files Add a new chunked file type to the backend database. Add a level of abstraction Add a new attribute of chunk number Sort based on chunk number Create chunk index with a B-tree on the chunk number

Replacement Schemes LRU is not viable, because chunks at different levels have different costs. Benefit of a chunk is measured by fraction of base table it represents Use benefits of chunks as weights when determining which chunk to replace in the cache.

Cost Saving Ratio Defined as the percentage of the total cost of the queries saved due to hits in the cache. Better than a normal hit ratio, since chunks at different levels have different benefits.

Summary Chunk-based caching allows fine granularity and queries to be partially answered from the cache. A chunked file organization can reduce the cost of a chunk miss with minimal cost in implementation. Performance depends heavily on choosing the right chunk range and a good replacement policy