Efficient Methods for Data Cube Computation

Slides:



Advertisements
Similar presentations
Cache –Warming Strategies for Analysis Services 2008 Chris Webb Crossjoin Consulting Limited
Advertisements

An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
Materialization and Cubing Algorithms. Cube Materialization Each cell of the data cube is a view consisting of an aggregation of interest. The values.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Multidimensional Data. Many applications of databases are "geographic" = 2­dimensional data. Others involve large numbers of dimensions. Example: data.
Multidimensional Data
5.1Database System Concepts - 6 th Edition Chapter 5: Advanced SQL Advanced Aggregation Features OLAP.
Implementation & Computation of DW and Data Cube.
Dimensional Modeling – Part 2
B + -Trees COMP171 Fall AVL Trees / Slide 2 Dictionary for Secondary storage * The AVL tree is an excellent dictionary structure when the entire.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
OLAP OPERATIONS. OLAP ONLINE ANALYTICAL PROCESSING OLAP provides a user-friendly environment for Interactive data analysis. In the multidimensional model,
Ahsan Abdullah 1 Data Warehousing Lecture-12 Relational OLAP (ROLAP) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics.
1 B Trees - Motivation Recall our discussion on AVL-trees –The maximum height of an AVL-tree with n-nodes is log 2 (n) since the branching factor (degree,
Multi-Dimensional Databases & Online Analytical Processing This presentation uses some materials from: “ An Introduction to Multidimensional Database Technology,
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
OnLine Analytical Processing (OLAP)
Efficient Methods for Data Cube Computation and Data Generalization
Building the cube – Chapter 9 & 10 Let’s be over with it.
1 Relational Databases and SQL. Learning Objectives Understand techniques to model complex accounting phenomena in an E-R diagram Develop E-R diagrams.
Closed Cube Computation Data cube produces large outputs –1,015,367 tuples (39MB) –210,343,580 tuples (8GB)(200 times) Two methods to reduce.
BI Terminologies.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
UNIT-II Principles of dimensional modeling
1 On-Line Analytic Processing Warehousing Data Cubes.
병렬분산컴퓨팅연구실 1 Cubing Algorithms, Storage Estimation, and Storage and Processing Alternatives for OLAP 병렬 분산 컴퓨팅 연구실 석사 1 학기 이 은 정
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Dense-Region Based Compact Data Cube
Data Analysis and OLAP Dr. Ms. Pratibha S. Yalagi Topic Title
Nan Zhang Texas A&M University
On-Line Application Processing
Computer Organization
Chapter 6: Integrity (and Security)
A data cube (1) OR a lattice of cuboids
Chapter 5. Data Cube Technology
CHP - 9 File Structures.
Data Mining: Concepts and Techniques (3rd ed.) — Chapter 5 —
Indexing ? Why ? Need to locate the actual records on disk without having to read the entire table into memory.
On-Line Analytic Processing
Data warehouse and OLAP
Merchandising Activities
What is OLAP OLAP allows to model data in a multidimensional way like a data cube in order to look for the data from many perspectives.
Chapter 5: Advanced SQL Database System concepts,6th Ed.
SQL/OLAP Sang-Won Lee Let’s e-Wha! URL: Jul. 12th, 2001 SQL/OLAP
Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube Introducing iceberg cubes will lessen the burden of computing trivial aggregate.
COMP 430 Intro. to Database Systems
B+-Trees.
B+-Trees.
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
Datamining : Refers to extracting or mining knowledge from large amounts of data Applications : Market Analysis Fraud Detection Customer Retention Production.
DATA CUBE Advanced Databases 584.
CMSC 341 Lecture 10 B-Trees Based on slides from Dr. Katherine Gibson.
Market Basket Analysis and Association Rules
Chapter 11: Indexing and Hashing
Data warehouse Design Using Oracle
MIS2502: Review for Exam 1 JaeHwuen Jung
B- Trees D. Frey with apologies to Tom Anastasio
B- Trees D. Frey with apologies to Tom Anastasio
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B-Trees Disk Storage What is a multiway tree? What is a B-tree?
B- Trees D. Frey with apologies to Tom Anastasio
On-Line Application Processing
Retail Sales is used to illustrate a first dimensional model
Fundamentals of Data Cube & OLAP Operations
Data Models in DBMS A model is a representation of reality, 'real world' objects and events, associations. Data Model can be defined as an integrated collection.
Chapter 11: Indexing and Hashing
Slides based on those originally by : Parminder Jeet Kaur
Shelly Cashman: Microsoft Access 2016
Introduction to Graphs
Presentation transcript:

Efficient Methods for Data Cube Computation The pre-computation of all or part of a data cube can greatly reduce the response time and enhance the performance of on-line analytical processing. However, such computation is challenging because it may require substantial computational time and storage space.

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube A data cube is a lattice of cuboids. Each cuboid represents a group-by. ABC is the base cuboid, containing all three of the dimensions. Here, the aggregate measure, M, is computed for each possible combination of the three dimensions. The base cuboid is the least generalized of all of the cuboids in the data cube. The most generalized cuboid is the apex cuboid, commonly represented as all. It contains one value—it aggregates measure M for all of the tuples stored in the base cuboid. To drill down in the data cube, we move from the apex cuboid, downward in the lattice. To roll up, we move from the base cuboid, upward

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube A cell in the base cuboid is a base cell. A cell from a nonbase cuboid is an aggregate cell. An aggregate cell aggregates over one or more dimensions, where each aggregated dimension is indicated by a * in the cell notation. Suppose we have an n-dimensional data cube. Let a = (a1, a2, . . . , an, measures) be a cell from one of the cuboids making up the data cube. We say that a is an m-dimensional cell (that is, from an m-dimensional cuboid) if exactly m (m <= n) values among {a1, a2, : : : , an} are not *. If m = n, then a is a base cell; otherwise, it is an aggregate cell (i.e., where m < n).

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube Consider a data cube with the dimensions month, city, and customer group, and the measure price. (Jan, * , * , 2800) and (*, Toronto, * , 1200) are 1-D cells, (Jan, * , Business, 150) is a 2-D cell, and (Jan, Toronto, Business, 45) is a 3-D cell. An ancestor-descendant relationship may exist between cells. In an n-dimensional data cube, an i-D cell a = (a1, a2, . . . , an, measuresa) is an ancestor of a j-D cell b = (b1, b2, . . . , bn, measuresb), and b is a descendant of a, if and only if (1) i < j, and (2) for 1 <=m<=n, am = bm whenever am  *. In particular, cell a is called a parent of cell b, and b is a child of a, if and only if j = i+1 and b is a descendant of a.

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube Referring to our previous example, 1-D cell a=(Jan, * , * , 2800), and 2-D cell b = (Jan, * , Business, 150), are ancestors of 3-D cell c = (Jan, Toronto, Business, 45); c is a descendant of both a and b. b is a parent of c, and c is a child of b. In order to ensure fast on-line analytical processing, it is sometimes desirable to pre-compute the full cube. This, however, is exponential to the number of dimensions. That is, a data cube of n dimensions contains 2n cuboids. There are even more cuboids if we consider concept hierarchies for each dimension. In addition, the size of each cuboid depends on the cardinality of its dimensions. Thus, pre-computation of the full cube can require huge and often excessive amounts of memory.

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube Partial materialization of data cubes offers an interesting trade-off between storage space and response time for OLAP. Instead of computing the full cube, we can compute only a subset of the data cube’s cuboids, or subcubes consisting of subsets of cells from the various cuboids. Many cells in a cuboid may actually be of little or no interest to the data analyst. Recall that each cell in a full cube records an aggregate value. Measures such as count, sum, or sales in dollars are commonly used. For many cells in a cuboid, the measure value will be zero. When the product of the cardinalities for the dimensions in a cuboid is large relative to the number of nonzero-valued tuples that are stored in the cuboid, then we say that the cuboid is sparse. If a cube contains many sparse cuboids, we say that the cube is sparse.

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube In many cases, a substantial amount of the cube’s space could be taken up by a large number of cells with very low measure values. This is because the cube cells are often quite sparsely distributed within a multiple dimensional space. For example, a customer may only buy a few items in a store at a time. Such an event will generate only a few nonempty cells, leaving most other cube cells empty. In such situations, it is useful to materialize only those cells in a cuboid (group-by) whose measure value is above some minimum threshold.

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube In a data cube for sales, say, we may wish to materialize only those cells for which count is 10 or only those cells representing sales $100. This not only saves processing time and disk space, but also leads to a more focused analysis. The cells that cannot pass the threshold are likely to be too trivial to warrant further analysis. Such partially materialized cubes are known as iceberg cubes. The minimum threshold is called the minimum support threshold, or minimum

Cube Materialization: Full Cube, Iceberg Cube, Closed Cube, and Shell Cube An iceberg cube can be specified with an SQL query, as shown in the following example. compute cube sales iceberg as select month, city, customer group, count(*) from salesInfo cube by month, city, customer group having count(*) >= min sup The compute cube statement specifies the precomputation of the iceberg cube, sales iceberg, with the dimensions month, city, and customer group, and the aggregate measure count(). The input tuples are in the salesInfo relation. The cube by clause specifies that aggregates (group-by’s) are to be formed for each of the possible subsets of the given dimensions.