Download presentation
Presentation is loading. Please wait.
Published byJack Cameron Modified over 9 years ago
1
1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik
2
2 The Data Analysis Cycle User extracts data from database with query Then visualizes, analyzes data with desktop tools Spread Sheet Table 1 10 15 10 12 10 9 6 3 Size vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape 10 4 2 0 -2 10 -4 Price vs Speed Access Time (seconds) 10 -9 10 -6 10 -3 10 0 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Size(B) $/MB visualize Extract analyze
3
3 N-Dimensional data What exactly is N-Dimensional data ? –Relation with N-attribute Domains. –Could have Domain Tables for dimension in the main table. Why is just this not enough? –We need aggregation of various kinds to make the data representation humanly readable.
4
4 Relational Representation of a 3-D Data Model Sales Fact Table model_key year_key color_key sales Measures Year Color
5
5 Aggregate Functions Aggregation Functions : –SQL Standard – SUM(), COUNT(), MIN(), MAX(), and AVG(). –Many Systems provide their own custom aggregate functions and some even provide users ability to make custom functions. The basic idea is : Combine all values in a column into a single scalar value.
6
6 6 Relational Group By Operator Group By allows aggregates over table sub-groups Result is a new table Syntax: select location, sum(units) from inventory group by location having nation = “USA”;
7
7 Problems with GROUP BY Histogram –In standard SQL, histograms are computed indirectly from table-valued expression which is then aggregated. Roll-up Totals and Sub-Totals for drill-downs. –Reports commonly aggregate data at a coarse level, and then at successively finer levels. Roll-up: going up levels. Drill-down: going down levels. Cross-tabulation (Cross-tab for short). –Symmetric aggregation table. The problem hence is a 2 N – way Union for every Roll- up or Cross-tab, when using GROUP BY
8
8 An example approach Not relational Not convenient
9
9 ‘ALL’ Dummy value to fill all the super-aggregation items. Is actually a set representing all the values that are present for the corresponding dimension. There are two ways of dealing with it. –Define a new keyword ALL in SQL ALL() function is defined to enumerate the set that ALL represents. ALL [NOT] ALLOWED is added to column definition syntax Set interpretation guides relational operators {=, IN} for ALL –Avoiding the ALL keyword. NULL is used instead of ALL. GROUPING() function to discriminate between ALL and NULL
10
10 This is a simple 3-dimensional roll-up. Aggregating over N dimensions requires N such unions. 3D ROLL-UP 3D Roll-Up
11
11 Cross Tabs The symmetric aggregation result is a table called cross-tabulation.
12
12 Data Cube Relational Operator
13
13 N-dimensional Cube Each Attribute is a Dimension N-dimensional Aggregate (sum(), max(),...) –fits relational model exactly: a 1, a 2,...., a N, f() Super-aggregate over N-1 Dimensional sub-cubes ALL, a 2,...., a N, f() a 1, ALL, a 3,...., a N, f()... a 1, a 2,...., ALL, f() –this is the N-1 Dimensional cross-tab. Super-aggregate over N-2 Dimensional sub-cubes ALL, ALL, a 3,...., a N, f()... a 1, a 2,...., ALL, ALL, f()
14
14 CUBE Operator Syntax: SELECT Model, Year, Color, SUM(sales) AS Sales FROM Sales WHERE Model in (‘Ford’, ‘Chevy’) AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE (Model, Year, Color) Semantics:
15
15 CUBE Result of a Cube Operator
16
16 ROLL UP Operator Syntax: SELECT Manufacturer, Year, Color, Model, SUM(price) AS Revenue FROM Weather GROUP BY Manufacturer ROLLUP Year(Time) AS Year Month(Time) AS Month Day(Time) AS Day Semantics:
17
17 Snowflake Schema A snowflake schema showing the core fact table and some of the many aggregation granularities of the core dimensions.
18
18 Addressing Data Cube SQL3 defines a Turing Complete procedural programming language. SELECT Year, Color, Model, SUM(sales) AS total SUM(Sales) / total(ALL, ALL, ALL) FROM Sales WHERE Model IN {‘Ford’, ‘Chevy’} AND Year BETWEEN 1990 AND 1992 GROUP BY CUBE Model, Year, Color
19
19 Computing Data Cubes If each attribute has N i values CUBE has P (N i +1) values Compute N-D cube with hash if fits in RAM Compute N-D cube with sort if overflows RAM Same comments apply to subcubes: –compute N-D-1 subcube from N-D cube. –Aggregate on “biggest” domain first when >1 deep –Aggregate functions need hidden variables: e.g. average needs sum and count. Use standard techniques from query processing –arrays, hashing, hybrid hashing –fall back on sorting.
20
20 Computing Data Cubes 2 N Algorithm for cube computation. –The simplest algorithm to compute the cube is to allocate a handle for each cube cell Categorization of aggregation functions. –Distributive If the function can be calculated in the following distributed manner: –Partition data into n sets. –Compute the aggregation function on each partition to get an aggregate value. –Apply a function g(), to the n aggregates to get a final aggregate. –This aggregate is the same as it would have been if the whole data would have been aggregated at the same time. COUNT(), SUM(), MIN(), MAX(), SUM(). Can be more efficiently calculated than by the 2 N Algorithm
21
21 Computing Data Cubes continued.. –Algebraic If it can be calculated by an algebraic function with M(a bounded +ve integer) arguments(each result of a distributive function) Min_N(), max_N, standard_deviation(), avg() Can also be calculated in a more efficient way. –Holistic If there is no constant bound on the storage size needed to describe a subaggregate. rank(), median(), mode() (Need base data) 2 N algorithm the fastest for exact result, but better algorithms for approximate results.
22
22 Compute 2D core of 2 x 3 Cube Then computer 1D edges Then compute 0D points Works for algebraic and distributive functions Saves “lots” of calls Example
23
23 Maintaining a Data Cube –Up until now we have been discussing only SELECT statements. –Now we have to accommodate INSERT, DELETE, & UPDATE –Example max() function Distributive for SELECT and INSERT, but holistic for DELETE –If a function algebraic for INSERT,UPDATE and DELETE it is easy to maintain the cube. –If it is distributive it is fairly inexpensive ( using scratchpads) –If its holistic it is expensive to maintain the cube.
24
24 Summary CUBE operator generalizes relational aggregates Needs ALL value to denote sub-cubes –ALL values represent aggregation sets Needs generalization of user-defined aggregates Decorations and abstractions are interesting Computation has interesting optimizations Relationship to “rest of SQL” not fully worked out.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.