Based on notes by Jim Gray Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Jim Gray Microsoft Adam Bosworth Microsoft Andrew Layman Microsoft Hamid Pirahesh IBM Presented by: Changwu Li Based on notes by Jim Gray 9/17/2018
The Data Analysis Cycle Spread Sheet Table 1 10 15 12 9 6 3 Size vs Speed Access Time (seconds) -9 -6 -3 Cache Main Secondary Disc Nearline Tape Offline Online 4 2 -2 -4 Price vs Speed Size(B) $/MB visualize Extract analyze User extracts data from database with query Then visualizes, analyzes data with desktop tools 9/17/2018
Relational Aggregate Operators SQL has several aggregate operators: sum(), min(), max(), count(), avg() Other systems extend this with many others: stat functions, financial functions, ... The basic idea is: Combine all values in a column into a single scalar value. Syntax select sum(units) from inventory; 9/17/2018
Relational Group By Operator Group By allows aggregates over table sub-groups Result is a new table Syntax: select deptno, sum(salary) from emp group by deptno 9/17/2018
Problems With This Design Users Want Histograms Users want sub-totals and totals drill-down & roll-up reports Users want CrossTabs Conventional wisdom These are not relational operators They are in many report writers and query engines F() G() H() sum M T W T F S S AIR HOTEL FOOD MISC 9/17/2018
Table 5a: Ford Sales Cross Tab A cross tab example Table 5a: Ford Sales Cross Tab Ford 1994 1995 total (ALL) black 50 85 135 white 10 75 60 160 220 9/17/2018
How to solve this problem? Answer: cube 9/17/2018
The Idea: Think of the N-dimensional Cube Each Attribute is a Dimension N-dimensional Aggregate (sum(), max(),...) fits relational model exactly: a1, a2, ...., aN, f() Super-aggregate over N-1 Dimensional sub-cubes ALL, a2, ...., aN , f() a3 , ALL, a3, ...., aN , f() ... a1, a2, ...., ALL, f() this is the N-1 Dimensional cross-tab. Super-aggregate over N-2 Dimensional sub-cubes ALL, ALL, a3, ...., aN , f() a1, a2 ,...., ALL, ALL, f() 9/17/2018
An Example CUBE 9/17/2018
Why the ALL Value? Need a new “Null” value (overloads the null indicator) Value must not already be in the aggregated domain Can’t use NULL since may aggregate on it. Think of ALL as a token representing the set All(color)={red, white, blue}, All(year)={1990, 1991, 1992}, All(model)={Chevy, Ford} Follow “set of values” semantics. 9/17/2018
CUBE operator: Syntax select model, make, year, sum(sales) Proposed syntax: Note: Group By operator repeats aggregate list in select list in group by list select model, make, year, sum(sales) from car_sales where model in {“chevy”, “ford”} and year between 1990 and 1994 group by model, make, year with cube having sum(sales) > 0; 9/17/2018
How To Compute the Cube? If each attribute has Ni values CUBE has P (Ni+1) values Compute N-D cube with hash if fits in RAM Compute N-D cube with sort if overflows RAM Same comments apply to subcubes: compute N-D-1 subcube from N-D cube. Aggregate on “biggest” domain first when >1 deep Aggregate functions need hidden variables: e.g. average needs sum and count. 9/17/2018
Example: Compute 2D core of 2 x 3 cube Then compute 1D edges Then compute 0D point Works for algebraic and distributive functions Saves “lots” of calls 9/17/2018
Real world implementation Both Oracle 9i and SQL server 2000 An example in Oracle 9i: select deptno, job, sum(sal) as salary from emp group by cube(deptno, job) 9/17/2018
-------- ------------------ -------------------- 10 CLERK 1300 DEPTNO JOB SALARY -------- ------------------ -------------------- 10 CLERK 1300 10 MANAGER 2450 10 PRESIDENT 5000 10 8750 20 ANALYST 6000 20 CLERK 1900 20 MANAGER 2975 20 10875 30 CLERK 950 30 MANAGER 2850 30 SALESMAN 5600 30 9400 ANALYST 6000 CLERK 4150 MANAGER 8275 PRESIDENT 5000 SALESMAN 5600 2902 9/17/2018
1999 ACM Turing Award Jim Gray 9/17/2018