Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006
Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy
Motivation Aggregation, summarization and exploration Of historical data To help management make informed decisions
Different Goal Aggregation, summarization and exploration Of historical data To help management make informed decisions ProductBranchTimePrice Coke (0.5 gallon)Convoy Street :00:01$1.00 Pepsi (0.5 gallon)UTC :00:01$1.03 Coke (1 gallon)UTC :00:02$1.50 AltoidsCosta Verde :01:33$ Find the total sales for each product and month Find the percentage change in the total monthly sales for each product
Different Requirements OLTPOLAP TasksDay to day operationHigh level decision support Size of databaseGigabytesTerabytes Time spanRecent, up-to-dateSpanning over months / years Size of working setTens of records, accessed through primary keys Consolidated data from multiple databases WorkloadStructured / repetitiveAd-hoc, exploratory queries PerformanceTransaction throughputQuery latency OLTP – On-Line Transaction Processing OLAP – On-Line Analytical Processing
Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy
Query Language Extensions In the real world, data is stored in RDBs.
Query Language Extensions In the real world, data is stored in RDBs. How to express N-dimensional problems using 2D tables?
Query Language Extensions In the real world, data is stored in RDBs. How to express N-dimensional problems using 2D tables? Can we combine OLAP and SQL queries? Jim Gray et al: Data Cube: A Relational Aggregation Operator 1997
Query Language Extensions 1.histograms Problems with GROUP BY SELECT sales, prod_name, population FROM sales_history GROUP BY Population(City, State) as population
Query Language Extensions 1.histograms 2.rollup/drilldow n Problems with GROUP BY Product Category Product Name MonthSalesSales by Cat., by Name Sales by Cat. DrinksCokeFeb30.3 Mar HeinekenFeb34.8 Mar non relational representation
Query Language Extensions 1.histograms 2.rollup/drilldow n Problems with GROUP BY Product Category Product Name MonthSalesSales by Cat., by Name Sales by Cat. DrinksCokeFeb DrinksCokeMar DrinksHeinekenFeb DrinksHeinekenMar relational, but the rollup is huge
Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations Problems with GROUP BY Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeTotal124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenTotal158.6 DrinksTotal Could be represented as:
Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations Problems with GROUP BY 2-D aggregation is more compact and more natural: DrinksFebMarTotal Coke Heineken Total
Query Language Extensions 1.histograms 2.rollup/drilldown 3.cross tabulations 4.complex expressions, hard to optimize Problems with GROUP BY when reducing to 1-D aggregation (GROUP BY) need 2^{number of dim.} GROUP BY’s
Query Language Extensions Reducing the number of attributes Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenALL158.6 DrinksALL DrinksALLFeb65.1 DrinksALLMar217.7
Query Language Extensions introduce a new value: “ALL” Reducing the number of attributes “ALL” = the set over which we aggregate DrinksFebMarTotal (ALL) Coke Heineken Total (ALL)
Query Language Extensions GROUP BY (1D) General approach Sales by Product Name FebMar Coke Heineken SUM
Query Language Extensions GROUP BY (1D) Cross Tab (2D) General approach DrinksFebMarALL Coke Heineken ALL Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 DrinksHeinekenFeb34.8 DrinksHeinekenMar123.8 DrinksHeinekenALL158.6 DrinksALLFeb65.1 DrinksALLMar217.7 DrinksALL the corresponding relation:
Query Language Extensions GROUP BY (1D) Cross Tab (2D) Cube (3D) General approach Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 ………… SnacksDoritosFeb123.8 SnacksDoritosMar158.6 SnacksDoritosALL65.1 ………… ALL By cat. and month By cat. and name (does it make sense?) By month and name
Query Language Extensions GROUP BY (1D) Cross Tab (2D) Cube (3D) Any hypercube can be represented as a relation! General approach
Query Language Extensions a CUBE relation, with aggregation function f(.) (x 1, x 2, …, x n-1, x n, f() ) …………………………… (x 1, x n-1, …, x n, ALL, f() ) …………………………… (x 1, x 2, …, ALL, x n, f() ) …………………………… after ROLLUP, reduce to a linear # of tuples (x 1, x 2, …, x n-1, x n, f() ) ………………………………… (x 1, x n-1, …, x n, ALL, f() ) ………………………………… (x 1, x 2, …, ALL, ALL, f() ) ………………………………… (ALL, ALL, …, ALL, ALL, f() ) General approach
Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, prod_name, month, SUM(sales) AS sales FROM sales_history GROUP BY CUBE prod_category, prod_name, month Product Category Product Name MonthSales DrinksCokeFeb30.3 DrinksCokeMar93.9 DrinksCokeALL124.2 ………… DrinksALLFeb99.8 ………… ALL Idea: Group by the CUBE list. Union the aggregates. Introduce the ALL values.
Query Language Extensions The new operators: CUBE, ROLLUP SELECT prod_category, month, day, state, prod_name, SUM(sales) AS sales FROM sales_history GROUP BY prod_category ROLLUP month, day CUBE city, state Product Category MonthDayStateProduct Name Sales DrinksFeb26CACoke12.3 Feb26CAHeineken5.4 …………… Feb26CAALL30.4 Feb26ALLCoke… ………… SnacksFeb26CADoritos12.0 …………
Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy
Research Areas SQL language extensions Server architecture Parallel processing Index structures Materialized views
Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy
Simultaneous Aggregates Multi-Dimensional Optimization to calculate multiple aggregates simultaneously Useful for materialization of aggregate views Y. Zhao, P. Deshpande, J. Naughton An Array-Based Algorithm for Simultaneous Multidimensional Aggregates SIGMOD 1997
Multiple Aggregates Month / Product FebMarTotal Altoids Coke Doritos Heineken Pepsi Pringles Total ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar Aggregate on…
Multiple Aggregates City / Product San DiegoLos AngelesTotal Altoids Coke Doritos Heineken Pepsi Pringles Total Month / City FebMarTotal Los Angeles San Diego Total Month / Product FebMarTotal Altoids Coke Doritos Heineken Pepsi Pringles Total ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar Aggregate on…
Multiple Aggregates ProductCityMonthSales CokeSan DiegoFeb 0612 PepsiLos AngelesFeb 0613 DoritosSan DiegoMar 0672 AltoidsSan DiegoMar Sales by Product / City 2.Sales by Product / Month 3.Sales by Month / City 4.Sales by Product 5.Sales by City 6.Sales by Month 7.Sales (Total) Is it possible to make a single pass over the transactional table? calculate multiple aggregates simultaneously? Aggregate on…
Chunking Dimension B Dimension A Dimension C 1 ProductCityMonthSales CokeSan DiegoFeb 0612 Array Chunk Product City Month Partition transactional data into array chunks
Naïve Algorithm Dimension A Dimension C Pivot on AB aggregate on all C Dimension A Dimension B
Naïve Algorithm Dimension A Dimension C Pivot on AB aggregate on all C Pivot on AC aggregate on all B Pivot on BC aggregate on all A Dimension B
Single Pass Algorithm Dimension A Dimension C B AB AC BC Make a single pass over data
Single Pass Algorithm Dimension A Dimension C B AB AC BC Simultaneously maintain multiple aggregates
Single Pass Algorithm Dimension A Dimension C B AB AC BC Write out completed aggregates
Single Pass Algorithm Dimension A Dimension C B AB AC BC Only allocate memory that is necessary
Single Pass Algorithm AB AC BC Array Chunk ABC 4 x 4 x 4 AB 16 x 4 x 4 AC 4 x 4 x 4 BC 4 x 4 A 4 x 4 B4B4 C4C4 all 1 Minimum memory spanning tree
Multi Pass Algorithm ABCD ABC ABD ACD BCD AB AC BC AD BD CD A B C D all Recursively aggregate
Overview Motivation Multi-Dimensional Data Model Research Areas Optimizations –Materializing multiple aggregates simultaneously –Materialization strategy
Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube
Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger!
Implementing Data Cubes Biggest problem for data warehouses: the size Space / time trade-off: accelerate queries by materializing the cube The size of the relations gets even bigger! M(ultidimensional)OLAP: good query performance, but bad scalability R(elational)OLAP: very scalable; query performance improved by materializing (partial) results
Implementing Data Cubes V. Harinarayan, A. Rajaraman, J.D. Ullman: Implementing Data Cubes Efficiently SIGMOD 1996 Presents a materialization strategy for the cells of the cube.
Implementing Data Cubes Time Id City Id Product Id Sales Day Month Week City Id City State Product Id Name Category Week Month Year Category Id Category Name
Implementing Data Cubes casted as particular case of the rewriting using views problem what cells to materialize what SQL views to materialize
Implementing Data Cubes casted as particular case of the rewriting using views problem what cells to materialize what SQL views to materialize p = product t = time c = city simple idea: Q 1 depends on Q 2 (Q 1 ≤Q 2 ) if Q 1 can be fully answered using the results of Q 2 ptc pt tc pc tp c none
Implementing Data Cubes but cube dimensions are usually hierarchical product_name product_category none day weekmonth year none city state none XX direct-product lattice p = product t = time c = city ptc pttcpc pts pwc pyc pmc ps p cat t … … … … …
Implementing Data Cubes Def. cost of answering Q = # of rows in the table of ancestor(Q) It can be estimated w/o materializing the views Assume that all queries are identical to some view in the lattice
Implementing Data Cubes For a set S and a view v B(v,S) = ∑ w≤v, (w not in S) max{cost(w)-cost(v), 0} Greedy algorithm for selecting k views to materialize from the lattice: 1.S := {top view} 2.For i=1 to k, add v to S s.t. B(v,S) is maximized The greedy algorithm is an (e-1)/e ≈ 0.63 approx. of the optimum.
Discussion Questions from the audience…