 N. Roussopoulos 2007 OLAP & Data Cubing Spring 2007 Nick Roussopoulos

Slides:



Advertisements
Similar presentations
1 CUBE: A Relational Aggregate Operator Generalizing Group By Jim Gray Adam Bosworth Andrew Layman Microsoft Microsoft.com Hamid Pirahesh IBM.
Advertisements

Query Optimization Reserves Sailors sid=sid bid=100 rating > 5 sname (Simple Nested Loops) Imperative query execution plan: SELECT S.sname FROM Reserves.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates By Yihong Zhao, Prasad M. Desphande and Jeffrey F. Naughton Presented by Kia Hall.
1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Dwarf: A High Performance OLAP Engine Nick Roussopoulos ACT Inc. & UMD.
Quick Review of Apr 10 material B+-Tree File Organization –similar to B+-tree index –leaf nodes store records, not pointers to records stored in an original.
Chapter 11 Indexing and Hashing (2) Yonsei University 2 nd Semester, 2013 Sanghyun Park.
Generating the Data Cube (Shared Disk) Andrew Rau-Chaplin Faculty of Computer Science Dalhousie University Joint Work with F. Dehne T. Eavis S. Hambrusch.
Technical BI Project Lifecycle
March DGRC FedStats Visit Aggregation in Main Memory Kenneth A. Ross Columbia University.
Advanced Databases: Lecture 2 Query Optimization (I) 1 Query Optimization (introduction to query processing) Advanced Databases By Dr. Akhtar Ali.
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
OLAP Services Business Intelligence Solutions. Agenda Definition of OLAP Types of OLAP Definition of Cube Definition of DMR Differences between Cube and.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
Cube Tree Dimension: number of group-by values Relation tuples map to a point in the space Aggregates: projection of all data points on all the subspaces.
Data Cube and OLAP Server
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
Chap8: Trends in DBMS 8.1 Database support for Field Entities 8.2 Content-based retrieval 8.3 Introduction to spatial data warehouses 8.4 Summary.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
An Array-Based Algorithm for Simultaneous Multidimensional Aggregates
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Online Analytical Processing (OLAP) Hweichao Lu CS157B-02 Spring 2007.
Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals Presenter : Parminder Jeet Kaur Discussion Lead : Kailang.
Objectives of the Lecture :
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Database Systems: Design, Implementation, and Management Eighth Edition Chapter 10 Database Performance Tuning and Query Optimization.
1 CUBE: A Relational Aggregate Operator Generalizing Group By By Ata İsmet Özçelik.
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
Multidimensional Indexes Applications: geographical databases, data cubes. Types of queries: –partial match (give only a subset of the dimensions) –range.
Data Warehousing.
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Storage Structures. Memory Hierarchies Primary Storage –Registers –Cache memory –RAM Secondary Storage –Magnetic disks –Magnetic tape –CDROM (read-only.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
CPSC 404, Laks V.S. Lakshmanan1 Overview of Query Evaluation Chapter 12 Ramakrishnan & Gehrke (Sections )
The Cubetree Storage Organization A High Performance ROLAP Datablade 데이터베이스 연구실 석사 3 학기 강 주 영
Data Warehousing.
IMS 4212: Database Implementation 1 Dr. Lawrence West, Management Dept., University of Central Florida Physical Database Implementation—Topics.
File Processing : Query Processing 2008, Spring Pusan National University Ki-Joune Li.
Implementation of Database Systems, Jarek Gryz1 Evaluation of Relational Operations Chapter 12, Part A.
DMBS Internals I February 24 th, What Should a DBMS Do? Store large amounts of data Process queries efficiently Allow multiple users to access the.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
SF-Tree and Its Application to OLAP Speaker: Ho Wai Shing.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Evaluation of Relational Operations Chapter 14, Part A (Joins)
CS4432: Database Systems II
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
Data Warehousing and OLAP Outline u Models & operations u Implementing a warehouse u Future directions.
1 Copyright © 2008, Oracle. All rights reserved. Repository Basics.
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Analysis Decision Support Systems Data Analysis and OLAP Data Warehousing.
Dense-Region Based Compact Data Cube
BlinkDB.
CPS216: Data-intensive Computing Systems
Parallel Databases.
BlinkDB.
Hash-Based Indexes Chapter 11
Database Management Systems (CS 564)
Based on notes by Jim Gray
DATA CUBE Advanced Databases 584.
File Processing : Query Processing
Hash-Based Indexes Chapter 10
CUBE MATERIALIZATION E0 261 Jayant Haritsa
Data Warehousing Concepts
Chapter 11 Instructor: Xin Zhang
Slides based on those originally by : Parminder Jeet Kaur
Presentation transcript:

 N. Roussopoulos 2007 OLAP & Data Cubing Spring 2007 Nick Roussopoulos

 N. Roussopoulos 2007 Database Management Systems-2- OLAP-The Data Analysis Cycle User extracts data from database with query Then visualizes, analyzes data with desktop tools Spread Sheet Table Size vs Speed Access Time (seconds) Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Price vs Speed Access Time (seconds) Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape Size(B) $/MB

 N. Roussopoulos 2007 Database Management Systems-3- The Data Cube [Gray, Bosworth, Layman, Pirahesh ICDE 96] summarize multidimensional data for trend analysis weather(time,latitude,longitude,altitude,temp,b-pressure) groupby with statistical functions (avg,min,max,count,sum) aggregates over table sub-groups results in a new table select location, sum(units) from inventory group by location having nation = “USA”; Table SUM() A B C D attribute A A A B B B B B C C C C C D D select avg(temp) from weather select time,altitude from weather groupby time,altitude

 N. Roussopoulos 2007 Database Management Systems-4- An Example SALES ModelYearColorSales Chevy1990red5 Chevy1990white87 Chevy1990blue62 Chevy1991red54 Chevy1991white95 Chevy1991blue49 Chevy1992red31 Chevy1992white54 Chevy1992blue71 Ford1990red64 Ford1990white62 Ford1990blue63 Ford1991red52 Ford1991white9 Ford1991blue55 Ford1992red27 Ford1992white62 Ford1992blue39 CUB E

 N. Roussopoulos 2007 Database Management Systems-5- Division of labor Computation vs Visualization Relational system builds CUBE relation u aggregation best done close to data u filtering of data is possible u Cube computation may be recursive è (e.g., percent of total, quartile,....) Visualization System displays/explores the cube

 N. Roussopoulos 2007 Database Management Systems-6- Problems with SQL Groubys Histograms (aggregation over computed categories) F() G() H() GROUP BY CUBE

 N. Roussopoulos 2007 Database Management Systems-7- Problems with SQL Groubys drill-down and roll-up Not relational (null values in the keys)

 N. Roussopoulos 2007 Database Management Systems-8- More problems with Groubys roll-up is asymmetric (e.g. does not aggregate by year or by color alone cross-tabulation (spreadsheets) even if SQL syntax can be devised, a 6D cross-tab requires 64 groupby queries to generate it and 64 scans and sorts of the data u most of these are not relational expressions but are in many report writers

 N. Roussopoulos 2007 Database Management Systems-9- CUBE: A Relational Aggregate Operator Generalizing Group By By Make & Color CHEVY FORD RED WHITE BLUE By Color By Make & Year By Color & Year By Make By Year Sum The Data Cube and The Sub-Space Aggregates RED WHITE BLUE ChevyFord By Make By Color Sum Cross Tab Sum Aggregate RED WHITE BLUE By Color Sum Group By (with total)

 N. Roussopoulos 2007 Database Management Systems-10- Idea: N-dimensional Cube Each Attribute is a Dimension N-dimensionalAggregate (sum(), max(),...) u fits relational model exactly: è a 1, a 2,...., a N, f(*) Super-aggregate over N-1 Dimensional sub-cubes è ALL, a 2,...., a N, f(*) è a 3, ALL, a 3,...., a N, f(*) è... è a 1, a 2,...., ALL, f(*) u this is the N-1 Dimensional cross-tab. Super-aggregate over N-2 Dimensional sub-cubes è ALL, ALL, a 3,...., a N, f(*) è... è a 1, a 2,...., ALL, ALL, f(*)

 N. Roussopoulos 2007 Database Management Systems-11- Summary of the Cube CUBE operator generalizes relational aggregates Needs ALL value to denote sub-cubes u ALL values represent aggregation sets Needs generalization of user-defined aggregates Decorations and abstractions are interesting Computation has interesting optimizations Relationship to “rest of SQL” not fully worked out.

 N. Roussopoulos 2007 Computing the (full) Cube Discussion from: “Computation of Multi-dimensional Aggregates”; SIGMOD 1996 Options:  One SQL query for each group by on the original data  Use one group by to compute another e.g., Group-by on (A, B) can be computed from group-bys on (A, B, C) or (A, B, D) etc…  Overlap Method (main contribution of the above paper) Compute multiple group-by’s simultaneously More specifically:  For each group-by, can use “sorting” or “hashing”  Say (A, B): Sorting: Sort the relation first by A, then by B: make a scan and compute the aggregates one by one Hashing: Maintain the aggregates (|A| * |B|) in memory – scan the relation once and appropriately update What if we want to compute (A, B) from (A, B, C)?  If the group-by on (A, B, C) is already sorted, we can do it very efficiently  Records (a1, b1, c1, aggr), (a1, b1, c2, aggr) … sequential, so need just one tuple of memory What if we want to compute (A, B) from (A, C, B)?  This is trickier since all tuples needed to compute (a1, b1, aggr) are not contiguous  Need memory equal to |B| tuples Database Management Systems-12-

 N. Roussopoulos 2007 Computing the (full) Cube An example of memory requirements and actual computation Database Management Systems-13-

 N. Roussopoulos 2007 Computing the (full) Cube An example of memory requirements and actual computation Database Management Systems-14-

 N. Roussopoulos 2007 Database Management Systems-15- Cube={Materialized Views} [Harinarayan, Rajaraman, Ullman 96] each groupby creates a “summary table” which is a materialized view with some dressing storing these summary tables speed up cube queries what to store and what not TPC-D example for sale analysis

 N. Roussopoulos 2007 Database Management Systems-16- The Lattice Organization the query sales groupby part will be answered at  p - cost of scanning 0.2M records  pc - -”- 6.0M -”-  psc - -”- 6.0M -”- select the views that minimize overall query performance  need a good query model  need a good optimization criterion

 N. Roussopoulos 2007 Database Management Systems-17- Views grow exponentially in general 2**N subspaces ABCD ABCBCDACDABD ABADACBCBDCD ABCD none

 N. Roussopoulos 2007 Database Management Systems-18- Greedy Allocation Algorithm optimization criterion:  storage S (total capacity)  query model (query frequencies to all views)  find the best views to materialize linear cost model:  cost of answering Q from a materialized view A generated by QA (Ansestor) is the size of the table A  cost of accessing part of a view is equal to cost of accessing all the view for each view v in a subset of views S  C(v) is the storage cost  B(v,S) is the benefit of v wrt S  for each w <= v (w is covered by v): u is the min cost in S s.t. w <= u if C(v)<C(u) then B w =C(u)-C(v) else B w =0  computes the benefit of v by considering how much it helps other views that it covers- if the cost of answering thru v is better than v’s competitors, then it adds this to the total benefit of v

 N. Roussopoulos 2007 Greedy Allocation Algorithm Can be shown to have an approximation ratio of (1 – 1/e) = 0.63…  Seems straightforward application of submodularity of the objective but did not check carefully Issues:  Dealing with hierarchies for roll-ups and drill-downs Can be incorporated w/o much trouble  Cost of using a view is assumed to be linear Usually not true We may want to build indexes on the views Addressed in a later paper by Gupta, Harinarayan, Rajaraman, and Ullman, 1997  Did not look at “refresh time” Even with batch updates, we still have limited time to do the updates Database Management Systems-19-

 N. Roussopoulos 2007 Database Management Systems-20- DynaMat Yannis Kotidis, Nick Roussopoulos (Sigmod 1999) Conventional Data Warehouse  pre-computed set view is static (too hard to select and adjust)  usually selected by an administrator DynaMat proposed a framework for automatic management of views  Unifies view selection & view refresh  Amortizes generation and maintenance cost over multiple uses of cached results Techniques  DynMat caches the results of every query  Each incoming query is evaluated against the cached results to see if any of those can be used  The captured set is updated within an update cycle to the extent possible

 N. Roussopoulos 2007 Database Management Systems-21- DynaMat Architecture Online Operation Try to match each query from the view pool (Fragment Locator)  Fragments are either single value predicates or complete ranges  A Directory Index is maintained for efficient searches On the fly decide whether to cache the result in the pool (Admission Control Entity)

 N. Roussopoulos 2007 Database Management Systems-22- Materialized Range Fragments Materialized Results are is restricted to one of a) a full Range R_i = {min_d, max_d} b) a single value for d_i c) an empty range denotes a dimension that is not present in the query SQL queries are mapped to MR queries that are answered by cached MRFs MRFs are Coarser than query results (expanded when necessary) No combination of MRFs are used to answer a query (more costly especially when MRFs are too small and/or overlap) An R-tree based index is used to identify possible MRFs that can answer the query- among those, the best fit is chosen The use of MRFs makes matching efficient.

 N. Roussopoulos 2007 Database Management Systems-23- Storage Structures & Construction of Cubes Subcubes vs. Full cubes  Subcube selection Cost of construction and indexing Maintenance

 N. Roussopoulos 2007 Database Management Systems-24- Cubetrees [Roussopoulos, Kotidis, Roussopoulos 97] better storage organization needed  materialized views and indexes are not different  single storage organization for both bulk load techniques are very important  rates should be in the order of GB/hour (industrial strength) incremental bulk updates is the MOST important issue we had lots of experience with spatial access methods: mainly with all possible variations of R-trees (handy) packed R-trees

 N. Roussopoulos 2007 Database Management Systems-25- Extended Data Cube Model Table R(A,B,C,Q) Relation tuple groupby(A,C) groupby(B) groupby(A,B) groupby(B,C) groupby(A) groupby(C) groupby(none) u relation tuples: points in the N-d space u groupby projections: also points u point data is very efficient for multidimensional indexing C T(a,b,c,q) (a,0,0,q) (a,b,0,q) (a,0,c,q) (0,b,0,q) (a,b,0,q) T (a,b,c,q) (a,0,0,q) (a,0,c,q) A B (0,b,0,q) (0,b,c,q) (0,0,c,q) (0,b,c,q) 0 (0,0,c,q)

 N. Roussopoulos 2007 Database Management Systems-26- Dataless Cubetree separate the fact table (relation) points keep only the aggregate projection points in the cubetree to reduce the size Table R(A,B,C,Q) groupby(A,C) groupby(B) groupby(A,B) groupby(B,C) groupby(A) groupby(C) groupby(none) (a,0,0,q) (a,b,0,q) (a,0,c, q) (0,b,0,q) (a,b,0,q) (a,0,0,q) (a,0,c,q) (0,0,c,q) A B (0,b,0,q) (0,b,c,q) (0,0,c,q) (0,b,c,q) C 0

 N. Roussopoulos 2007 Database Management Systems-27- BIGGEST CHALLENGE: In-place Update problem T(a,b,c,q) B C A T’(a,b,c’,q’) (a,0,0,q) (a,b,0,q) (0,0,0,q) (0,b,0,q) (0,b,c,q) (a,0,c,q) (0,0,c,q) (a,0,0,q+q’) (a,b,0,q+q’) (0,0,0,q+q’) (0,b,0,q+q’) (0,b,c,q) (a,0,c,q) (0,0,c,q) (a,0,c’,q’)(0,0,c’,q’) (0,b,c’,q’) each record in the fact table may update exponential number of other points (in 3-d 2^3 = 8 points)  record-at-a-time updates are  too expensive in terms of I/O  destroys clustering of data points  Kills indexes  main reason for SIR < 2%

 N. Roussopoulos 2007 Database Management Systems-28- Future of OLAP & Data Cubing The big rush to data warehousing has passed and left bitter taste  Too costly  Did not achieve the promised database integration New applications with multi-dimensional data are needed  Cost with today’s technology is much less  Data integration is not easier and requires hard brain work Promising data areas  Scientific  Security  Web