Data Warehousing & OLAP Yannis Kotidis AT&T Labs-Research.

Slides:



Advertisements
Similar presentations
OLAP Tuning. Outline OLAP 101 – Data warehouse architecture – ROLAP, MOLAP and HOLAP Data Cube – Star Schema and operations – The CUBE operator – Tuning.
Advertisements

1 DynaMat A Dynamic View Management System for Data Warehouses Vicky :: Cao Hui Ping Sherman :: Chow Sze Ming CTH :: Chong Tsz Ho Ronald :: Woo Lok Yan.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Data Warehousing Willem Visser RW334. Somebody is watching! Everybody seems to be recording your every move Loyalty cards Cookies – Facebook, Twitter,…
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
ICS 421 Spring 2010 Data Warehousing (1) Asst. Prof. Lipyeow Lim Information & Computer Science Department University of Hawaii at Manoa 3/18/20101Lipyeow.
Decision Support and Data Warehouse. Decision supports Systems Components Data management function –Data warehouse Model management function –Analytical.
Data Sources Data Warehouse Analysis Results Data visualisation Analytical tools OLAP Data Mining Overview of Business Intelligence Data visualisation.
Data Warehousing & OLAP. 2 What is Data Warehouse? “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 29 Overview of Data Warehousing and OLAP.
Data Warehousing. On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
DATA WAREHOUSE (Muscat, Oman).
1 Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously.  A decision support database that is maintained.
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
CS346: Advanced Databases
An Overview of Data Warehousing and OLTP Technology Presenter: Parminder Jeet Kaur Discussion Lead: Kailang.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Data Cube Computation Model dependencies among the aggregates: most detailed “view” can be computed from view (product,store,quarter) by summing-up all.
8/20/ Data Warehousing and OLAP. 2 Data Warehousing & OLAP Defined in many different ways, but not rigorously. Defined in many different ways, but.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Basic Concepts of Datawarehousing An Overview Prasanth Gurram.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
An Introduction to Data Warehousing Yannis Kotidis AT&T Labs-Research.
Data Warehousing.
Data Warehouse & Data Mining
1 Cube Computation and Indexes for Data Warehouses CPS Notes 7.
OnLine Analytical Processing (OLAP)
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Data Warehouse & OLAP Kuliah 1 Introduction Slide banyak mengambil dari acuan- acuan yang dipakai.
Data Warehousing.
Roadmap 1.What is the data warehouse, data mart 2.Multi-dimensional data modeling 3.Data warehouse design – schemas, indices 4.The Data Cube operator –
October 28, Data Warehouse Architecture Data Sources Operational DBs other sources Analysis Query Reports Data mining Front-End Tools OLAP Engine.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
13 1 Chapter 13 The Data Warehouse Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Dr. N. MamoulisAdvanced Database Technologies1 Topic 6: Data Warehousing & OLAP Defined in many different ways, but not rigorously. A decision support.
Ch3 Data Warehouse Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009.
Copyright © 2007 Ramez Elmasri and Shamkant B. Navathe Slide
UNIT-II Principles of dimensional modeling
1 On-Line Analytic Processing Warehousing Data Cubes.
Data Warehousing Multidimensional Analysis
Data Mining Data Warehouses.
Business Intelligence Transparencies 1. ©Pearson Education 2009 Objectives What business intelligence (BI) represents. The technologies associated with.
OLAP & Data Warehousing. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Online Analytical Processing (OLAP) An Overview Kian Win Ong, Nicola Onose Mar 3 rd 2006.
Data Warehousing.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
On-Line Analytic Processing
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data Warehouse.
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
DATA CUBES E0 261 Jayant Haritsa Computer Science and Automation
Presentation transcript:

Data Warehousing & OLAP Yannis Kotidis AT&T Labs-Research

2Yannis Kotidis What is Data Warehouse? “A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of data in support of management’s decision- making process.”—W. H. Inmon A Data Warehouse is used for On-Line-Analytical-Processing: “Class of tools that enables the user to gain insight into data through interactive access to a wide variety of possible views of the information” 3 Billion market worldwide [1999 figure, olapreport.com] –Retail industries: user profiling, inventory management –Financial services: credit card analysis, fraud detection –Telecommunications: call analysis, fraud detection

3Yannis Kotidis Data Warehouse Initiatives Organized around major subjects, such as customer, product, sales –integrate multiple, heterogeneous data sources –exclude data that are not useful in the decision support process Focusing on the modeling and analysis of data for decision makers, not on daily operations or transaction processing –emphasis is on complex, exploratory analysis not day-to-day operations Large time horizon for trend analysis (current and past data) Non-Volatile store –physically separate store from the operational environment

4Yannis Kotidis Data Warehouse Architecture Extract data from operational data sources –clean, transform Bulk load/refresh –warehouse is offline OLAP-server provides multidimensional view Multidimensional-olap (Essbase, oracle express) Relational-olap (Redbrick, Informix, Sybase, SQL server)

5Yannis Kotidis Why do we need all that? Operational databases are for On Line Transaction Processing –automate day-to-day operations (purchasing, banking etc) –transactions access (and modify!) a few records at a time –database design is application oriented –metric: transactions/sec Data Warehouse is for On Line Analytical Processing (OLAP) –complex queries that access millions of records –need historical data for trend analysis –long scans would interfere with normal operations –synchronizing data-intensive queries among physically separated databases would be a nightmare! –metric: query response time

6Yannis Kotidis Examples of OLAP Comparisons (this period v.s. last period) –Show me the sales per region for this year and compare it to that of the previous year to identify discrepancies Multidimensional ratios (percent to total) –Show me the contribution to weekly profit made by all items sold in the northeast stores between may 1 and may 7 Ranking and statistical profiles (top N/bottom N) –Show me sales, profit and average call volume per day for my 10 most profitable salespeople Custom consolidation (market segments, ad hoc groups) –Show me an abbreviated income statement by quarter for the last four quarters for my northeast region operations

7Yannis Kotidis Multidimensional Modeling Example: compute total sales volume per product and store Product Store 800

8Yannis Kotidis Dimensions and Hierarchies product city month category region year product country quarter state month week city day store PRODUCT LOCATION TIME NY DVD August Sales of DVDs in NY in August A cell in the cube may store values (measurements) relative to the combination of the labeled dimensions DIMENSIONS

9Yannis Kotidis Common OLAP Operations Roll-up: move up the hierarchy –e.g given total sales per city, we can roll-up to get sales per state Drill-down: move down the hierarchy –more fine-grained aggregation –lowest level can be the detail records (drill-through) category region year product country quarter state month week city day store PRODUCT LOCATION TIME

10Yannis Kotidis Pivoting Pivoting: aggregate on selected dimensions –usually 2 dims (cross-tabulation)

11Yannis Kotidis Slice and Dice Queries Slice and Dice: select and project on one or more dimensions product customers store customer = “Smith”

12Yannis Kotidis Roadmap What is a data warehouse and what it is for What are the differences between OLTP and OLAP Multi-dimensional data modeling Data warehouse design –the star schema, bitmap indexes The Data Cube operator –semantics and computation Aggregate View Selection Dynamic View Management Other Issues

13Yannis Kotidis Data Warehouse Design Most data warehouses adopt a star schema to represent the multidimensional model Each dimension is represented by a dimension-table –LOCATION(location_key,store,street_address,city,state,country,region) –dimension tables are not normalized Transactions are described through a fact-table –each tuple consists of a pointer to each of the dimension-tables (foreign- key) and a list of measures (e.g. sales $$$)

14Yannis Kotidis Star Schema Example time_key day day_of_the_week month quarter year TIME location_key store street_address city state country region LOCATION SALES product_key product_name category brand color supplier_name PRODUCT time_key product_key location_key units_sold amount { measures

15Yannis Kotidis Advantages of Star Schema Facts and dimensions are clearly depicted –dimension tables are relatively static, data is loaded (append mostly) into fact table(s) –easy to comprehend (and write queries) “Find total sales per product-category in our stores in Europe” SELECT PRODUCT.category, SUM(SALES.amount) FROM SALES, PRODUCT,LOCATION WHERE SALES.product_key = PRODUCT.product_key AND SALES.location_key = LOCATION.location_key AND LOCATION.region=“Europe” GROUP BY PRODUCT.category

16Yannis Kotidis Star Schema Query Processing time_key day day_of_the_week month quarter year TIME location_key store street_address city state country region LOCATION SALES product_key product_name category brand color supplier_name PRODUCT time_key product_key location_key units_sold amount { measures JOIN S region=“Europe” P category

17Yannis Kotidis Indexing OLAP Data: Bitmap Index Each value in the column has a bit vector: –The i-th bit is set if the i-th row of the base table has the value for the indexed column –The length of the bit vector: # of records in the base table Mainly intended for small cardinality domains LOCATIONIndex on Region

18Yannis Kotidis R102 R117 R118 R124 Join-Index Join index relates the values of the dimensions of a star schema to rows in the fact table. –a join index on region maintains for each distinct region a list of ROW-IDs of the tuples recording the sales in the region Join indices can span multiple dimensions OR –can be implemented as bitmap- indexes (per dimension) –use bit-op for multiple-joins SALES region = Africa region = America region = Asia region = Europe LOCATION

19Yannis Kotidis Problem Solved? “Find total sales per product-category in our stores in Europe” –Join-index will prune ¾ of the data (uniform sales ), but the remaining ¼ is still large (several millions transactions) Index is unclustered High level aggregations are expensive!!!!! –long scans to get the data –hashing or sorting necessary for group-bys region country state city store LOCATON  Long Query Response Times  Pre-computation is necessary

20Yannis Kotidis Multiple Simultaneous Aggregates Cross-Tabulation (products/store) Sub-totals per store Sub-totals per product Total sales 4 Group-bys here: (store,product) (store) (product) () Need to write 4 queries!!!

21Yannis Kotidis The Data Cube Operator (Gray et al) All previous aggregates in a single query: SELECT LOCATION.store, SALES.product_key, SUM (amount) FROM SALES, LOCATION WHERE SALES.location_key=LOCATION.location_key CUBE BY SALES.product_key, LOCATION.store Challenge: Optimize Aggregate Computation

22Yannis Kotidis Store Product_key sum(amout) ALL1379 1ALL1268 1ALL536 1ALL1937 ALL11870 ALL2800 ALL3780 ALL41670 ALLALL5120 Relational View of Data Cube SELECT LOCATION.store, SALES.product_key, SUM (amount) FROM SALES, LOCATION WHERE SALES.location_key=LOCATION.location_key CUBE BY SALES.product_key, LOCATION.store

23Yannis Kotidis Data Cube: Multidimensional View Total annual sales of DVDs in America Quarter Product Region sum DVD VCR PC 1Qtr 2Qtr 3Qtr 4Qtr America Europe Asia sum

24Yannis Kotidis Other Extensions to SQL Complex aggregation at multiple granularities (Ross et. all 1998) –Compute multiple dependent aggregates Other proposals: the MD-join operator (Chatziantoniou et. all 1999] SELECT LOCATION.store, SALES.product_key, SUM (amount) FROM SALES, LOCATION WHERE SALES.location_key=LOCATION.location_key CUBE BY SALES.product_key, LOCATION.store: R SUCH THAT R.amount = max(amount)

25Yannis Kotidis Data Cube Computation Model dependencies among the aggregates: most detailed “view” can be computed from view (product,store,quarter) by summing-up all quarterly sales product,store,quarter productstorequarter none store,quarterproduct,quarterproduct, store

26Yannis Kotidis Computation Directives Hash/sort based methods (Agrawal et. al. VLDB’96) 1.Smallest-parent 2.Cache-results 3.Amortize-scans 4.Share-sorts 5.Share-partitions product,store,quarter productstorequarter none store,quarterproduct,quarterproduct, store

27Yannis Kotidis Alternative Array-based Approach Model data as a sparse multidimensional array –partition array into chunks (a small sub-cube which fits in memory). –fast addressing based on (chunk_id, offset) Compute aggregates in “multi-way” by visiting cube cells in the order which minimizes the # of times to visit each cell, and reduces memory access and storage cost. B What is the best traversing order to do multi-way aggregation?

28Yannis Kotidis Reality check:too many views! 2 n views for n dimensions (no- hierarchies) Storage/update- time explosion More pre- computation doesn’t mean better performance!!!!

29Yannis Kotidis How to choose among the views? Use some notion of benefit per view Limit: disk space or maintenance-time product,store,quarter productstorequarter none store,quarterproduct,quarterproduct, store Hanarayan et al SIGMOD’96: Pick views greedily until space is filled Catch: quadratic in the number of views, which is exponential!!!

30Yannis Kotidis View Selection Problem Selection is based on a workload estimate (e.g. logs) and a given constraint (disk space or update window) NP-hard, optimal selection can not be computed > 4-5 dimensions –greedy algorithms (e.g. [Harinarayan96]) run at least in polynomial time in the number of views i.e exponential in the number of dimensions!!! Optimal selection can not be approximated [Karloff99] –greedy view selection can behave arbitrary bad Lack of good models for a cost-based optimization!

31Yannis Kotidis Problem Generalization View Management Problem: Materialize and maintain the right subset of views with respect to the workload and the available resources What is the workload? –“Farmers” v.s. “Explorers” [Inmon99] –Pre-compiled queries (report generating tools, data mining) –Ad-hoc analysis (unpredictable) What are the resources? –Disk space (getting cheaper) –Update window (getting smaller)

32Yannis Kotidis DynaMat: A Dynamic View Management System Continuous management based on disk space and update window restrictions Engage views whenever possible for incoming queries –e.g. infer monthly sales out of pre-computed daily sales –support both ad-hoc and pre-compiled queries Exploit dependencies among the views to maintain the best subset of them within the given update window

33Yannis Kotidis System Overview Utilize a dedicated disk space (View Pool) for results of past queries Engage stored results for answering new queries –Amortize query execution cost through multiple uses of the result Aggregate Locator Query Interface Admission Control View Pool User DW base tables

34Yannis Kotidis The Space & Time Bounds Pool utilization increases between updates Space bound: new results compete with stored aggregates for the limited space Time bound: results are evicted from the pool due to the update-time window Time bound Space bound

35Yannis Kotidis Dynamic View Management Space and time restrictions will lead us to evict materialized aggregates Not a traditional caching problem –aggregates don’t have the same size,cost, cost/size –aggregates are not independent –costs are dynamic f1f1 product store customer link f2f2 goodness(f) = accesses(f) / (t-t last_access ) * cost(f) / size(f) size in pages re-computation cost number of accesses staleness

36Yannis Kotidis Exploiting Dependencies For Updates For each stored aggregate compute minimum update cost UC(f) –incrementally from deltas –re-computation from father shared maintenance cost Total Update Cost: f Deltas Incremental product store customer Re-compute updated results

37Yannis Kotidis Roadmap What is a data warehouse and what it is for What are the differences between OLTP and OLAP Multi-dimensional data modeling Data warehouse design –the star schema, bitmap indexes The Data Cube operator –semantics and computation Aggregate View Selection Dynamic View Management Other Issues

38Yannis Kotidis Other Issues Fact+Dimension tables in the DW are views of tables stored in the sources Lots of view maintenance problems –correctly reflect asynchronous changes at the sources –making views self-maintainable Interactive queries (on-line aggregation) –e.g. show running-estimates + confidence intervals Computing Iceberg queries efficiently Approximation –rough-estimates for hi-level aggregates are often good-enough –histogram, wavelet, sampling based techniques (e.g. AQUA)

39Yannis Kotidis The End Thank you!