CS 345: Topics in Data Warehousing Thursday, September 30, 2004.

Slides:



Advertisements
Similar presentations
Dimensional Modeling.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
C6 Databases.
Outline What is a data warehouse? A multi-dimensional data model Data warehouse architecture Data warehouse implementation Further development of data.
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Technical BI Project Lifecycle
Data Warehousing CPS216 Notes 13 Shivnath Babu. 2 Warehousing l Growing industry: $8 billion way back in 1998 l Range from desktop to huge: u Walmart:
Data Warehousing M R BRAHMAM.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
Data Warehouse IMS5024 – presented by Eder Tsang.
OLAP. Overview Traditional database systems are tuned to many, small, simple queries. Some new applications use fewer, more time-consuming, analytic queries.
Advanced Querying OLAP Part 2. Context OLAP systems for supporting decision making. Components: –Dimensions with hierarchies, –Measures, –Aggregation.
COMP 578 Data Warehousing And OLAP Technology Keith C.C. Chan Department of Computing The Hong Kong Polytechnic University.
Data Warehousing. On-Line Analytical Processing (OLAP) Tools The use of a set of graphical tools that provides users with multidimensional views of their.
CSE6011 Warehouse Models & Operators  Data Models  relations  stars & snowflakes  cubes  Operators  slice & dice  roll-up, drill down  pivoting.
Chapter 13 The Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Components of the Data Warehouse Michael A. Fudge, Jr.
CS 345: Topics in Data Warehousing Tuesday, September 28, 2004.
ETL Design and Development Michael A. Fudge, Jr.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Decision Support Chapter 23.
Week 6 Lecture The Data Warehouse Samuel Conn, Asst. Professor
OnLine Analytical Processing (OLAP)
CS 157B: Database Management Systems II March 20 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
1 Data Warehouses BUAD/American University Data Warehouses.
OLAP & DSS SUPPORT IN DATA WAREHOUSE By - Pooja Sinha Kaushalya Bakde.
Data Warehousing.
BI Terminologies.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
Winter 2006Winter 2002 Keller, Ullman, CushingJudy Cushing 19–1 Warehousing The most common form of information integration: copy sources into a single.
Ayyat IT Group Murad Faridi Roll NO#2492 Muhammad Waqas Roll NO#2803 Salman Raza Roll NO#2473 Junaid Pervaiz Roll NO#2468 Instructor :- “ Madam Sana Saeed”
UNIT-II Principles of dimensional modeling
1 On-Line Analytic Processing Warehousing Data Cubes.
On-Line Application Processing Warehousing Data Cubes (Data Mining) (slides borrowed from Stanford)
ADVANCED TOPICS IN RELATIONAL DATABASES Spring 2011 Instructor: Hassan Khosravi.
OLAP On Line Analytic Processing. OLTP On Line Transaction Processing –support for ‘real-time’ processing of orders, bookings, sales –typically access.
Data Warehousing.
Copyright© 2014, Sira Yongchareon Department of Computing, Faculty of Creative Industries and Business Lecturer : Dr. Sira Yongchareon ISCG 6425 Data Warehousing.
Database Management Systems, 2 nd Edition. R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 Data Warehousing and Decision Support Chapter 25.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Introduction to OLAP and Data Warehouse Assoc. Professor Bela Stantic September 2014 Database Systems.
An Overview of Data Warehousing and OLAP Technology
Data Warehouses and OLAP 1.  Review Questions ◦ Question 1: OLAP ◦ Question 2: Data Warehouses ◦ Question 3: Various Terms and Definitions ◦ Question.
Or How I Learned to Love the Cube…. Alexander P. Nykolaiszyn BLOG:
CSE6011 Implementing a Warehouse  Monitoring: Sending data from sources  Integrating: Loading, cleansing,...  Processing: Query processing, indexing,...
Data Mining and Data Warehousing: Concepts and Techniques What is a Data Warehouse? Data Warehouse vs. other systems, OLTP vs. OLAP Conceptual Modeling.
Data warehouse.
Data Warehousing CIS 4301 Lecture Notes 4/20/2006.
Data Warehouse.
On-Line Analytic Processing
Data warehouse and OLAP
Chapter 13 The Data Warehouse
Data storage is growing Future Prediction through historical data
Summarized from various resources Modern Database Management
Data Warehouse.
Applying Data Warehouse Techniques
Data Warehouse and OLAP
Introduction of Week 9 Return assignment 5-2
Retail Sales is used to illustrate a first dimensional model
Data Warehousing Concepts
Applying Data Warehouse Techniques
Applying Data Warehouse Techniques
Data Warehouse and OLAP
Presentation transcript:

CS 345: Topics in Data Warehousing Thursday, September 30, 2004

Review of Tuesday’s Class Roles of Information Technology –1. Automate clerical work –2. Decision support OLAP vs. OLTP –Different query characteristics –Different performance requirements –Different data modeling requirements –OLAP combines data from many sources High-level course outline –Logical Database Design –Query Processing –Physical Database Design –Data Mining

Outline of Today’s Class Data integration Basic OLAP queries –Data cubes –Slice and dice, drill down, roll up –MOLAP vs. ROLAP –SQL OLAP Extensions: ROLLUP, CUBE Star Schemas –Facts and Dimensions

Loading the Data Warehouse Source Systems Data Staging AreaData Warehouse (OLTP) Data is periodically extracted Data is cleansed and transformed Users query the data warehouse

Terminology: ETL ETL = Extraction, Transformation, & Load Extraction: Get the data out of the source systems Transformation: Convert the data into a useful format for analysis Load: Get the data into the data warehouse (…and build indexes, materialized views, etc.) We will return to this topic in a couple weeks.

Data Integration is Hard Data warehouses combine data from multiple sources Data must be translated into a consistent format Data integration represents ~80% of effort for a typical data warehouse project! Some reasons why it’s hard: –Metadata is often poor or non-existent –Data quality is often bad Missing or default values Multiple spellings of the same thing (Cal vs. UC Berkeley vs. University of California) –Inconsistent semantics What is an airline passenger?

Federated Databases An alternative to data warehouses Data warehouse –Create a copy of all the data –Execute queries against the copy Federated database –Pull data from source systems as needed to answer queries “lazy” vs. “eager” data integration Data WarehouseFederated Database Query Answer Query Extraction Rewritten Queries Answer Source Systems Source Systems Warehouse Mediator

Warehouses vs. Federation Advantages of federated databases: –No redundant copying of data –Queries see “real-time” view of evolving data –More flexible security policy Disadvantages of federated databases: –Analysis queries place extra load on transactional systems –Query optimization is hard to do well –Historical data may not be available –Complex “wrappers” needed to mediate between analysis server and source systems Data warehouses are much more common in practice –Better performance –Lower complexity –Slightly out-of-date data is acceptable

Two Approaches to Data Warehousing Data mart: like a data warehouse, but smaller and more focused Top-down approach –First build single unified data warehouse with all enterprise data –Then create data marts containing specialized subsets of the data from the warehouse Bottom-up approach –First build a data mart to solve the most pressing problem –Then build another data mart, then another –Data warehouse = union of all data marts In practice, not much difference between the two Our book advocates the bottom-up approach

Data Cube Axes of the cube represent attributes of the data records –Generally discrete-valued / categorical –e.g. color, month, state –Called dimensions Cells hold aggregated measurements –e.g. total $ sales, number of autos sold –Called facts Real data cubes have >> 3 dimensions JulAugSep CA OR WA Red Blue Gray Auto Sales

Slicing and Dicing JulAugSep CA OR WA Red Blue Gray Red Blue Gray JulAugSep CA OR WA Blue JulAugSep CA OR WA Blue JulAugSep Total

Querying the Data Cube Cross-tabulation –“Cross-tab” for short –Report data grouped by 2 dimensions –Aggregate across other dimensions –Include subtotals Operations on a cross-tab –Roll up (further aggregation) –Drill down (less aggregation) CAORWATotal Jul Aug Sep Total Number of Autos Sold

Roll Up and Drill Down CAORWATotal Jul Aug Sep Total Number of Autos Sold CAORWATotal Number of Autos Sold CAORWATotal Red Blue Gray Total Roll up by Month Number of Autos Sold Drill down by Color

“Standard” Data Cube Query Measurements –Which fact(s) should be reported? Filters –What slice(s) of the cube should be used? Grouping attributes –How finely should the cube be diced? –Each dimension is either: (a) A grouping attribute (b) Aggregated over (“Rolled up” into a single total) –n dimensions → 2 n sets of grouping attributes –Aggregation = projection to a lower-dimensional subspace

Full Data Cube with Subtotals Pre-computation of aggregates → fast answers to OLAP queries Ideally, pre-compute all 2 n types of subtotals Otherwise, perform aggregation as needed Coarser-grained totals can be computed from finer-grained totals –But not the other way around

Data Cube Lattice Total State MonthColor State, Month State, Color Month, Color State, Month, Color Drill Down Roll Up

MOLAP vs. ROLAP MOLAP = Multidimensional OLAP Store data cube as multidimensional array (Usually) pre-compute all aggregates Advantages: –Very efficient data access → fast answers Disadvantages: –Doesn’t scale to large numbers of dimensions –Requires special-purpose data store

Sparsity Imagine a data warehouse for Safeway. Suppose dimensions are: Customer, Product, Store, Day If there are 100,000 customers, 10,000 products, 1,000 stores, and 1,000 days… …data cube has 1,000,000,000,000,000 cells! Fortunately, most cells are empty. A given store doesn’t sell every product on every day. A given customer has never visited most of the stores. A given customer has never purchased most products. Multi-dimensional arrays are not an efficient way to store sparse data.

MOLAP vs. ROLAP ROLAP = Relational OLAP Store data cube in relational database Express queries in SQL Advantages: –Scales well to high dimensionality –Scales well to large data sets –Sparsity is not a problem –Uses well-known, mature technology Disadvantages: –Query performance is slower than MOLAP –Need to construct explicit indexes

Creating a Cross-tab with SQL SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red' Grouping Attributes Measurements Filters

What about the totals? SQL aggregation query with GROUP BY does not produce subtotals, totals Our cross-tab report is incomplete. CAORWATotal Jul453330? Aug503642? Sep383140? Total???? Number of Autos Sold StateMonthSUM CAJul45 CAAug50 CASep38 ORJul33 ORAug36 ORSep31 WAJul30 WAAug42 WASep40

One solution: a big UNION ALL SELECT state, month, SUM(quantity) FROM sales GROUP BY state, month WHERE color = 'Red‘ UNION ALL SELECT state, "ALL", SUM(quantity) FROM sales GROUP BY state WHERE color = 'Red' UNION ALL SELECT "ALL", month, SUM(quantity) FROM sales GROUP BY month WHERE color = 'Red‘ UNION ALL SELECT "ALL", "ALL", SUM(quantity) FROM sales WHERE color = 'Red' Original Query State Subtotals Month Subtotals Overall Total

A better solution “UNION ALL” solution gets cumbersome with more than 2 grouping attributes n grouping attributes → 2 n parts in the union OLAP extensions added to SQL 99 are more convenient –CUBE, ROLLUP SELECT state, month, SUM(quantity) FROM sales GROUP BY CUBE(state, month) WHERE color = 'Red'

Results of the CUBE query StateMonthSUM(quantity) CAJul45 CAAug50 CASep38 CANULL133 ORJul33 ORAug36 ORSep31 ORNULL100 WAJul30 WAAug42 WASep40 WANULL112 NULLJul108 NULLAug128 NULLSep109 NULLNULL345 Notice the use of NULL for totals Subtotals at all levels

ROLLUP vs. CUBE CUBE computes entire lattice ROLLUP computes one path through lattice –Order of GROUP BY list matters –Groups by all prefixes of the GROUP BY list GROUP BY ROLLUP(A,B,C) A,B,C (A,B) subtotals (A) subtotals Total GROUP BY CUBE(A,B,C) A,B,C Subtotals for the following: (A,B), (A,C), (B,C), (A), (B), (C) Total

ROLLUP example Total State MonthColor State, Month State, Color Month, Color State, Month, Color SELECT color, month, state, SUM(quantity) FROM sales GROUP BY ROLLUP(color,month,state)

Logical Database Design Logical database design is the topic for the next two weeks. Logical design vs. physical design: –Logical design = conceptual organization for the database Create an abstraction for a real-world process –Physical design = how is the data stored Select data structures (tables, indexes, materialized views) Organize data structures on disk Three main goals for logical design: –Simplicity –Expressiveness –Performance

Goals for Logical Design Simplicity –Users should understand the design –Data model should match users’ conceptual model –Queries should be easy and intuitive to write Expressiveness –Include enough information to answer all important queries –Include all relevant data (without irrelevant data) Performance –An efficient physical design should be possible

Another perspective… From a presentation at SIGMOD*: Three design rules: –Simplicity *by Jeff Byard and Donovan Schneider, Red Brick Systems

Star Schema Sales Date Product Store Promotion Fact table Dimension tables

Dimension Tables Each one corresponds to a real-world object or concept. –Examples: Customer, Product, Date, Employee, Region, Store, Promotion, Vendor, Partner, Account, Department Properties of dimension tables: –Contain many descriptive columns Dimension tables are wide (dozens of columns) –Generally don’t have too many rows At least in comparison to the fact tables Usually < 1 million rows –Contents are relatively static Almost like a lookup table Uses of dimension tables –Filters are based on dimension attributes –Grouping columns are dimension attributes –Fact tables are referenced through dimensions

Fact Tables Each fact table contains measurements about a process of interest. Each fact row contains two things: –Numerical measure columns –Foreign keys to dimension tables –That’s all! Properties of fact tables: –Very big Often millions or billions of rows –Narrow Small number of columns –Changes often New events in the world → new rows in the fact table Typically append-only Uses of fact tables: –Measurements are aggregated fact columns.

Comparing Facts and Dimensions Narrow Big (many rows) Numeric Growing over time Wide Small (few rows) Descriptive Static Facts Dimensions Facts contain numbers, dimensions contain labels

Four steps in dimensional modeling 1.Identify the process being modeled. 2.Determine the grain at which facts will be stored. 3.Choose the dimensions. 4.Identify the numeric measures for the facts.

Grain of a Fact Table Grain of a fact table = the meaning of one fact table row Determines the maximum level of detail of the warehouse Example grain statements: (one fact row represents a…) –Line item from a cash register receipt –Boarding pass to get on a flight –Daily snapshot of inventory level for a product in a warehouse –Sensor reading per minute for a sensor –Student enrolled in a course Finer-grained fact tables: –are more expressive –have more rows Trade-off between performance and expressiveness –Rule of thumb: Err in favor of expressiveness –Pre-computed aggregates can solve performance problems

Choosing Dimensions Determine a candidate key based on the grain statement. –Example: a student enrolled in a course –(Course, Student, Term) is a candidate key Add other relevant dimensions that are functionally determined by the candidate key. –For example, Instructor and Classroom Assuming each course has a single instructor!