Data Warehouse and the Star Schema

Slides:



Advertisements
Similar presentations
Dimensional Modeling.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
Data Warehousing – An Introductory Perspective
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2015, David C. Roberts, all rights reserved.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2014, David C. Roberts, all rights reserved.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
The McGraw-Hill Companies, Inc Information Technology & Management Thompson Cats-Baril Chapter 3 Content Management.
DW-1: Introduction to Data Warehousing. Overview What is Database What Is Data Warehousing Data Marts and Data Warehouses The Data Warehousing Process.
1 Data Warehousing Lecture-13 Dimensional Modeling (DM) Virtual University of Pakistan Ahsan Abdullah Assoc. Prof. & Head Center for Agro-Informatics Research.
Presented By: Muhammad Rizvi Raghuram Vempali Surekha Vemuri.
Data Warehouse and Business Intelligence Dr. Minder Chen Fall 2009.
CS 157B: Database Management Systems II March 20 Class Meeting Department of Computer Science San Jose State University Spring 2013 Instructor: Ron Mak.
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2015, David C. Roberts, all rights reserved.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
Data Warehouse. Design DataWarehouse Key Design Considerations it is important to consider the intended purpose of the data warehouse or business intelligence.
1 Data Warehouses BUAD/American University Data Warehouses.
Data Warehousing.
MIS2502: Data Analytics The Information Architecture of an Organization.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
DIMENSIONAL MODELING MIS2502 Data Analytics. So we know… Relational databases are good for storing transactional data But bad for analytical data What.
MIS2502: Data Analytics Dimensional Data Modeling
Basic Model: Retail Grocery Store
UNIT-II Principles of dimensional modeling
CMPE 226 Database Systems October 21 Class Meeting Department of Computer Engineering San Jose State University Fall 2015 Instructor: Ron Mak
1 Agenda – 04/02/2013 Discuss class schedule and deliverables. Discuss project. Design due on 04/18. Discuss data mart design. Use class exercise to design.
June 08, 2011 How to design a DATA WAREHOUSE Linh Nguyen (Elly)
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Building the Corporate Data Warehouse Pindaro Demertzoglou Data Resource Management.
Copyright © 2016 Pearson Education, Inc. Modern Database Management 12 th Edition Jeff Hoffer, Ramesh Venkataraman, Heikki Topi CHAPTER 9: DATA WAREHOUSING.
CMPE 226 Database Systems April 12 Class Meeting Department of Computer Engineering San Jose State University Spring 2016 Instructor: Ron Mak
Data Warehouse and the Star Schema CSCI 242 ©Copyright 2016, David C. Roberts, all rights reserved.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Business Intelligence Overview
Jaclyn Hansberry MIS2502: Data Analytics The Things You Can Do With Data The Information Architecture of an Organization Jaclyn.
Database Principles.
Advanced Applied IT for Business 2
On-Line Analytic Processing
Data warehouse and OLAP
MIS2502: Data Analytics Dimensional Data Modeling
Data Warehouse.
Applying Data Warehouse Techniques
ACS1803 Lecture Outline 2   DATA MANAGEMENT CONCEPTS Text, Ch. 3
MIS2502: Data Analytics Dimensional Data Modeling
MIS2502: Data Analytics Dimensional Data Modeling
Competing on Analytics II
Inventory is used to illustrate:
Retail Sales is used to illustrate a first dimensional model
MIS2502: Data Analytics Dimensional Data Modeling
CMPE 226 Database Systems April 11 Class Meeting
Database Vs. Data Warehouse
Unidad II Data Warehousing Interview Questions
An Introduction to Data Warehousing
MIS2502: Data Analytics The Information Architecture of an Organization Acknowledgement: David Schuff.
MIS2502: Data Analytics Dimensional Data Modeling
Dr. Bernard Chen Ph.D. University of Central Arkansas Fall 2009
Retail Sales is used to illustrate a first dimensional model
Dimensional Modeling.
MIS2502: Data Analytics Dimensional Data Modeling
Retail Sales is used to illustrate a first dimensional model
Dimensional Model January 16, 2003
Data Warehousing Concepts
Applying Data Warehouse Techniques
Analytics, BI & Data Integration
Applying Data Warehouse Techniques
Data Warehouse and OLAP Technology
Data Warehousing.
Presentation transcript:

Data Warehouse and the Star Schema CSCI 242 ©Copyright 2019, David C. Roberts, all rights reserved

Finally we are talking about something not invented by IBM Finally we are talking about something not invented by IBM! Inventor is unknown. Popularized by Ralph Kimball and his company, Red Brick Warehouse.

History First product introduced by Red Brick Warehouse, a standalone system for data warehouse Algorithm was figured out by Oracle and Sybase. Oracle built into DBMS, Sybase made separate software product. IBM bought Red Brick

Agenda Definition Why data warehouse Product History Processing Star queries Data warehouse in the enterprise Data warehouse design Relevance of normalization Star schema Processing the star schema

Definition Data warehouse: A repository of integrated information, available for queries and analysis. Data and information are extracted from heterogeneous sources as they are generated The point is that it’s not used for transaction processing; that is, it’s read-only. And the data can come from heterogeneous sources and it can all be queried in one database.

Why Data Warehouse Analytical use of a database involves lots of reading, and some table scanning Reading and scanning interferes with update but not with other reading and scanning It makes sense to have a separate copy of data for analytics, recopied once a day

Data Warehouse vs. OLTP OLTP DW Purpose Automate day-to-day operations Analysis Structure RDBMS RMBMS Data Model Normalized Dimensional Access SQL SQL and business analysis programs Data Data that runs the business Current and historical information Condition of data Changing, incomplete Historical, complete, descriptive

Red Brick Invented data warehouse; they sold a hardware product with a star schema database You loaded the Red Brick Warehouse and then queried it for OLTP It featured new optimizations for star schemas, was very fast

Enter Sybase Sybase learned the optimization and developed their own product. The Sybase product was a stand-alone software data warehouse product It couldn’t do general-purpose database work, was just a data warehouse They appear to have copied the Red Brick idea, without selling hardware

Enter Oracle Oracle, later, also copied the same optimization They added a bitmap index to their database product, and added the star schema optimization Now their product could do data warehouse as well as database

Status Today Oracle dominates the field today IBM eventually bought Red Brick so still offers some sort of Red Brick product Sybase offers their OLTP product, now as an offering of SAP

Processing star queries So what is this algorithm that is so copied? Processing star queries

Star Schema Data warehouse relies on the star schema The data is not normalized DW is loaded from a normalized database There is a fact table surrounded by multiple dimension tables Fact table has all measures for the subject area, with foreign keys for dimensions for each measure

A Sample OLTP Schema customers orders order items products

Transformed to a Star Schema Time Hour Day of Week Month Year Season products times Time Product Customer Channel dimension table dimension table sales fact table customers channels dimension table dimension table

Star Schema Time Location Fact Table Item Supplier Customer

Processing Star Queries Build a bitmap index on each foreign key column of the fact table Index is a 2-dimensional array, one column for each row being indexed, one row per value of that column Bitmap indexes are typically much smaller than b-tree indexes, that can be larger than the data itself

Bitmap Index Example

Query Processing The typical query is a join of foreign keys of dimension tables to the fact table This is processed in two phases: From the fact table, retrieve all rows that are part of the result, using bitmap indexes Join the result of the step above to the dimension tables

Example Query Find sales and profits from the grocery departments of stores in the West and Southwest districts over the last three quarters

Example Query SELECT store.sales_district, time.fiscal_period, SUM(sales.dollar_sales) revenue, SUM(dollar_sales) - SUM(dollar_cost) income FROM sales, store, time, product WHERE sales.store_key = store.store_key AND sales.time_key = time.time_key AND sales.product_key = product.product_key AND time.fiscal_period IN ('3Q95', '4Q95', '1Q96') and product.department = 'Grocery' AND store.sales_district IN ('San Francisco', 'Los Angeles') GROUP BY store.sales_district, time.fiscal_period;

Phase 1 Finding the rows in the SALES table (using bitmap indexes): SELECT ... FROM sales WHERE store_key IN (SELECT store_key FROM store WHERE sales_district IN ('WEST', 'SOUTHWEST')) AND time_key IN (SELECT time_key FROM time WHERE quarter IN ('3Q96', '4Q96', '1Q97')) AND product_key IN (SELECT product_key FROM product WHERE department = 'GROCERY');

Phase 2 Now the fact table is joined to dimension tables. For dimension tables of small cardinality, a full-table scan may be used. For large cardinality, a hash join could be used.

The Star Transformation Use bitmap indexes to retrieve all relevant rows from the fact table, based on foreign key values This happens very fast Join this result set to the dimension tables If there are many values, a hash join may be used If there are fewer values, a b-tree driven join may be used

How DW Fits into the Enterprise User OLTP1 Application A Extract, Transform And Load User Integration Data Warehouse User Application B OLTP2 User User Integration Application C Data Mart Data Mart Data Mart Data Mart OLTP3 User User

Data Warehouse Database Design A conventional database design for data warehouse would lead to joins on large amounts of data that would run slowly The star schema allows for fast processing of very large quantities of data in the data warehouse It also allows for very compact representation of events that occur many times

A Sample OLTP Schema customers orders order items products

Transformed to a Star Schema products times dimension table dimension table sales fact table customers channels dimension table dimension table

Star Schema Time Location Fact Table Item Supplier Customer

Fact Table The fact table contains the actual business process measurements or metrics for a specific event, called facts, usually numbers. A fact table represents facts by foreign keys from other tables, called “dimension” tables These foreign keys are usually generated keys, in order to save fact table space If you are building a DW of monthly sales in dollars, your fact table will contain monthly sales, one row per month. If you are building a DW of retail sales, each row of the fact table might have one row for each item sold.

Fact Table Design A fact table may contain one or more facts. Usually you create one fact table per business event. For example if you want to analyze the sales numbers and also advertising spending, they are two separate business processes. So you will create two separate fact tables, one for sales data and one for advertising cost data. On the other hand if you want to track the sales tax in addition to the sales number, you simply create one more fact column in the Sales fact table called Tax.

Dimension Table Dimension tables have a small number of rows (compared to fact tables) but a large number of columns For the lowest level of granularity of a fact in the fact table, a dimension table will have one row that gives all the categories for each value The dimension table is often all key, so a generated key is used so that the fact table reference to the dimension table can be small

Time Dimension Schema Column Name Type Dim_Id INTEGER (4) Month SMALL INTEGER (2) Month_Name VARCHAR (3) Quarter SMALL INTEGER (4) Quarter_Name VARCHAR (2) Year

Time Dimension Data 1001 1 Jan Q1 2003 1002 2 Feb 1003 3 Mar 1004 4 TM _Dim_Id TM _Month TM_Month_Name TM _Quarter TM_Quarter_Name TM_Year 1001         1 Jan Q1 2003 1002 2 Feb 1003 3 Mar 1004 4 Apr Q2 1005 5 May

Location Dimension Schema Column Name Type Dim_Id INTEGER (4) Loc_Code VARCHAR (4) Name VARCHAR (50) State_Name VARCHAR (20) Country_Name

Location Dimension Data Dim_Id Loc_Code Name State_Name Country_Name 1001      IL01 Chicago Loop Illinois USA 1002   IL02 Arlington Hts 1003 NY01 Brooklyn New York 1004 TO01 Toronto Ontario Canada 1005 MX01 Mexico City Distrito Federal Mexico

Product Data Schema Column Name Type Dim_Id INTEGER (4) SKU VARCHAR (10) Name VARCHAR (30) Category

Product Data Dim_Id SKU Name Category 1001 DOVE6K Dove Soap 6Pk Sanitary 1002 MLK66F# Skim Milk 1 Gal Dairy 1003 SMKSAL55 Smoked Salmon 6oz Meat

Categories in Dimension Tables Categories may or may not be hierarchical; or can be both Categories provide canned values that can be given to users for queries

Granularity (Grain) of the Fact Table The level of detail of the fact table is known as the grain of the fact table. In this example the grain of the fact table is monthly sales  number per location per product.

Note about Granularity There may be multiple star schemas at different levels of granularity, especially for very large data warehouses The first could be the finest—say, each transaction such as a sale The next could be an aggregation, like the previous example There could be more levels of aggregation

Design Approach 1. Identify the business process. In this step you will determine what is your business process that your data warehouse represents. This process will be the source of your metrics or measurements. 2. Identify the Grain You will determine what does one row of fact table mean. In the previous example you have decided that your grain is 'monthly sales per location per product'. It might be daily sales or even each sale could be one row. 3. Identify the Dimensions Your dimensions should be descriptive (SQL VARCHAR or CHARACTER) as much as possible and conform to your grain. 4. Finally Identify the facts In this step you will identify what are your measurements (or metrics or facts). The facts should be numeric and should confirm to the grain defined in step 2.

Monthly Sales Fact Table Schema Field Name Type TM_Dim_Id INTEGER (4) PR_ Dim_Id LOC_ Dim_Id Sales

Monthly Sales Fact Table Data TM_Dim_Id PR_ Dim_Id LOC_ Dim_Id Sales 1001 1003 435677 1002 451121 98765 1004 65432

Data Mart A data mart is a collection of subject areas organized for decision support based on the needs of a given department. Examples: finance has their data mart, marketing has theirs, sales has theirs and so on. Each department generally runs its own data mart. Ownership of the data mart allows each department to bypass the control that might coordinate the data found in the different departments. Each department's data mart is peculiar to and specific to its own needs. Typically, the database design for a data mart is built around a star-join structure designed for that department. The data mart contains only a modicum of historical information and is granular only to the point that it suits the needs of the department. The data mart may also include data from outside the organization, such as purchased normative salary data that might be purchased by an HR department.

About the Data Mart The structure of the data in the data mart may or may not be compatible with the structure of data in the data warehouse. The amount of historical data found in the data mart is different from the history of the data found in the warehouse. Data warehouses contain robust amounts of history, while data marts usually contain modest amounts of history. The subject areas found in the data mart are only faintly related to the subject areas found in the data warehouse. The relationships found in the data mart may not be those relationships that are found in the data warehouse. The types of queries satisfied in the data mart are quite different from those queries found in the data warehouse.

Walmart’s Data Warehouse Half a petabyte in capacity (.5 x 1015 bytes) World’s largest DW Tracks 100 million customers buying billions of products every week Every sale from every store is transmitted to Bentonville every night Walmart has more than 18,000 retail stores, employs 2.2 million, serves 245 million customers every week

Typical Questions How much orange juice did we sell last year, last month, last week in store X? What internal factors (position in store, advertising campaigns...) influence orange juice sales? How much orange juice are we going to sell next week, next month, next year?