Physical Design Michael A. Fudge, Jr.

Slides:



Advertisements
Similar presentations
Dimensional Modeling.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
BY LECTURER/ AISHA DAWOOD DW Lab # 4 Overview of Extraction, Transformation, and Loading.
Keys, Referential Integrity and PHP One to Many on the Web.
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Technical BI Project Lifecycle
Data Warehousing M R BRAHMAM.
THE RELATIONAL DATABASE MODEL & THE DATABASE DEVELOPMENT PROCESS Fact of the Week: According to a Gartner study in ‘06, Microsoft SQL server had the highest.
Dimensional Modeling Business Intelligence Solutions.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
Chapter 3 Database Management
Database Systems: A Practical Approach to Design, Implementation and Management International Computer Science S. Carolyn Begg, Thomas Connolly Lecture.
Introduction to Structured Query Language (SQL)
A Guide to SQL, Seventh Edition. Objectives Understand the concepts and terminology associated with relational databases Create and run SQL commands in.
Introduction to Structured Query Language (SQL)
Access Lecture 1 Database Overview and Creating Tables Create an Employee Table.
Chapter 17 Methodology – Physical Database Design for Relational Databases Transparencies © Pearson Education Limited 1995, 2005.
Team Dosen UMN Physical DB Design Connolly Book Chapter 18.
5 Copyright © 2009, Oracle. All rights reserved. Defining ETL Mappings for Staging Data.
ETL Design and Development Michael A. Fudge, Jr.
ETL By Dr. Gabriel.
Agenda Common terms used in the software of data warehousing and what they mean. Difference between a database and a data warehouse - the difference in.
MS Access: Database Concepts Instructor: Vicki Weidler.
Introducing ETL: Components & Architecture Michael A. Fudge, Jr.
SharePoint 2010 Business Intelligence Module 6: Analysis Services.
IST722 Data Warehousing Business Intelligence Development with SQL Server Analysis Services and Excel 2013 Michael A. Fudge, Jr.
Denise Luther Senior IT Consultant Practical Technology Enablement with Enterprise Integrator.
AL-MAAREFA COLLEGE FOR SCIENCE AND TECHNOLOGY INFO 232: DATABASE SYSTEMS CHAPTER 7 INTRODUCTION TO STRUCTURED QUERY LANGUAGE (SQL) Instructor Ms. Arwa.
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
IMS 6217: Data Warehousing / Business Intelligence Part 3 1 Dr. Lawrence West, Management Dept., University of Central Florida Analysis.
Database Technical Session By: Prof. Adarsh Patel.
Lecture 9 Methodology – Physical Database Design for Relational Databases.
Dimensional model. What do we know so far about … FACTS? “What is the process measuring?” Fact types:  Numeric Additive Semi-additive Non-additive (avg,
Chapter 16 Methodology – Physical Database Design for Relational Databases.
SQL Structured Query Language Programming Course.
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
DATABASE MGMT SYSTEM (BCS 1423) Chapter 5: Methodology – Conceptual Database Design.
7 1 Chapter 7 Introduction to Structured Query Language (SQL) Database Systems: Design, Implementation, and Management, Seventh Edition, Rob and Coronel.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
10/10/2012ISC239 Isabelle Bichindaritz1 Physical Database Design.
CHAPTER 3 DATABASES AND DATA WAREHOUSES. 2 OPENING CASE STUDY Chrysler Spins a Competitive Advantage with Supply Chain Management Software Chapter 2 –
6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.
INFO1408 Database Design Concepts Week 15: Introduction to Database Management Systems.
Data Driven Designs 99% of enterprise applications operate on database data or at least interface databases. Most common DBMS are Microsoft SQL Server,
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
Designing a Data Warehousing System. Overview Business Analysis Process Data Warehousing System Modeling a Data Warehouse Choosing the Grain Establishing.
Methodology – Physical Database Design for Relational Databases.
Chapter 4 Constraints Oracle 10g: SQL. Oracle 10g: SQL 2 Objectives Explain the purpose of constraints in a table Distinguish among PRIMARY KEY, FOREIGN.
UNIT-II Principles of dimensional modeling
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 4 Logical & Physical Database Design
June 08, 2011 How to design a DATA WAREHOUSE Linh Nguyen (Elly)
Data Resource Management Application Layer TPS A RCHITECTURE Data Layer Sales/MarketingHR Finance/Accounting Operations Spreadsheet Data MS Access Accounts.
1 Database Systems, 8 th Edition Star Schema Data modeling technique –Maps multidimensional decision support data into relational database Creates.
Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke1 The Relational Model Chapter 3.
7 Copyright © 2006, Oracle. All rights reserved. Defining a Relational Dimensional Model.
1 CS122A: Introduction to Data Management Lecture #4 (E-R  Relational Translation) Instructor: Chen Li.
Building the Corporate Data Warehouse Pindaro Demertzoglou Lally School of Management Data Resource Management.
Methodology – Physical Database Design for Relational Databases
Applying Data Warehouse Techniques
SSIS Demo Michael A. Fudge, Jr.
Dimensional Modeling.
DATABASE TECHNOLOGIES
Applying Data Warehouse Techniques
Technical Architecture
Applying Data Warehouse Techniques
Presentation transcript:

Physical Design Michael A. Fudge, Jr. IST722 Data Warehousing Physical Design Michael A. Fudge, Jr.

Pop Quiz! For dimensional modeling define these: Conformed dimension Degenerate dimension Junk Dimensions Type 1,2,3 SCD’s 3 types of facts 3 fact table grains

Pop Quiz! - Answers Dimensional Modeling Conformed dimension Shared Among DM’s Degenerate dimension Dimensions in the Fact table Junk Dimensions Categorical Dimension / Catch All Type 1,2,3 SCD’s 1. replace, 2. new row, 3. new column 3 types of facts Additive, Semi-Additive, Non-Additive 3 fact table grains Trans. / Periodic Snap. / Accumulating Snap

So, where are we? Last Week: We covered: We learned how to: Dimensional Modeling We learned how to: Design dimensional models for relational databases. Detailed Design This Week: We’ll cover ROLAP Implementation of Dimensional Models We’ll learn how to Implement dimensional models in relational databases. Technical

Recall: Kimball Lifecycle Describes an approach for data warehouse projects

The Goal: Detailed Design to ROLAP Implementation

Today’s Agenda: Describe the process of implementing dimensional model designs in a relational database (ROLAP) Discuss approaches to implementation Walk through an implementation together using a case-study, so you can see this in action!

The Physical Design Process, At a Glance Develop Standards Detailed Dimensional Model / Physical Model Development Environment Instantiate Relational Database Develop Security, Auditing and Staging tables and Index plan Design ROLAP Database & Test / Verify Test Environment Add Aggregations and improved Indexes Finalize database Designs

A word about Environments. Networked so others can access it Should be identical to prod in data and function. Measure performance here. Isolated to the Developers Can use subsets of data Not for “testing” Test Prod Dev

Our In-Class Case Study: Fudgemart Employee Time Sheets We will: Implement the ROLAP Schema Load with data to test / verify the model Let’s see the Detailed Design Workbook…

The ROLAP Star Schema Simple Data Mart We’ll use this throughout our lesson today. You can Generate the SQL from the Excel Dimensional Modeling Worksheet!!!

Developing Standards Design Development Environment Test Environment Develop Standards Detailed Dimensional Model / Physical Model Development Environment Instantiate Relational Database Develop Security, Auditing and Staging tables and Index plan Design ROLAP Database & Test / Verify Test Environment Add Aggregations and improved Indexes Finalize database Designs

Naming Conventions Follow your organizations naming conventions Develop them if you don’t have any! Consistency is key here Examples: Customer_Dim DimCustomer  I use this one dim_customer [Dim Customer] Dim == Dimension Fact == Fact Table Stg == Staged Data

To Null or Not to Null? The attributes in your dimension tables should not have nulls Attributes without a value (null) should be assigned one Example: No email?  “No Email” Null dates should get a special flag surrogate key Foreign keys in the fact table should never be null Nulls are okay for values in the fact tables. We do this for the business users!

Synonyms & Views Synonyms and Views are logical abstractions of tables and SQL SELECT statements, respectively. For any table directly accessible by an end user a view or synonym should be used. This way you can change the underlying tables without affecting the user’s external dependencies (Report, Web page, etc…) CREATE VIEW name AS … CREATE SYNONYM name FOR … Give quick demo: create synonym customers for DimCustomer; select * from customers create view vProductSales as select s.*, p.size as ProductSize, p.Color as ProductColor, p.Description as ProductDescription from FactSales s join DimProduct p on p.ProductKey=s.ProductKey select * from vProductSales

Primary Keys Dimension tables should use Surrogate keys Fact tables should use composite keys made up of dimension foreign keys and degenerate dimensions. Most surrogate keys are number sequences date surrogate keys can be of the form YYYYMMDD Surrogate keys can be used in the fact table but they increase the table size and do not improve performance. Demo: Show how another customer would be added. Show PK of the Fact table (by showing contents) List contents of date table

Foreign Keys Foreign keys are important. Don’t devalue! FK’s enforce referential integrity between the PK in the dimension table and the FK in the Fact table. This prevents you from inserting invalid data into the Fact table. If you’re concerned about the performance impacts of constraint checking, you can drop the FK’s, insert the data, then reinstate the constraints with the nocheck option. (show diagram) ALTER TABLE FactSales drop fkSalesCustomerKey ALTER TABLE FactSales with nocheck add constraint fkSalesCustomerKey foreign key (CustomerKey) references DimCustomer(CustomerKey)

The Physical Model Design Development Environment Test Environment Develop Standards Detailed Dimensional Model / Physical Model Development Environment Instantiate Relational Database Develop Security, Auditing and Staging tables and Index plan Design ROLAP Database & Test / Verify Test Environment Add Aggregations and improved Indexes Finalize database Designs

Use Data Modeling Tools! Useful for documenting metadata for tables and columns. Produce reports based on the model and documentation. Most tools generate the SQL required to create your model. The Poor man’s option is Hand write the SQL…  Examples: Oracle SQL Developer Data Modeler SAP Power Designer CA’s ERWin IBM Rational / InfoSphere Microsoft Visio Enterprise Architect MySQL Workbench

A Tour of the Kimball Detailed Dimensional Modeling Workbook Part documentation. Part data modeling tool (DMT). All Fun!

Is It Time to Use an SCM? Yes. SCM  Source Code Management Git, Subversion, Mercurial, CVS Time to get serious about an SCM, since you’ll be Generating / creating code Making lots of changes Collaborating with others concurrently CSM tools allow you to record and track changes to your code and easily roll-back versions and collaborate with others Learn Git: http://git-scm.com/doc

Handling SCD’s in the Dimension Tables Type 1 = No change to table required. Type 2 = Require extra columns to your dimension table to track changes Type 3 = Each time a change is made a new column need to be added to the dimension table.

Example: Type 2 Handling Type 2 is the most common SCD These columns should be added to assist with tracking, but not displayed to the end-user. Add these columns: RowIsCurrent (yes/no)  Is this the current row. RowStartDate (datetime)  Start date of valid row RowEndDate (datetime)  End date of valid row RowChangeReason (text)  Explain why row changed Demo: Fudgemart Workbook… ALTER TABLE DimCustomer ADD RowIsCurrent bit default(1) not null, RowStartDate datetime default('1/1/1900') not null, RowEndDate datetime default ('12/31/9999') not null, RowChangeReason nvarchar(200) null; select * from DimCustomer; -- example: 'Derek-Smith' moves -- always update first update DimCustomer set RowIsCurrent=0, RowEndDate = GetDate(), [RowChangeReason] ='Change of Address' where [Name]= 'Derek-Smith' -- then insert insert into DimCustomer ([Name],[City], [State],[ZipCode], [RowStartDate]) values ('Derek-Smith','Syracuse','NY','13244', getdate());

Star – vs – Snowflake Star Schema is preferred over snowflake as it is easier for users to understand. If you need to snowflake, collapse your multi-valued / outrigger dimensions into a view. Snowflaking makes it easier to attach fact tables at different grain. Demo: Fudgemart Workbook (DimEmployee + dates)…

Sizing Estimates Need to know how must disk you’ll need. Calculate row lengths for Fact & Large Dimension tables. Estimate based on sizes of data types. Come up with initial load size + scheduled ETL Assume indexes will consume as much room as the base data. A good rule of thumb total space = 3 to 4 * Star Schema Size

Build Your Development Environment Design Develop Standards Detailed Dimensional Model / Physical Model Development Environment Instantiate Relational Database Develop Security, Auditing and Staging tables and Index plan Design ROLAP Database & Test / Verify Test Environment Add Aggregations and improved Indexes Finalize database Designs

Physical Modeling Checklist Design the physical ROLAP structure (using your DMT or SQL) Initial ETL Load (Not Automated with ETL Tooling) Test and verify your data in the model Finalize your Source-to-Target Map: Check Naming Conventions for tables & columns Name user-accessed views & synonyms Verify data type & length of columns Re-check your SCD types Rules for replacing NULL with a default value Add columns for maintenance and auditing purposes

Instantiate the ROLAP Database You’ll need this before you can develop the ETL process. You don’t need to focus on performance at this point because you don’t know the bottlenecks. The Development environment should be separate from the test environment. Use your SCM tool to manage code changes as you make them And update your documentation! Demo: Fudgemart Workbook, generate SQL.

Add An Auditing Dimension An Audit Dimension is a special table for tracking the ETL process. Each time the ETL process is run a row is added to the audit dimension table. Each Dimension and Fact table gets two more columns InsertAuditKey  Which process loaded this row UpdateAuditKey  Which process changed this row most recently? Will explore this while covering ETL. Demo: Fudgemart Workbook…

Initial Stage + ETL To verify your ROLAP model, you’ll need to populate it with data. Initial Stage and ETL are typically done with SQL Queries If the data volume is too large, use sub-sets of the source data. You’re still exploring and validating your ROLAP Star Schema. Take the lessons learned as you profile for automating the ETL process to come.

Best Practices for Staging Data Always stage your data “as is” to avoid a dependency on the source systems. You do not want your stage data in the same database or schema as your data warehouse. Helps keep the models “tidy”. On your Server, you’ll notice you have Stage and DW for this reason. Demo: Stage and Initial Load via ETL

Security Tables Security tables are used to filter row data based on user access or group access. For example: Current user is a member of Store 102, so she only sees Sales for that store. In SQL Server we use SYSTEM_USER to Id the user. All DBMS’s have a means to do this. Demo: Add Security table so managers can see only their employee’s timesheets. -- security table create table UserStores ( SystemUser varchar(100) not null, StoreNumber smallint not null, constraint PkUserStores primary key (SystemUser, StoreNumber) ); insert into UserStores values ( SYSTEM_USER, 102); select * from UserStores -- now we create a view create view vSales as select s.* from FactSaless join DimStore d on d.StoreKey = s.StoreKey join UserStores u on u.StoreNumber = d.StoreNumber where SYSTEM_USER = u.SystemUser select * from vSales;

The Test Environment Design Development Environment Test Environment Develop Standards Detailed Dimensional Model / Physical Model Development Environment Instantiate Relational Database Develop Security, Auditing and Staging tables and Index plan Design ROLAP Database & Test / Verify Test Environment Add Aggregations and improved Indexes Finalize database Designs

Test Environment This is the point where end-users enter into the process. Your system will be loaded with data so you will be able to monitor usage and adjust performance accordingly. Your test environment is separate from your Development environment. It should be network accessible.

Indexing Dimension & Fact Tables If your DBMS supports bitmapped indexes, add them to your dimension tables on attributes involved in row filters. Bitmapped indexes are good for low-cardinality columns (Y/N or High, Med, Low) Supported in Oracle, not SQL Server For fact tables, follow the index plan optimizer of your DBMS. Demo: Execution Plans Build a visual view of vSalesMart in GUI view builder then run the view with Actual Execution Plan on.

Product Subcat Key PK, FK Aggregations Aggregate popular rollup data. Monitor queries to find out what’s popular. Improves performance. FactSales Date Key PK,FK Product Key PK, FK Sales Amt Sales Qty DimDate Date Key PK Date Name Year-Month Year-Qtr …. DimProduct Product Key PK Product Name Product Color Product Subcat Key Product Subcat …. FactSalesSummary Year-Month Key PK,FK Product Subcat Key PK, FK Sales Amt Sales Qty Rollup

Summary Develop standards for consistency Use data modeling tool to help document the physical design. Use a SCM tool to track changes to your design. Add to your schema to support Type 2 & 3 dimensions. Include a framework for auditing the ETL process. Build and verify your model in Development Introduce users during the test phase.

Physical Design Michael A. Fudge, Jr. IST722 Data Warehousing Physical Design Michael A. Fudge, Jr.