Data Warehouse Toolkit

Slides:



Advertisements
Similar presentations
Dimensional Modeling By Dr. Gabriel.
Advertisements

Dimensional Modeling.
CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
Chapter 10: Designing Databases
BY LECTURER/ AISHA DAWOOD DW Lab # 2. LAB EXERCISE #1 Oracle Data Warehousing Goal: Develop an application to implement defining subject area, design.
Copyright © Starsoft Inc, Data Warehouse Architecture By Slavko Stemberger.
Technical BI Project Lifecycle
Introduction to data warehouses
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chapter 13 The Data Warehouse
Data Warehousing DSCI 4103 Dr. Mennecke Introduction and Chapter 1.
Dimensional Modelling
BUSINESS INTELLIGENCE/DATA INTEGRATION/ETL/INTEGRATION AN INTRODUCTION Presented by: Gautam Sinha.
Business Intelligence
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
Systems analysis and design, 6th edition Dennis, wixom, and roth
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Dimensional Modeling Chapter 2. The Dimensional Data Model An alternative to the normalized data model Present information as simply as possible (easier.
Data Warehousing Concepts, by Dr. Khalil 1 Data Warehousing Design Dr. Awad Khalil Computer Science Department AUC.
Data Warehouse and Business Intelligence Dr. Minder Chen Fall 2009.
Database Design Part of the design process is deciding how data will be stored in the system –Conventional files (sequential, indexed,..) –Databases (database.
DIMENSIONAL MODELLING. Overview Clearly understand how the requirements definition determines data design Introduce dimensional modeling and contrast.
Chapter 1 Adamson & Venerable Spring Dimensional Modeling Dimensional Model Basics Fact & Dimension Tables Star Schema Granularity Facts and Measures.
Database Design and Management CPTG /23/2015Chapter 12 of 38 Functions of a Database Store data Store data School: student records, class schedules,
6.1 © 2010 by Prentice Hall 6 Chapter Foundations of Business Intelligence: Databases and Information Management.
UNIT-II Principles of dimensional modeling
June 08, 2011 How to design a DATA WAREHOUSE Linh Nguyen (Elly)
The Need for Data Analysis 2 Managers track daily transactions to evaluate how the business is performing Strategies should be developed to meet organizational.
Building the Corporate Data Warehouse Pindaro Demertzoglou Data Resource Management.
Management Information Systems by Prof. Park Kyung-Hye Chapter 7 (8th Week) Databases and Data Warehouses 07.
Logical Database Design and the Rational Model
View Integration and Implementation Compromises
Advanced Applied IT for Business 2
Fundamentals of Information Systems, Sixth Edition
Chapter 13 Business Intelligence and Data Warehouses
Data warehouse and OLAP
Fundamentals & Ethics of Information Systems IS 201
Chapter 13 The Data Warehouse
Lecture 2 The Relational Model
Methodology – Physical Database Design for Relational Databases
Physical Database Design for Relational Databases Step 3 – Step 8
Data Warehouse.
OLAP Systems versus Statistical Databases
Star Schema.
Applying Data Warehouse Techniques
Overview and Fundamentals
Competing on Analytics II
Inventory is used to illustrate:
Retail Sales is used to illustrate a first dimensional model
MANAGING DATA RESOURCES
Data Warehouse Overview September 28, 2012 presented by Terry Bilskie
An Introduction to Data Warehousing
Relational Database Model
C.U.SHAH COLLEGE OF ENG. & TECH.
Retail Sales is used to illustrate a first dimensional model
Dimensional Modeling.
Retail Sales is used to illustrate a first dimensional model
Data Warehouse.
Dimensional Model January 16, 2003
DWH – Dimesional Modeling
The ultimate in data organization
Chapter 17 Designing Databases
Data Warehousing Concepts
Applying Data Warehouse Techniques
Chapter 3 Database Management
Review of Major Points Star schema Slowly changing dimensions Keys
CAD DESK PRIMAVERA PRESENTATION.
Applying Data Warehouse Techniques
Data Warehouse and OLAP Technology
Presentation transcript:

Data Warehouse Toolkit Kimball

Key Definitions Data mart is a specific, subject-oriented repository of data that was designed to answer specific questions Usually, multiple data marts exist to serve the needs of multiple business units (sales, marketing, operations, collections, accounting, etc.) Data warehouse is a single organizational repository of enterprise wide data across many or all subject areas. Data warehouse is an enterprise wide collection of data marts

Key Definitions “Business Intelligence” refers to reporting and analysis of data stored in the warehouse Data warehouse is the foundation for business intelligence. ‘‘Data warehouse/business intelligence’’ (DW/BI) refers to the complete end-to-end system.

Two Main Data Warehouse Development Methodologies Top-down approach The Inmon’s approach DW is developed based on the Enterprise wide data model DW as a single repository feeds data into data marts Longer to implement May fail due to the lack of patience and commitment Bottom-up approach The Kimball’s approach Starts with one data mart (ex. sales); later on additional data marts are added (ex. collection, marketing, etc.) Data flows from source into data marts, then into the data warehouse Faster to implement Implementation in stages Need to ensure consistency of metadata Making sure each data mart calls Apple and Apple The Hybrid approach

The Kimball Lifecycle Diagram

The Kimball Lifecycle Illustrates the general flow of a DW implementation Identifies task sequencing and highlights activities that should happen concurrently May need to be customized to address the unique needs of your organization Not every detail of every Lifecycle task will be performed on every project

The Kimball Lifecycle, SDLC, and DBLC Planning DB Initial Study DB Design Analysis Implementation Detailed System Design Testing Implementation Operation Maintenance Maintenance

Program/Project Planning Kimball’s view of programs and projects Project refers to a single iteration of the Kimball Lifecycle from launch through deployment Program refers to the broader, ongoing coordination of resources, infrastructure, timelines, and communication across multiple projects a program contains multiple projects In real world, programs do not necessarily start before projects although ideally they should be.

Program/Project Planning Scope definition understanding business requirements Tasks’ identification Scheduling Resource planning Workload assignment The end document represents a blueprint of the project

Program/Project Management Enforces the project plan Activities: Status monitoring Issue tracking Development of a comprehensive communication plan that addresses both the business and IT units

Business Requirements Definition Success of the project depends on a solid understanding of the business requirements!!! Understanding the key factors driving the business is crucial for successful translation of the business requirements into design considerations

What follows the business requirements definition? 3 concurrent tracks focusing on Technology Data Business intelligence applications Arrows in the diagram indicate the activity workflow along each of the parallel tracks Dependencies between the tasks are illustrated by the vertical alignment of the task boxes.

Technology Track Technical Architecture Design Overall architectural framework and vision Considerations: the business requirements current technical environment planned strategic technical directions

Technology Track Product Selection and Installation Based on the designed technical architecture Evaluation and selection of Products that will deliver needed capabilities Hardware platform Database management system Extract-transformation-load (ETL) tools Data access query tools Reporting tools must be evaluated Installation of selected products/components/tools Testing of installed products to ensure appropriate end-to-end integration within the data warehouse environment.

Data Track Design of the dimensional model The physical design of the model Extraction, transformation, and loading (ETL) of source data into the target models.

Dimensional Modeling Detailed data analysis of a single business process is performed to identify the fact table granularity, associated dimensions and attributes, and numeric facts. Dimensional models contain the same data content and relationships as models normalized into third normal form, but structured differently. Improve understandability and query performance required by DW/BI Primary constructs of a dimensional model fact tables dimension tables

Dimensional Modeling Fact tables Contain the metrics resulting from a business process or measurement event, such as the sales ordering process or service call event Dimensional models should be structured around business processes and their associated data sources, This results in ability to design identical, consistent views of data for all observers, regardless of which business unit they belong to, which goes a long way toward eliminating misunderstandings at business meetings Fact table’s granularity should be set at the lowest, most atomic level captured by the business process This allows for maximum flexibility and extensibility. Business users will be able to ask constantly changing, free- ranging, and very precise questions.

Dimensional Modeling Dimensional table Contain the descriptive attributes and characteristics associated with specific, tangible measurement events, such as the customer, product, or sales representative associated with an order being placed. Dimension attributes are used for constraining, grouping, or labeling in a query. Hierarchical many-to-one relationships are denormalized into single dimension tables.

Star Schema A fact table Multiple dimension tables Example: Assume this schema to be of a retail-chain. Fact will be revenue (money). How do you want to see data is called a dimension.

Snowflake Schema The snowflake schema is a variation of the star schema used in a data warehouse. The snowflake schema is a more complex schema than the star schema because the tables which describe the dimensions are normalized.

Snowflake Schema Disadvantages: Advantages: Fact tables are typically responsible for 90% or more of the storage requirements, so the benefit is normally insignificant. Normalization of the dimension tables ("snowflaking") can impair the performance of a data warehouse. Advantages: If a dimension is very sparse (i.e. most of the possible values for the dimension have no data) and/or a dimension has a very long list of attributes which may be used in a query, the dimension table may occupy a significant proportion of the database and snowflaking may be appropriate. In practice, many data warehouses will normalize some dimensions and not others, and hence use a combination of snowflake and classic star schema.

Physical Design Defining the physical structures setting up the database environment Setting up appropriate security preliminary performance tuning strategies, from indexing to partitioning and aggregations. If appropriate, OLAP databases are also designed during this process.

ETL Design and Development The MOST important stage 70% of the risk and effort in the DW project is attributed to this stage ETL system capabilities: Extraction Cleansing and conforming Delivery and management

ETL Raw data is extracted from the operational source systems and is being transformed into meaningful information for the business ETL processes must be architected long before any data is extracted from the source ETL system strives to deliver high throughput, as well as high quality output Incoming data is checked for reasonable quality Data quality conditions are continuously monitored Kimball calls ETL a “data warehouse back room”

Business Intelligence Application Track Applications that query, analyze, and present information from the dimensional model. BI applications deliver business value from the DW/BI solution, rather than just delivering the data The goal is to deliver capabilities that are accepted by the business to support and enhance their decision making. BI Application Design Identify the candidate BI applications and appropriate navigation interfaces to address the users’ needs and needed capabilities. Produce BI application specification BI Application Development Configuration of the business metadata and tool infrastructure Construction and validation of the specified analytic and operational BI applications and the navigational portal

Deployment It is crucial that adequate planning was performed to make sure that: the results of technology, data, and BI application tracks are tested and fit together properly Appropriate education and support infrastructure is in place. It is critical that deployment be well orchestrated Deployment should be deferred if all the pieces, such as training, documentation, and validated data, are not ready for production release.

Maintenance Occurs when the system is in production Includes: technical operational tasks that are necessary to keep the system performing optimally usage monitoring performance tuning index maintenance system backup Ongoing support, education, and communication with business users

Growth DW systems tend to expand (if they were successful) Is considered as a sign of success New requests need to be prioritized Starting the cycle again Building upon the foundation that has already been established Focusing on the new requirements

Questions ?

Dimensional Modeling

Dimensional Modeling Dimensional modeling Logical design technique for structuring data It is intuitive to business users Easy-to-understand Fast query performance Primary constructs of a dimensional model fact tables dimension tables

Star Schema A fact table Multiple dimension tables Example: Assume this schema to be of a retail-chain. Fact will be revenue (money). How do you want to see data is called a dimension.

Facts Facts Measurements Numeric Additive Semi-additive Non-additive Critical BI applications do not retrieve a single fact table row; data is summarized Semi-additive Cannot be summed across time periods Examples: account balances, inventory levels Non-additive Cannot be summed across any dimension Are stored in dimension tables

Fact Tables Fact tables Conformed facts For non-conformed facts Store numeric additive facts Conformed facts Facts with identical definitions May have same standardized name in separate tables For non-conformed facts Different interpretations must be given different names

Fact Tables Fact table keys Complex key that consists of foreign keys from intersecting dimension tables Every foreign key must match a unique primary key in the corresponding dimension table Foreign keys should not be null Special keys such as “unknown”, “N/A”, etc. should be used instead.

Fact Tables Fact table granularity Data should be at the lowest, most detailed atomic grain captured by a business process Flexibility in querying/reporting Scalability

Dimension Tables Dimension tables Dimensions Consist of highly correlated groups of attributes that represent key objects in business such as products, customers, employees, facilities Store attributes for Query constraining/filtering Query result labeling Dimensions Can be easily identified when business users use “by” word Example: by year, by product, by region, etc.

Dimension Tables Dimension attributes Textual fields Numeric values that behave like text Non-additives Requirements Labels consist of full worlds Descriptive No missing values Discretely valued (contain only 1 value for each row in the dimension table) Quality assured (no misspelling, obsolete or orphaned values, different versions of the same attribute)

Dimension Tables Dimension tables are small with regard to the number of rows Storing descriptions for each attribute is critical Easy-to-use for business users Rows are uniquely identified by a single key, usually, a sequential surrogate key

Dimension Tables Advantages of using surrogate keys Performance Efficient joins smaller indexes more rows per block Data integrity When the keys in operational systems are reused Discontinued products, Deceased customers, etc. Mapping when integrating data from different sources Keys from different sources may be different Mapping table of the surrogate key and keys from different sources

Dimension Tables Advantages of using surrogate keys (Cont) Handling unknown or N/A values Ease of assignment a surrogate key value to rows with these values Tracking changes in dimensional attribute values Creating new attributes and assigning the next available surrogate key

Dimension Tables Disadvantages of using surrogate keys Assignment and management of surrogate keys and appropriate substitution of these keys for natural keys – extra load for ETL system Many ETL tools have built-in capabilities to support surrogate key processing Once the process is developed, it can be easily reused for other dimensions

Conformed Dimensions a.k.a. master or common reference dimensions Shared across the DW environment joining to multiple fact tables representing various business processes 2 types Identical dimensions One dimension being a subset of a more detailed dimension

Conformed Dimensions Identical dimensions Same content, interpretation, and presentation regardless of the business process involved Same keys, attribute names, attribute definitions, and domain values regardless of domain values they join to Example: product dimension referenced by orders and the one referenced by inventory are identical One dimension being a perfect subset of a more detailed, granular dimension table Same attribute names, definitions, and domain values Example: sales is linked to a dimension table at the individual product level; sales forecast is linked at the brand level

Conformed Dimensions Product Dimension Product key PK Product description SKU number Brand description Sub class description Class description Department description Color size Display type Sales Fact Table Date key FK Product key FK … other FKeys… Sales quantity Sales amount Sales Forecast Fact Table Month key FK Brand key FK … other FKeys… Forecast quantity Forecast amount Brand Dimension Brand key PK Brand description Sub class description Class description Department description Display type

Conformed Dimensions Benefits Consistency Integration Every fact table is filtered consistently and results are labeled consistently Integration Users can create queries that drill across fact tables representing different processes individually and then join result set on common dimension attributes Reduced development time to market Once created, conform dimensions are reused

Dimensional Design Process Based on business requirements and data realities Step 1 – choose the business process Step 2 – declare the grain Step 3 – identify dimensions Step 4 – Identify facts

Enterprise Bus Architecture Requirements are gathered and represented in a form of Enterprise Data Warehouse Bus Matrix Each row corresponds to a business/process Each column corresponds to a dimension of the business Each column is a conformed dimension Enterprise Data Warehouse Bus Matrix documents the overall data architecture for DW/BI system

Enterprise Bus Architecture Matrix

Enterprise Bus Architecture Matrix Possible Problems: Level of details for each column and row in the matrix Row-related Listing departments/imitating organizational chart instead of business processes Listing reports and analytics related to business process instead of the business process itself Ex. Shipping orders business process supports various analytics such as customer ranking, sales rep performance, product movement analyses

Enterprise Bus Architecture Matrix Possible Problems (Cont): Column-related Generalized columns/dimensions Example: “Entity” column is too general as it includes employees, suppliers, contractors, vendors, customers Too many columns related to the same dimension Worst case when each attribute is listed separately Example: Product, Product Group, LOB are all related to the Product dimension and should be listed as one.

Date/Time Dimensions Standard date dimension table at a daily grain Rationale: remove association with calendar from BI applications Use numeric surrogate keys for date dimension tables Date Dimension Date key pk Calendar Date Calendar Month Calendar Day Calendar Quarter Calendar Half year Calendar Year Fiscal Quarter Fiscal Year …

Date/Time Dimensions Time of day should be treated as dimension only if there are meaningful textual descriptions for periods within the day Example; lunch hour, rush hours, etc. Otherwise, time of day needs to be represented as a simple non-additive fact or a date/timestamp

Date/Timestamp Used in the fact table to support precise time interval calculated across fact rows Calculations to be performed by ETL system Example: elapsed time between original claim date and first payment date

Multiple Time Zones Express time in coordinated universal time (UTC) Additionally, may be expressed in local time Other options: use a single time zone (for example, ET) to express all times in this zone local call date dimension Call Center Activity Fact Local call date key FK UTC call date key FK Local call time of day fk UTC call time of day fk … Local call time of day dimension UTC call date dimension UTC call time of day dimension

Degenerate Dimensions Occur in transaction fact tables that have a natural parent-child structure Key remains the only attribute left after other attributes got separated into dimensions Key should be the actual transaction number Stored in a fact table - do not create a corresponding dimension table

Degenerate Dimensions Example: DIM CUSTOMER Customer key customer id customer lname customer fname ORDERS TRANSACTIONS order# customer id customer lname customer fname shipto street address shipto city shipto state shipto zip order total amount discount amount net order amount payment amount order date ORDERS FACTS customer key shipto address key order date key order total amount discount amount net order amount payment amount order# DIM SHIPTO ADDRESS Shipto address key shipto street address shipto city shipto state shipto zip DIM Order Date Order date key Calendar date Calendar month …

Slowly Changing Dimensions Dimension table attributes change infrequently Mini-dimensions Separating more frequently changing attributes into their own separate dimension table, a.k.a. mini-dimension 3 types of handling slowly changing dimensions Overwrite the dimension attribute Add a new dimension row Add a new dimension attribute

Slowly Changing Dimensions - Overwrite the dimension attribute New values overwrite old ones No history is kept Problems occur if data was previously aggregated based on old values Will not match ad-hoc aggregations based on new values Previous aggregations need to be updated to keep aggregated data in-sync.

Slowly Changing Dimensions - Add a new dimension row Most popular technique New row with new surrogate PK is inserted into dimension table to reflect new attribute values Both, old and new values are stored along with effective and expiration dates, and the current row indicator Example:

Slowly Changing Dimensions - Add a new dimension attribute Used infrequently A new column is added to the dimension table Old value is recorded in a “prior” attribute column New value is recorded in the existing column All BI applications transparently use the new attribute Queries can be written to access values stored in the “prior“ attribute column

Role-playing Dimensions Same physical dimension table plays different logical role in a dimension model Example: multiple date dimensions Order Date Dimension Order date key PK Order date Order date day of week Order date month … Order Transaction Fact Order date key FK Ship date key FK Product key FK Order amount … Ship Date Dimension Ship date key PK Ship date Ship date day of week Ship date month …

Role-playing Dimensions Other examples: Customer (ship to, bill to, sold to) Facility or port (origin, destination) Provider (referring, performing) Stored in the same physical table but presented in a separately-labeled view Implemented using views or aliases depending on the database platform

“Junk” Dimensions Miscellaneous flags and text attributes that cannot be placed into one of existing dimension tables Store them in a “junk” dimension Store as unique combinations Example: Data profiling is useful in identifying junk dimension candidates

Snowflaking Occurs when dimension tables are normalized Increases complexity for users Decreases performance Brand dimension Brand key pk Brand description Subcategory key FK Product Dimension Product key PK Product Descr SKU number Brand key FK Package type key FK Subcategory dimension Subcategory key pk Subcategory description Package type dimension Package type key pk Package type descr

Outrigger Dimensions Look like a beginning of a snowflake Example: Large number of attributes Different grain Different update frequency Customer dimension Customer key PK Fname Lname Address County County demographics … County demographics Outrigger dimension County Demogr key Total population Males Female Under 18 … Fact table Customer key FK ….

Bridge Tables Used to implement variable-depth hierarchies Should be used only when absolutely necessary Negatively affect usability Decrease performance Example: reporting revenue for customers who has subsidiary relationship Customer hierarchy bridge Parent Customer key Subsid. Customer key #levels from parent Bottom flag Top flag Fact table date key FK Customer key F … Customer dimension Customer key FK ….

3 Fundamental Fact Table Grains Transaction One row per transaction/line of transaction Rows are inserted into fact tables only when a transaction activity occurs

3 Fundamental Fact Table Grains Periodic snapshot At predetermined intervals snapshots of the same level of details are taken and stacked consecutively in the fact table Example: most financial reports, bank account value Complements detailed transaction facts but not substitutes them Share the same conformed dimensions but have less dimensions

3 Fundamental Fact Table Grains Accumulating snapshot Less frequently used Have multiple date FK that correspond to each milestone in the workflow Lots of N/A or Unknown fields when a row is originally inserted Requires a special row in date dimension table as discussed earlier

Facts of Different Granularity A single fact table cannot have facts with different granularity All measurements must be in the same level of details Example: Measurements are captured for each line order except for the shipping charge which is for the entire order Solutions: Allocating higher level facts to a lower granularity Create two separate fact table

Multiple Currencies and Units of Measures Measurements are provided in a local currency Measurements are also converted to a standardized currency or conversion rates must be stored Similarly, in case of multiple units of measures, conversions to all different units of measure are provided

Student attendance event facts Factless Fact Tables business processes that do not generate quantifiable measurements Example: student attendance Can be easily converted into traditional fact tables by adding an attribute Count, which is always equal to 1. Helps to perform aggregations Date dimension Student attendance event facts Date key Student key Facility key Faculty key Course/section key student dimension facility dimension faculty dimension Course/section dimension

Consolidated Fact Tables Fact tables populated from different sources may potentially be consolidated into single one Level of granularity must be the same Measurements are listed side-by-side Example: by combining forecast and actual sales amounts, a forecast/actual sales variance amount can be easily calculated and stored

Recommendations to Avoid Common Misconceptions about Dimensional Modeling Do not take a “report-centric” approach Do not create a new dimensional model for each slightly different report Do not create a new dimensional model for each department for data from the same source Create dimensional models with the finest level of granularity (atomic data) Flexible and independent of a specific business question/report Scalable Use conformed dimensions ease integration efforts Make ETL process structured Avoid chaos when integrating multiple data marts

Comprehensive example – Video rental

E-R Diagram Customer #Cust No F Name L Name Ads1 Ads2 City State Zip Tel No CC No Expire Requestor of Rental #Rental No Date Clerk No Pay Type CC No Expire CC Approval Line #Line No Due Date Return Date OD charge Pay type Owner of Holder of Title #Title No Name Vendor No Cost Video #Video No One-day fee Extra days Weekend Name for E-R Diagram

Dimensional Model Customer CustID Cust No F Name L Name Rental RentalID Rental No Clerk No Store Pay Type Line LineID OD Charge OneDayCharge ExtraDaysCharge WeekendCharge DaysReserved DaysOverdue AddressID RentalId VideoID TitleID RentalDateID DueDateID ReturnDateID Video Video No Title TitleNo Name Cost Vendor Name Rental Date SQLDate Day Week Quarter Holiday Due Date Return Date Address Adddress1 Address2 City State Zip AreaCode Phone Dimensional Model

Modeling Process

4 steps of dimensional modeling Choose a business process Declare the grain Identify dimensions Identify facts

High-level model diagram Is a data model at the entity level Shows specific fact and dimension tables applicable to a specific business process Great communication and training tool Currency Date Order, Due Product Promotion Order junk Orders Channel Customer Sales person

Derived facts Additive calculation using other facts in the same table Can be calculated using a view Example: net sales based on subtraction of commission amount from the gross sales Non-additive calculation that is expressed at a different level of details than the fact table itself Can be calculated by BI tools at the time of query Example: Year-to-date sales

Derived facts

Detailed Dimensional Design Worksheet

Updating bus matrix

Sample Data Model Issue List

Design document Brief description of business processes included in the design High level discussion of the business requirements to be supported pointing back to the detailed requirements document High level data model diagram Detailed dimensional design worksheet for each fact and dimension table Open issues list highlighting the unresolved issues Discussion of any known limitations of the design to support the project scope and business requirements Other items of interest, such as design compromises or source data concerns)

Questions ?

Outline What is Dimensional Modeling. Star Schemas (Facts and Dimensions) Star Schema vs. ER Diagram SQL Comparison

Outline (continued) Designing the Data warehouse Keys References Strengths of Dimensional Modeling Myths of Dimensional Modeling. Designing the Data warehouse Keys References

What is a MDDB? An MDDB is a specialized data storage facility that stores summarized data for fast and easy access. Users can quickly view large amounts of data as a value at any cross-section of business dimensions. A business dimension can be any logical vision of the data -- time, geography, or product, for example. Once an MDDB is created, it can be copied or transported to any platform. In addition, regardless of where the MDDB resides, it is accessible to requesting applications on any supported platform anywhere on the network, including the Web.

MDDB (continued) MDDB can be implemented either on a proprietary MDDB product or as a dimensional model on a RDBMS. The later is the more common. For our purposes we will use Oracle 8i, a Relational Database. Proprietary MDDB database include Oracle’s Express, Arbor Essbase, Microsoft’s SQL Server OLAP component, etc.

What is a data warehouse? Data warehouses began in the 70’s out of the need of many companies to combine the data of it’s various operational systems into a useful and consistent form for analysis. Data-warehouses are used to provide data to Decision Support Systems (DSS). Many data-warehouses also work with OLAP (Online Analytical Processing) servers and clients. Data warehouses are updated only in batch not by transactions. They are optimized for SQL Selects. This optimization includes de-normalization.

DW (continued) Inmon’s Four Characteristics of a Data Warehouse : Subject-Oriented: DW’s answer a question, they don’t just store data. Integrated: DW’s provide a unified view of the companies data. Nonvolatile: DW’s are read-only for analytical purposes, de-normalization is ok. Time: DW-Data is time sensitive. Analyze the past to predict the future.

Review of ER Modeling Entity-relationship modeling is a logical design technique that seeks to eliminate data redundancy and maintain the integrity of the database. They do this by highly normalizing the data. The more you normalize the more entities and relationships you wind up with. This is necessary in an online transaction processing (OLTP) system because insert, deletes, and updates against de-normalized data requires additional transactions to keep all the redundant data in sync. This is both highly inefficient and prone to errors. The ER Model is the best model for OLTP.

The Problem with ER Diagrams ER Diagrams are a spider web of all entities and their relationship to other entities throughout the database schema. Un-related relationships clutter the view of what you really want to get at.   ER Diagrams are too complex for most end users to understand and because of all the joins required to get any meaningful data for analysis they are highly inefficient. Not useful for data-warehouses which need intuitive high performance retrieval of data.

What is Dimensional Modeling. Dimensional modeling is the name of a logical design technique often used for data-warehouses.   Dimensional modeling is a logical design technique that seeks to present the data in a standard framework that is intuitive and allows for high-performance access. Dimensional modeling provides the best results for both ease of use and high performance.

It uses the relational model with a few restrictions: Every dimension is composed of one table with a multi-part key, called the fact table, and a set of smaller tables called dimension tables. Each dimension has a single-part primary key that corresponds exactly to one of the components of the multi-part key in the fact table. This creates a structure that looks a lot like a star, hence the term “Star Schema” Interestingly early, late 60’s, implementations of relational databases looked a lot like Star Schema’s. They pre-dated ER Diagrams.

What is a Fact Table? A fact table is composed of two or more primary keys and usually also contains numeric data. Because it always contains at least two primary keys it is always a M-M relationship.  

What is a Dimension? Dimension tables on the other hand have a primary key and only textual data or non-text data that is used for textual purposes. This data is used for descriptive purposes only. Each dimension is considered an equal entry point into the fact table. The textual attributes that describe things are organized within the dimensions. For example in a retail database we would have product, store, customer, promotion, and time dimensions. Whether or not to combine related dimensions into one dimensions is usually up to intuition. Remember however that guiding principal of dimensional modeling is 1. Intuitive Design, and 2. Performance.

Dimensions (continued) Because Dimensions are the entry point into the facts that the user is looking for they should be very descriptive and intuitive to the user. Here are some rules: Verbose (full words) Descriptive Complete (no missing values) Quality assured (no misspellings, impossible values, obsolete or orphaned values, or cosmetically different versions of the same attribute) Indexed (perhaps B-Tree or bitmap) Documented in metadata that explains the origin and interpretation of each attribute.

SQL Comparison Dimensional Model: SELECT description, SUM(quoted_price), SUM(quantity), SUM(unit_price) , SUM(total_comm) FROM order_fact of JOIN part_dimension pd ON of.part_nr = pd.part_nr GROUP BY description;   ER-Model: SELECT description, SUM(quoted_price), SUM(quantity), SUM(unit_price), SUM(total_comm) FROM order o JOIN order_detail od ON o.order_nr = od.order_nr JOIN part p ON p.part_nr = od.part_nr JOIN customer c ON o.customer_nr = c.customer_nr JOIN slsrep s ON s.slsrep_nr = c.slsrep_nr Notice that the dimensional model only joins two tables, while the ER model joins all five in the ER Diagram. This is very typical of highly normalized ER models. Imagine a typical normalized database with 100’s of tables

Rules about Facts and Dimensions: The Basic tenet of dimensional modeling: “If you want to be able to slice your data along a particular attribute, you simple need to make the attribute appear in a dimension table.” Facts and their corresponding Dimensions must be of the same granularity. Meaning if the fact table holds numerically data for days, then the dimensions must have factual attributes that describe daily data. An attribute can live in one and only one dimension, whereas a fact can be repeated in multiple fact tables. If a dimension appears to have more than one location, it is probably playing multiple roles and needs a slightly different textual description.

Rules (continued) There is not necessarily a one to one relation between source data and dimensional data, in fact usually one source will create multiple dimensions or multiple source data will create one dimension.   Every fact should have a default aggregation. Even if that aggregation is No Aggregation.

ER to Dimensional Models Separate each entity into the business process that it represents. Create fact tables by selecting M-M relationships that contain numeric and additive non-key facts. Fact tables may be a detail level or at an aggregated level depending on business needs. Create Dimensions by de-normalizing all the remaining tables into flat tables with atomic keys that connect directly to the fact tables. Kimball: 146/147

Strengths of Dimensional Modeling The Dimensional model is: Predictable. Query tools can make strong assumptions about it. Dynamic. Extends Gracefully by adding rows or columns. Standardized approach to modeling business events. Growing number of software applications to support it. Kimball: 147 to 149

Myths about Dimensional Modeling Dimensional Models are non-dynamic: Only when you pre-aggregate. Kept in it’s detail form it is just as dynamic as ER. Dimensional Models are too complex: Just the opposite Snow flaking is an alternative to Dimensional Modeling: Snow flaking is an extension to the Star Schema. It adds sub-dimensions to dimensions and therefore looks like a snow-flake. It decreases the “simplicity” of the star-schema and should be avoided. Kimball: 150/151

Designing the Data warehouse There are two approaches to building the data-warehouse. The first is the top-down approach. In this approach an entire organization wide data-warehouse is built and then smaller data-marts use it as a source. The second approach, which much more feasible, is the bottom-up approach. In this approach individual data-marts are built using conformed dimensions and a standardized architecture across the enterprise.

Design Success factors Create a surrounding architecture that defines the scope and implementation of the complete data warehouse Oversee the construction of each piece of the complete data warehouse. Kimball in chapter five refers to a design called the data-warehouse bus architecture. Kimball: 155

Drilling There are two types of drilling Drill down: Which simple means give me more detail, or a lower level of granularity. For example show sales figures for each county instead of for each state. Drill up: Which simple means give me less detail, or a higher level of granularity. For example showing sales figures for each state instead of each county. Most reporting/OLAP tools these days have this capability.

Special Types of Dimensions   Special Types of Dimensions Time dimension: Should be nation neutral. 176 Person dimension: Very atomic, for example separate fields for all parts of name and address. 178 Small Static (slowly changing) Dimensions. Small Dynamic (rapidly changing) Dimensions. Large Static (slowly changing) Dimensions. Large Dynamic (rapidly changing) Dimensions. Degenerate Dimensions: Dimensions without Attributes. Miscellaneous Dimensions: Miscellaneous data that doesn’t fit anywhere else, but that you want to keep.  

Keys It is best only to use artificial keys assigned by the data-warehouse, don’t use original production keys. Also avoid smart keys. Smart keys are keys that usually are also attributes.

Designing the Fact Table Kimball defines a four step process. Choose the data mart Choose the fact table grain: Should be as granular as possible. Choose the dimensions: Usually determined by the fact table. Choose the facts of interest to you. Kimball: 194

Granularity Detail granularity has several advantages over Aggregate granularity More Dynamic Required for data-mining Allows for Behavior analysis (207/208) Aggregates offer increased performance when details are not needed. A best of both worlds can be achieved using something called a snapshot. In Oracle this is achieved using a Materialized View. Transactions and snapshots are the yin and yang of data- warehousing. Kimball: 211

Thank You