Shawn Nesser Microsoft
Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Dimensional models required for end user access Focusing on data presentation area, not data staging area Data Staging Area Data Mart End User Data Access Tools Data Staging Data Presentation Enterprise Data Warehouse
Differences between transactional processing (OLTP) and data warehousing understood Transactional Processing Data Warehousing Represents current state Preserves history Predictable usage Highly unpredictable usage Optimized to get data “in” Optimized to get data “out” Deals with small number of records Deals with millions of records Optimized for machine processing Optimized for human processing
Transaction processing systems prefer normalized (3NF) structure Optimized for single record data entry and retrieval Very well defined process and access path Pre-supposes the types of work being performed Data warehousing requires a different approach Optimized for query (discovery / exploration) performance Simple, understandable and memorable Flexible Product Product Line Ship TypeShipper Contract Contract Type Customer Order Item Customer Location Sales Order Contact Location Product Group Ship Contact
Education LevelExp. Bilingual US EligibleCountry (NID)NameBirth DateHire Date A-Not Indicated0.0N YUSAMorgan,Richard29/03/195025/11/1985 SELECT PSXLATITEM.XLATLONGNAME AS EDUCATION_LEVEL, PS_PERS_DATA_EFFDT.YEARS_OF_EXP, PS_PERS_DATA_EFFDT.BILINGUALISM_CODE, PS_PERS_DATA_EFFDT.US_WORK_ELIGIBILTY, PS_PERS_NID.COUNTRY, PS_NAMES.NAME, PS_PERSON.BIRTHDATE, PS_PERSON.ORIG_HIRE_DT FROM PS_PERS_DATA_EFFDT INNER JOIN PSXLATITEM ON PS_PERS_DATA_EFFDT.HIGHEST_EDUC_LVL = PSXLATITEM.FIELDVALUE INNER JOIN PS_PERS_NID ON PS_PERS_DATA_EFFDT.EMPLID = PS_PERS_NID.EMPLID INNER JOIN PS_NAMES ON PS_PERS_DATA_EFFDT.EMPLID = PS_NAMES.EMPLID INNER JOIN PS_PERSON ON PS_PERS_DATA_EFFDT.EMPLID = PS_PERSON.EMPLID INNER JOIN (SELECT EMPLID, MAX(EFFDT) AS EFFDT FROM PS_PERS_DATA_EFFDT AS PS_PERS_DATA_EFFDT_1 GROUP BY EMPLID) AS PERS ON PS_PERS_DATA_EFFDT.EMPLID = PERS.EMPLID AND PS_PERS_DATA_EFFDT.EFFDT = PERS.EFFDT INNER JOIN (SELECT EMPLID, MAX(EFFDT) AS EFFDT FROM PS_NAMES AS PS_NAMES_1 WHERE (NAME_TYPE = 'PRI') GROUP BY EMPLID) AS NAMES ON PS_NAMES.EMPLID = NAMES.EMPLID AND PS_NAMES.EFFDT = NAMES.EFFDT WHERE (PS_PERS_DATA_EFFDT.EMPLID AND (PSXLATITEM.FIELDNAME = 'HIGHEST_EDUC_LVL') AND (PSXLATITEM.EFF_STATUS = 'A') AND (PS_NAMES.NAME_TYPE = 'PRI')
Introduction Basic Concepts Star Schema (Fact and Dimensions Tables) Surrogate Keys Hierarchies / Drill Down Factless Fact Tables Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
What is a Star Schema? Single data (fact) table surrounded by multiple descriptive (dimension) tables
Easier to understand and navigate Better performance due to fewer joins Extensible to handle change New dimensions New attributes New facts New aggregates New data sources Flexibility to support ad hoc queries / exploration Recommended by most data access tools Fact 1 Fact 2 Fact 3 Dim 1 Key Dim 2 Key Dim 3 Key Dim 4 Key Dim 2 Table Dim 2 Key Dim 3 Table Dim 3 Key Dim 4 Table Dim 4 Key Dim 1 Table Dim 1 Key
Report labels and query constraints “By” words “Where” clauses Descriptive attributes Verbose; Minimal or no codes End user familiar terminology Hierarchical relationships Supports “drill down” into more detail Product Description SKU # Size Flavor Formula Brand Description Category Description Product Key Dimension tables store redundant, descriptive data in order to improve ease-of-use, support single table browsing, and reduce table joins
Product Description SKU # Size Class Formula Brand Description Category Description Product Key Product Description SizeClass Brand Description 0001 Cheerios 10 oz 10 oz FamilyCheerios 0002 Cheerios 24 oz 24 oz FamilyCheerios 0003 Total 10 oz 10 oz HealthTotal 0004 Total 24 oz 24 oz HealthTotal
Facts Metrics result from business processes Rapidly changing Usually numeric and additive Tend to have huge numbers of records Contains lowest level of grain possible Typically built around single business process Orders, shipments, claims, deposits, sales transactions Multi-part key Foreign keys to dimension tables Date is always a key Resolve many-to-many relationships Types of fact tables Transactional Periodic Snapshots (points in time) Accumulating Snapshots Total $ Sales Total Unit Sales Promoted $ Sales Promoted Unit Sales Date Key Product Key Store Key Promotion Key
PRODUCT_KEY PRODUCT_CODE PRODUCT_CNAME PRODUCT_NAME PRODUCT_CAT PRODUCT_SUBCAT PRODUCT_DEPT PRODUCT_BRAND … EFF_FROM_DT EFF_TO_DT CURRENT_FLAG LOCATION_KEY LOCATION_NAME LOCATION_TYPE LOCATION_DESC … EFF_FROM_DT EFF_TO_DT CURRENT_FLAG DATE_KEY DAY_DATE DAY_IN_YEAR MONTH_NUM QTR_NUM HALF_NUM YEAR_NUM … DIM_TIME DIM_LOCATION DIM_PRODUCT FACT_SALES PRODUCT_KEY LOCATION_KEY DATE_KEY … SUM_QUANTITY SUM_SALES SUM_COST SUM_MARGIN SUM_TAX PROMO_QUANTITY PROMO_SALES … select location_type, sum(sum_sales) from dim_product, dim_location, fact_sales where fact_sales.product_key = dim_product.product_key and fact_sales.location_key = dim_location.location_key andproduct_cat = ‘LAPTOPS’ group by location_type Location Type Store$150,000 Internet$320,000 Sales Sales of: LAPTOPS
Recommend surrogate keys Integer, non-meaningful, sequence number Don’t use operational ID’s or meaningful keys Benefits Isolate warehouse keys from operational system changes Improve performance Support integration from multiple data sources Enable tracking of dimension changes (slowly changing dimensions) Can record “Not applicable” or “TBD” values
Dimension tables can represent multiple hierarchical roll-ups For example, Store dimension could have the following hierarchies: Physical Geography: Zip, City, County, State, Country Sales Organization: District, Region, Zone Distribution Roll-up: DC Center, DC Region Store Description Store Type Zip City State Sales District Sales Region Sales Zone DC Center DC Region Store Key
How does the data warehouse know that you had a promotion if no promoted products sell? Factless fact table Records either the coverage or occurrence of an event Item Desc. Item Key Store Desc. Store Key Promo Desc. Promo Key Date Desc. Date Key Promo Counter Date Key Item Key Store Key Promo Key Promotion Event Coverage Table
Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Dimensional attributes evolve over time Customers move, get married, have children, adjust salaries You may want to track the history of these changes as it could impact your analysis capabilities. For example: How does getting married impact the customer’s buying behaviors? What impact did the sales reorganization have on sales force productivity? For every dimensional attribute, you need to identify a “change” strategy Type 1: Store only the current value Type 2: Create a dimension record for each value (with or without date stamps) Type 3: Create an attribute in the dimension record for previous value Hybrid
Type 1: Overwrite the Changed Attribute Situation: Tracking the change history of this attribute has no analytic value Original Record Item KeyItem DescDept 12345Sim City 3000Educational SW Updated Record Item KeyItem DescDept 12345Sim City 3000Strategy SW
STATUS_KEYSTATUS_CODESTATUS_DESC 1OPEN Open Prior state of table Post-update state of table STATUS_KEYSTATUS_CODESTATUS_DESC 1OPEN Open Orders STATUS_CODESTATUS_DESC OPENOpen Orders Operational System New description read from system Overwrite history
Type 2: Add a New Dimension Record Situation: Maintains history as new dimension record (need to access two records to determine the change date) Original Record Item KeyItem DescDept 12345Sim City 3000Educational SW Updated Record Item KeyItem DescDept 13456Sim City 3000Strategy SW
Prior state of table Post-update state of table STATUS_KEYSTATUS_CODESTATUS_DESC 1OPEN Open STATUS_CODESTATUS_DESC OPENOpen Orders Operational System New description read from system 2OPEN Open Orders New columns added to enable key lookup during fact loading EFF_FROM_DTEFF_TO_DT 1 st January /01/ /01/ /31/ /31/2999 STATUS_KEYSTATUS_CODESTATUS_DESC 1OPEN Open EFF_FROM_DTEFF_TO_DT 01/01/197012/31/2999 Keep history
Even simple name changes cause problems DEPT_KEY DESCRDEPTID U.S. Ops. Eastern Region East EFF_FROM_DT 01/01/ /03/2005 Change of name from ‘Eastern Region’ to ‘East’ Department Eastern Region2005 East2005 Year Report for all 2005 $200,000 $300,000 Invoice $ Department Eastern RegionJan Month Report for January 2005 $200,000 Invoice $ Department EastMar Month Report for March 2005 $300,000 Invoice $
Type 3: Add a “Prior” Attribute Situation: Maintains history as new dimensional attribute Original Record Item KeyItem DescDept 12345Sim City 3000Educational SW Updated Record Item KeyItem DescDeptPrior Dept 12345Sim City 3000Strategy SWEducational SW
Prior state of table Post-update state of table STATUS_KEYSTATUS_CODESTATUS_DESC 1OPEN Open Orders STATUS_CODESTATUS_DESC OPENOpen Orders Operational System New description read from system PREV_STATUS_DESC 1 st January 2005 Open STATUS_KEYSTATUS_CODESTATUS_DESC 1OPEN Open PREV_STATUS_DESC Open Keep one version of history
Hybrid Approach Use Type 2 to track changes as they occur Include “current” Type 3 attribute, treated as Type 1 Item KeyItem Desc“As Was” Dept“Current” Dept 12345Sim City 3000Strategy SWSimulation SW 13456Sim City 3000Strategy SWSimulation SW 14567Sim City 3000Simulation SWSimulation SW
Prior state of table Post-update state of table STATUS_KEYSTATUS_CODECURR_STATUS 1OPEN Open Orders STATUS_CODESTATUS_DESC OPENOpen Orders Operational System New description read from system 2OPEN Open Orders EFF_FROM_DTEFF_TO_DT 1 st January /01/ /01/ /31/ /31/2999 HIST_STATUS Open Open Orders STATUS_KEYSTATUS_CODECURR_STATUS 1OPEN Open EFF_FROM_DTEFF_TO_DT 01/01/197012/31/2999 HIST_STATUS Open Keep history + keep current on each row
select product_name, product_cname, quarter, sum(sum_sales) from dim_product, dim_time, fact_sales where dim_product.product_key = fact_sales.product_key anddim_time.date_key = fact_sales.date_key group by product_name, product_cname, quarter Using historical nameUsing current name Name SnickersQ1 SkittlesQ2 Quarter $750,000 $300,000 Invoice $Name MarathonQ1 SnickersQ1 Quarter $500,000 $250,000 Invoice $ SkittlesQ2$300,000
Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
1.Choose the Business Process Examples: invoices, orders, shipments, claims, deposits A set of related fact and dimension tables for each process 2.Identify the Grain Select the lowest possible granularity (transaction line item, call detail record) 3.Identify the Dimensions Conformed dimensions 4.Identify the Facts
Separate fact tables for each business process Each business process represented by one fact table since the grain, dimensions and facts are unique for each business process The Conformed Dimensions enable the linkage between the different business processes (stars) Inventory Facts Date Key Item Key Store Key Sales Facts Date Key Item Key Store Key Promo Key Date Dimension Item Dimension Store Dimension Promo Dimension
DateStorePromoDistribution Center ShipperVendorItem Store Sales Store Inventory Purchase Orders
Business processes and shared dimensionsDateItemStorePromo Dist Ctr ShipperVendor Store Sales XXXX Store Inventory XXX Store Deliveries XXXXX Dist Ctr Inventory XXX Dist Ctr Delivery XXXXX Purchase Orders XXXX
Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Definition: Data marts that combine business measurements from multiple business processes; sometimes called a second-level data mart Budget Variance Fact Accounting Period Key (FK) G/L Account Key (FK) G/L Organizational Key (FK) =========================== Accounting Period Dimension Accounting Period Key (PK) Accounting Period Attributes… G/L Account Dimension G/L Account Key (PK) G/L Account Attributes… G/L Organization Dimension G/L Organization Key (PK) G/L Organization Attributes… Calculated (derived) metric Accounting Period Budget Variance Accounting Period Actual Amount Accounting Period Budget Amount
Definition: Type of fact table with multiple dates representing the major milestones of a relatively short-lived process or pipeline. Useful for identifying and understanding process bottlenecks / lag analysis Example: Order fulfillment pipeline OrdersBacklog Mfg Release Finished Goods Inventory ShipmentInvoicing
Date Dimension (views for 9 roles) Order Fulfillment Accumulating Fact Order Date Key (FK) Backlog Date Key (FK) Release to Mfg Date Key (FK) Finished Inventory Date Key (FK) Requested Ship Date Key (FK) Scheduled Ship Date Key (FK) Actual Ship Date Key (FK) Arrival Date Key (FK) Invoice Date Key (FK) ================================ Order to Mfg Release Lag Mfg Release to Inventory Lag Inventory to Shipment Lag Order to Shipment Lag Requested to Actual Ship Lag Calculated metrics
Customer Score Dimension Customer Score Key (PK) Attrite Score Lapse Score Fraud Score Cross Sell Score Up Sell Score Definition: Attributes of a larger dimension, such as customer, that are placed into separate, smaller dimension to control explosive growth May change more quickly than other dimensional attributes Order Entry Fact Table Customer Key (FK) Customer Score Key (FK) More Foreign Keys… Facts… Customer Dimension Customer Key (PK) Customer Name Customer Address Other Customer Attributes…
Some questions … What does it mean to “conform” dimension tables? Should I treat very large dimensions differently from other dimensions?
Separate fact tables for each business process Each business process represented by one fact table since the grain, dimensions and facts are unique for each business process The conformed dimensions enable the linkage between the different business processes (stars) Inventory Facts Date Key Product Key Store Key Sales Facts Date Key Product Key Store Key Promo Key Date Dimension Product Dimension Location Dimension Promo Dimension
All ‘Product’ tables should conform Conform = built from a common source e.g. DIM_PRODUCT GL Facts Account Key GL Product Key Cost Center Key Period Activity $ AR Facts Account Key Cost Center Key Product Key Line Amount $ Line Number Account Dimension GL Product Dimension Cost Center Dimension Product Dimension PRODUCT_KEY PRODUCT_CODE PRODUCT_NAME PRODUCT_CAT … GL_PRODUCT GL_PRODUCT_DESC … DIM_PRODUCT GL_PRODUCT_KEY GL_PRODUCT GL_PRODUCT_DESC … DIM_GL_PRODUCT
select gl_product, sum(line_amt) from dim_product, fact_inv_ln where fact_inv_ln.product_key = dim_product.product_key andgl_product in (‘License’,‘Services’) group by gl_product GL Product License$550,000,000 Services$125,000,000 Invoice $ Invoice Sales GL Product License$550,000,000 Services$125,000,000 Period Activity GL Period Activity select gl_product, sum(period_act) from dim_gl_product, fact_gl where fact_gl.gl_product_key = dim_product.gl_product_key andgl_product in (‘License’,‘Services’) group by gl_product GL Product License$550,000,000 Services$125,000,000 Invoice $ ‘Reconciliation’ $550,000,000 $125,000,000 Period Activity
The process of de-normalization naturally leads to a single customer dimension table Potential issues Tracking change on some attributes = several million rows of data Each customer query hits a very large table How often do you need address information? Conclusion Non-optimal query performance Customer table is best split according to use Demographic queries Current customer queries Historical customer queries DIM_CUSTOMER CUSTOMER_KEY CUST_NUMBER CUST_NAME CUST_TYPE CUST_STATUS GENDER NATIONALITY CHILDREN INCOME_BAND ADDR_LINE1 ADDR_LINE2 ADDR_LINE3 …
DIM_CST Type I/III SCD Current attributes MDIM_CST_DEMOG Mini-dimension Demographic attributes DIM_CST_EXTEND Type II SCD Changing attributes 20 million100 million max 2 million >1 billion
Retrieve large number of rows for each index value Constrain small table join to big table select state, sum(sum_sales) from dim_cst_extend extend, fct_sales sales where sales.cst_extend_key = extend.cst_extend_key andstate in (‘CA’, ‘FL’) group by state State CA$1,500,000 FL$750,000 Invoice $ Sales by State select gender, sum(sum_sales) from mdim_cst_demog demog, fact_sales sales where sales.cst_demog_key = demog.cst_demog_key andgender = ‘Male’ group by gender Gender Male$1,250,000 Invoice $ Sales by Gender Using the mini-dimension Using the ‘extension’ dimension Retrieve small number of rows for each index value Constrain big table join to bigger table
Introduction Basic Concepts Slowly Changing Dimensions Conformed Dimensions Advanced Concepts Summary
Get the grain right upfront, and everything else takes care of itself Atomic data is the most naturally dimensional data For example, POS business process data mart Natural grain is the individual POS transactions POS Transaction Fact Date Key (FK) Product Key (FK) Store Key (FK) Promotion Key (FK) POS Transaction # (DD) Dollars Units Costs Promotion Dimension Promotion Key (PK) Promotion Name Price Treatment Ad Treatment Display Treatment Coupon Treatment Product Dimension Product Key (PK) Description Brand Category Flavor Size Date Dimension Date Key (PK) Day of Week Week Number Month Month Number Store Dimension Store Key (PK) Store ID Store Name Address District Region
Clerk Dimension Clerk Key (PK) Clerk Name Clerk Age Band Clerk Years of Service Clerk Education Clerk Key (PK) Clerk Name Clerk Age Band Clerk Years of Service Clerk Education Customer Dimension Customer Key (PK) Customer Name Customer City Customer State Customer Age Band Customer LTV Estimate Customer Key (PK) Customer Name Customer City Customer State Customer Age Band Customer LTV Estimate If you get the grain right, then the data model is easily extensible In our POS example, we can extend for: Customer Dimension (to support new loyalty card) Clerk Dimension (to track individual clerk performance) POS Transaction Fact Date Key (FK) Product Key (FK) Store Key (FK) Promotion Key (FK) Customer Key (FK) Clerk Key (FK) POS Transaction # (DD) Dollars Units Costs Promotion Dimension Promotion Key (PK) Promotion Name Price Treatment Ad Treatment Display Treatment Coupon Treatment Product Dimension Product Key (PK) Description Brand Category Flavor Size Date Dimension Date Key (PK) Day of Week Week Number Month Month Number Store Dimension Store Key (PK) Store ID Store Name Address District Region
Dimensional models required for end user access Focusing on data presentation area, not data staging area Data Staging Area Data Mart End User Data Access Tools Data Staging Data Presentation Enterprise Data Warehouse Characteristics of a dimensional model Optimized for query (discovery / exploration) performance Simple, understandable and memorable Flexible
“The Data Warehouse Toolkit” second edition “The Microsoft Data Warehouse Toolkit” Kimball University Design Tips Kimball’s Data Warehouse Designer column The Data Warehouse Institute