ETL Design - Stage Philip Noakes May 9, 2015
Who am I? Philip Noakes Database Developer/Designer CapTech Consulting MCITP in SQL Server BI
Agenda Background – ETL and Staging Data Data Modeling in Stage ETL Architecture ETL vs ELT Data Modeling in Stage Concepts Table Structure Data Flow Auxiliary tables in the stage environment Control tables Logging Process Execution Errors Notification/Reporting
ETL Architecture
ETL vs ELT ELT - Loading raw data to presentation layer then performing transformations at the target ETL – Loading transformed data into the presentation layer
ETL vs ELT When to use ELT When to use ETL Traceability to untransformed source data Larger volumes of data When to use ETL All other times “The ETL process can take a long time. If we are processing in stream, we’ll have a connection open to the source system. A long-running process can create problems with database locks and stress the transaction system.” 1
Stage Design Built by database developers for database developers!!!
Stage Concepts Schemas Denormalized Data Data Cleansing Source specific Secured Denormalized Data Data Cleansing Flag bad or unusable data
Table Design - Schemas Organization Security Administration Source System Identification Cleansed vs Raw Security Administration Grant access by source system
Table Design - Denormalizing Flattening Data Pulling higher granularity attributes into lower granularity records. Pivoting lower granularity data into columns on higher granularity rows.
Table Design - Denormalizing Example: Orders Table
Table Design - Denormalizing Example: Product Categories
Table Design - Denormalizing "[...] design staging tables to better suit the target rather than the source. Reasons: 1. ETL is usually a two-step process. Stage then load. if the staging does mild transformations to better suit the target, I need only create one set of load processes. If the DW gets similar data from multiple sources. all I need to do is create new source specific staging processes and let the existing load processes handle the new source. 2. Sources change. I don't want to rewrite ETL processes from end-to-end because of a change in the source. 3. Most of the heavy transformation logic occurs on the load side. With the staging tables closer in structure to the target, the load process code tends to be simpler.“ - Nicholas Galemmo, Kimball Group Forums
Table Design - Denormalizing Why Denormalize in stage? Stand alone tables Reflect target architecture Utilize keys and indexes on source Why not Denormalize in stage? Strain on source system Flexibility
Table Design – Persisted Tables Provides traceability and reload capabilities without hitting source Store more attributes than required in the presentation layer Defined retention period Track processed records
Data Cleansing Identify data scenarios that you don’t want in the target Enforce business rules Look for duplicates Check referential integrity
Data Cleansing Status/Audit Fields Status Code Process/Do Not Process Error Description
Data Cleansing Cleansed Data Tables
Table Design – Data Typing Match the Source Log rejected records (And maybe fail) Image Source: http://dba.stackexchange.com/questions/6589/ssis-data-flow-error-output-runs-all-the-time
Table Design – Keys, Indexes, Etc… Foreign Keys? No! Indexing? No Primary Keys Yes Not NULLs/Check Constraints * - Persisted tables
Auxiliary Tables Set up a Framework! Log process execution stats Keep track of errors Run your system Image Source: http://parascadd.com/products/rcdetailing/panopliapreprocessor.html
Framework Components and Capabilities Control Table Incremental Loads Package Execution Logging System health reporting SLA tracking Error Tables Record Accounting Notification
Control Table
Using the Control Table
Process Tables
Error Logging Reference 2
Error Logging
Summary Stage = Exciting Maintain security considerations in stage Table design can reduce impact on source system Stage can decrease the complexity of target load Stage can be used for recovery and reload Use stage to limit risk of data quality issues
Questions
References 1 – KimballGroup.com Design Tip #99 - http://www.kimballgroup.com/2008/03/design-tip-99-staging-areas-and-etl-tools/ 2 – Erik Veerman, Jessica M Moss, Brian Knight, Jay Hackney. 2008. SQL Server 2008 Integration Services: Problem, Design, Solution