SLOWLY CHANGING DIMENSIONS Features vs. Performance Benjamin Sigursteinsson Miracle Iceland
Who am I? Database programmer since 1987 BI/DW since 1997 Mostly Oracle to begin with SQL Server entered the DW picture in 2005 DW projects in US, Europe, Middle East and of course Iceland. Miracle Iceland - since 2003
Structure of session Short overview of SCD’s (5 mins) What are they What is the problem? Demonstration of 3-4 common approaches Standard SCD wizard using SSIS 3 rd party SCD approach. Kimball. T-SQL approach using MERGE Manual SSIS approach
SCD types Assume we all know what a dimension is Basically 3 types of dimensions Will not bother with type 3
SCD type 1 A „regular“ dimension. Nothing special here. No history kept, behaves as most OLTP systems Benefits Changes overwritten PK usually an integer, but could be the business key such as an SSN for a customer Simple Drawbacks We loose history with each update
SCD type 2 History kept. Additional columns added to track changes : ValidFrom, ValidTo, isCurrent Primary key always an integer of some sort. Benefit We can see the status as it was in the past Drawback Grows big. Updating slower. Complex to maintain Can icrease the number of dimensions (current value dimensions) Use of it can confuse end users if not properly presented
SCD type 1 - Example CustomerDim Handling of SCD1 A change is made and the name of Coke is changed to Coca-Cola CustomerPKCustomerBKNameZip 100ABCSnapple DEFCoke GHIPepsi10012 CustomerPKCustomerBKNameZip 100ABCSnapple DEFCoca Cola GHIPepsi10012
SCD type 2 - Example CustomerPKCustomerBKNameZipValidFromValidToCurrent 100ABCSnapple Y 200DEFCoke N 300GHIPepsi Y 301DEFCoca-Cola Y A new record has been inserted for the changed customer and the old one has been expired. All new transactions will be on Coca Cola but the old ones will be Coke
What are we looking for? Speed Logging Error handling Ease of use Flexibility Sources and Destinations
Data we are using for demos Destination table is CustomerDim, 1.6 m records Street, Customerversion and Policy are SCD1 ZIP code is SCD2 Source is a single table with records, there of: SCD1 changes SCD2 changes 500 new records
Very quickly – SSIS SCD Wizard Introduced in 2005 Has been more or less unchanged since Inflexible Logging Slow. Easy to use at first, changes made are lost during modifications. Only 1 data destination by default Demo
Kimball SCD component Name has been changed to Dimension Merge SCD? Todd McDermid Far better than the SSIS SCD in terms of logging More flexible Speed...Needs some tweaking and depends on no. updates Enhanced logging/auditing More choice of outputs Superior to the SSIS SCD in most aspects NULL expiry dates are not the only option, we can use other methods of identifying the current record Demo
T-SQL Merge Merge statement added in 2008 A SET operation, not an atomic one A classic UPSERT component Functionality similar to Try to update row If it fails then insert it SQL Server 2008 R2 has an OUTPUT clause that gives us the ability to do type 2 operations easily
T-SQL Merge Very fast - fastest Flexible and „easy“ My favourite Foreign keys are a problem if using the output clause Drop them before merge Enable after merge Limited logging and error handling, especially since you have to disable foreign keys in some cases Demo
Manual SSIS – from scratch Fast Not so easy to implement always Logging is decent, but has to be set up manually Demo
Roundup (1 is best) WizardKSCDManualMerge Speed*4321 Logging3124 Flexibility4113 Destinations2114 Sources1114 Ease of use1*133 Error handling2122 Use your own judgement to evaluate. (ex. is speed more valuable than logging?)
THANK YOU! For attending this session and PASS SQLRally Nordic 2011, Stockholm