Patterns and Best Practices in SSIS or, how to keep your DBA happy with your crazy-ass ETL
It’s all about me... I’m a SpeakingMentor !
Obligatory LOTR reference…
Mantra C# is *not* the only way to initialise a variable C# is *not* the only way to move files C# is *not* the only way to call a web service C# scripts are opaque to the SSIS runtime C#... <sigh> Mantra
Not a pipeline Data does not get passed to components (cough) Components manipulate blocks of data (true) Not a pipeline
What you *think* happens…. Get data Replace nulls Conditional Split Merge Sort
What actually happens… Get data Conditional Split Replace nulls Sort Merge
SSIS not being a pipeline
It’s all about speed… There are 2 transformation types: Synchronous – fastest (streaming and row-based) Asynchronous – slower And three ‘modes’: Non-blocking Semi-blocking Full blocking It’s all about speed…
Non-blocking synchronous streaming transforms Audit Cache Transform Character Map Conditional Split Copy Column Data Conversion Derived Column Multicast Percent Sampling Row Count Lookup Non-blocking synchronous streaming transforms
Non-blocking synchronous row transforms DQS cleansing Export Column Import Column OLEDB Command Script Task SCD Lookup Non-blocking synchronous row transforms
Wait, there’s two ‘Lookup’s ? Lookups are non-blocking streaming transforms when the ‘Full Cache’ option is used for the lookup data Using ‘Partial Cache’ or ‘No Cache’ options in the lookup make the Lookup a row-based transform, which is necessarily slower Wait, there’s two ‘Lookup’s ?
Semi-blocking asynchronous transforms Data Mining Merge Merge Join Pivot Unpivot Union All Term Lookup Semi-blocking asynchronous transforms
Full blocking asynchronous transforms Aggregate Fuzzy Grouping Fuzzy Lookup Row Sampling Sort Term Extraction Script Task Full blocking asynchronous transforms
Wait, there’s two ‘Script Task’s ?? Script tasks are non-blocking when they’re using an outside resource (i.e. not the data that you’re working on) They become blocking when they collect a dataset before sending it on to a destination Set the ‘SynchronousInputID’ property on the output columns to ‘None’ Wait, there’s two ‘Script Task’s ??
Large Data Sets BufferTempStoragePath BlobTempStoragePath These can either be set in your package template, or injected into the .dtsx Large Data Sets
Really ? You can just inject stuff ? Yes Find and Replace default entries with your custom requirement… any decent text editor will do. BLOBTempStoragePath="F:\astDisk\WithLoadsOfSpace_temp" bufferTempStoragePath="F:\astDisk\WithLoadsOfSpace_temp” Really ? You can just inject stuff ?
Things that make you go hmmm Logging Package Checkpoints Location, location, location Expressions Parallel Operations ‘IsSorted=True’ property Auditing Bad Data Things that make you go hmmm
Here’s one I prepared earlier… Time to dissect Here’s one I prepared earlier…
Take-aways – You should know.. Your data Your environment Transform types Ingest / Egress of data What you are trying to achieve When to give up and use C# Take-aways – You should know..