Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas w:
About Rafael DW BI – 12 years SQL Server MVP Solution Architect - Quaero a CSG Systems Solution Charlotte, NC Lots mistakes along the way!
Mistakes are the portals of discovery. James Joyce
Today’s Plan 10 mistakes to avoid What, why, how to prevent them Share real life examples No magic formulas
Ignoring Data Realities …Or finding them too late. 1
The Problem Relying on common knowledge: –The data is ‘good’ –I know this data well We don’t have time Cycle: Code Load Explode! Research-Recode-Retest = Rework 1
The Fix Requirements Data Profiling Compare Clue: Business want ‘good’ quality data: Accurate Timeliness Relevant Complete Understood Trusted 1
Benefits Early awareness about data quality issues Better ETL development estimates Uncover new business rules Better understanding of business requirements 1
How? 3 rd Party tools Hand crafted SQL queries SSIS: Data Profiling Task –Decent profiling –Get up and running quickly –SQL Server data sources only –Output is XML –Results can be loaded in a table – XSLT required 1
Exception Handling …Actually the lack thereof 2
The Problem Data’s ‘Buts and Ifs’ nobody mentioned Unreliable data sources Missed homework: data profiling –Data type mismatches and overflow –Referential integrity Cycle: Run Fail Patch 2
The Fix 2 Consider exceptions at different levels –Data/Database –Network –Operative System Design a system-wide strategy –Design Patterns Templates Log and notify!
How? Data/Database: –In SSIS: Use dataflow error outputs to redirect offending rows Network: –Pre-process: test connectivity –In SSIS: Event handlers, precedence constraints with conditional logic O/S –Pre-process: Validate space available, File available, etc. –In SSIS: Event handlers, precedence constraints with conditional logic 2
Inadequate Logging …What, when, how? 3
The Problem No/Little Logging Too Much Logging Meaningless Logging Error troubleshooting Execution monitoring Performance tracking Auditing 3
The Fix 3 Add logging capabilities – Start with key events, add more as needed –Start – End date & Times –Row Counts –On Error –On Warning Create reports on top of logging tables Don’t forget to clean/prune logs Logging I/O are expensive
The Fix 3 SSIS logging SSIS event handler Be aware of the concept of containers in SSIS - events ‘bubble-up’ Have to be included on each package –Use package templates
No Recovery & Restart …Game Over! 4
The Problem Re-starting after failure is not automated It requires manual clean-up of partial results Prone to human error May require to start process from the beginning Risk of ‘skipping’ data Risk of duplicating data 4
The Fix 4 Create restart-ability points Consider piggybacking on logging Use ternary logic at each recovery point: –Skip –Run –Clean-up and re-run Staging source data is handy Custom
Staging Area Unauthorized Use …could cause injuries. 5
The Problem Failing to understand staging area is a ‘construction zone’ Reports and applications accessing staging data Using staging tables as on-line data archive 5
The Fix 5 Easy: Keep staging area off-limit Make all required data in data presentation layer Keep staging data available only for required time Use appropriate data aging and archiving policies and processes
How not to write a report? A Classic Example 5
Performance: Losing the focus 6 …
Very Fast, but… 6
Vanity Testing …good for feeling awesome. 7
No Portability …deployment in progress! 8
Forgetting the Owner’s Manual …aka the beloved documentation. 9
Missing the Bigger Picture …the architecture. 10
The problem Jumping to coding without a blueprint Break it down into group of tasks List all tasks and functionality you can’t live without Place the tasks in the appropriate group 10
The Fix Create an attack plan Embrace an architecture Divide and conquer! List all tasks and functionality you require Place the tasks in the appropriate group 10
Extract Changed data capture Data Staging Transform Data cleansing Other Data Transformations Deduplication Exception Handling Load Data LoadLoad Aggregates OLAP Cube Processing ETL Management Job SchedulerRecovery Restart Activity Monitor ABC SupportBackup Data Error tracking Other Post Load Actions AlertingSecurityCompliance An example 10
Helping Your Data Warehouse Succeed: 10 Mistakes to Avoid in Data Integration Rafael Salas w: