Download presentation
Presentation is loading. Please wait.
Published bySamuel Harrington Modified over 9 years ago
1
Base SAS ® vs. SAS ® Data Integration Studio Greg Nelson and Danny Grasse
2
2 Outline –Overview of Understanding ETL –What SAS approaches do we have? –38 Best Practices –Key Areas for Comparison 1.Design and data profiling. 2.Source data extraction. 3.Transformation and loading. 4.Change data capture. 5.Quality review, auditing and exception handling/management. 6.Integration with the production environment and business process components. –Summary and Conclusions
3
3 Overview ETL Data Warehousing 101 Data Integration Studio “Consistent” version of the truth Credible information versus Data quality
4
4 Corporate Information Factory
5
5 Ralph Kimball History Excellence Father of “dimensional” data warehouse design The Data Warehouse Toolkit (I and II) The Data Warehouse Lifecycle Toolkit The Data Warehouse ETL Toolkit
6
6 The Data Integration Process
7
7 38 Subsystems 38 Sub-systems define your ETL strategy –Design and data profiling –Source data extraction –Transformation and loading –Change data capture –Quality review, auditing and exception handling/management –Integration with the production environment and business process components
8
8 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 1 Productionizatation
9
9 Design and Data Profiling Design is often played by “soft skills” in the SAS ecosystem Data profiling –Base SAS – frequencies, crosstabs, macros and toolkits –DataFlux software – data profiling on steroids –DI Studio – data profiling currently supported through generalized metadata Data Profiling is an analysis exercise, not a technical one 3. Data profiling system. Column property analysis including discovery of inferred domains, and structure analysis including candidate foreign key — primary relationships, data rule analysis, and value rule analysis.
10
10 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 2 Productionizatation
11
11 Source Data Extraction Source data adapters (including data conversions and filters) –SAS/Access products –Data Step, SQL and DI Studio Push/pull/dribble –How we move the data Filtering & Sorting –How we select the data to be moved Data staging (versus accessing) –How we land the data 1. Extract system. Source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.
12
12 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 3 Productionizatation
13
13 Transformation and Loading 5. Data conformer. 9. Surrogate key creation system. 12. Fixed hierarchy dimension builder. 13. Variable hierarchy dimension builder. 14. Multivalued dimension bridge table builder. 15. Junk dimension builder. 16. Transaction grain fact table loader. 17. Periodic snapshot grain fact table loader. 18. Accumulating snapshot grain fact table loader. 19. Surrogate key pipeline. 20. Late arriving fact handler. 21. Aggregate builder. 22. Multidimensional cube builder. 23. Real-time partition builder. 24. Dimension manager system. 25. Fact table provider system.
14
14 Transformation and Loading Conforming dimensions and facts –Good old SAS code (Data Step, SQL, Formats) – all available in DI Studio –Good design Creation of surrogate keys –Hand coded in Base SAS –“Automagic” in DI Studio Building summary tables and cubes –Just another “target” in DI Studio
15
15 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 4 Productionizatation
16
16 Change Data Capture 2. Change data capture system. Source log file readers, source date and sequence number filters, and CRC-based record comparison in ETL system. 10. Slowly Changing Dimension (SCD) processor. Transformation logic for handling three types of time variance possible for a dimension attribute: Type 1 (overwrite), Type 2 (create new record), and Type 3 (create new field). 11. Late arriving dimension handler. Insertion and update logic for dimension changes that have been delayed in arriving at the data warehouse.
17
17 Change Data Capture When we see new data coming in from the operational system, we have to make a decision about how to handle that change. We have three options: –We can overwrite or update the old value (Type I) –We can create a new record and create some mechanism for recreating historical references to that data – depending on the date for the report that is being requested (Type II). –We can retain both as alternative “realities”. For the latter, we usually create a new column and put the old value in the new column to allow for alternatives to reporting. (Type III)
18
18 Change Data Capture SAS Approaches –Base SAS – very robust using macros –Can control everything about the load –DI Studio has limited coverage –SAS does support CRC-based record comparisons (MD5 function) DI Studio –3 types of loading techniques: update, refresh, append –Type I & II are dropdowns; Type II SCD is a transform –Doesn’t support Type 3 outside of transform code
19
19 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 5 Productionizatation
20
20 Quality Handling 4. Data cleansing system. Typically a dictionary driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. "De-duplication" including identification and removal usually of individuals and organizations, possibly products or locations. Often uses fuzzy logic. "Surviving" using specialized data merge logic that preserves specified fields from certain sources to be the final saved versions. Maintains back references (such as natural keys) to all participating original sources. 6. Audit dimension assembler. Assembly of metadata context surrounding each fact table load in such a way that the metadata context can be attached to the fact table as a normal dimension. 7. Quality screen handler. In line ETL tests applied systematically to all data flows checking for data quality issues. One of the feeds to the error event handler (see subsystem 8). 8. Error event handler. Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality
21
21 Quality Handling Detecting errors Handling them Providing audit records
22
22 Quality Management Detecting errors –SAS errors versus data errors DataFlux –Data rationalization –At the point of data entry Base SAS –If then else routines (lookup tables, formats) DI Studio –Not much other than BASE SAS
23
23 Audit trail Base SAS –Log parsing routines DI Studio –Workspace server logs Event System –Detailed logs, summary logs and event triggers
24
24 Exception Handling Base SAS –Macros, put statements in log file DI Studio –Simple email, exception tables and log file Event System –Subscribe to events –Responds to errors, warnings, notes and custom assertions
25
25 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 6 Productionizatation
26
26 Productionization of SAS ETL 26. Job scheduler. 27. Workflow monitor. 28. Recovery and restart system. 29. Parallelizing/pipelining system. 30. Problem escalation system. 31. Version control system. 32. Version migration system. 33. Lineage and dependency analyzer. 34. Compliance reporter. 35. Security system. 36. Backup system. 37. Metadata repository manager. 38. Project management system.
27
27 Productionization of SAS Version control, change control, promotion, backup and recovery –None is available in BASE SAS –Version control – minimal for multi-developer access –Change management –partially available in DI Studio –Automated promotion – weak and/or not available –Backup – metadata server can be backed up (but no rollback feature)
28
28 Productionization of SAS Scheduling, dependency management and restartability, including parallelization. –Provided by LSF Scheduler –Managed by person doing the scheduling not writing the code –LSF provides parallelization, but also 'grid' computing with the associated 'pipelining' of steps
29
29 Productionization of SAS Metadata management and impact analysis. –Very good in DI Studio
30
30 Productionization of SAS Project management and problem escalation. –Not in scope for DI Studio
31
31 Summary Di Studio is a code generator Can do everything Base SAS can do Metadata and security is the key for why we want to use DI Studio DI Studio writes “better” code in some cases Challenges: change control and what happens when things go bad
32
32 ThotWave Technologies Thinking Data Danny Grasse Senior Consultant dgrasse@thotwave.com How to reach us... Greg Nelson CEO and Founder greg@thotwave.com
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.