Base SAS ® vs. SAS ® Data Integration Studio Greg Nelson and Danny Grasse.

Slides:



Advertisements
Similar presentations
Supervisor : Prof . Abbdolahzadeh
Advertisements

Introduction to OWB(Oracle Warehouse Builder)
IS 4420 Database Fundamentals Chapter 11: Data Warehousing Leon Chen
Data Manager Business Intelligence Solutions. Data Mart and Data Warehouse Data Warehouse Architecture Dimensional Data Structure Extract, transform and.
Navigator Management Partners LLC Business Analysis Professional Development Day – Sep 2014 How to understand and deliver requirements to your Business.
Technical BI Project Lifecycle
DATA WAREHOUSE DATA MODELLING
Dimensional Modeling Business Intelligence Solutions.
Dimensional Modeling CS 543 – Data Warehousing. CS Data Warehousing (Sp ) - Asim LUMS2 From Requirements to Data Models.
Components and Architecture CS 543 – Data Warehousing.
New Features Adaptive Suite.
1 © Prentice Hall, 2002 Chapter 11: Data Warehousing.
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
M ODULE 5 Metadata, Tools, and Data Warehousing Section 4 Data Warehouse Administration 1 ITEC 450.
Leaving a Metadata Trail Chapter 14. Defining Warehouse Metadata Data about warehouse data and processing Vital to the warehouse Used by everyone Metadata.
State of Connecticut Core-CT Project Query 4 hrs Updated 1/21/2011.
ETL Design and Development Michael A. Fudge, Jr.
Data Conversion to a Data warehouse Presented By Sanjay Gunasekaran.
ETL By Dr. Gabriel.
Agenda Common terms used in the software of data warehousing and what they mean. Difference between a database and a data warehouse - the difference in.
Chapter 5 Using SAS ® ETL Studio. Section 5.1 SAS ETL Studio Overview.
Introducing ETL: Components & Architecture Michael A. Fudge, Jr.
L/O/G/O Metadata Business Intelligence Erwin Moeyaert.
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
Sayed Ahmed Logical Design of a Data Warehouse.  Free Training and Educational Services  Training and Education in Bangla: Training and Education in.
Database Systems – Data Warehousing
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
Data Warehousing Seminar Chapter 5. Data Warehouse Design Methodology Data Warehousing Lab. HyeYoung Cho.
Jean-Pierre Dijcks Principal Product Manager Oracle Warehouse Builder Oracle Corporation.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
Chapter 6: Foundations of Business Intelligence - Databases and Information Management Dr. Andrew P. Ciganek, Ph.D.
Session 4: The HANA Curriculum and Demos Dr. Bjarne Berg Associate professor Computer Science Lenoir-Rhyne University.
Agenda 03/27/2014 Review first test. Discuss internal data project. Review characteristics of data quality. Types of data. Data quality. Data governance.
Using SAS® Information Map Studio
Discovering Computers Fundamentals Fifth Edition Chapter 9 Database Management.
Professor Michael J. Losacco CIS 1110 – Using Computers Database Management Chapter 9.
1 Data Warehouses BUAD/American University Data Warehouses.
ETL Extract. Design Logical before Physical Have a plan Identify Data source candidates Analyze source systems with data- profiling tools Receive walk-through.
DAT 360: DTS in SQL Server 2000 Best Practices Euan Garden Group Manager, SQL Server Microsoft Corporation.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
Data Staging Data Loading and Cleaning Marakas pg. 25 BCIS 4660 Spring 2012.
ISQS 3358, Business Intelligence Supplemental Notes on the Term Project Zhangxi Lin Texas Tech University 1.
Transportation: Refreshing Warehouse Data Chapter 13.
Chapter 11: Data Warehousing Modern Database Management 6 th Edition Jeffrey A. Hoffer, Mary B. Prescott, Fred R. McFadden.
7 Strategies for Extracting, Transforming, and Loading.
June 08, 2011 How to design a DATA WAREHOUSE Linh Nguyen (Elly)
SSIS – Deep Dive Praveen Srivatsa Director, Asthrasoft Consulting Microsoft Regional Director | MVP.
Interactions & Automations
INCREMENTAL AGGREGATION After you create a session that includes an Aggregator transformation, you can enable the session option, Incremental Aggregation.
Event Title Event Date. Module 02—Introduction to Dimensional Modeling Techniques Name Title Microsoft Corporation.
Metadata Driven Clinical Data Integration – Integral to Clinical Analytics April 11, 2016 Kalyan Gopalakrishnan, Priya Shetty Intelent Inc. Sudeep Pattnaik,
Practical MSBI(SSIS, SSAS,SSRS) online training. Contact Us: Call: Visit:
SAS DI ONLINE TRAINING Contact our Support Team : SOFTNSOL India: Skype id : softnsoltrainings id:
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Supervisor : Prof . Abbdolahzadeh
Data Cleansing - Duplicate Identification and Resolution
SQL Server Integration Services
IBM DATASTAGE online Training at GoLogica
Data Warehouse.
Applying Data Warehouse Techniques
Swagatika Sarangi (Jazz), MDM Expert
Applying Data Warehouse Techniques
An Introduction to Data Warehousing
Typically data is extracted from multiple sources
Applying Data Warehouse Techniques
Data Warehousing Concepts
Applying Data Warehouse Techniques
David Gilmore & Richard Blevins Senior Consultants April 17th, 2012
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Base SAS ® vs. SAS ® Data Integration Studio Greg Nelson and Danny Grasse

2 Outline –Overview of Understanding ETL –What SAS approaches do we have? –38 Best Practices –Key Areas for Comparison 1.Design and data profiling. 2.Source data extraction. 3.Transformation and loading. 4.Change data capture. 5.Quality review, auditing and exception handling/management. 6.Integration with the production environment and business process components. –Summary and Conclusions

3 Overview ETL Data Warehousing 101 Data Integration Studio “Consistent” version of the truth Credible information versus Data quality

4 Corporate Information Factory

5 Ralph Kimball History Excellence Father of “dimensional” data warehouse design The Data Warehouse Toolkit (I and II) The Data Warehouse Lifecycle Toolkit The Data Warehouse ETL Toolkit

6 The Data Integration Process

7 38 Subsystems 38 Sub-systems define your ETL strategy –Design and data profiling –Source data extraction –Transformation and loading –Change data capture –Quality review, auditing and exception handling/management –Integration with the production environment and business process components

8 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 1 Productionizatation

9 Design and Data Profiling Design is often played by “soft skills” in the SAS ecosystem Data profiling –Base SAS – frequencies, crosstabs, macros and toolkits –DataFlux software – data profiling on steroids –DI Studio – data profiling currently supported through generalized metadata Data Profiling is an analysis exercise, not a technical one 3. Data profiling system. Column property analysis including discovery of inferred domains, and structure analysis including candidate foreign key — primary relationships, data rule analysis, and value rule analysis.

10 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 2 Productionizatation

11 Source Data Extraction Source data adapters (including data conversions and filters) –SAS/Access products –Data Step, SQL and DI Studio Push/pull/dribble –How we move the data Filtering & Sorting –How we select the data to be moved Data staging (versus accessing) –How we land the data 1. Extract system. Source data adapters, push/pull/dribble job schedulers, filtering and sorting at the source, proprietary data format conversions, and data staging after transfer to ETL environment.

12 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 3 Productionizatation

13 Transformation and Loading 5. Data conformer. 9. Surrogate key creation system. 12. Fixed hierarchy dimension builder. 13. Variable hierarchy dimension builder. 14. Multivalued dimension bridge table builder. 15. Junk dimension builder. 16. Transaction grain fact table loader. 17. Periodic snapshot grain fact table loader. 18. Accumulating snapshot grain fact table loader. 19. Surrogate key pipeline. 20. Late arriving fact handler. 21. Aggregate builder. 22. Multidimensional cube builder. 23. Real-time partition builder. 24. Dimension manager system. 25. Fact table provider system.

14 Transformation and Loading Conforming dimensions and facts –Good old SAS code (Data Step, SQL, Formats) – all available in DI Studio –Good design Creation of surrogate keys –Hand coded in Base SAS –“Automagic” in DI Studio Building summary tables and cubes –Just another “target” in DI Studio

15 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 4 Productionizatation

16 Change Data Capture 2. Change data capture system. Source log file readers, source date and sequence number filters, and CRC-based record comparison in ETL system. 10. Slowly Changing Dimension (SCD) processor. Transformation logic for handling three types of time variance possible for a dimension attribute: Type 1 (overwrite), Type 2 (create new record), and Type 3 (create new field). 11. Late arriving dimension handler. Insertion and update logic for dimension changes that have been delayed in arriving at the data warehouse.

17 Change Data Capture When we see new data coming in from the operational system, we have to make a decision about how to handle that change. We have three options: –We can overwrite or update the old value (Type I) –We can create a new record and create some mechanism for recreating historical references to that data – depending on the date for the report that is being requested (Type II). –We can retain both as alternative “realities”. For the latter, we usually create a new column and put the old value in the new column to allow for alternatives to reporting. (Type III)

18 Change Data Capture SAS Approaches –Base SAS – very robust using macros –Can control everything about the load –DI Studio has limited coverage –SAS does support CRC-based record comparisons (MD5 function) DI Studio –3 types of loading techniques: update, refresh, append –Type I & II are dropdowns; Type II SCD is a transform –Doesn’t support Type 3 outside of transform code

19 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 5 Productionizatation

20 Quality Handling 4. Data cleansing system. Typically a dictionary driven system for complete parsing of names and addresses of individuals and organizations, possibly also products or locations. "De-duplication" including identification and removal usually of individuals and organizations, possibly products or locations. Often uses fuzzy logic. "Surviving" using specialized data merge logic that preserves specified fields from certain sources to be the final saved versions. Maintains back references (such as natural keys) to all participating original sources. 6. Audit dimension assembler. Assembly of metadata context surrounding each fact table load in such a way that the metadata context can be attached to the fact table as a normal dimension. 7. Quality screen handler. In line ETL tests applied systematically to all data flows checking for data quality issues. One of the feeds to the error event handler (see subsystem 8). 8. Error event handler. Comprehensive system for reporting and responding to all ETL error events. Includes branching logic to handle various classes of errors, and includes real-time monitoring of ETL data quality

21 Quality Handling Detecting errors Handling them Providing audit records

22 Quality Management Detecting errors –SAS errors versus data errors DataFlux –Data rationalization –At the point of data entry Base SAS –If then else routines (lookup tables, formats) DI Studio –Not much other than BASE SAS

23 Audit trail Base SAS –Log parsing routines DI Studio –Workspace server logs Event System –Detailed logs, summary logs and event triggers

24 Exception Handling Base SAS –Macros, put statements in log file DI Studio –Simple , exception tables and log file Event System –Subscribe to events –Responds to errors, warnings, notes and custom assertions

25 Design and Data Profiling Transformation and loading Change data capture Source data extraction Quality and exception handling 38 Subsystems: Category 6 Productionizatation

26 Productionization of SAS ETL 26. Job scheduler. 27. Workflow monitor. 28. Recovery and restart system. 29. Parallelizing/pipelining system. 30. Problem escalation system. 31. Version control system. 32. Version migration system. 33. Lineage and dependency analyzer. 34. Compliance reporter. 35. Security system. 36. Backup system. 37. Metadata repository manager. 38. Project management system.

27 Productionization of SAS Version control, change control, promotion, backup and recovery –None is available in BASE SAS –Version control – minimal for multi-developer access –Change management –partially available in DI Studio –Automated promotion – weak and/or not available –Backup – metadata server can be backed up (but no rollback feature)

28 Productionization of SAS Scheduling, dependency management and restartability, including parallelization. –Provided by LSF Scheduler –Managed by person doing the scheduling not writing the code –LSF provides parallelization, but also 'grid' computing with the associated 'pipelining' of steps

29 Productionization of SAS Metadata management and impact analysis. –Very good in DI Studio

30 Productionization of SAS Project management and problem escalation. –Not in scope for DI Studio

31 Summary Di Studio is a code generator Can do everything Base SAS can do Metadata and security is the key for why we want to use DI Studio DI Studio writes “better” code in some cases Challenges: change control and what happens when things go bad

32 ThotWave Technologies Thinking Data Danny Grasse Senior Consultant How to reach us... Greg Nelson CEO and Founder