About Me – csnapp@captechconsulting.com @SnappSQL MCSE and PMP certified IT Consultant with CapTech since 2006 and have over 14 years of Microsoft SQL Server experience Computer Science degree from the University of Richmond Masters degree in IT Management from the University of Virginia Founded my own MLB Data Analytics Company Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Topics We’ll Cover Explanation of ETL and ELT Strategies Debate the characteristics In Azure? How to Implement an ELT Architecture The Tactical Benefits: Superior Traceability Reduced Execution Times Extensible Design Fits Next Steps for Database Developers Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Presentation Disclaimer I’ve chosen to focus on my SQL Server and SSIS experiences but the concepts still apply to your database platform of choice ELTL is assumed to be the same as ELT. The extra L takes place but feels unnecessary. Sources can be anything, but let’s assume it’s tabular. Targets are typically to support transactional systems or business intelligence, but let’s talk big data too! Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Why the Debate? Landscape changes are causing the discussion: Volume, Variety, Velocity, Veracity, Value Analytic tools are changing the game too Codebase complexities and delivery times need to be controlled Key Difference: When and Where the transformation step is performed Key Truth: Data Quality is always a concern Key Takeaway: Business needs and technical capabilities still drive the data management decision Offer data to analysts and let them link it in their BI tool Consumers don’t care as long as its immediate and correct Technology or methodology never absolves you of analysis Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Background Level Set - ETL Data from disparate data sources Extract Data in tool, in flight; modifications done in memory Transform Data to destination structure Load Image source: https://tekclasses.com/difference-between-etl-and-elt-process/ Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Conventional ETL Design Let’s Debate? Graphical view of data pipelines Traditionally batch oriented, parallel and scheduled Offer functionality not available in the RDBMS Requires specialized developer skills Business rules, cleansings, validations, filtering, joins, lookups all programmed into the tool Focus on a single destination model and work backwards Leverages a proprietary engine, potentially separate server Performance dependent on component configurations, order operations, memory to data volume ratio Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Background Level Set - ELT Data from disparate data sources Extract Data in its original form to a staging area of the target server Load Data with semantic layer and MERGE to destination Transform Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Conventional ELT Design Let’s Debate? Considered to be a more modern approach Allows for more real-time processing Fewer volume and structure concerns Extract data fast, limiting source system strain Reuse stage/processing to load multiple target structures Consolidate area where business and data quality rules exist Lays a foundation for a Data Lake, populating multiple destinations Leverage a RDBMS’s transaction engine to work with data locally Can necessitate more drive space & CPU, but reduces other hardware needs Maintaining large repository is not quite so simple Image source: https://tekclasses.com/difference-between-etl-and-elt-process/ Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
But What About Azure? Integration Runtimes Data Factory DF v2 Polybase SSIS on VM AML T-SQL Hive, Spark Data Transfer Units a major factor Scaling destination pricing tier throttles performance Scale Azure SQLDB to 25-50 DTUs per MB of bandwidth SQL DW offers Massively Parallel Processing architecture Best to use Polybase, leverage T-SQL on the DW Many “distributed” design considerations Loads from Hadoop/Data Lake? Consider Spark, Hive https://docs.microsoft.com/en-us/azure/data-factory/concepts-integration-runtime https://docs.microsoft.com/en-us/azure/data-factory/tutorial-deploy-ssis-packages-azure Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
My Proposed ELT Style Architecture Stage Use SSIS to truncate stage table and do 1:1 data flow task Simultaneously copy of all source data about the transaction that changed Semantic Views SELECT statement which applies all business rules, joins, transformations from the stage area Output columns are format of a single target table T-SQL Merge Source is the view - Destination is the target table - Join on Natural Keys Can handle Type 1 or Type 2 deltas Key Benefits: Faster development cycles and execution times Decreased cost and complexity of code maintenance Flexible to fit many scenarios Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Superior Traceability Stage is a 1:1 copy of the source Table names, Column names, Data types Process used by multiple data flows Views contain all the code No longer tracing dozens of ETL tool components Code is more readable and reviews are faster Repeatable design patterns Limit variability of implementation Object Oriented methodology Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Reduced Execution Time Leverage TRUNCATE on stage Remove usage of blocking components No waste of transaction log None of the components below cause wait times or require proper use of caching Connections are opened/closed quickly Use MERGE command to move data locally Stages execute in parallel Fast at determining if/what needs to change and much faster than OLEDB command or slowly changing dimension components Source queries are small, fast and leverage indexes Only stage data about data that needs refreshed Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Extensible Design Fits Adaptable to the most common scenarios Some difference in “what to stage” Order of target loads matters Code reusability Dynamically generate the MERGE command Mapping table to translate vernacular between systems Frameworks Too! Use an Execution Log to track all executing processes Consistent parameterization and lookup of date ranges Re-startability Data Validation & Retry Errors process Target Structure 3rd Normal Form Transactional Star Schema Reporting Mart Load Frequency One Time Migration On Going Refresh Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Next Steps for Database Developers Consider Scaling back on using complex ETL tools to house ETL logic Hardware sizing – more drive space and virtual memory on target server Leverage Visual Studio’s Database Projects to manage your objects Data Analysts and Source to Target documentation Implement Build queries that replicate existing ETL processes Pilot this process against a traditional approach Copyright © 2018 CapTech Ventures, Inc. All rights reserved.
Questions?