ETL Patterns in the Cloud with Azure Data Factory Mark Kromer Senior Program Manager Microsoft Azure Data Management @kromerbigdata
ETL Patterns in the Cloud Important factors for success What is ETL? More than Extract, Transform, Load Scheduling, Monitoring, Maintenance, Source Control, CI/CD, Operationalize Platform as a Service (ADF) vs. Infrastructure as a Service (IaaS/SSIS) Self-managed vs. Provider-Managed ELT or ETL? Difference is primarily highly-parsed semantics However: In the cloud, common pattern == stage data in low-cost, inexpensive storage Not typically performant to process data in-flight Particularly crossing boundaries (on-prem, vnets, data centers, regions) Scale is very important in Cloud ETL Cloud projects assume elastic scale. ETL is not immune to this expectation. Flexible Schema is very important in Cloud ETL Assume “Big Data tenets” aka “data chaos”: Your data sources will change shape, size and volume. Often!
Cloud ETL Patterns with ADF
Easy-to-use Wizard for Copying Data at Scale 6/1/2019 6:40 AM Easy-to-use Wizard for Copying Data at Scale © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Nightly ETL Data Loads Codefree Design code-free ETL workflows Copy data from on- prem, other clouds and Azure Stage data for transformation Build visual data transformations Schedule triggers for your pipeline execution Monitor processes and configure alerts All within ADF
Build Resilient Data Flows with Schema Drift Handling of Flexible Schemas
Slowly Changing Dimension Scenario Common DW pattern to manage changing attributes to dimension members Graphically build code-free SCD ETL pattern to load your data warehouse Connect directly to Azure SQL DB and Azure SQL DW Use Lookup, Surrogate Key, Derived Column and Select transforms
Load Star Schema DW Scenario Classic ETL pattern is easy to build in ADF’s code-free Data Flow visual data transformation environment Add Aggregate transforms to produce calculations that you store in your analytical database schema Use Join transform to combine data from multiple data sources and data streams inside your data flow Land your data in your Lake folders or direct to Azure SQL DW
Data Lake Data Science Scenario ADF supports building visual data transformations against your data directly in Data Lake locations (i.e. Azure Blob Store, Azure Data Lake Store) Built-in handling of schema drift for frequent changes in data lake file formats, columns, and data types Perform data exploration and data profiling across your data lake in ADF Data Flow win interactive debug data preview
Azure Data Factory Workflow Data Pipelines/Control Flow
Conditional execution 6/1/2019 6:40 AM Incremental Delta Data Copy Conditional execution If-Then, Lookup, Execute Pipeline Connection Managers © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Built-in source control support 6/1/2019 6:40 AM Built-in source control support © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Operationalize – Monitor your data pipelines 6/1/2019 6:40 AM Operationalize – Monitor your data pipelines © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Use Templates to quickly get started Quickly get started with building data integration solutions. Avoid building same workflows repeatedly. Simply instantiate a template. Improve developer productivity along with reducing development time for repeat processes.
ADF Integration Runtime Activity Dispatch/Monitor Data Movement SSIS Package Execution
Azure Data Factory Service Command & Control 6/1/2019 6:40 AM Data Flow UX & SDK Authoring | Monitoring/Mgmt Azure Cloud Azure Data Factory Service Scheduling | Orchestration | Monitoring PaaS Cloud Host Integration Runtime Installable Agent Integration Runtime Cloud Apps, Svcs & Data On Premises Apps & Data © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Customer 1 firewall border Azure Data Factory “Integration Runtime” deployed on premises for transformation and then moved to cloud Customer 1 Customer 1 firewall border “IR” ADF Foo On-prem
SSIS in ADF
Provision SSIS IR in ADF 6/1/2019 6:40 AM Provision SSIS IR in ADF © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
6/1/2019 6:40 AM Deployment via SSMS on premises in Azure Once connected, you can deploy projects/packages to SSIS PaaS from your local file system/SSIS on premises © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
You can select some packages to execute on SSIS PaaS 6/1/2019 6:40 AM Execution via SSMS You can select some packages to execute on SSIS PaaS © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
You can see package execution error messages 6/1/2019 6:40 AM Monitoring via SSMS on premises in Azure You can see package execution error messages © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Execute SSIS Packaged in ADF Pipeline
ADF Mapping Data Flows
What is ADF Mapping Data Flow? 6/1/2019 6:40 AM What is ADF Mapping Data Flow? Data Flow is a new feature of Azure Data Factory that allows you to build data transformations in a visual user interface Transform Data, At Scale, in the Cloud, Zero-Code Cloud-first, scale-out ELT Code-free dataflow pipelines Serverless scale-out transformation execution engine Maximum Productivity for Data Engineers Does NOT require understanding of Spark / Scala / Python / Java Resilient Data Transformation Flows Built for big data scenarios with unstructured data requirements Operationalize with Data Factory scheduling, control flow and monitoring © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Code-free Data Transformation At Scale Does not require understanding of Spark, Big Data Execution Engines, Clusters, Scala, Python … Focus on building business logic and data transformation Data cleansing Aggregation Data conversions Data prep Data exploration … not …
Build your logical data flows adding data transformations in a guided experience
Microsoft Azure Data Factory Continues to Extend Data Flow Library with a Rich Set of Transformations and Expression Functions
Debug mode provides row-level context and visible results in inspector pane
Debug Data Flows with Data Preview and Data Sampling
Interactive Expression Builder – Build data transform expressions, not Spark code
Deep Monitoring Introspection of Data Transformations
Sponsors