Presentation is loading. Please wait.

Presentation is loading. Please wait.

Beyond orchestration with Azure Data Factory

Similar presentations


Presentation on theme: "Beyond orchestration with Azure Data Factory"— Presentation transcript:

1 Beyond orchestration with Azure Data Factory
SQL Saturday Oslo August

2 Our sponsors Platinum and event Gold Global Silver Bronze Raffle

3 Compute options in Azure Data Factory
Azure Batch Service Azure Functions Databricks Data Lake Analytics HDInsight Mapping Data Flows SSIS Stored Procedures Wrangling Data Flows

4 Compute options in Azure Data Factory
Azure Batch Service Azure Functions Databricks Data Lake Analytics HDInsight Mapping Data Flows SSIS Stored Procedures Wrangling Data Flows

5 Todays case – Norwegian corporate registry
Brønnøysundregistrene contains data on all registered organizations in Norway Businesses (small, medium, large) Voluntary organizations Foundations Data is available for download through APIs Full data API Updates API We want to load the data into a SQL target table with Slowly Changing Dimensions type 1 history

6 Common setup

7 Setup I have set up a seperate source pipeline to load data into a Data Lake Storage (gen1) One pipeline for loading full data One pipeline for loading updates, with one file for each unit For Databricks I have chosen to utilize JSON Mapping data flows does not support JSON (yet) SSIS does not support JSON natively All tools have been used with only native functionality, not considering possible extensions

8 Databricks

9 Databricks Spark clusters supporting Scala, SQL, R and Python notebooks, Jar or Python Can have autoscaling Built in versioning Integrated with Azure (AD/KeyVault, etc) Step by step debugging Databricks also exists for AWS Databricks: Vise hvordan man setter opp cluster (aiutoscaling, etc) Vise notebook, og med versjonering Vise kjøring av steg for steg

10 Databricks pros and cons
Can handle JSON (with different formats) Can handle filesets Built in delta Easy coding ad debugging Can use SQL, Scala, R, Python Auto startup and shutdown of clusters Monitoring Connected to Azure services Code language – not graphical Cluster configuration can be tricky Not integrated with ADF datasets

11 Mapping Data Flows Preview

12 Mapping data flows Currently in preview
Graphical ETL tool integrated in ADF Uses Databricks as compute engine (but cannot use datasets in databricks) Have it's own expression language Can do row operations (insert, update, delete, etc) and table creation

13 Mapping Data Flows pros and cons
Can handle filesets Tightly integrated with Data Factory Auto startup and shutdown of clusters Monitoring with lineage Debug helps you to view data at each step Limited support for file formats, not JSON yet Slowest performance Not full flexibility

14 SSIS

15 SSIS Microsofts well known ETL tool
SSIS jobs can be deployed from Visual Studio into Data Factory and be run by a SSIS Integration Runtime in Data Factory SSIS-IR will not start and stop when activities are invoked SSIS-IR must run to deploy code Uses Azure SQL Server for storing configuration

16 SSIS pros and cons Familiar for many developers
No need to rewrite everything Best performance Limited support for file formats and specifications, not JSON Limited support for cloud data stores Limited monitoring Manual startup and shut down of integration runtime

17 Conclusions....

18 Comparison time! Databricks Data flows SSIS Code/"nocode" Code Visual
DF integration Ok Strong Monitoring Best Hard Scale Auto Stop/Resume Stop/resume Cluster startup 4 minutes 14 minutes 5 – 26 minutes Init load 2 minutes + 6 minutes DB load 25 minutes 8 minutes SCD1 2.5 minutes 9 minutes 55 seconds

19 My view on things... SSIS is not where you want do to data transformation in Azure. This is more for legacy/complex transformations that you don't want to change Databricks is the best choice if you want full flexibility and least lock-in Mapping Data Flows is promising and have great features if you would like to have a tightly integrated tool that have a graphical user interface Azure Functions are for smaller tasks (or could we have a PowerShell compute task in the future?) Want to follow me? Twitter.com/datahelge and blog on datahelge.no


Download ppt "Beyond orchestration with Azure Data Factory"

Similar presentations


Ads by Google