Custom Activities in Azure Data Factory Presented by Jared Zagelbaum Senior Consultant, Blue Granite
Introduction About me: Microsoft Data Platform since 2008 Azure since Azure MCSE Data & Analytics Microsoft Certificate in Data Science (R) Senior Consultant with Blue Granite Recent projects (last 6 months) – Manufacturing, Logistics Technologies implemented - Power BI, SQL DW, ADF, SSIS (BIML), SSAS, SQL Server, Azure Data Lake, DevOps / CI Lesson descriptions should be brief.
Objectives Understand when to use a custom activity Know how to go about creating one Save you some pain with undocumented things I’ve encountered Appreciate the scope of what you can really do with ADF v2 orchestrating Azure services Example objectives At the end of this lesson, you will be able to: Save files to the team Web server. Move files to different locations on the team Web server. Share files on the team Web server.
Agenda Azure Prerequisites for Custom Activities Overview of Azure Batch Implementation of custom activities in Azure Data Factory (v1 and v2) Review the use cases for custom activities in Azure Data Factory ADFv1 Deep Dive Setting up development environment for ADFv1 custom activities Developing a custom activity for ADF v1 Deployment and Debugging ADFv2 Deep Dive Developing a custom activity for ADF v2 (much more fun version) How presentation will benefit audience: Adult learners are more interested in a subject if they know how or why it is important to them. Presenter’s level of expertise in the subject: Briefly state your credentials in this area, or explain why participants should listen to you.
Azure Batch Azure Batch creates and manages a pool of compute nodes (virtual machines), installs the applications you want to run, and schedules jobs to run on the nodes. There is no cluster or job scheduler software to install, manage, or scale. There is no additional charge for using Batch. You only pay for the underlying resources consumed, such as the virtual machines, storage, and networking. Batch works well with intrinsically parallel (also known as "embarrassingly parallel") workloads-- where the applications can run independently, and each instance completes part of the work.
Custom Activities Compared ADF v1 vs v2 Differences version 2 Custom Activity version 1 (Custom) DotNet Activity How custom logic is defined By providing an executable By implementing a .Net DLL Execution environment of the custom logic Windows or Linux Windows (.Net Framework 4.5.2) Executing scripts Supports executing scripts directly (for example "cmd /c echo hello world" on Windows VM) Requires implementation in the .Net DLL Dataset required Optional Required to chain activities and pass information Pass information from activity to custom logic Through ReferenceObjects (LinkedServices and Datasets) and ExtendedProperties (custom properties) Through ExtendedProperties (custom properties), Input, and Output Datasets Retrieve information in custom logic Parses activity.json, linkedServices.json, and datasets.json stored in the same folder of the executable Through .Net SDK (.Net Frame 4.5.2) Logging Writes directly to STDOUT Implementing Logger in .Net DLL Custom Activities Compared ADF v1 vs v2 ADFv1 Execution restricted to single activity run (no opportunity to scale within an activity definition) ADFv2 Can run parallel / scale out easily via control activities Can run packaged executables if callable from command line (Linux or Windows)– not just scripts! Must use cloud hosted integration runtime and Azure batch
Key Takeaways… ADFv1 ADFv2 Custom (.Net) activities are designed to interact with datasets that require specific access methods / transformation rules. Azure Batch is used as an anonymizer of resources more than for its actual potential to scale. Requires .Net 4.5.2, IDotNetActivity interface, and NuGet Package Microsoft.Azure.Management.DataFactories – if you need a custom activity, you’re basically building it from scratch ADFv2 Run any executable- self compiled, script, or packaged executable (with command arguments)…Windows or Linux OS. Control activities leverage the full power of Azure Batch to scale out parallel workloads “No holds barred” – not expected to produce or transform a dataset
Use cases for custom activities ADFv1 ADFv2 You need to access a source or service not supported with native components You need to perform a specific compute task on “small data” ADFv1 use cases You want to run an SMP application based on conditions / wall clock and possibly have the output of the application trigger additional actions You want batch processes logging all to a common system You are filling in the holes unsupported in current SSIS lift and shift: https://docs.microsoft.com/en- us/sql/integration-services/lift-shift/ssis- azure-validate-packages
ADFv1 Deep Dive
ADFv1 Adding custom code to projects and deployment is fairly easy with Data Factory Tools for Visual Studio 2015 Debugging .Net class library requires additional work Developing pipelines and activities is all JSON Debugging in ADFv1 is centralized
ADFv2 Deep Dive
ADFv2 Use any development environment you want, heck, even any framework as long as its SMP based No slick tooling for deployment like in v1 = slightly more work Debugging locally doesn’t require much refactoring if any Developing pipelines and activities is helped by visual editor initially Debugging in ADFv2 is buggy– its still in preview!
Session Summary ADFv1 and v2 use Azure batch to run custom activities v1 mostly for convenience v2 fer reelz yo– you can use it to run parallel tasks at enormous scale (along with your Azure bill) ADFv1 had a dream of things being nice, neat, tumbling windows where custom activities had a certain place in this tiny little world ADFv2 lets you run pretty much any workload and access pretty much any data source without restriction to platform or scale, and orchestrates everything into a single service. Kids, you can drive the car now.
Evaluation Thanks for attending and filling out the evaluations for all the sessions you go to today– they really matter to the presenters!