ETL process management with TSQL

Slides:



Advertisements
Similar presentations
Categories of I/O Devices
Advertisements

Application Graphic design / svetagraphics.com 01 FRAMEWORK data service.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir.
1/1/ / faculty of Electrical Engineering eindhoven university of technology Introduction Part 3: Input/output and co-processors dr.ir. A.C. Verschueren.
10 minute Activity Be ready to report back in ten minutes.
7/14/2015EECS 584, Fall MapReduce: Simplied Data Processing on Large Clusters Yunxing Dai, Huan Feng.
1 CSCD 330 Network Programming Lecture 13 More Client-Server Programming Sometime in 2014 Reading: References at end of Lecture.
CS 153 Design of Operating Systems Spring 2015 Lecture 17: Paging.
Learningcomputer.com SQL Server 2008 – Administration, Maintenance and Job Automation.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
Exceptions Handling Exceptionally Sticky Problems.
Stored Procedure. Objective At the end of the session you will be able to know :  What are Stored Procedures?  Create a Stored Procedure  Execute a.
DAT 360: DTS in SQL Server 2000 Best Practices Euan Garden Group Manager, SQL Server Microsoft Corporation.
Win32 Programming Lesson 14: Introducing Windows Memory (C Rox…)
ADAPTING YOUR ETL SOLUTION TO USE SSIS 2012 Presentation by Devin Knight
Learningcomputer.com SQL Server 2008 –Views, Functions and Stored Procedures.
1 Why Threads are a Bad Idea (for most purposes) based on a presentation by John Ousterhout Sun Microsystems Laboratories Threads!
IMS 4212: Constraints & Triggers 1 Dr. Lawrence West, Management Dept., University of Central Florida Stored Procedures in SQL Server.
Creating Simple and Parallel Data Loads With DTS.
Labs Unit 1 - Science. Labs  We will be doing many labs in class.  Labs are completed in groups of 3 or 4.  Your group is formed by either your entire.
Daniel Black – SQL Server Developer What is SetFocus  The SetFocus SQL Master’s Program is an intensive, hands– on, project oriented program allowing.
SSIS Templates, Configurations & Variables
Tim Hall Oracle ACE Director
ATS Application Programming: Java Programming
COMP 430 Intro. to Database Systems
CSCD 330 Network Programming
SQL 2016 R Services a.k.a. leveraging your local data lake
MapReduce “MapReduce allows us to stop thinking about fault tolerance.” Cathy O’Neil & Rachel Schutt, 2013.
Explore the Integration Services Catalog
Architectures of Digital Information Systems Part 1: Interrupts and DMA dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital.
Generics, Exceptions and Undo Command
Designing and Implementing an ETL Framework
September 2 Performance Read 3.1 through 3.4 for Tuesday
Outsourcing Database Administration
Handling Exceptionally Sticky Problems
DBA and IT Professional for ~9 years. Currently I am a Data Architect
PROCEDURES, CONDITIONAL LOGIC, EXCEPTION HANDLING, TRIGGERS
CS703 - Advanced Operating Systems
Deploying and Configuring SSIS Packages
Error Handling Summary of the next few pages: Error Handling Cursors.
Where I am at: Swagatika Sarangi MDM Lead PASS Summit SQL Saturdays
Workflow Best Practices
DevOps Database Administration
Solving ETL Bottlenecks with SSIS Scale Out
Azure Automation and Logic Apps:
(Test Driven) Software Development
DevOps Database Administration
Cse 344 May 4th – Map/Reduce.
Operating Systems Chapter 5: Input/Output Management
Teaching London Computing
Operating Systems.
Fundamentals of Data Representation
DBA for ~4+years, IT Professional for 7.5 years.
Outsourcing Database Administration
Summit Nashville /3/2019 1:48 AM
Tonga Institute of Higher Education IT 141: Information Systems
Tonga Institute of Higher Education IT 141: Information Systems
User Input Keyboard input.
Huddle Boards High-level Overview Let’s review a HUDDLE BOARD…
CSCD 330 Network Programming
Handling Exceptionally Sticky Problems
Data Types Every variable has a given data type. The most common data types are: String - Text made up of numbers, letters and characters. Integer - Whole.
SSIS Data Integration Data Warehouse Acceleration
SSIS Data Integration Data Warehouse Acceleration
SSRS – Thinking Outside the Report
Lecture 12 Input/Output (programmer view)
SSIS - Overview John Manguno. SSIS - Overview John Manguno.
SSIS Data Integration Data Warehouse Acceleration
Michael Stephenson Microsoft MVP - Azure
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

ETL process management with TSQL Richard Swinbank

ETL process management ETL performed by a collection of processes SSIS packages TSQL stored procedures Other bits of sticky tape and string Lots of them! Process execution has to be managed What runs when In what order What happens when things go wrong

Five desirable ETL behaviours Parallel processing Fast to finish Convenient way to locate faults Fast to fix Easy to resume after error… Fast to restart …with as little as possible left to do Fast to finish after restart Easy to add new processes We’ll come back to this

A very small example Ten processes Process dependencies We’ll be using stored procedures for now Process dependencies Let’s look at some possible approaches A B C D E F G H I J

Approach #1: Stepwise SQL Agent job Call each SP in a separate job step FYI, demo.usp_ProcessE is broken

Agent job: Step-by-step C D E F G H I J

Agent job: Step-by-step C D E F G H I J

Agent job: Step #1 A B C D E F G H I J

Agent job: Step #2 A B C D E F G H I J

Agent job: Step #3 A B C D E F G H I J

Agent job: Step #4 A B C D E F G H I J

Agent job: Step #5 A B C D E F G H I J

Stepwise SQL Agent job: Results

Stepwise SQL Agent job: Evaluation Parallel processing Convenient way to locate faults

Stepwise SQL Agent job: Evaluation Parallel processing Convenient way to locate faults Easy to resume after error…

Stepwise SQL Agent job: Evaluation Parallel processing Convenient way to locate faults Easy to resume after error… …with as little as possible left to do

Approach #2: Master SSIS package Call each SP from an Execute SQL Task Deploy package to SSIS catalog; run in agent job

Master SSIS package: Results

Master SSIS package: Results

Master SSIS package: Evaluation Parallel processing

Master SSIS package: Evaluation Parallel processing Convenient(ish) way to locate faults

Master SSIS package: Evaluation Parallel processing Convenient(ish) way to locate faults Easy to resume after error…

Master SSIS package: Evaluation Parallel processing Convenient(ish) way to locate faults Easy to resume after error… …with as little as possible left to do

Recap We’ve identified some desirable behaviours Parallel processing Convenient way to locate faults Easy to resume after error… …with as little as possible left to do (Easy to add new processes – we’ll come back to this) We’ve looked at two process management approaches Stepwise SQL Agent Job Master SSIS package Each has some of the behaviours we want… …but neither has all of them 10 MINUTES

Dependency-driven process management in TSQL 1. Table of processes Process Status demo.usp_ProcessA Ready demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD Not ready demo.usp_ProcessE demo.usp_ProcessF demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessJ 2. Dependency information Process RunsAfter demo.usp_ProcessD demo.usp_ProcessA demo.usp_ProcessF demo.usp_ProcessB demo.usp_ProcessG demo.usp_ProcessC demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessE demo.usp_ProcessJ 3. Process handler SP

Process handler Pseudo-TSQL WHILE (anything’s ready) BEGIN SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END Pseudo-TSQL

Process handler A B C D E F G H I J WHILE (anything’s ready) BEGIN SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END D E F G H I J Process Status demo.usp_ProcessA Ready demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD Not ready demo.usp_ProcessE demo.usp_ProcessF demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessJ

Process handler A B C D E F G H I J WHILE (anything’s ready) BEGIN SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB Ready demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE demo.usp_ProcessF Not ready demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessJ

Process handler A B C D E F G H I J WHILE (anything’s ready) BEGIN SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB Ready demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE demo.usp_ProcessF Not ready demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessJ

Process handler A B C D E F G H I J WHILE (anything’s ready) BEGIN SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB Ready demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE demo.usp_ProcessF Not ready demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB Ready demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE demo.usp_ProcessF Not ready demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB Ready demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE Errored demo.usp_ProcessF Not ready demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD Ready demo.usp_ProcessE Errored demo.usp_ProcessF demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI Not ready demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE Errored demo.usp_ProcessF Ready demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI Not ready demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE Errored demo.usp_ProcessF Ready demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI Not ready demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE Errored demo.usp_ProcessF demo.usp_ProcessG demo.usp_ProcessH Ready demo.usp_ProcessI Not ready demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE Errored demo.usp_ProcessF demo.usp_ProcessG demo.usp_ProcessH Ready demo.usp_ProcessI Not ready demo.usp_ProcessJ

Better process handler WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END D E F G H I J Process Status demo.usp_ProcessA Done demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE Errored demo.usp_ProcessF demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI Not ready demo.usp_ProcessJ

Better process handler: Evaluation Parallel processing Convenient way to locate faults Easy to resume after error… …with as little as possible left to do WHILE (anything’s ready) BEGIN BEGIN TRY SELECT ready process EXECUTE selected process UPDATE ProcessList SET process status = ‘Done’ , process’s dependants = ‘Ready’ END TRY BEGIN CATCH SET process status = ‘Errored’ END CATCH END Process Status demo.usp_ProcessA Done demo.usp_ProcessB demo.usp_ProcessC demo.usp_ProcessD demo.usp_ProcessE Errored demo.usp_ProcessF demo.usp_ProcessG demo.usp_ProcessH demo.usp_ProcessI Not ready demo.usp_ProcessJ

Parallel processing Run multiple handlers at the same time Must prevent different handlers from running the same process Make handler reserve a process before executing it (and set status of processes in execution to ‘Running’) Reserve by inserting details into reservations table Catch PK violation, continue Process demo.usp_ProcessA demo.usp_ProcessC demo.usp_ProcessE demo.usp_ProcessB demo.usp_ProcessD Primary key

Parallelisable process handler WHILE (anything’s ready) BEGIN BEGIN TRY -- for process execution SELECT ready process BEGIN TRY -- for process reservation INSERT ready process details INTO ProcessReservations END TRY BEGIN CATCH CONTINUE END CATCH UPDATE ProcessList SET process status = ‘Running’ EXECUTE selected process SET process status = ‘Done’ [...]

What about SSIS packages? Can’t EXECUTE selected package Execute package using SSIS catalog SPs SSISDB.catalog.create_execution SSISDB.catalog.start_execution Need EXECUTE-like behaviour Return only when package execution has finished Raise error if something goes wrong Wrap up in package runner SP Handler executes process or package runner IF process is SP EXECUTE process ELSE IF process is SSIS package EXECUTE usp_RunPackage @process

Demo Sprockit – my implementation of this approach Pure TSQL & SQL Server Agent Completely free & open-source 20 MINUTES

Those five desirable behaviours Parallel processing Convenient way to locate faults Easy to resume after error… …with as little as possible left to do Easy to add new processes? What’s so hard about this anyway?! 40 MINUTES

Adding new processes Where do I put new process demo.usp_ProcessK? To decide, I need to know what everything else does Difficult unless I know the ETL landscape very well Takes a while for newbies to get up to speed I A C B D H G F E J

Process dependencies I A C B D H G F E J

Process dependencies A B C D E F G H I J

Resource dependencies T01 T02 T03 A B C T07 T04 T06 D E F G H T11 T05 T08 T09 T10 I J T12 T13 T14

Resource dependencies T01 T02 T03 A B C T07 T04 T06 D E F G H T11 T05 T08 T09 T10 I J T12 T13 T14

Resource dependencies T01 T02 T03 A B C T07 T04 T06 D E F G H T11 T05 T08 T09 T10 I J T12 T13 T14

Dependencies in Sprockit Developers provide resource dependencies (table sprockit.Resource) Process Resource Input/output demo.usp_ProcessB Table T01 Input Table T04 Output

Dependencies in Sprockit Developers provide resource dependencies (table sprockit.Resource) Handlers infer process dependencies (uvw_ProcessDependency view) Process Resource Input/output demo.usp_ProcessB Table T01 Input Table T04 Output demo.usp_ProcessG Table T06 Table T09

Dependencies in Sprockit Developers provide resource dependencies (table sprockit.Resource) Handlers infer process dependencies (uvw_ProcessDependency view) When adding a process, the dependency information you need is right there in the SP/package …so no need for the full process dependency picture Process Resource Input/output demo.usp_ProcessB Table T01 Input Table T04 Output demo.usp_ProcessG Table T06 Table T09 Process RunsAfter demo.usp_ProcessG demo.usp_ProcessB

What if I want the full picture? Structured dependency information is a data source Force-directed graphs in Power BI Graphviz (e.g. http://www.webgraphviz.com/) digraph G { n2 [label="usp_ProcessB"]; n3 [label="usp_ProcessC"]; n7 [label="usp_ProcessG"]; n8 [label="usp_ProcessH"]; n10 [label="Process_J.dtsx"]; n2 -> n7; n3 -> n7; n3 -> n8; n7 -> n10; }

What if I want the full picture? Structured dependency information is a data source Force-directed graphs in Power BI Graphviz (e.g. http://www.webgraphviz.com/)

Summary ETL process management in TSQL is simple but powerful Parallel processing Convenient way to locate faults (and tolerate transaction deadlocks) Easy to resume after error… …with as little as possible left to do Easy to add new processes Pure TSQL means everything’s in the database Exploit structured dependency information Leverage ETL activity information Code available at http://RichardSwinbank.net/sprockit Thanks for listening!

Questions? http://RichardSwinbank.net/sprockit richard@RichardSwinbank.net @RichardSwinbank