Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline

Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline
Albany, NY SQL Saturday #622 July 29, 2017 By Matt Batalon MCSE: Data Management & Analytics Blog: codebatalon.wordpress.com

About Me Presently a Data Solutions Developer for healthcare company in RI Using SQL Server since SQL Server 2000 MCSE: Data Management & Analytics President of the Rhode Island SQL Server Users Group (formerly SNESSUG) Also involved in the Rhode Island Business Intelligence User Group Co-Organizer of Providence SQL Saturday sqlsaturday.com

Presentation Agenda High Level Overview of the SSIS Architecture
Control Flow vs. Data Flow Dissecting the Data Flow 200~300 Level, not a deep dive into internals SSIS Pipeline and Memory Buffers Transformations - Synchronous vs. Asynchronous Streaming Partially Blocking Fully Blocking Row by Row What do these settings even do??

Why do we even care about this stuff?
It works! Great! Could it be more efficient? How long did it take to complete? Is this a problem? Will it be in the future? Always develop the most efficient optimized solutions we can Much easier to optimize from the get – go!

SSIS Architecture Control Flow Data Flow

Control Flow When first viewing a package…gives a view of what’s supposed to happen  Tasks (such as a data flow task, script task, send mail, FTP) Precedence constraints Containers Separate Engine than the Data Flow engine

Example Control Flow

Data Flow The Data Flow task encapsulates the data flow engine that moves data between sources and destinations, and lets the user transform, clean, and modify data as it is moved. Microsoft Books Online:

Typical Data Flow Task Source(s) Extract data Transformation
Modify, route, cleanse, summarize data Destinations Load data

Lots of options

Sources Starting Point Flat files, XML Files, Databases, etc.
The data is converted into buffers

Transformations Transformations take place in memory, or should take place in memory if possible The first transformation can start before all data is loaded into memory. The best performing transformations modify… or “transform” data that already exists inside a buffer.

Destinations The Destination task then connects to the target file or database and empties out the buffers as it loads the target with the transformed data.

Data Flow Engine Buffer: a physical piece of memory that contains rows and columns SSIS Buffer Manager: determines all the buffers that are needed before the package is executed and also controls and manages the creation Pipeline: part of the Data Flow Engine data pipeline uses buffers to manipulate data in memory

What if there’s A LOT of data?
Large Sources: Many rows in the source, then the Data Flow task can read data from the source component, transform data, and start loading data into the target all at the same time. This means we can have multiple buffers in use all at the same time by different components of the data flow task depending on we manage the buffers throughout the pipeline.

Buffer Settings and the Buffer Manager
The Buffer Manager computes the number of bytes per row from the source DefaultBufferMaxRows setting is defaulted to 10,000 rows DefaultBufferSize is defaulted to 10MB, maximum is 2GB (2016) Whichever comes first

Example Source Row SSIS Data Type Conversions = DT_STR = 10 Bytes = DT_Date = 3 Bytes = DT_STR = 20 Bytes = DT_UI4 = 4 Bytes = DT_CY = 8 Bytes Total: 79 Bytes

A look inside the buffer…
EstimatedRowSize * DefaultBufferMaxRows = EstimatedBufferSize Our example: (using defaults) 79Bytes * 10,000 = 790KB 790KB < 10MB – Therefore we can fit much more than the default row size(10,000) into a buffer

sp_spaceused <table name>
bytes = 42,112*1024 = 43,112,688 row = 43,112,688/ 500,000 = 86 bytes per row

DefaultMaxBufferRows
DefaultBufferSize / rowsize = DefaultBufferMaxRows 10MB * 1,024 * 1,024 / 86 = 121,927

New for SQL Server 2016

BufferSizeTuning Log Event

BLOBS are an exception BLOB data types do not fit into buffers because they do not have a fixed width Instead a ‘descriptor’ of the BLOB is stored in the buffer that describes the BLOB’s location %TEMP% and %TMP% env. variables are default (C:\ drive) VARCHAR(MAX) and NVARCHAR(MAX)

Blob & Buffer Paths BLOB, text, image etc. Buffer is anything else

Buffers Spooled

Transformations Effect Buffers
Transformations used by SSIS affect the creation and performance of buffers, hence effecting the efficiency of your package Transformations fall into two main categories: Synchronous and Asynchronous

Synchronous Transformations
Output is synchronous with input Does not need information about any other rows inside the data set Rows entering = rows exiting the transformation, i.e. record count stays the same Can usually operate on the same buffer Usually better performing than an asynchronous transformation Typically Streaming or Non-Blocking type transformations are synchronous transformations

Asynchronous Transformations
Does not process rows independently in the dataset Rather than output rows as they are processed, the transformation must output data asynchronously, or at a different time Record counts usually change from input to output Must create a new buffer upon output of the transformation Generally poorer performance than synchronous transformation Typically a Semi-Blocking or Blocking Transformation

Different Types of Transformations
Different types of transformations in SSIS: Non-Blocking (Streaming) Partially Blocking or Semi-Blocking Fully Blocking Non-Blocking (Row by Row) In this case the term “Blocking” has nothing to do with concurrency as in the database engine

Streaming Transformations

Non-Blocking Streaming Transformations
Generally are synchronous The logic inside of the transformation does not impede the data flow from moving onto the next transformation in the pipeline Fastest of all transformations because they operate on data that already exists inside of an already created buffer These transformations update the buffer in place then the data is passed onto the destination

Non-Blocking Streaming Transformations
Audit Cache Transform Character Map Conditional Split Copy Column Data Conversion Derived Column Lookup (full cache) Multicast Percent Sampling Row Count

Partially Blocking Transformations

Generally asynchronous Also known as “semi-blocking” transformations The transformations might prevent or “block” buffers from being sent down the pipeline because they are waiting on another action Their output creates a new buffer, different than non-blocking streaming transformations

Data Mining Query Merge Merge Join Pivot Term Lookup Unpivot Union All

Fully Blocking Transformations

Asynchronous Slowest performing, most memory intensive transformations Cannot let anything pass through the pipeline until finished Has to read all rows in the buffer before we start applying the transformation operation logic and creating a single output row The pipeline is stopped at these transformations, and new buffers are always created Can potentially use up all memory, and data then can spill on to disk

Aggregate Fuzzy Grouping Fuzzy Lookup Row Sampling Sort Term Extraction

Non-Blocking Row By Row Transformations
Considered Non-Blocking or Streaming because data can still flow to the next transformation Rows flowing through this transformation are processed one by one because they have to interact with an outside process, such as a database, file or another component Buffers can get held up

Non-Blocking – Row by Row Transformation Examples
DQS Cleansing Export Column Import Column Lookup (Non-cached or partially cached) OLEDB Command Script (if it’s calling an outside process) SCD (each row does a lookup against DIM)

Lookup Transformation
Can be Streaming Transformation Beware of the Non-Cache or Partially Cache Or Full Cache, with very large reference data set.

Alternatives to Blocking Transformations
Optimize at the Source if you can Choose streaming transformations over partially or fully blocking transformations whenever possible T-SQL can help sometimes for some of those fully blocking transformations Aggregate = T-SQL SUM or AVG Sort = T-SQL Order By Setting IsSortedProperty on Source (Demo6)

RunInOptimizedMode Setting on the Data Flow Task
Ignores unused columns downstream It still has to figure out, what is being used/not used however

PipelineInitialization Logging Event

Destinations Table or View - fast load” uses bulk insert
Maximum Insert Commit Size: commits rows in batches that are the smaller of the Maximum insert commit size, or the remaining rows in the buffer that is currently being processed 0 = 1 big commit?

MaxConcurrentExecutables
Inside the Control Flow - Possibly use Containers with multiple Data Flows - 1 (# of cores + 2) E.g. 4 core server can run 6 containers Low CPU usage, possibly increase to have more Data Flows run in parallel

EngineThreads A property on each Data Flow task 2 Types of Threads
Source Threads Worker Threads (Transformations & Destinations)

Embrace but be Cautious
Too many things running in parallel may not be a good thing Ensure there is enough resources Stagger ETL jobs Use precedence constraints where appropriate

More Optimization Loading Flat Files as fast as possible especially for good for staging Use the “Fast Parse” option (if the data is locale specific) Check the Advanced Properties of Source to change the field lengths Suggest Types button might help

Partition with Conditional Split
Inside the Data Flow – create multiple streams Great for staging data Equal sized partitions or files The Modulo function can be used if primary key is an incremental value such as IDENTITY, or if there is counter, or number column Also See Balanced Data Distributor

SSIS Catalog catalog.execution_component_phases
catalog.execution_data_statistics

Key Take-a-ways A little bit of SSIS internals is a good thing!
Optimize your Sources as best you can Understand transformations Optimize your Destinations as best you can Embrace Parallelism whenever possible

Sources and References for this Presentation

Sources and References for this Presentation
Advanced Integration Services Stacia Misner Varga

Sources and References for this Presentation - Links
Books Online SQL Server CAT Team: Top 10 SQL Server Integration Services Best Practices Simple Talk

Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline

Similar presentations

Presentation on theme: "Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline

Similar presentations

Presentation on theme: "Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline"— Presentation transcript:

Similar presentations

About project

Feedback