Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline

Slides:



Advertisements
Similar presentations
SSIS Dataflow Performance Tuning 1 st October 2010 Jamie Thomson.
Advertisements

Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
MapReduce Online Tyson Condie UC Berkeley Slides by Kaixiang MO
What is a Database By: Cristian Dubon.
Introduction to ETL Using Microsoft Tools By Dr. Gabriel.
MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.
Deep Dive into ETL Implementation with SQL Server Integration Services
Transaction log grows unexpectedly
SQL Server Integration Services 2008 &2012
Performance Tuning SSIS. HR Departments are no fun. Don’t mention the stalking incident with Clay Aiken What happened in Vegas My prom date with a puppet.
SQL Server Integration Services (SSIS) Presented by Tarek Ghazali IT Technical Specialist Microsoft SQL Server (MVP) Microsoft Certified Technology Specialist.
2 Overview of SSIS performance Troubleshooting methods Performance tips.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Architecture Rajesh. Components of Database Engine.
IT 456 Seminar 5 Dr Jeffrey A Robinson. Overview of Course Week 1 – Introduction Week 2 – Installation of SQL and management Tools Week 3 - Creating and.
1 CS 430 Database Theory Winter 2005 Lecture 16: Inside a DBMS.
Integration Services in SQL Server 2008 Allan Mitchell SQL Server MVP.
Access 2007 ® Use Databases How can Microsoft Access 2007 help you structure your database?
1 Advanced Topics Using Microsoft SQL Server 2005 Integration Services Allan Mitchell – SQLBits – Oct 2007.
SSIS – Deep Dive Praveen Srivatsa Director, Asthrasoft Consulting Microsoft Regional Director | MVP.
Creating Simple and Parallel Data Loads With DTS.
Best Practices in Loading Large Datasets Asanka Padmakumara (BSc,MCTS) SQL Server Sri Lanka User Group Meeting Oct 2013.
Aggregator Stage : Definition : Aggregator classifies data rows from a single input link into groups and calculates totals or other aggregate functions.
Pulling Data into the Model. Agenda Overview BI Development Studio Integration Services Solutions Integration Services Packages DTS to SSIS.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
SSIS Templates, Configurations & Variables
Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:
SQL Server Statistics and its relationship with Query Optimizer
Presented By: Jessica M. Moss
What Is The SSIS Catalog and Why Do I Care?
Design Patterns for SSIS Performance
Antonio Abalos Castillo
LOCO Extract – Transform - Load
Lecture 16: Data Storage Wednesday, November 6, 2006.
SOFTWARE DESIGN AND ARCHITECTURE
Informatica PowerCenter Performance Tuning Tips
SQL Server Integration Services
Chapter 12: Query Processing
Presented by: Warren Sifre
Database Performance Tuning and Query Optimization
Power BI Performance …Tips and Techniques.
Populating a Data Warehouse
Populating a Data Warehouse
Performance Tuning SSIS
Database Management Systems (CS 564)
Welcome to SQL Saturday Denmark
Statistics: What are they and How do I use them
Populating a Data Warehouse
Ch 4. The Evolution of Analytic Scalability
Populating a Data Warehouse
Operating Systems Chapter 5: Input/Output Management
Selected Topics: External Sorting, Join Algorithms, …
Fundamentals of Databases
In Memory OLTP Not Just for OLTP.
Designing SSIS Packages for Performance
Data processing with Hadoop
Lecture 2- Query Processing (continued)
Chapter 11 Database Performance Tuning and Query Optimization
MAPREDUCE TYPES, FORMATS AND FEATURES
Diving into Query Execution Plans
Patterns and Best Practices in SSIS
CENG 351 Data Management and File Structures
SSIS Data Integration Data Warehouse Acceleration
Using wait stats to determine why my server is slow
Design for Flexibility and Performance - ETL Patterns with SSIS and Beyond And without further ado, here is Daniel with Using SSIS to Prepare Data for.
Visual Data Flows – Azure Data Factory v2
Visual Data Flows – Azure Data Factory v2
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline Albany, NY SQL Saturday #622 July 29, 2017 By Matt Batalon MCSE: Data Management & Analytics Email: mbatalon@gmail.com Blog: codebatalon.wordpress.com Twitter:@MattBatalon

About Me Presently a Data Solutions Developer for healthcare company in RI Using SQL Server since SQL Server 2000 MCSE: Data Management & Analytics President of the Rhode Island SQL Server Users Group (formerly SNESSUG) Also involved in the Rhode Island Business Intelligence User Group Co-Organizer of Providence SQL Saturday sqlsaturday.com

Presentation Agenda High Level Overview of the SSIS Architecture Control Flow vs. Data Flow Dissecting the Data Flow 200~300 Level, not a deep dive into internals SSIS Pipeline and Memory Buffers Transformations - Synchronous vs. Asynchronous Streaming Partially Blocking Fully Blocking Row by Row What do these settings even do??

Why do we even care about this stuff? It works! Great! Could it be more efficient? How long did it take to complete? Is this a problem? Will it be in the future? Always develop the most efficient optimized solutions we can Much easier to optimize from the get – go!

SSIS Architecture Control Flow Data Flow

Control Flow When first viewing a package…gives a view of what’s supposed to happen  Tasks (such as a data flow task, script task, send mail, FTP) Precedence constraints Containers Separate Engine than the Data Flow engine

Example Control Flow

Data Flow The Data Flow task encapsulates the data flow engine that moves data between sources and destinations, and lets the user transform, clean, and modify data as it is moved. Microsoft Books Online: https://docs.microsoft.com/en-us/sql/integration-services/control-flow/data-flow-task

Typical Data Flow Task Source(s) Extract data Transformation Modify, route, cleanse, summarize data Destinations Load data

Lots of options

Sources Starting Point Flat files, XML Files, Databases, etc. The data is converted into buffers

Transformations Transformations take place in memory, or should take place in memory if possible The first transformation can start before all data is loaded into memory. The best performing transformations modify… or “transform” data that already exists inside a buffer.

Destinations The Destination task then connects to the target file or database and empties out the buffers as it loads the target with the transformed data.

Data Flow Engine Buffer: a physical piece of memory that contains rows and columns SSIS Buffer Manager: determines all the buffers that are needed before the package is executed and also controls and manages the creation Pipeline: part of the Data Flow Engine data pipeline uses buffers to manipulate data in memory

What if there’s A LOT of data? Large Sources: Many rows in the source, then the Data Flow task can read data from the source component, transform data, and start loading data into the target all at the same time. This means we can have multiple buffers in use all at the same time by different components of the data flow task depending on we manage the buffers throughout the pipeline.

Buffer Settings and the Buffer Manager The Buffer Manager computes the number of bytes per row from the source DefaultBufferMaxRows setting is defaulted to 10,000 rows DefaultBufferSize is defaulted to 10MB, maximum is 2GB (2016) Whichever comes first

Example Source Row SSIS Data Type Conversions = DT_STR = 10 Bytes = DT_Date = 3 Bytes = DT_STR = 20 Bytes = DT_UI4 = 4 Bytes = DT_CY = 8 Bytes Total: 79 Bytes

A look inside the buffer… EstimatedRowSize * DefaultBufferMaxRows = EstimatedBufferSize Our example: (using defaults) 79Bytes * 10,000 = 790KB 790KB < 10MB – Therefore we can fit much more than the default row size(10,000) into a buffer

sp_spaceused <table name> bytes = 42,112*1024 = 43,112,688 row = 43,112,688/ 500,000 = 86 bytes per row

DefaultMaxBufferRows DefaultBufferSize / rowsize = DefaultBufferMaxRows 10MB * 1,024 * 1,024 / 86 = 121,927

New for SQL Server 2016

BufferSizeTuning Log Event

BLOBS are an exception BLOB data types do not fit into buffers because they do not have a fixed width Instead a ‘descriptor’ of the BLOB is stored in the buffer that describes the BLOB’s location %TEMP% and %TMP% env. variables are default (C:\ drive) VARCHAR(MAX) and NVARCHAR(MAX)

Blob & Buffer Paths BLOB, text, image etc. Buffer is anything else

Buffers Spooled

Transformations Effect Buffers Transformations used by SSIS affect the creation and performance of buffers, hence effecting the efficiency of your package Transformations fall into two main categories: Synchronous and Asynchronous

Synchronous Transformations Output is synchronous with input Does not need information about any other rows inside the data set Rows entering = rows exiting the transformation, i.e. record count stays the same Can usually operate on the same buffer Usually better performing than an asynchronous transformation Typically Streaming or Non-Blocking type transformations are synchronous transformations

Asynchronous Transformations Does not process rows independently in the dataset Rather than output rows as they are processed, the transformation must output data asynchronously, or at a different time Record counts usually change from input to output Must create a new buffer upon output of the transformation Generally poorer performance than synchronous transformation Typically a Semi-Blocking or Blocking Transformation

Different Types of Transformations Different types of transformations in SSIS: Non-Blocking (Streaming) Partially Blocking or Semi-Blocking Fully Blocking Non-Blocking (Row by Row) In this case the term “Blocking” has nothing to do with concurrency as in the database engine

Streaming Transformations

Non-Blocking Streaming Transformations Generally are synchronous The logic inside of the transformation does not impede the data flow from moving onto the next transformation in the pipeline Fastest of all transformations because they operate on data that already exists inside of an already created buffer These transformations update the buffer in place then the data is passed onto the destination

Non-Blocking Streaming Transformations Audit Cache Transform Character Map Conditional Split Copy Column Data Conversion Derived Column Lookup (full cache) Multicast Percent Sampling Row Count

Partially Blocking Transformations

Partially Blocking Transformations Generally asynchronous Also known as “semi-blocking” transformations The transformations might prevent or “block” buffers from being sent down the pipeline because they are waiting on another action Their output creates a new buffer, different than non-blocking streaming transformations

Partially Blocking Transformations Data Mining Query Merge Merge Join Pivot Term Lookup Unpivot Union All

Fully Blocking Transformations

Fully Blocking Transformations Asynchronous Slowest performing, most memory intensive transformations Cannot let anything pass through the pipeline until finished Has to read all rows in the buffer before we start applying the transformation operation logic and creating a single output row The pipeline is stopped at these transformations, and new buffers are always created Can potentially use up all memory, and data then can spill on to disk

Fully Blocking Transformations Aggregate Fuzzy Grouping Fuzzy Lookup Row Sampling Sort Term Extraction

Non-Blocking Row By Row Transformations Considered Non-Blocking or Streaming because data can still flow to the next transformation Rows flowing through this transformation are processed one by one because they have to interact with an outside process, such as a database, file or another component Buffers can get held up

Non-Blocking – Row by Row Transformation Examples DQS Cleansing Export Column Import Column Lookup (Non-cached or partially cached) OLEDB Command Script (if it’s calling an outside process) SCD (each row does a lookup against DIM)

Lookup Transformation Can be Streaming Transformation Beware of the Non-Cache or Partially Cache Or Full Cache, with very large reference data set.

Alternatives to Blocking Transformations Optimize at the Source if you can Choose streaming transformations over partially or fully blocking transformations whenever possible T-SQL can help sometimes for some of those fully blocking transformations Aggregate = T-SQL SUM or AVG Sort = T-SQL Order By Setting IsSortedProperty on Source (Demo6)

RunInOptimizedMode Setting on the Data Flow Task Ignores unused columns downstream It still has to figure out, what is being used/not used however

PipelineInitialization Logging Event

Destinations Table or View - fast load” uses bulk insert Maximum Insert Commit Size: commits rows in batches that are the smaller of the Maximum insert commit size, or the remaining rows in the buffer that is currently being processed 0 = 1 big commit?

MaxConcurrentExecutables Inside the Control Flow - Possibly use Containers with multiple Data Flows - 1 (# of cores + 2) E.g. 4 core server can run 6 containers Low CPU usage, possibly increase to have more Data Flows run in parallel

EngineThreads A property on each Data Flow task 2 Types of Threads Source Threads Worker Threads (Transformations & Destinations)

Embrace but be Cautious Too many things running in parallel may not be a good thing Ensure there is enough resources Stagger ETL jobs Use precedence constraints where appropriate

More Optimization Loading Flat Files as fast as possible especially for good for staging Use the “Fast Parse” option (if the data is locale specific) Check the Advanced Properties of Source to change the field lengths Suggest Types button might help

Partition with Conditional Split Inside the Data Flow – create multiple streams Great for staging data Equal sized partitions or files The Modulo function can be used if primary key is an incremental value such as IDENTITY, or if there is counter, or number column Also See Balanced Data Distributor

SSIS Catalog catalog.execution_component_phases catalog.execution_data_statistics

Key Take-a-ways A little bit of SSIS internals is a good thing! Optimize your Sources as best you can Understand transformations Optimize your Destinations as best you can Embrace Parallelism whenever possible

Sources and References for this Presentation

Sources and References for this Presentation

Sources and References for this Presentation Advanced Integration Services Stacia Misner Varga

Sources and References for this Presentation - Links Books Online https://docs.microsoft.com/en-us/sql/integration-services/control-flow/data-flow-task https://docs.microsoft.com/en-us/sql/integration-services/data-flow/data-flow-performance-features SQL Server CAT Team: Top 10 SQL Server Integration Services Best Practices http://blogs.msdn.com/b/sqlcat/archive/2013/09/16/top-10-sql-server-integration-services-best-practices.aspx Simple Talk https://www.simple-talk.com/sql/ssis/working-with-precedence-constraints-in-sql-server-integration-services/