Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:

Slides:



Advertisements
Similar presentations
SSIS Dataflow Performance Tuning 1 st October 2010 Jamie Thomson.
Advertisements

Filegroup “Stage A” Filegroup “Stage A” Filegroup “A” Partition 1,2 Filegroup “B” Partition 3,4 Filegroup “C” Partition 5,6 Filegroup “D” Partition.
Big Data Working with Terabytes in SQL Server Andrew Novick
Deep Dive into ETL Implementation with SQL Server Integration Services
Moving Data Lesson 23. Skills Matrix Moving Data When populating tables by inserting data, you will discover that data can come from various sources.
Making Data Warehouse Easy Conor Cunningham – Principal Architect Thomas Kejser – Principal PM.
Architecting a Large-Scale Data Warehouse with SQL Server 2005 Mark Morton Senior Technical Consultant IT Training Solutions DAT313.
1 Chapter Overview Transferring and Transforming Data Introducing Microsoft Data Transformation Services (DTS) Transferring and Transforming Data with.
Copying, Managing, and Transforming Data With DTS.
DBI308. What are SQL Server Fast Track Reference Configurations General Fast Track Recommendations Reference Configurations and Best Practices FT 3.0.
ETL Design and Development Michael A. Fudge, Jr.
ETL By Dr. Gabriel.
Performance Tuning SSIS. HR Departments are no fun. Don’t mention the stalking incident with Clay Aiken What happened in Vegas My prom date with a puppet.
SSIS Over DTS Sagayaraj Putti (139460). 5 September What is DTS?  Data Transformation Services (DTS)  DTS is a set of objects and utilities that.
2 Overview of SSIS performance Troubleshooting methods Performance tips.
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
SESSION CODE: BIE07-INT Eric Kraemer Senior Program Manager Microsoft Corporation.
BW Know-How Call : Performance Tuning dial-in phone numbers! U.S. Toll-free: (877) International: (612) Passcode: “BW”
Integration Services in SQL Server 2008 Allan Mitchell SQL Server MVP.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Best Practices in Loading Large Datasets Asanka Padmakumara (BSc,MCTS) SQL Server Sri Lanka User Group Meeting Oct 2013.
Copyright Sammamish Software Services All rights reserved. 1 Prog 140  SQL Server Performance Monitoring and Tuning.
# CCNZ What is going on here???
Doing fast! Optimizing Query performance with ColumnStore Indexes in SQL Server 2012 Margarita Naumova | SQL Master Academy.
Best Practices for Columnstore Indexes Warner Chaves SQL MCM / MVP SQLTurbo.com Pythian.com.
Carlos Bossy Quanta Intelligence SQL Server MCTS, MCITP BI CBIP, Data Mining Real-time Data Warehouse and Reporting Solutions.
11 Copyright © 2009, Oracle. All rights reserved. Enhancing ETL Performance.
Introducing Hekaton The next step in SQL Server OLTP performance Mladen Prajdić
Dissecting the Data Flow: SSIS Transformations, Memory & the Pipeline
Plan for Populating a DW
Presented By: Jessica M. Moss
Design Patterns for SSIS Performance
Antonio Abalos Castillo
SQL Server Internals Overview
Informatica PowerCenter Performance Tuning Tips
# - it’s not about social media it’s about temporary tables and data
# - it’s not about social media it’s about temporary tables and data
SQL Server Integration Services
IBM DATASTAGE online Training at GoLogica
Where I am at: Swagatika Sarangi MDM Lead PASS Summit SQL Saturdays
Presented by: Warren Sifre
Database Performance Tuning and Query Optimization
A developers guide to Azure SQL Data Warehouse
Blazing-Fast Performance:
Populating a Data Warehouse
Populating a Data Warehouse
Performance Tuning SSIS
SQL 2014 In-Memory OLTP What, Why, and How
What’s new in SQL Server 2016 Availability Groups
BRK2279 Real-World Data Movement and Orchestration Patterns using Azure Data Factory Jason Horner, Attunix Cathrine Wilhelmsen, Inmeta -
A developers guide to Azure SQL Data Warehouse
20 Questions with Azure SQL Data Warehouse
Populating a Data Warehouse
Steve Hood SimpleSQLServer.com
Typically data is extracted from multiple sources
Designing SSIS Packages for Performance
11 Simplex or Multiplex?.
Chapter 11 Database Performance Tuning and Query Optimization
Bulk Load and Minimal Logging
Data Warehousing Concepts
Diving into Query Execution Plans
SQL Performance for DBAs
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Applying Data Warehouse Techniques
How To Load A Fact Table Really, Really Fast
T-SQL Tools: Simplicity for Synchronizing Changes Martin Perez.
Visual Data Flows – Azure Data Factory v2
Implementing ETL solution for Incremental Data Load in Microsoft SQL Server Ganesh Lohani SR. Data Analyst Lockheed Martin
Presentation transcript:

Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:

(c) 2011 Microsoft. All rights reserved. Garrett Edmondson MCITP, MSCE GEdmondson@SolidQ.com http://garrettedmondson.wordpress.com – BLOG http://www.youtube.com/garrettedmondson - more videos! (c) 2011 Microsoft. All rights reserved.

ETL = Move Data Over Network Network is slowest part of any data warehouse!!! Minimize Transformations: Load data from source(s) as fast as possible Incremental loads: pull least amount of data possible (c) 2011 Microsoft. All rights reserved.

OLTP ETL-ish Data Warehouse ETL ETL Types OLTP ETL-ish Mirroring or 2012  Database Availability Groups Replication (Transactional, Merge, Peer-to-Peer ) Log Shipping Data Warehouse ETL Data State Change Data Capture/incremental Integration Services – SSIS Compress and BCP Flat-Files

OLTP Based ETL-ish (c) 2011 Microsoft. All rights reserved. Easy to setup DBA’s very familiar to with replication technologies 3rd NF (typically non-dimensional) Transactional Consistency Replay transactions on “Date Warehouse” server and support Reporting queries Scalability issues – 100’s of sources/instances ?!!! Can be used for “Real Time” Data Warehousing Be very careful !!! See above (c) 2011 Microsoft. All rights reserved.

Data Warehouse ETL (c) 2011 Microsoft. All rights reserved. Load Pattern Typically Daily ETL Loads Load Data State No need to replay DMLs Change Data Capture (CDC) Transaction log reader for DW work loads Convert LSN to DateTime stamp Net changes since last ETL run i.e. row version (c) 2011 Microsoft. All rights reserved.

Data Processing with SSIS - Transformations (c) 2011 Microsoft. All rights reserved.

Transforms Row Based Partially Blocking Blocking Logically works row by row No memory is copied Buffer is reused Row Based (synchronous) Works with groups of rows Memory is copied Shape of the buffer can change Partially Blocking (asynchronous) Need all input rows from all buffers before producing any output rows Blocking (asynchronous)

(c) 2011 Microsoft. All rights reserved.

Asynchronous Data Processing in SSIS Aggregation Demo (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. !?! (c) 2011 Microsoft. All rights reserved.

Asynchronous Processing in SSIS = Linear performance Each fully blocking asynchronous component must spool all the rows More rows = longer processing time No way to process the rows faster No DB Engine optimization like (query engine, statistics, compression, columnstore indexes) Procedural like processing (do this then that) versus relational declarative (give me that) Good for VM/SAN solutions as long as processing times are acceptable

(c) 2011 Microsoft. All rights reserved. <rant> NEVER use the OLE DB Command for a data warehouse batch load processes. it is pure evil because  it does DML commands on a row-by-row basis. Good luck loading a lot rows with that! </rant> (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. Fact Data: ELT ETL – Extract Transform Load Extract (SSIS) Transform (SSIS) Load to DB Engine ELT – Extract Load Tans from Transform with DB Engine ELT – Advantages Asynchronous (Blocking) transforms much faster Join optimization SQL Server DBE query engine Fastest Loads with flatfiles (PDW dwloader) (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. SSIS: The Right Tool for the Job Multiple data sources and destination Synchronous Transformations Workflow management Trickle feed Real-time ETL Asynchronous Transformations Once your operations are defined, you need to figure out if SSIS really is the right tool for the job. First let’s consider what SSIS is best suited for. SSIS is a great choice if you’re pulling in data from multiple sources, or splitting it up and sending it to a number of places. It’s also good if your data needs to go through a series of transforms, or you’re merging multiple sources of data. Finally, the package designer in BIDS lets you visually layout your workflow, which for a lot of people is easier than doing everything directly inside of stored procedures with SQL. You’ll want to be careful about using SSIS if your design requires you to do trickle feed or real-time ETL type operations. SSIS can do them, but it was really designed for bulk data loads. Our data pipeline is really fast, but the runtime that loads and hosts it can be slow to startup at times. When you’re moving large amounts of data, you don’t notice this startup cost, but you will if you’re running your package every 15 seconds, or only moving a row or two at a time. One of the first big customer issues I worked on, they had a single set of packages to do their bulk loads, and their incremental feeds. I say incremental, but it was more like a trickle feed – they had some web process that would kick off all of the packages for one to five rows of customer data. Their solution was big, too – something like 30 packages. They’d run through these really complex data flows that worked great when they were moving their entire data set, but it seems to take forever to just run through a couple of rows. Finally, there are a couple of reasons you’ll want to use something other than SSIS. If your source and destination databases are on the same server, you’ll probably want to do everything using SQL. Otherwise you’ll be copying all of the data out to SSIS, and then pushing it all back in. It’s usually way more efficient to just process it directly on the server at that point. Another reason not to use SSIS is if you’re doing a straight file to database load, or sometimes even a database to database load, without applying any transformations or control flow type logic to it. You can do it with SSIS, and if you want the graphical design experience, it’ll still work, but you can get the same performance, or maybe even a little better, if you use a BULK INSERT statement, or BCP. Single source and destination server BULK INSERT works just fine (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. Summary 90% of customers will hit their performance goals with the correct package design Most tuning and optimization will be done at the database and environment level (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. FlatFile ELT Data Compression – Most efficient way to transfer Data over the wire FlatFile Demo (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. Partition Switching (c) 2011 Microsoft. All rights reserved.

Partition Switching Pattern 1 Target DB SSIS: Surrogate Key Lookup Filegroup A Source DB Source DB STG Fact (Partitioned) Heap Fact (Partitioned) CSI Source DB Source DB Switch Concurrent bulk inserts = # Cores Create Indexes SORT_IN_TEMPDB MAXDOP =1

Partition Switching Demo Table Partitioning Demo.sql (c) 2011 Microsoft. All rights reserved.

Partition Switching Pattern - FlatFiles Target DB Fastest Filegroup A Source Data Files STG Fact (Partitioned) Heap Fact (Partitioned) CL/CI Switch Concurrent bulk inserts = # Cores Create Indexes SORT_IN_TEMPDB MAXDOP =1

Partitioning Fact Tables See the Data Loading Performance Guide from the SQLCAT team Minimally logged operations are key Best practices Remove indexes (empty tables), or use ORDER hint to load sorted data Use TABLOCK Insert in parallel Scales linearly to 16 streams if you’re not IO bound SQL 2012 - 15,000 Partitions Switching on same filegroup Target Partition must be empty Load staging table with BULK INSERT with TABLOCK (c) 2011 Microsoft. All rights reserved.

Don’t run more than 1 create CI per filegroup to avoid page splits Step 3 “Transform” Step 2 “Stage Insert” Step 4 “Final Append” Target Database Step 1 “Base Load” 8 Source Data Files 2 sets, 4 concurrent Create Cluster Index with Compression INTO “Final Destination” Create CI 8 Concurrent Partition Switch Part Switch Filegroup “Stage A” Filegroup “A” Partition 1,2 Filegroup “B” Partition 3,4 Filegroup “C” Partition 5,6 Filegroup “D” Partition 7,8 “Stage B” 8 Concurrent Inserts 8 Heap Stage Table Constraint on CI Part Key 8 Concurrent Bulk Insert Don’t run more than 1 create CI per filegroup to avoid page splits Determine number of filegroups and partitions per filegroup by examining available memory and CPU cores 4 Create Cluster Indexes at a time = 4x40GB memory for sorts: 160GB total memory of 192 available CPU: General rule is to run with half the number of physical cores if compression is being used and parity with physical cores if no compression is used. Since we can run up to 4 partitions at a time, 4 independent filegroups is ideal for a table of this size on a system with this much memory and CPU cores 2 CI Stage Tables Base Heap StageTable Destination Partitioned CI Table Partition 2 Destination CI Partition 1 Destination CI Partition 4 Destination CI Partition 3 Destination CI Partition 6 Destination CI Partition 5 Destination CI Partition 8 Destination CI Partition 7 Destination CI 8 Core Server (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. Bulk Insert Sales 2001 2002 2003 2004 Fact SWITCH SWITCH stgFact_2001 BULK INSERT SWITCH SWITCH stgFact_2002 BULK INSERT BULK INSERT SWITCH SWITCH stgFact_2003 SWITCH SWITCH stgFact_2004 BULK INSERT (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. Large Updates Fact 2001 2002 2003 2004 Fact_New SWITCH Fact Update SWITCH Fact_Old Fact_Delta Update Records BULK INSERT (c) 2011 Microsoft. All rights reserved.

(c) 2011 Microsoft. All rights reserved. Large Deletes Fact 2001 2002 2003 2004 2001 (Filtered) SWITCH Fact_Temp (2001 Filtered) BULK INSERT SWITCH Fact_Temp (2001) (c) 2011 Microsoft. All rights reserved.