Azure SQL Data Warehouse Performance Tuning

Slides:



Advertisements
Similar presentations
Tuning: overview Rewrite SQL (Leccotech)Leccotech Create Index Redefine Main memory structures (SGA in Oracle) Change the Block Size Materialized Views,
Advertisements

Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
Making Data Warehouse Easy Conor Cunningham – Principal Architect Thomas Kejser – Principal PM.
Lecture 6 Indexing Part 2 Column Stores. Indexes Recap Heap FileBitmapHash FileB+Tree InsertO(1) O( log B n ) DeleteO(P)O(1) O( log B n ) Range Scan O(P)--
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Module 7 Reading SQL Server® 2008 R2 Execution Plans.
Ashwani Roy Understanding Graphical Execution Plans Level 200.
Parallel Execution Plans Joe Chang
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
Chapter 5 Index and Clustering
INTRODUCING SQL SERVER 2012 COLUMNSTORE INDEXES Exploring and Managing SQL Server 2012 Database Engine Improvements.
Azure SQL DW – Elastic Data Analytics in the cloud Josh Sivey | Microsoft TSP #492 | Phoenix.
--A Gem of SQL Server 2012, particularly for Data Warehousing-- Present By Steven Wang.
How to kill SQL Server Performance Håkan Winther.
SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP,MCP. SQL SERVER Database Administration.
Scott Fallen Sales Engineer, SQL Sentry Blog: scottfallen.blogspot.com.
Execution Plans Detail From Zero to Hero İsmail Adar.
Diving into Query Execution Plans ED POLLACK AUTOTASK CORPORATION DATABASE OPTIMIZATION ENGINEER.
Doing fast! Optimizing Query performance with ColumnStore Indexes in SQL Server 2012 Margarita Naumova | SQL Master Academy.
Best Practices for Columnstore Indexes Warner Chaves SQL MCM / MVP SQLTurbo.com Pythian.com.
SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.
Azure SQL Data Warehouse for Beginners
In-Memory Capabilities
Advanced Topics for Azure SQL Data Warehouse
Stored Procedures – Facts and Myths
SQL Data Warehouse: lesson learned and practical implementation tips
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
Antonio Abalos Castillo
MongoDB Er. Shiva K. Shrestha ME Computer, NCIT
Why Is My SQL DW Query Slow?
Finding more space for your tight environment
Joe Chang yahoo . com qdpma.com
Before we begin Goals Non-goals (but feel free to ask questions)
Azure SQL Datawarehouse - Datawarehouse on Cloud
Hustle and Bustle of SQL Pages
Machine Learning, Analytics, & Data Science Conference
Four Rules For Columnstore Query Performance
A developers guide to Azure SQL Data Warehouse
Azure SQL Data Warehouse for SQL Server DBAS
Blazing-Fast Performance:
What is the Azure SQL Datawarehouse?
Cardinality Estimator 2014/2016
ColumnStore Index Primer
SQL 2014 In-Memory OLTP What, Why, and How
TechEd /20/ :49 PM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered.
Azure SQL Data Warehouse for SQL Server DBAS
A developers guide to Azure SQL Data Warehouse
Physical Database Design
MPP – Maximize Parallel Productivity
20 Questions with Azure SQL Data Warehouse
11/29/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.
Statistics: What are they and How do I use them
TechEd /2/2018 7:32 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Azure SQL DWH: Optimization
Managing batch processing Transient Azure SQL Warehouse Resource
Microsoft SQL Server 2014 for Oracle DBAs Module 7
The Five Ws of Columnstore Indexes
Four Rules For Columnstore Query Performance
Azure SQL DWH: Tips and Tricks for developers
Clustered Columnstore Indexes (SQL Server 2014)
Azure SQL DWH: Tips and Tricks for developers
Diving into Query Execution Plans
Microsoft Analytics Platform System 03 – Distribution Theory & Design
Outperform the Competition with Azure SQL Data Warehouse
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Moving your on-prem data warehouse to cloud. What are your options?
Using Columnstore indexes in Azure DevOps Services. Lessons learned.
Sunil Agarwal | Principal Program Manager
Presentation transcript:

Azure SQL Data Warehouse Performance Tuning Simon Facer Microsoft PFE

Simon Facer Microsoft PFE since 2011 SQL Server since 1995 – version 4.21 (Sybase System 12) APS since 2013 ADW since 2016

Out of SCope Data Bricks Polybase Optimized for Elasticity What we aren’t going to talk about (in detail) … Data Bricks Polybase Optimized for Elasticity Adaptive Caching HDInsight Azure Data Lake Azure Data Factory Azure Analytics Power BI Optimized for Compute / Gen 2

In Scope (aka the Agenda) Azure SQL DW Overview Table Basics Capturing Query Data Common Design and Performance Issues

What is MPP? Control Compute Compute Compute Compute Compute Compute Massively Parallel Processing Control Compute Compute Compute 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 Compute Compute Compute 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010

What is Azure Data Warehouse? Massively Parallel Processing Storage Compute Compute Compute Compute Compute 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 Compute Compute Compute 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 Compute 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 Compute 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 011010101010101010110101011101010101011011010010 (60)

Table use cases Fact Tables Dimension Tables Stage Tables Goals Fact Tables Millions / Billions of rows Aggregatable data Dimension Tables Attribute data Stage Tables Data sink Fast scan of M or B of rows SCAN Fast read of specific rows SEEK Fast write

Table geometries Hash Distributed Tables Data is distributed based on the hash of the distribution key value 60 Distributions Fact Tables Round-Robin Distributed Tables Data is distributed evenly across all distributions Staging Tables Replicated Tables Copy on each compute node * Dimension tables < 2GB https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute Hash Distributed: Hashing algorithm is deterministic – value ‘x’ will always go to the same distribution. All tables use the same hashing algorithm, so value ‘x’ in the hash key for table tbl_SalesHdr will be in the same distribution as value ‘x’ in the hash key for table tbl_SalesDtl. The Hashing algorithm is data-type agnostic - it is based on the bytes of the field, not the value. The results are counter-intuitive – value 1 in an INT field will hash differently to value 1 in a BIGINT field, because the BIGINT field has more bytes than the INT field. May be a good candidate for tables with frequent DML operations. Round-Robin Distributed Rows are distributed evenly across all 60 distributions. Data is not assigned to distributions in any deterministic pattern. Queries against RR tables will almost always incur data movement. Can be a good candidate when … There is no obvious JOIN key There is no good candidate HASH key, The table does not share a common join key with other tables When the table is a temporary staging table Replicated Tables Data for Replicated tables is copied on each Compute node Should be < 2GB is size After scaling the DW, needs to be re-initialized on Compute nodes After updates, needs to be re-initialized on Compute Nodes Query DMV sys.pdw_replicated_table_cache_state to identify tables that need to be re-initialized -- Code Start: SELECT [ReplicatedTable] = t.[name] FROM sys.tables t JOIN sys.pdw_replicated_table_cache_state c ON c.object_id = t.object_id JOIN sys.pdw_table_distribution_properties p ON p.object_id = t.object_id WHERE c.[state] = 'NotReady' AND p.[distribution_policy_desc] = 'REPLICATE’ -- Re-initialize table with: SELECT TOP 1 * FROM [ReplicatedTable] Re-initialization incurs a table-level EXCLUSIVE lock Re-initialization also rebuild all indexes on the table

Storage Options Rowstore – Clustered Index Indexed / ordered data Indexes get Fragmented over time Data insert ordered on Cluster Key - new rows appended to the end of the table – fast load performance Data insert not ordered on Cluster Key - new rows inserted into existing pages results in Page Splits – poor load performance Index maintenance on DML – overhead on data load Good lookup performance Ideal for limited range scans & singleton selects (Seeks) Slower for table scans / partition scans / loading

Storage Options Rowstore – Heap No clustered index / unordered data New rows appended to the end of the table – fast load performance Whole table is / may be read for lookups (Seeks) Whole table is read for Scans Bad read performance

Storage Options Clustered ColumnStore Index (CCI) Highly compressed – IO efficient Compression up to 15x (vs. RowStore up to 3.5x) Load performance dependent on Batch Size Lookup (seek) queries perform badly Scan queries – optimized! Query performance depends on CCI quality / health

How to capture Query MetaData EXPLAIN Equivalent of SQL Server’s ‘Estimated Execution Plan’

How to capture Query MetaData DMVs Equivalent of SQL Servers ‘Actual Execution Plan’

How to capture Query MetaData XML Output shows D-SQL operations:

How to capture Query MetaData Demo … See the ‘Resources’ slide at the end for scripts used in this session.

Common Design and Perf. Issues Data Movement Data Skew Statistics CCI Health Locking Resource Contention

Common Issues – Data Movement Why does data move? fact_OrderHeader OrderID 1 … 2 … 3 … 4 … 7 … 8 … fact_OrderDetail SKU (Order ID) 999-111-222 (7) … 999-222-555 (3) … 888-111-222 (1) … 999-111-333 (7) … Compute #1 Compute # 2 Compute # 3

Common Issues – Data Movement Why does data move? Distribution incompatible JOINs Distribution incompatible AGGREGATIONs Store_Sales HASH([ProductKey]) [ProductKey] INT NULL Web_Sales HASH([ProductKey]) [ProductKey] BIGINT NULL

Common Issues – Data Movement Why does data move? Distribution incompatible JOINs Distribution incompatible AGGREGATIONs SELECT COUNT_BIG(*) FROM [dbo].[FactOnlineSales] GROUP BY [StoreKey]; Incompatiblity: FactOnlineSales distributed by ProductKey Query groups by Store Resolution: Re-distribute data on ProductKey

Common Issues – Data Skew Causes of Data Skew Natural Skew NULL hash key values Default hash key value Bad hash key choice Resolution: Pick a different hash key Split default values into a secondary table

Common Issues - Statistics The MPP Query Optimizer heavily relies on statistics to evaluate plans Out-of-Date or Non-Existent Statistics is the most common reason for MPP performance issues! Avoid issues with statistics by creating them on all recommended columns and updating them after every load

Common Issues - Statistics It is recommended Statistics are created on all columns used in: Joins Predicates Aggregations Group By’s Order By’s Computations Don’t forget about multi-column statistics …

Common Issues - Statistics It is recommended Statistics are created on all columns used in: Joins Predicates Aggregations Group By’s Order By’s Computations Don’t forget about multi-column statistics …

Common Issues - Statistics Azure SQL DW now supports automatic creation of column level statistics Auto Update not supported Multi-column stats not auto created Stats Creation is Synchronous Stats Creation is triggered by: , yet SELECT INSERT-SELECT CTAS UPDATE DELETE EXPLAIN May 10th release https://azure.microsoft.com/en-us/blog/sql-dw-now-supports-automatic-creation-of-statistics/ https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-statistics

Common Issues – CCI Health Clustered ColumnStore Indexes Better Performance with > 100K rows / Compressed Row Group Best Performance with 1,048,576 rows / Compressed Row Group Deleted rows impact performance Open Row Groups (Delta Store) – HEAPs Loading Batches – > 100K rows / distribution – direct to Compressed format Small Resource Class – memory pressure can limit Compressed RGs size Compressing a Row Group requires Memory: 72MB + (r * c * 8) + (r * short str col * 32) + (long str col * 16MB) Distributed tables have 60 sets of Row Groups Recommended ≥ 60M rows (1M / distribution) Each distribution has its own Delta Store Partitions add CCIs / distribution https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-memory-optimizations-for-columnstore-compression Short str ≤ 32 bytes, Long str > 32 bytes

Common Issues – Resource ConteNtion Queries occupy Concurrency Slots, based on Resource Class # of concurrent queries depends on DWU service objective Allocated RAM / Query allocated depends on Resource Class and DWU

Common Issues – Resource ConteNtion Memory: Gen 1 Performance level Compute nodes Memory per data warehouse (GB) DW100 1 24 DW200 2 48 DW300 3 72 DW400 4 96 DW500 5 120 DW600 6 144 DW1000 10 240 DW1200 12 288 DW1500 15 360 DW2000 20 480 DW3000 30 720 DW6000 60 1440 Gen 2 Performance level Compute nodes Memory per data warehouse (GB) DW1000c 2 600 DW1500c 3 900 DW2000c 4 1200 DW2500c 5 1500 DW3000c 6 1800 DW5000c 10 3000 DW6000c 12 3600 DW7500c 15 4500 DW10000c 20 6000 DW15000c 30 9000 DW30000c 60 18000

Common Issues – Resource ConteNtion Gen 1 (Static Resource Classes): Concurrency Slots Used Service level Maximum concurrent queries Maximum concurrency slots staticrc10 staticrc20 staticrc30 staticrc40 staticrc50 staticrc60 staticrc70 staticrc80 DW100 4 1 2 DW200 8 DW300 12 DW400 16 DW500 20 DW600 24 DW1000 32 40 DW1200 48 DW1500 60 DW2000 80 64 DW3000 120 DW6000 128 240 https://docs.microsoft.com/en-us/azure/sql-data-warehouse/memory-and-concurrency-limits

Common Issues – Resource ConteNtion Gen 1 (Dynamic Resource Classes): Concurrency Slots Used Service level Maximum concurrent queries Concurrency slots available smallrc mediumrc largerc xlargerc DW100 4 1 2 DW200 8 DW300 12 DW400 16 DW500 20 DW600 24 DW1000 32 40 DW1200 48 DW1500 60 DW2000 80 64 DW3000 120 DW6000 128 240 https://docs.microsoft.com/en-us/azure/sql-data-warehouse/memory-and-concurrency-limits

Common Issues – Resource ConteNtion Gen 2 (Static Resource Classes): Concurrency Slots Used Service Level Maximum concurrent queries Concurrency slots available staticrc10 staticrc20 staticrc30 staticrc40 staticrc50 staticrc60 staticrc70 staticrc80 DW1000c 32 40 1 2 4 8 16 DW1500c 60 DW2000c 48 80 64 DW2500c 100 DW3000c 120 128 DW5000c 200 DW6000c 240 DW7500c 300 DW10000c 400 DW15000c 600 DW30000c 1200 https://docs.microsoft.com/en-us/azure/sql-data-warehouse/memory-and-concurrency-limits

Common Issues – Resource ConteNtion Gen 2 (Dynamic Resource Classes): Concurrency Slots Used Service Level Maximum concurrent queries Concurrency slots available smallrc mediumrc largerc xlargerc DW1000c 32 40 1 4 8 28 DW1500c 60 6 13 42 DW2000c 80 2 17 56 DW2500c 100 3 10 22 70 DW3000c 120 12 26 84 DW5000c 200 20 44 140 DW6000c 240 7 24 52 168 DW7500c 300 9 30 66 210 DW10000c 400 88 280 DW15000c 600 18 132 420 DW30000c 1200 36 264 840 https://docs.microsoft.com/en-us/azure/sql-data-warehouse/memory-and-concurrency-limits

Resources Microsoft Azure SQL Data Warehouse SQL Data Warehouse Documentation

Questions …