Azure SQL DWH: Optimization

Azure SQL DWH: Optimization
Sergiy Lunyakin Azure SQL DWH: Optimization

Our Partners If you think, that a SQL Saturday is a nice possibility to learn from and network with fellow SQL Server enthusiasts FOR FREE, I just ask you one thing: Visit the sponsor booths and chat with the sponsors! They are covering the expenses for each and every of you, with is around EUR 60 …

About me Data Architect in BigData and Analytics Team at SoftServe Inc
Data Platform MVP, MCSE BI, MCSA Cloud Platform Leader of Speaker at SQL Conferences Organizer of SQLSaturday Lviv Contacts: @slunyakin

Agenda Architecture of Azure SQL DW
Sizing. Service Layer and Resource Classes Data loading options. What is better Choose a right distribution type Update Statistics Indexing

Architecture

Architecture of Azure SQL DW
Dist_DB_1 Dist_DB_2 Dist_DB_15 Dist_DB_16 Dist_DB_17 Dist_DB_30 … Dist_DB_46 Dist_DB_47 Dist_DB_60 … … … … …

Sizing

Sizing factors Number of nodes Tempdb size Concurrency & Memory Load
Transaction size DWU/cDWU DWU – Optimized for Elasticity cDWU – Optimized for Compute (more resources, NVMe Solid State Disk cache that keeps the most frequently accessed data close to the CPUs). Gets 2.5x more memory.

Introducing DWU CPU RAM I/O DWU Max queries Max MB mem/ dist DWU Max
4 400 DW200 8 800 DW300 12 1200 DW400 16 1600 DW500 20 2000 DW600 24 2400 DW1000 32 4000 DW1200 4800 DW1500 6000 DW2000 8000 DW3000 12000 DW6000 24000 DWU Max queries Max GB mem/ dist DW1000c 32 10 DW1500c 15 DW2000c 20 DW2500c 25 DW3000c 30 DW5000c 50 DW6000c 60 DW7500c 75 DW10000c 100 DW15000c 150 DW30000c 300 CPU RAM I/O

Resource Classes Static Resource Classes Dynamic Resource Classes
allocate the same amount of memory regardless of the current service level. Dynamic Resource Classes allocate a variable amount of memory depending on the current service level. When you scale up to a larger service level, your queries automatically get more memory.

Load management The load performance scales as you increase DWUs.
Microsoft Build 2016 12/4/2018 7:54 PM Load management The load performance scales as you increase DWUs. PolyBase automatically parallelizes the data load process Multiple readers will not work against compressed text files (e.g. gzip) Multiple readers will work against compressed columnar/block format files (e.g. ORC, RC) DWU Readers Writers DW100 8 60 DW200 16 DW300 24 DW400 32 DW500 40 DW600 48 DW1000+ 80+ © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Data Loading

Options Parallel PolyBase Azure Data Factory (PolyBase)
SSIS (Azure SQL DW Comp) Single Gated Client bcp / Insert Bulk SQLBulkCopy SSIS (data flow) Azure Data Factory

Single Gated Client Compute Node DMS Bridge Control Node Compute Node

Single Gated Client Parallelised
Compute Node DMS Bridge Client Control Node Compute Node DMS Bridge Client DMS Client Compute Node DMS Bridge

Parallel Loading with PolyBase
Compute Node DMS Bridge Azure Storage Blob (ASB) Control Node Compute Node DMS Bridge DMS Compute Node DMS Bridge

Recommendations for data loading
Load data with enough compute Separate user with high resource class Use CTAS to Load data to staging table from Azure Blob with Polybase Heap, Round-Robin (Hash whether prod table is hash distributed ), No partition Do transformation and load data to production table

Data preparation Transfer data to blob storage One root folder per table Split uncompressed text files bigger than 2 GB Each split can be targeted by a different reader Compressed text files cannot be split by reader(gzip) Multiple compressed text files can be read in parallel Multiple readers will work against compressed columnar/block format files (e.g. ORC, RC)

Distributions

Distributions Distribution – SQL Database which stores one or more distributed table Splits data table to 60 buckets through compute nodes Hash distributed table * Round-Robin distributed table * Replicate table * * Selecting the right distribution method is key for good performance

Creating distributed tables
Microsoft Build 2016 12/4/2018 7:54 PM Creating distributed tables CREATE TABLE [build].[FactOnlineSales] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = ROUND_ROBIN ; CREATE TABLE [build].[FactOnlineSales] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = HASH([ProductKey]) ; © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Round Robin Distribution
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

HASH Distribution 01 03 01 02 N HASH ( ) 01 02 03 04 05 06 07 08 09 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Hash distribution key guidance
Distribution key is not updateable! Use a column that has static values Does not contain NULL values Large number of distinct values Even distribution of values Used frequently in joins and group by Avoid columns used in the where clause

Replicated (vs. hash distributed)
Node 00 Node 01 Node 02 SALES PRODUCT PRODUCT PRODUCT PRODUCT PRODUCT PRODUCT PRODUCT Node 03 Node 04 Node 05 Node 06

Recommendations Replicated Round-Robin Hash
Small dimension tables in a star schema with less than 2 GB of storage after compression (~5x compression) Warm-Up Replicated tables after create, scale, pause/resume Round-Robin Temporary/Staging table No obvious joining key or good candidate column Hash Fact tables Large dimension tables

Statistics

Key points Created manually (should be fixed soon) Updated manually
Can make a huge difference to performance DISTINCT JOIN (+composite) WHERE GROUP BY (+composite) ORDER BY

Recommendations Create Stored Procedure to identify and create missed statistics Create Stored Procedure to update statistics after data loading or changing Automate it

Indexing

Indexing Heap Clustered Index Clustered ColumnStore (CCI)
Staging/temporary table Small tables with small lookups Clustered Index Up to 100-m rows table Large tables (more than 100-m rows) with only 1-2 columns are heavily used Clustered ColumnStore (CCI) Large tables (more than 100-m rows)

Recommendations Consider to add Nonclustered Index to a column heavily used for filter. Make updates on the indexed columns, it takes memory. Use higher resource class Avoid trimming and creating many small compressed Row Groups in CCI At least 100k rows per compressed Row Groups. The ideal is 1-m rows in a row group.

Recommendations Slow performance can happen due to poor compression of your Row Groups, consider to rebuild or reorganize CCI using higher resource class. Consider to partition your table when you have a large fact tables (>1B row table). The partition key should be based on date. Be careful to not over-partition, especially with a CCI. Benefit from CCI = (60 distributions * N partitions * 1m rows) >=Count(1) from YourTable

Summary Service Level size and Resource Classes are very important for performance Data Loading with higher RC, DWUs and Polybase Carefully select your distribution type Control Statistics Check compression in CCI’s Row Groups, rebuild or reorganize CCI with higher RC. More memory Not over-partition table with CCI

Questions?

Our Partners If you think, that a SQL Saturday is a nice possibility to learn from and network with fellow SQL Server enthusiasts FOR FREE, I just ask you one thing: Visit the sponsor booths and chat with the sponsors! They are covering the expenses for each and every of you, with is around EUR 60 …

Azure SQL DWH: Optimization

Similar presentations

Presentation on theme: "Azure SQL DWH: Optimization"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Azure SQL DWH: Optimization

Similar presentations

Presentation on theme: "Azure SQL DWH: Optimization"— Presentation transcript:

Similar presentations

About project

Feedback