Advanced Topics for Azure SQL Data Warehouse

Slides:

Advertisements

Similar presentations

Yukon – What is New Rajesh Gala. Yukon – What is new.NET Framework Programming Data Types Exception Handling Batches Databases Database Engine Administration.

Advertisements

Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)

Introduction to Structured Query Language (SQL)

Introduction to Structured Query Language (SQL)

Sofia, Bulgaria | 9-10 October SQL Server 2005 High Availability for developers Vladimir Tchalkov Crossroad Ltd. Vladimir Tchalkov Crossroad Ltd.

Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.

6 1 Lecture 8: Introduction to Structured Query Language (SQL) J. S. Chou, P.E., Ph.D.

1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.

05 | SET Operators, Windows Functions, and Grouping Brian Alderman | MCT, CEO / Founder of MicroTechPoint Tobias Ternstrom | Microsoft SQL Server Program.

SMP MPP with PDW ** Workload requirements usually drive the architecture decision.

Azure SQL DW – Elastic Data Analytics in the cloud Josh Sivey | Microsoft TSP #492 | Phoenix.

Enable Operational Analytics (HTAP) in SQL Server 2016 and Azure SQL Database Sunil Agarwal Principal Program Manager, SQL Server Product Tiger Team

Azure SQL Data Warehouse for Beginners

An Refresher and How-To Profile Data using SQL

SQL Data Warehouse: lesson learned and practical implementation tips

Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.

Why Is My SQL DW Query Slow?

Lecture 16: Data Storage Wednesday, November 6, 2006.

7/22/2018 9:21 PM BRK3270 Building a Better Data Solution: Microsoft SQL Server and Azure Data Services Joey D’Antoni Principal Consultant Denny Cherry.

Introduction Module 16 9/5/2018 9:26 PM

Installation and database instance essentials

Introduction to SQL Server Management for the Non-DBA

Database Performance Tuning and Query Optimization

Before we begin Goals Non-goals (but feel free to ask questions)

Azure SQL Datawarehouse - Datawarehouse on Cloud

Machine Learning, Analytics, & Data Science Conference

Evaluation of Relational Operations

A developers guide to Azure SQL Data Warehouse

Azure SQL Data Warehouse for SQL Server DBAS

Microsoft Analytics Platform System

Azure SQL Data Warehouse Scaling: Configuration and Guidance

Analytics for Apps: Landing and Loading Data into SQL Data Warehouse

Migrating a Disk-based Table to a Memory-optimized one in SQL Server

Using Indexed Views & Computed Columns for Performance !

What is the Azure SQL Datawarehouse?

Dynamics AX Performance

Microsoft Analytics Platform System 04 – APS Data Loading

Azure SQL Data Warehouse Performance Tuning

Microsoft Ignite NZ October 2016 SKYCITY, Auckland.

Relational Operations

TechEd /20/ :49 PM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered.

Azure SQL Data Warehouse for SQL Server DBAS

Migrating your SQL Server Instance

BRK2279 Real-World Data Movement and Orchestration Patterns using Azure Data Factory Jason Horner, Attunix Cathrine Wilhelmsen, Inmeta -

A developers guide to Azure SQL Data Warehouse

Azure SQL DWH: Tips and Tricks for developers

MPP – Maximize Parallel Productivity

20 Questions with Azure SQL Data Warehouse

11/29/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.

Statistics: What are they and How do I use them

Azure SQL DWH: Tips and Tricks for developers

TechEd /2/2018 7:32 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.

Azure SQL DWH: Optimization

Managing batch processing Transient Azure SQL Warehouse Resource

In-Memory OLTP for Database Developers

Sunil Agarwal | Principal Program Manager

Context about the Data Warehouse

Four Rules For Columnstore Query Performance

Azure SQL DWH: Tips and Tricks for developers

Transaction Log Internals and Performance David M Maxwell

Contents Preface I Introduction Lesson Objectives I-2

Chapter 11 Database Performance Tuning and Query Optimization

General External Merge Sort

Azure SQL DWH: Tips and Tricks for developers

Inside SQL Server Polybase

Microsoft Analytics Platform System 03 – Distribution Theory & Design

Data modelling for Power BI using brand new Analysis Services Features

Your Data Any Place, Any Time

Moving your on-prem data warehouse to cloud. What are your options?

Introduction to SQL Server and the Structure Query Language

Presentation transcript:

Advanced Topics for Azure SQL Data Warehouse Microsoft Build 2016 5/20/2018 7:50 AM Advanced Topics for Azure SQL Data Warehouse James Rowland-Jones (JRJ) Principal Program Manager jrj@microsoft.com @jrowlandjones © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Agenda Data Movement Resource Management Workload concurrency Statistics Wrap-up

Data Movement

Why data moves Incompatible join Incompatible aggregation Re-distribute data Data consistency Query syntax

What data moves? As little as is possible! Remove columns Remove rows Retain columns required for query resolution Remove rows Apply where clause predicates Pre-aggregate Data Group by the distribution key for partial aggregation Transport remote rows Only send rows to other nodes that need to be stored remotely

How data moves Data Movement Service (DMS) Exists on Control and Compute nodes Responsible for all load & query data movement

Simple example SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ; SELECT SUM(*) FROM dbo.[FactInternetSales] ; Control Compute SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ;

EXPLAIN (MPP PLAN) <?xml version="1.0" encoding="utf-8"?> <dsql_query number_nodes="1" number_distributions="60" number_distributions_per_node="60"> <sql>SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]</sql> <dsql_operations total_cost="0.00192" total_number_operations="4"> <dsql_operation operation_type="ON"> <location permanent="false" distribution="Control" /> <sql_operations> <sql_operation type="statement">CREATE TABLE [tempdb].[QTables].[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c] ([col] BIGINT ) WITH(DATA_COMPRESSION=PAGE); </sql_operation> </sql_operations> </dsql_operation> <dsql_operation operation_type="PARTITION_MOVE"> <operation_cost cost="0.00192" accumulative_cost="0.00192" average_rowsize="8" output_rows="1" /> <location distribution="AllDistributions" /> <source_statement>SELECT [T1_1].[col] AS [col] FROM (SELECT COUNT_BIG(CAST ((0) AS INT)) AS [col] FROM (SELECT 0 AS [col] FROM [JRJDW].[dbo].[FactInternetSales] AS T3_1) AS T2_1 GROUP BY [T2_1].[col]) AS T1_1</source_statement> <destination>Control</destination> <destination_table>[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c]</destination_table> <dsql_operation operation_type="RETURN"> <location distribution="Control" /> <select>SELECT [T1_1].[col] AS [col] FROM (SELECT ISNULL([T2_1].[col], CONVERT (BIGINT, 0, 0)) AS [col] FROM (SELECT SUM([T3_1].[col]) AS [col] FROM [tempdb].[QTables].[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c] AS T3_1) AS T2_1) AS T1_1</select> <sql_operation type="statement">DROP TABLE [tempdb].[QTables].[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c]</sql_operation> </dsql_operations> </dsql_query>

Creating distributed tables Microsoft Build 2016 5/20/2018 7:50 AM Creating distributed tables CREATE TABLE [build].[FactOnlineSales] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = ROUND_ROBIN ; CREATE TABLE [build].[FactOnlineSales] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = HASH([ProductKey]) ; © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

ROUND ROBIN DISTRIBUTION 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

HASH DISTRIBUTION 01 03 01 02 N HASH ( ) 01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60

Joining HASH tables Web_Sales HASH([ProductKey]) Microsoft Build 2016 5/20/2018 7:50 AM Joining HASH tables Store_Sales HASH([ProductKey]) Web_Sales HASH([ProductKey]) JOIN is on the TYPE Must also be an equi-join © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Joining HASH tables Web_Sales HASH([ProductKey]) Microsoft Build 2016 5/20/2018 7:50 AM Joining HASH tables Store_Sales HASH([ProductKey]) [ProductKey] INT NULL Web_Sales HASH([ProductKey]) [ProductKey] BIGINT NULL © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Joining HASH tables Web_Sales HASH([ProductKey]) Microsoft Build 2016 5/20/2018 7:50 AM Joining HASH tables Store_Sales HASH([ProductKey]) [ProductKey] INT NULL Web_Sales HASH([ProductKey]) [ProductKey] INT NULL © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Hash distribution key guidance Distribution key is not updateable! Use a column that has static values Does not contain NULL values Large number of distinct values Even distribution of values Used frequently in joins and group by Avoid columns used in the where clause

Distribution compatibility HASH ROUND ROBIN REPLICATED* Round robin joins always trigger data movement

Hash & Replicated join compatibility Left Table Right Table Inner Left Right Full Cross Replicated HASH In PDW, if you write it as a LEFT join, and that requires movement, it will try to write it as a RIGHT join if that will prevent movement. Equality join / AND join col = col or col = x is not good Conditions! For a Distributed – Distributed join to be compatible (green) join must Contain distribution key of both columns Match data types on distribution keys Be an equality join

Aggregation Incompatibility Data needs to be moved for full aggregation Two approaches: Re-distribute data by a column in the group by Keeps data down at the compute level Push data to a central point for aggregation Uses the control node Most commonly seen with aggregates that have no group bys Ex. Show me by month and category the sales. The GROUP BY requires to move the data because the data is not distributed on these columns. You can keep the aggregation on the Compute node or move it to the Control node, depending on how you write your query.

Incompatible Aggregation example --EXPLAIN SELECT COUNT_BIG(*) FROM [cso].[FactOnlineSales] GROUP BY [StoreKey] OPTION (LABEL = 'Shuffle : Aggregate') ; Incompatiblity: FactOnlineSales distributed by ProductKey Query groups by Store Resolution: Re-distribute data on ProductKey N.B. Data is pre-aggregated by StoreKey first The Shuffle move resolves the aggregation. PDW does its best to make sure that only the minimum comes off the Compute node when moving data.

Microsoft Build 2016 5/20/2018 7:50 AM Re-distributing Data You can move: From hash to round_robin and vice versa From hash (a) to hash (b) From hash to replicated and vice versa* From round_robin to replicated and vice versa* Typically found when data is being persisted rather than returned to the user * APS only today © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Re-distribution example --EXPLAIN CREATE TABLE [tmp].[DimEmployee] WITH (DISTRIBUTION = Hash(EmployeeKey)) AS SELECT * FROM [cso].[DimEmployee] OPTION (LABEL = 'CTAS : Redistribution') ;

Query Syntax Causes of data movement: Expressions on the distribution key Additional causes of data movement: OVER() COUNT(DISTINCT [col]) …these can be optimised… For over and COUNT DISTINCT data movement occurs when the distribution key is not included!

COUNT DISTINCT examples --EXPLAIN SELECT COUNT_BIG(DISTINCT [DateKey]) FROM [cso].[FactOnlineSales] OPTION (LABEL = 'COUNT DISTINCT incompatible dist key') ; SELECT COUNT_BIG(DISTINCT ([ProductKey])) OPTION (LABEL = 'COUNT DISTINCT compatible dist key')

OVER() examples --EXPLAIN SELECT SUM([SalesAmount]) OVER(PARTITION BY [DateKey]) FROM [cso].[FactOnlineSales] OPTION (LABEL = 'OVER() incompatible dist key') ; SELECT SUM([SalesAmount]) OVER(ORDER BY [ProductKey]) OPTION (LABEL = 'OVER() incompatible no partition key') SELECT SUM([SalesAmount]) OVER(PARTITION BY [ProductKey]) OPTION (LABEL = 'OVER() compatible dist key')

Resource management

Introducing DWU CPU RAM I/O ALTER DATABASE ContosoRetailDW MODIFY (service_objective = 'DW1000') ;

Sizing by storage capacity? Microsoft Build 2016 5/20/2018 7:50 AM Database Capacity CREATE DATABASE MyDB COLLATE SQL_Latin1_General_CP1_CI_AS ( EDITION = 'DataWarehouse' , SERVICE_OBJECTIVE = 'DW400' , MAXSIZE = 10240 GB ) ; You need to factor in the compression of the data so assuming 5x compression a 100TB db will hold 500TB data Sizing by storage capacity? 1TB / DWU100 is good place to start © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Tempdb sizing Microsoft Build 2016 5/20/2018 7:50 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Load management Delimited text guidance Microsoft Build 2016 5/20/2018 7:50 AM Load management Delimited text guidance Evenly split the data into multiple files One file per reader Delimited text is the fastest DWU Readers Writers DW100 8 60 DW200 16 DW300 24 DW400 32 DW500 40 DW600 48 DW1000+ Compressed text limits concurrent access to text files PolyBase Azure Data Factory SSIS Bcp 3rd party data loading tools Split data across files OR Use different file format © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Transaction Size Microsoft Build 2016 5/20/2018 7:50 AM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Memory Management (MB per distribution)

Memory grant size estimation for row groups DECLARE @total_num_columns INT = 10 SELECT CEILING ( ( (75497472 +(@total_num_columns * 1048576 * 16) /10 ) )/1048576.0 ) AS Rowgroup_memorygrant_estimate_in_MiB_for_100K , CEILING ( (75497472 + (@total_num_columns * 1048576 * 16) /1048576.0 ) AS Rowgroup_memorygrant_estimate_in_MiB_for_1M ;

Estimating memory grant

Create Login (master) CREATE LOGIN newperson WITH PASSWORD = 'SQLB1ts!'; CREATE USER newperson for LOGIN newperson; EXEC sp_addrolemember 'loginmanager','newperson'; EXEC sp_addrolemember 'dbmanager','newperson';

Resource class roles SELECT ro.[name] AS [db_role_name] FROM sys.database_principals ro WHERE ro.[type_desc] = 'DATABASE_ROLE' AND ro.[is_fixed_role] = 0 ;

Create user (user db) CREATE USER newperson for LOGIN newperson ; GRANT CONTROL ON DATABASE::ContosoRetailDW TO newperson SELECT r.[name] AS role_principal_name , m.[name] AS member_principal_name FROM sys.database_role_members rm JOIN sys.database_principals AS r ON rm.[role_principal_id] = r.[principal_id] JOIN sys.database_principals AS m ON rm.[member_principal_id] = m.[principal_id] WHERE r.[name] IN ('mediumrc','largerc', 'xlargerc') EXEC sp_addrolemember 'mediumrc','newperson'

Identifying users with elevated requests SELECT r.[request_id] AS Req_ID , r.[command] AS Req_command , r.[status] AS Req_Status , r.[submit_time] AS Req_SubmitTime , r.[start_time] AS Req_StartTime , DATEDIFF(ms,[submit_time],[start_time]) AS Req_WaitDuration_ms , r.[resource_class] AS Req_resource_class FROM sys.dm_pdw_exec_requests r WHERE [session_id] <> session_id() ;

Concurrency

Concurrency & Memory

Statistics

Key points Created manually Updated manually Can make a huge difference to performance DISTINCT JOIN (+composite) WHERE GROUP BY (+composite) ORDER BY

Create table – default sizing CREATE TABLE [cso].[FactOnlineSales_INS] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL , [ReturnQuantity] int NOT NULL , [ReturnAmount] money NULL , [DiscountQuantity] int NULL , [DiscountAmount] money NULL , [TotalCost] money NOT NULL , [UnitCost] money NULL , [UnitPrice] money NULL , [ETLLoadID] int NULL , [LoadDate] datetime NULL , [UpdateDate] datetime NULL ) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([ProductKey])) ; INSERT INTO [cso].[FactOnlineSales_INS] SELECT * FROM [cso].[FactOnlineSales] SELECT rs.* FROM sys.dm_pdw_exec_requests er JOIN sys.dm_pdw_request_steps rs ON er.[request_id] = rs.[request_id] WHERE er.[session_id] = SESSION_ID() ;

Investigate the MPP plan --EXPLAIN SELECT SUM([SalesAmount]) FROM [cso].[FactOnlineSales_INS] AS fos JOIN [cso].[DimProduct] AS dip ON fos.[ProductKey] = dip.[ProductKey] WHERE fos.DateKey BETWEEN '2007-01-01 00:00:00.000' AND '2008-01-01 00:00:00.000' GROUP BY dip.[BrandName] ; CREATE STATISTICS stat_1 ON [cso].[FactOnlineSales_INS]([ProductKey]);

DMV: Row count and age of stats SELECT * FROM sys.schemas s JOIN sys.tables t ON s.[schema_id] = t.[schema_id] JOIN sys.partitions p ON t.[object_id] = p.[object_id] WHERE s.[name] = 'cso' AND t.[name] = 'FactOnlineSales_INS' ; SELECT stats_id , name AS stats_name , STATS_DATE(object_id, stats_id) AS statistics_date FROM sys.stats s WHERE s.object_id = OBJECT_ID('cso.DimCustomer')

Summary

Minimising movement Distribution key NOT NULL Distribution key data skew validated Distribution key data types compatible Variable length columns optimised Equijoins used on distribution keys Distribution keys included in the join Statistics up to date

Resource management Configure load user Scale for additional resources Size the rowgroup for memory grant Set appropriate resource class Scale for additional resources DWU1000+ 60 readers Multiply #files by readers for balanced throughput (i.e. 60,120,180 etc.) Tempdb (300GB per DWU100) Transaction Size Concurrency slots