Download presentation
Presentation is loading. Please wait.
1
Advanced Topics for Azure SQL Data Warehouse
Microsoft Build 2016 5/20/2018 7:50 AM Advanced Topics for Azure SQL Data Warehouse James Rowland-Jones (JRJ) Principal Program Manager @jrowlandjones © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
2
Agenda Data Movement Resource Management Workload concurrency
Statistics Wrap-up
3
Data Movement
4
Why data moves Incompatible join Incompatible aggregation Re-distribute data Data consistency Query syntax
5
What data moves? As little as is possible! Remove columns Remove rows
Retain columns required for query resolution Remove rows Apply where clause predicates Pre-aggregate Data Group by the distribution key for partial aggregation Transport remote rows Only send rows to other nodes that need to be stored remotely
6
How data moves Data Movement Service (DMS) Exists on Control and Compute nodes Responsible for all load & query data movement
7
Simple example SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ;
SELECT SUM(*) FROM dbo.[FactInternetSales] ; Control Compute SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ; SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales] ;
8
EXPLAIN (MPP PLAN) <?xml version="1.0" encoding="utf-8"?>
<dsql_query number_nodes="1" number_distributions="60" number_distributions_per_node="60"> <sql>SELECT COUNT_BIG(*) FROM dbo.[FactInternetSales]</sql> <dsql_operations total_cost=" " total_number_operations="4"> <dsql_operation operation_type="ON"> <location permanent="false" distribution="Control" /> <sql_operations> <sql_operation type="statement">CREATE TABLE [tempdb].[QTables].[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c] ([col] BIGINT ) WITH(DATA_COMPRESSION=PAGE); </sql_operation> </sql_operations> </dsql_operation> <dsql_operation operation_type="PARTITION_MOVE"> <operation_cost cost=" " accumulative_cost=" " average_rowsize="8" output_rows="1" /> <location distribution="AllDistributions" /> <source_statement>SELECT [T1_1].[col] AS [col] FROM (SELECT COUNT_BIG(CAST ((0) AS INT)) AS [col] FROM (SELECT 0 AS [col] FROM [JRJDW].[dbo].[FactInternetSales] AS T3_1) AS T2_1 GROUP BY [T2_1].[col]) AS T1_1</source_statement> <destination>Control</destination> <destination_table>[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c]</destination_table> <dsql_operation operation_type="RETURN"> <location distribution="Control" /> <select>SELECT [T1_1].[col] AS [col] FROM (SELECT ISNULL([T2_1].[col], CONVERT (BIGINT, 0, 0)) AS [col] FROM (SELECT SUM([T3_1].[col]) AS [col] FROM [tempdb].[QTables].[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c] AS T3_1) AS T2_1) AS T1_1</select> <sql_operation type="statement">DROP TABLE [tempdb].[QTables].[QTable_7cb3c9d5271e41bc9a28e583eeb2bd4c]</sql_operation> </dsql_operations> </dsql_query>
9
Creating distributed tables
Microsoft Build 2016 5/20/2018 7:50 AM Creating distributed tables CREATE TABLE [build].[FactOnlineSales] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = ROUND_ROBIN ; CREATE TABLE [build].[FactOnlineSales] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = HASH([ProductKey]) ; © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
10
ROUND ROBIN DISTRIBUTION
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
11
HASH DISTRIBUTION 01 03 01 02 N HASH ( ) 01 02 03 04 05 06 07 08 09 10
11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
12
Joining HASH tables Web_Sales HASH([ProductKey])
Microsoft Build 2016 5/20/2018 7:50 AM Joining HASH tables Store_Sales HASH([ProductKey]) Web_Sales HASH([ProductKey]) JOIN is on the TYPE Must also be an equi-join © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
13
Joining HASH tables Web_Sales HASH([ProductKey])
Microsoft Build 2016 5/20/2018 7:50 AM Joining HASH tables Store_Sales HASH([ProductKey]) [ProductKey] INT NULL Web_Sales HASH([ProductKey]) [ProductKey] BIGINT NULL © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
14
Joining HASH tables Web_Sales HASH([ProductKey])
Microsoft Build 2016 5/20/2018 7:50 AM Joining HASH tables Store_Sales HASH([ProductKey]) [ProductKey] INT NULL Web_Sales HASH([ProductKey]) [ProductKey] INT NULL © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
15
Hash distribution key guidance
Distribution key is not updateable! Use a column that has static values Does not contain NULL values Large number of distinct values Even distribution of values Used frequently in joins and group by Avoid columns used in the where clause
16
Distribution compatibility
HASH ROUND ROBIN REPLICATED* Round robin joins always trigger data movement
17
Hash & Replicated join compatibility
Left Table Right Table Inner Left Right Full Cross Replicated HASH In PDW, if you write it as a LEFT join, and that requires movement, it will try to write it as a RIGHT join if that will prevent movement. Equality join / AND join col = col or col = x is not good Conditions! For a Distributed – Distributed join to be compatible (green) join must Contain distribution key of both columns Match data types on distribution keys Be an equality join
18
Aggregation Incompatibility
Data needs to be moved for full aggregation Two approaches: Re-distribute data by a column in the group by Keeps data down at the compute level Push data to a central point for aggregation Uses the control node Most commonly seen with aggregates that have no group bys Ex. Show me by month and category the sales. The GROUP BY requires to move the data because the data is not distributed on these columns. You can keep the aggregation on the Compute node or move it to the Control node, depending on how you write your query.
19
Incompatible Aggregation example
--EXPLAIN SELECT COUNT_BIG(*) FROM [cso].[FactOnlineSales] GROUP BY [StoreKey] OPTION (LABEL = 'Shuffle : Aggregate') ; Incompatiblity: FactOnlineSales distributed by ProductKey Query groups by Store Resolution: Re-distribute data on ProductKey N.B. Data is pre-aggregated by StoreKey first The Shuffle move resolves the aggregation. PDW does its best to make sure that only the minimum comes off the Compute node when moving data.
20
Microsoft Build 2016 5/20/2018 7:50 AM Re-distributing Data You can move: From hash to round_robin and vice versa From hash (a) to hash (b) From hash to replicated and vice versa* From round_robin to replicated and vice versa* Typically found when data is being persisted rather than returned to the user * APS only today © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
21
Re-distribution example
--EXPLAIN CREATE TABLE [tmp].[DimEmployee] WITH (DISTRIBUTION = Hash(EmployeeKey)) AS SELECT * FROM [cso].[DimEmployee] OPTION (LABEL = 'CTAS : Redistribution') ;
22
Query Syntax Causes of data movement: Expressions on the distribution key Additional causes of data movement: OVER() COUNT(DISTINCT [col]) …these can be optimised… For over and COUNT DISTINCT data movement occurs when the distribution key is not included!
23
COUNT DISTINCT examples
--EXPLAIN SELECT COUNT_BIG(DISTINCT [DateKey]) FROM [cso].[FactOnlineSales] OPTION (LABEL = 'COUNT DISTINCT incompatible dist key') ; SELECT COUNT_BIG(DISTINCT ([ProductKey])) OPTION (LABEL = 'COUNT DISTINCT compatible dist key')
24
OVER() examples --EXPLAIN
SELECT SUM([SalesAmount]) OVER(PARTITION BY [DateKey]) FROM [cso].[FactOnlineSales] OPTION (LABEL = 'OVER() incompatible dist key') ; SELECT SUM([SalesAmount]) OVER(ORDER BY [ProductKey]) OPTION (LABEL = 'OVER() incompatible no partition key') SELECT SUM([SalesAmount]) OVER(PARTITION BY [ProductKey]) OPTION (LABEL = 'OVER() compatible dist key')
25
Resource management
26
Introducing DWU CPU RAM I/O ALTER DATABASE ContosoRetailDW MODIFY
(service_objective = 'DW1000') ;
27
Sizing by storage capacity?
Microsoft Build 2016 5/20/2018 7:50 AM Database Capacity CREATE DATABASE MyDB COLLATE SQL_Latin1_General_CP1_CI_AS ( EDITION = 'DataWarehouse' , SERVICE_OBJECTIVE = 'DW400' , MAXSIZE = GB ) ; You need to factor in the compression of the data so assuming 5x compression a 100TB db will hold 500TB data Sizing by storage capacity? 1TB / DWU100 is good place to start © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
28
Tempdb sizing Microsoft Build 2016 5/20/2018 7:50 AM
© 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
29
Load management Delimited text guidance
Microsoft Build 2016 5/20/2018 7:50 AM Load management Delimited text guidance Evenly split the data into multiple files One file per reader Delimited text is the fastest DWU Readers Writers DW100 8 60 DW200 16 DW300 24 DW400 32 DW500 40 DW600 48 DW1000+ Compressed text limits concurrent access to text files PolyBase Azure Data Factory SSIS Bcp 3rd party data loading tools Split data across files OR Use different file format © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
30
Transaction Size Microsoft Build 2016 5/20/2018 7:50 AM
© 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
31
Memory Management (MB per distribution)
32
Memory grant size estimation for row groups
INT = 10 SELECT CEILING ( ( ( * * 16) /10 ) )/ ) AS Rowgroup_memorygrant_estimate_in_MiB_for_100K , CEILING ( ( * * 16) / ) AS Rowgroup_memorygrant_estimate_in_MiB_for_1M ;
33
Estimating memory grant
34
Create Login (master) CREATE LOGIN newperson WITH PASSWORD = 'SQLB1ts!'; CREATE USER newperson for LOGIN newperson; EXEC sp_addrolemember 'loginmanager','newperson'; EXEC sp_addrolemember 'dbmanager','newperson';
35
Resource class roles SELECT ro.[name] AS [db_role_name]
FROM sys.database_principals ro WHERE ro.[type_desc] = 'DATABASE_ROLE' AND ro.[is_fixed_role] = 0 ;
36
Create user (user db) CREATE USER newperson for LOGIN newperson ;
GRANT CONTROL ON DATABASE::ContosoRetailDW TO newperson SELECT r.[name] AS role_principal_name , m.[name] AS member_principal_name FROM sys.database_role_members rm JOIN sys.database_principals AS r ON rm.[role_principal_id] = r.[principal_id] JOIN sys.database_principals AS m ON rm.[member_principal_id] = m.[principal_id] WHERE r.[name] IN ('mediumrc','largerc', 'xlargerc') EXEC sp_addrolemember 'mediumrc','newperson'
37
Identifying users with elevated requests
SELECT r.[request_id] AS Req_ID , r.[command] AS Req_command , r.[status] AS Req_Status , r.[submit_time] AS Req_SubmitTime , r.[start_time] AS Req_StartTime , DATEDIFF(ms,[submit_time],[start_time]) AS Req_WaitDuration_ms , r.[resource_class] AS Req_resource_class FROM sys.dm_pdw_exec_requests r WHERE [session_id] <> session_id() ;
38
Concurrency
39
Concurrency & Memory
40
Statistics
41
Key points Created manually Updated manually
Can make a huge difference to performance DISTINCT JOIN (+composite) WHERE GROUP BY (+composite) ORDER BY
42
Create table – default sizing
CREATE TABLE [cso].[FactOnlineSales_INS] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [PromotionKey] int NOT NULL , [CurrencyKey] int NOT NULL , [CustomerKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [SalesOrderLineNumber] int NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL , [ReturnQuantity] int NOT NULL , [ReturnAmount] money NULL , [DiscountQuantity] int NULL , [DiscountAmount] money NULL , [TotalCost] money NOT NULL , [UnitCost] money NULL , [UnitPrice] money NULL , [ETLLoadID] int NULL , [LoadDate] datetime NULL , [UpdateDate] datetime NULL ) WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([ProductKey])) ; INSERT INTO [cso].[FactOnlineSales_INS] SELECT * FROM [cso].[FactOnlineSales] SELECT rs.* FROM sys.dm_pdw_exec_requests er JOIN sys.dm_pdw_request_steps rs ON er.[request_id] = rs.[request_id] WHERE er.[session_id] = SESSION_ID() ;
43
Investigate the MPP plan
--EXPLAIN SELECT SUM([SalesAmount]) FROM [cso].[FactOnlineSales_INS] AS fos JOIN [cso].[DimProduct] AS dip ON fos.[ProductKey] = dip.[ProductKey] WHERE fos.DateKey BETWEEN ' :00:00.000' AND ' :00:00.000' GROUP BY dip.[BrandName] ; CREATE STATISTICS stat_1 ON [cso].[FactOnlineSales_INS]([ProductKey]);
44
DMV: Row count and age of stats
SELECT * FROM sys.schemas s JOIN sys.tables t ON s.[schema_id] = t.[schema_id] JOIN sys.partitions p ON t.[object_id] = p.[object_id] WHERE s.[name] = 'cso' AND t.[name] = 'FactOnlineSales_INS' ; SELECT stats_id , name AS stats_name , STATS_DATE(object_id, stats_id) AS statistics_date FROM sys.stats s WHERE s.object_id = OBJECT_ID('cso.DimCustomer')
45
Summary
46
Minimising movement Distribution key NOT NULL
Distribution key data skew validated Distribution key data types compatible Variable length columns optimised Equijoins used on distribution keys Distribution keys included in the join Statistics up to date
47
Resource management Configure load user Scale for additional resources
Size the rowgroup for memory grant Set appropriate resource class Scale for additional resources DWU readers Multiply #files by readers for balanced throughput (i.e. 60,120,180 etc.) Tempdb (300GB per DWU100) Transaction Size Concurrency slots
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.