Azure SQL Data Warehouse for SQL Server DBAS June 2018 Warner Chaves SQL MCM/ Data Platform MVP
Thanks to our sponsors And Global Gold Silver Bronze Microsoft JetBrains Rubrik Delphix Solution OMD
Bio DBA and Consultant for 11 years Previously L3 DBA at HP in Costa Rica, now Principal Consultant at Pythian in Ottawa, Ontario. Microsoft Data Platform MVP. Twitter: @warchav Blog: Sqlturbo.com Email: warner@sqlturbo.com Company: Pythian.com
Agenda Objective: cover Azure SQL Data Warehouse in a way that is easy to understand and adopt for SQL Server DBAs We will go over: Why Data Warehousing in the cloud? Service Cost and Model Fundamental differences with SQL Server Loading and querying data
Pre-requisites Sql Server experience. Basic Data Warehousing concepts.
Cloud vs Traditional Data Warehousing Significant upfront investment Capacity is forecasted and fixed Client needs to manage the solution Static or semi-static software Client needs to complete the ecosystem Predictable recurring bill Dynamic capacity Solution managed by the provider Software in continuous improvement Tightly integrated with the rest of cloud services
So what is Azure SQL DW? Microsoft Azure Service Successor to the on-premises appliance known as APS/PDW Targeted at running multi-TB Data Warehousing workloads It’s a PaaS service – DWaaS (AWS RedShift – Google BigQuery) It’s an MPP (Massively Parallel Processing) system Compute and storage are distributed and independent
SMP vs MPP Symmetric MultiProcessing Massively Parallel Processing
Azure SQL DW – Gen1 Azure Premium Storage Connection Client Data Movement Service Connection Client Control Node Compute Nodes Distributions
Azure SQL DW – Compute Optimized (Gen 2) Azure Premium Storage for files and Columnstore segments NVMe Cache Data Movement Service Control Node NVMe Cache Compute Nodes Distributions
Service Model Compute and Storage are scaled and billed separately Compute is measured in Data Warehousing Units The DWU control the capacity of the Compute Nodes Storage is billed in 1TB increments The service allows you to PAUSE compute and stop getting charged for it
Backup and Recovery The service keeps backups for 7 days A snapshot is made every 4 to 8 hours In case of DR, you can do a geo-restore to a ‘paired datacenter’ with the daily backup If you need to retain a copy for more than 7 days, right now the option is to do a restore and then pause compute so you’re only paying for storage (we’re hoping for improvements in this regard…)
How is the engine different from SQL Server?
Distribution Method Most important concept for good performance in Azure SQL DW. It determines the way ASDM will distribute the records in different buckets. There are three methods: HASH distribution Round-Robin distribution Replicated
Hash Distribution Same values end up in the same bucket. If the distribution column is used in joins or for a Group By then no data movement is necessary. If a particular value is dominant in the table then a distribution can be overloaded compared to the other ones and lower system performance.
Overloaded distribution
Round Robin Distribution ASDW simply does a Round-Robin over the records and puts each record in a different bucket. The values in the record don’t matter when assigning a bucket. Data movement is required for most operations. If a table doesn’t have a good HASH column and is too big to be a replicated table then this can be the best option. If a value is skewed, the distribution will still be uniform.
Replicated Distribution The table is copied to each compute node. Recommended for tables smaller than 2GB. For smaller tables that are usually part of join predicates. For simple predicates like equality or inequality. The storage is table size X amount of compute nodes so don’t abuse it.
T-SQL Differences ASDW encourages the use of the CTAS (Create Table AS) construct Fully Parallel Logging is minimized Joins on UPDATE – DELETE not supported (there are workarounds) MERGE not supported (for now at least) Some of the complex data types are not present (geography, geometry, hierarchy, xml) Full list here: https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-migrate-code
DEMO: Portal and Metadata
Data Warehouse Design Good service to consider if your DW is at 1TB+ and growing. Default table type is Clustered Columnstore. Ideal columnstore segment is 1 million records (same as SQL Server). ASDW uses 60 Distributions. Fact Tables: Columnstores (optionally with Partitioning) with HASH distribution (if possible) Dimension Tables: B-Tree o Columnstore (if it’s a large dimension) HASH, Replicated or Round-Robin.
The thing about Partitioning Daily 365 partitions 60 distributions 21900 partitions 1 million is the ideal segment 21 900 000 000 records
Partitioning usually at the weekly or monthly level if necessary
The thing about Partitioning Monthly 12 partitions 60 distributions 720 partitions 1 million is the ideal segment 720 000 000 records
Data Loading Two ways of loading data: Control Node Methods Through the Control Node PolyBase Control Node Methods SSIS BCP Loads from Blob Storage or Azure Data Lake Parallel multi-threaded load that does not go through the Control Node For large data loads the Control Node can become a bottleneck
DEMO: Loading data with PolyBase
Querying Data Azure SQL DW has some differences in terms of query execution. There are concurrency limits depending on the DWUs. There are transaction size limits per Distribution also based off DWUs. Each user gets assigned a resource class to determine how much compute they get. Some DMVs keep historical information. The use of Query Labels is recommended for troubleshooting and monitoring.
Query execution is queued if necessary Concurrency Limits DWUs 100 200 300 400 500 600 1000+ Concurrent Queries 4 8 12 16 20 24 32 (Gen1) – 128 (Gen2) Query execution is queued if necessary
Memory assigned is per distribution. The classes can also be static. Resource Classes CLASS SMALL MEDIUM LARGE X-LARGE Default X Memory 100MB Up to 3200MB Up to 6400MB Up to 12800MB Memory assigned is per distribution. The classes can also be static.
OPTION (LABEL = 'QuantitySum'); Query Label SELECT sum(Quantity) FROM FactTransactionHistory OPTION (LABEL = 'QuantitySum');
DEMO: Querying Data
Questions?
Thanks!!