Azure SQL Datawarehouse - Datawarehouse on Cloud Luca Ferrari Azure SQL Datawarehouse - Datawarehouse on Cloud
Sponsor
Organizzatori GetLatestVersion.it
Agenda: Solutions On Prem vs Cloud Introduction to Azure DWH SMP vs MPP Architecture Tables design Queries Scaling Data Load
Data warehousing solutions SQL Azure SQL Data Warehouse Traditional SMP SQL Server MPP SQL Server PolyBase Hadoop Analytics Platform System (PDW - APS) SMP MPP MPP On-Premise Cloud
SMP vs MPP
SMP vs MPP MPP: a divide and conquer approach to solving large data problems by using parallel computing Data divided and distributed across many computing resources Each computing resource operates on its portion of the data in parallel
Data warehousing solutions – On Premise Pro Contro Fast ETLs loading data (On-Prem to On-Prem only) HW cost No external network needed Storage cost Flexible Backup/Restore policy Energy cost License cost Limited Scalability (Sql SMP) Maintenance No Big Data Support On the Box * * Only using: Microsoft APS AU2 SQL Server 2016
Data warehousing solutions - Cloud Pro Contro No HW cost Slow ETLs (on-prem to cloud) & viceversa No Storage cost No Backup on-demand No License cost Scalable (Up and Down) on your actual need Big Data support on the Box
Introducing Azure SQL DWH A relational data warehouse-as-a-service, fully managed by Microsoft. Industries first elastic cloud data warehouse with proven SQL Server capabilities. Support your smallest to your largest data storage needs. From GB to PB
Architecture Control Node Endpoint for connections SQL DB Control Node Control Node Massively Parallel Processing (MPP) Engine SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node Endpoint for connections Regular SQL endpoint (TCP 1433) Persists no user data (metadata only) Coordinates compute activity using MPP Blob storage [WASB(S)] HDInsight
Architecture Compute Node(s) Azure SQL Database Control Node Massively Parallel Processing (MPP) Engine Compute Node(s) Azure SQL Database SQL DB SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node Blob storage [WASB(S)] An increase of DWU will increase the number of compute nodes HDInsight
Architecture GRS storage +PB’s of storage Control Node Blob storage [WASB(S)] Massively Parallel Processing (MPP) Engine SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node GRS storage +PB’s of storage Load data without incurring compute costs Blob storage [WASB(S)] HDInsight
DMS (Data Movement Service) executes across all database nodes Architecture Storage and Compute are de-coupled, enabling a true elastic service and separate charging for both compute and storage Application or User connection DMS (Data Movement Service) executes across all database nodes Control Node Data Loading (SSIS, REST, OLE, ADO, ODBC, WebHDFS, AZCopy, PS) DMS Massively Parallel Processing (MPP) Engine Compute Scale compute up or down when required (SLA <= 60 seconds). Pause, Restart, Stop, Start. DMS DMS DMS DMS SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node SQL DB Compute Node Azure Infrastructure and Storage Storage Add\Load data to WASB(S) without incurring compute costs Blob storage [WASB(S)] HDInsight
Architecture Data Movement Service Data Movement Service (DMS) moves data between the nodes. DMS gives the Compute nodes access to data they need for joins and aggregations. DMS is not an Azure service. It is a Windows service that runs alongside SQL Database on all the nodes. Since DMS runs behind the scenes, you won't interact with it directly. However, when you look at query plans, you will notice they include some DMS operations since data movement is necessary to run each query in parallel. Depends on Tables Design Table Statistics Queries
Demo - 1 Create Azure SQL DW
Data can be distributed across nodes or replicated Three table types Table Architecture Data can be distributed across nodes or replicated Three table types Hash Distributed Round-Robin Replicated
Hash Distributed Rows are distributed across multiple distributions based on a hash function applied to a column CREATE TABLE [dbo].[FactInternetSales] ( [ProductKey] int NOT NULL , [OrderDateKey] int NOT NULL , [CustomerKey] int NOT NULL , [PromotionKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [OrderQuantity] smallint NOT NULL , [UnitPrice] money NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH([ProductKey]) );
Round-Robin Data is evenly (or as evenly as possible) distributed among all the distributions without a hash function CREATE TABLE [dbo].[FactInternetSales] ( [ProductKey] int NOT NULL , [OrderDateKey] int NOT NULL , [CustomerKey] int NOT NULL , [PromotionKey] int NOT NULL , [SalesOrderNumber] nvarchar(20) NOT NULL , [OrderQuantity] smallint NOT NULL , [UnitPrice] money NOT NULL , [SalesAmount] money NOT NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN );
Replicated Data are replicated among all the distributions. Each nodes own the entire table’s dataset CREATE TABLE DimCustomer ( id int NOT NULL, lastName varchar(20), zipCode varchar(6) ) WITH DISTRIBUTION = REPLICATE, CLUSTERED INDEX (lastName) );
Table Architecture and query considerations Join type Left Table Right Table Compatibility All join types Replicated Compatible - no data movement required. Inner Join - Right Outer Join - Cross Join Distributed Compatible – no data movement required. Inner Join - Left Outer Join - Cross Join All join types, except cross joins, can be compatible. Compatible – no data movement required if the join predicate meets the following conditions: Predicate is an equality join. Predicate joins two distributed columns that have matching data types. For example, if table A is distributed on column a and table B is distributed on column b, and both a and b have matching data types, the following join is compatible: SELECT * FROM A JOIN B ON(A.a=B.b) SQL Server PDW analyzes the logic for some conjunctive predicates to see if they are compatible. For example, the following join is compatible: SELECT * FROM A JOIN B ON(A.a=B.b AND A.a = B.b+1) Cross Joins are always incompatible.
Table Architecture - Unsupported Features Primary key, Foreign keys, Unique and Check Table Constraints Unique Indexes Computed Columns Sparse Columns User-Defined Types Sequence Triggers Indexed Views Synonyms
Table Architecture - Unsupported Data ypes Data Type Workaround geometry varbinary geography hierarchyid nvarchar(4000) image text varchar ntext nvarchar sql_variant Split column into several strongly typed columns. table Convert to temporary tables. timestamp Rework code to use datetime2 and CURRENT_TIMESTAMP function. Only constants are supported as defaults, therefore current_timestamp cannot be defined as a default constraint. If you need to migrate row version values from a timestamp typed column then use BINARY(8) or VARBINARY(8) for NOT NULL or NULL row version values. xml user defined types convert back to their native types where possible default values default values support literals and constants only. Non-deterministic expressions or functions, such as GETDATE() or CURRENT_TIMESTAMP, are not supported.
Table Store Types Azure SQL DataWarehouse support both: Row store Traditional B-Tree Clustered Non Clustered Heap Columnstore Only Clustered Columnstore Indexes Data compression:
Table Store Types Row group Segments C1 C2 C3 C5 C6 C4 Columnstore
Table Store Types Columnstore Provides Dramatic Performance Updateable and clustered columnstore index (CCI) Stores data in columnar format Memory-optimized for next-generation performance Updateable to support bulk and/or trickle loading Up to 100x faster Up to 15x compression Save time and costs
Table Partitioning Full partitions support: Merge Split Switch Do not over-partition your data !!!
Table Statistics The more SQL Data Warehouse knows about your data, the faster it can execute queries against your data. The way that you tell SQL Data Warehouse about your data, is by collecting statistics about your data. Statistics are not created automatically on Control Node we have to create them ourselves Statistics are not updated automatically on Control Node We have to maintain them ourselves
Demo - 2 Distributed Hash Round-Robin CTL/CMP Table Mapping Data Skew
Queries Query execution: All queries point the Control Node (Azure Sql Database) CTL create the MPP Plan It depends on Table design Statistics Joins MPP plan is executed by CMP Nodes Results are sent to the client
Queries Almost all T-SQL commands could be used with Azure SQL DWH DDL DML Security Monitoring
Queries - Monitoring Monitoring Azure Sql DWH using DMVs System Views Connection / Sessions / Requests Cmp configuration ... System Views Tables Tables Space Allocation
Monitoring - Tools DWInsight – History Monitoring tool
Queries Troubleshoouting Label your queries !!! SELECT * FROM sys.tables OPTION (LABEL = 'My Query Label')
Demo - 3 MPP Plan Monitoring Azure DWH
Data Warehouse Unit (DWU) DWUs are a measure of underlying resources like CPU, memory, IOPS, which are allocated to your SQL Data Warehouse. Increasing the number of DWUs increases resources and performance
Data Warehouse Unit (DWU) How Fast do you wanna go ? Difficult for a customer choose which HW to go with what the implications will be for performance Customer can grow compute and storage as needed independently of each other
Data Warehouse Unit (DWU) Select small number of DWUs Monitor your application performance Determine how much faster or slower performance should be for you Increase or decrease the number of DWU Continue making adjustments until you reach an optimum performance level for your business requirements
Data Warehouse Unit (DWU) Workload Management DW 100 200 300 400 500 600 1000 1200 1500 2000 3000 6000 Engine Nodes 1 WorkerNodes 2 3 4 5 6 10 12 15 20 30 60 Concurrency Slots 8 16 24 32
Scaling Increase or decrease DWUs on your need By Azure Portal, T-SQL, Powershell, Rest API PS Command: Set-AzureRmSqlDatabase -DatabaseName "MySQLDW" -ServerName "MyServer" -RequestedServiceObjectiveName "DW1000« T-SQL Command: ALTER DATABASE MyDWHName MODIFY (SERVICE_OBJECTIVE = 'DWxxxxx');
Cost saving You can pause your DWH when you don’t need it By Azure Portal, Powershell, Rest API PS Command: Suspend-AzureRmSqlDatabase –ResourceGroupName "ResourceGroup1" –ServerName "Server01" –DatabaseName "Database02"
Pause/Resume Scaling ... And my queries ??? Demo - 4 Azure Portal T-SQL ... And my queries ???
Data Load Azure to Azure On Prem To Azure All data resides on Azure Fast and Simple On Prem To Azure Data needs to be send to Azure over internet Many options but slower than Azure to Azure
Data Load On Prem to Azure Azure to Azure Blob storage PolyBase to load data from Azure blob storage T-SQL Azure Data Factory SSIS Integration Services AzCopy ( < 10TB) Load to Azure Blob storage Bcp From Sql to Flat File From flat File to Azure SQL DWH Export Data to Disk ( > 10TB) Send Disk to the Data center by FedEx, DHL, UPS External Network is a potentialbottleneck
Data Load - (Furgone as a Service)
Questions ?
Resources https://azure.microsoft.com/it-it/services/sql-data-warehouse/ https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-load https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-overview-manage https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-best-practices https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-manage-monitor https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-reference-tsql-statements https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-overview https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-distribute https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-develop-label https://docs.microsoft.com/en-us/azure/sql-data-warehouse/sql-data-warehouse-tables-data-types