MPP – Maximize Parallel Productivity Getting the most for your efforts and money in SQL DW
Agenda MPP Architecture Data Warehousing Units (DWUs) Cost: Storage and Compute Creating an Instance Distribution Keys, Indexes, Partitions, and Statistics Loading Data DTSQL, the Language of SQL DW AGENDA: Gregg – Opening Remarks (15 mins) Jim & Robert – Sales Outlook & Solution Portfolio (20 mins) Louise - 2016 Goal Recap (10 mins) Louise & teams – Project Spotlights (15 mins – 3-4 mins each Zimmer, Nautilus, ATI Support) Brian A – Agile Overview & Table Activity (15 mins) Awards & Recognition (15 mins) ----End Meeting---- After Activity – Client Networking Activity
About Me Live in Indianapolis, Indiana, USA Data Warehousing / Analytics Consultant at DMI Was a software developer for 8 years Been in analytics for 10 years MCSE: Data Management and Analytics Also hold Hortonworks and SAP BI Certifications AGENDA: Gregg – Opening Remarks (15 mins) Jim & Robert – Sales Outlook & Solution Portfolio (20 mins) Louise - 2016 Goal Recap (10 mins) Louise & teams – Project Spotlights (15 mins – 3-4 mins each Zimmer, Nautilus, ATI Support) Brian A – Agile Overview & Table Activity (15 mins) Awards & Recognition (15 mins) ----End Meeting---- After Activity – Client Networking Activity
Architecture SQL DB - SMP Shared
Architecture SQL DW - MPP Shared? Nothing
Architecture SQL DW – MPP (a little more detail)
Architecture 100 DWU
Architecture 200 DWU
Architecture 500 DWU
Cost FREE €8.72/ DWU/mo. COMPUTE Assignment: Jimmy STORAGE DATA TRANSFER FREE €8.72/ DWU/mo. Assignment: Jimmy
CREATE AND CONNECT Starting the Process
What You Need Azure Account Spending Limit Azure SQL Database Azure VM (optional) Software
What You Need Azure Account Spending Limit Azure SQL Database Azure VM (optional) Software MSDN / VS or Trial
What You Need Azure Account Spending Limit Azure SQL Database Azure VM (optional) Software Use existing: ---or--- Create later
What You Need Azure Account Spending Limit Azure SQL Database Azure VM (optional) Software
What You Need Azure Account Spending Limit Azure SQL Database Azure VM (optional) Software SQL Server Management Studio Visual Studio SQL Server Data Tools Azure Storage Explorer Azure Feature Pack for SSIS
Create The Instance Provision the SQL DW
Create The Instance The Blade
Create The Instance Set Firewall Rules Connecting from outside of your Azure resource group requires firewall rules
CONNECT SSMS: 16.0-17.2+
CONNECT Visual Studio (2017 has code completion!)
Distribution, Indexes, Partitions, Statistics Storage concepts Distribution, Indexes, Partitions, Statistics
Distribution Round Robin Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Round Robin Cust # Cust # Cust #
Distribution Round Robin A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Round Robin Next Up A A B C Cust # 24 Cust # Cust #
Distribution Round Robin A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Round Robin Next Up B A B C Cust # 24 Cust # 72 Cust #
Distribution Round Robin A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Round Robin Next Up C A B C Cust # 24 Cust # 72 Cust # 119
Distribution Round Robin A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Round Robin Next Up A A B C Cust # 24 240 Cust # 72 Cust # 119
Distribution Round Robin A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Round Robin Next Up B A B C Cust # 24 240 Cust # 72 278 Cust # 119
Distribution Hash Distributed A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Hash Distributed A B C Cust # Cust # Cust #
Distribution Hash Distributed A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Hash Distributed Cust # Hash Result 24 B A B C Cust # Cust # 24 Cust #
Distribution Hash Distributed A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Hash Distributed Cust # Hash Result 72 C A B C Cust # Cust # 24 Cust # 72
Distribution Hash Distributed A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Hash Distributed Cust # Hash Result 119 B A B C Cust # Cust # 24 119 Cust # 72
Distribution Hash Distributed A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Hash Distributed Cust # Hash Result 240 A A B C Cust # 240 Cust # 24 119 Cust # 72
Distribution Hash Distributed A B C Cust # Name City 24 Britton Gray McCordsville, IN 72 Pamela Stephens Duluth, GA 119 Louis Wright Gloucester, MA 240 Amy Crosby Renton, WA 278 Max Cook Belton, TX Hash Distributed Cust # Hash Result 278 C A B C Cust # 240 Cust # 24 119 Cust # 72 278
INDEXES Much like SQL Server, but indexed within a distribution Default: Clustered Columnstore Cust # Order # Order Date 24 1095 3/1/2016 2210 8/15/2016 2901 11/14/2016 119 1140 3/16/2016 3319 12/10/2016
INDEXES Available: B-tree column indexes MSFT: “Use them for high cardinality columns that are used as filters in queries returning a small number of rows.”
Partitions Also much like SQL Server Partitions exist within a distribution, but must be consistent across distributions. Partition switching particularly effective given “CTAS” nature of ELT Optimal: Make sure partitions will have >1M rows A B C Order Date [2017] [2016] Order Date [2017] [2016] Order Date [2017] [2016]
Statistics SQL Server: Statistics kept on tables SQL DW: Statistics are kept on individual columns Tells the control node what column value distributions look like across nodes “How can I move the least amount of data?” Index columns used to filter or join
Getting lots of data in efficiently Loading Data Getting lots of data in efficiently
Loading Methods
PolyBase Preferred Method Scales with DWUs as each compute node is PolyBase capable
PolyBase SETUP PROCESS Copy data into Blob storage (storage explorer or AZCOPY) Create: Scoped credential External data source External file format External table
PolyBase LOAD PROCESS Initial load: CTAS (CREATE TABLE AS SELECT) CREATE TABLE MyTable WITH (CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = HASH(MyDistColumn), PARTITION (DateColumn RANGE RIGHT FOR VALUES (‘2010-01-01’,’2011-01-01’. . .))) as select * from ExternalTable; Incremental load: INSERT INTO Try to keep these in smaller batches Loads are automatically parallelized
BCP LOAD PROCESS ASCII or UTF-16 only Run from machine with source files (or direct access to them) bcp <table name> in <file> –S <server> –d <database> –U <user> –P <password> -t <‘delimiter’>
AZURE SQL DW UPLOAD TASK SSIS LEGACY METHOD AZURE SQL DW UPLOAD TASK Use SQL Server destination Change connection target < 10K rows per second Part of SSIS Azure Feature Pack UTF-8 text files only Assignment: Jimmy
Diminished Distributed Transact SQL DTSQL Diminished Distributed Transact SQL
ELT with CTAS CREATE TABLE AS SELECT All-or-nothing Minimal logging Preferred Method of ELT (Extract, Load, Transform)
NOT SUPPORTED Many Functions TRY_CAST(), TRY_CONVERT() Use ISNUMERIC() or ISDATE() before CAST/CONVERT Not perfect FORMAT TRIM XML/JSON functions Security: Row-level security Dynamic data masking
NOT SUPPORTED Miscellanea MERGE statement Global temporary tables (##) Cursors Geometric / geospatial data R Services Pausing / scaling immediately kills all running operations
Benchmarks SQL DW is so fast…
HOW FAST IS IT? Test data set
HOW FAST IS IT? Loading Data (Polybase)
HOW FAST IS IT? Star Query
QUESTIONS?