Microsoft Analytics Platform System 03 – Distribution Theory & Design

Microsoft Analytics Platform System 03 – Distribution Theory & Design
Brian Walker | Microsoft Architect – Data Insights COE Jesse Fountain | Microsoft WW TSP Lead April 17, 2019

Agenda MPP Database Design & Layout MPP Table Distribution Concepts
Elements Tempdb MPP Table Distribution Concepts Understanding Data Skew

Elements of MPP database

APS benefits: Appliance simplicity for DBA
CREATE DATABASE database_name [ WITH AUTOGROW = ON | OFF , | REPLICATED_SIZE = replicated_size [ GB ] | DISTRIBUTED_SIZE = distributed_size [ GB ] | LOG_SIZE = log_size [ GB ] ]

Total number of databases created =
Creating a database WFOHST01 User Creates Database PDW creates a logical shell database on the Control node PDW creates a physical application database on each Compute node Total number of databases created = # of Compute nodes + 1 CTL01 CMP01 CMP02 CMP03 CMP04 CMP05 CMP06

Understanding distributions
Each distribution: Maps to a physical table Allocated physical space 8 Distributions on every compute node (A-H) Each Distribution equates to a bucket of data Contains all records for a distribution value

Compute node database DB_GUID DIST_A DIST_B DIST_C DIST_D DIST_E
DIST_F DIST_G DIST_H PRIMARY V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 F1 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 REPLICATED

Tempdb

Tempdb in PDW tempdb Pdwtempdb1 Must specify #<table_name>
WFOHST01 tempdb Pdwtempdb1 Must specify #<table_name> WITH (LOCATION = User_DB) CTL01 Shell CMP01 User CMP02 User CMP03 User CMP04 User CMP05 User CMP06 User

Q Tables are used by the DMS during data movement
Temporary databases Logical Name Physical Name Control Node Compute Nodes Purpose tempdb pdwtempdb1 temp tables tempdb-sql Q Tables Sorts & Spills Q Tables are used by the DMS during data movement

MPP tables

Two table types Replicated Distributed Why Two Types?
Duplicate copy of table maintained on each node Smaller Tables (<3GB) Only Distributed Table is hashed on a single column and uniformly distributed across all nodes Each node has 8 distributions Each distribution is a separate physical table Why Two Types? To Co-Locate data on each node to minimize data movement for multi-table joins

APS benefits: Appliance simplicity for table creation
CREATE TABLE table_name [ ( { <column_definition> } [ ,...n ] ) WITH (CLUSTERED COLUMNSTORE INDEX, | CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] ) , | DISTRIBUTION = { HASH ( distribution_column_name ) | REPLICATE }, | PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ) )

Star schema example: Replicated tables
Smaller Dimension Tables are Replicated TD SD PD MD Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End Sales Facts Mktg Camp Id Qty Sold Dollars Sold TD SD PD MD Computer Nodes TD SD PD MD TD SD PD MD

What is the cost of replicating data?
Writes will be slower than distributed tables since there is only one writer on each node (as opposed to 8 per node for distributed tables). Instead of writing data once you write it n times PDW region: n = number of compute nodes in your appliance Hadoop: n = replication factor set in Hadoop (default is 3 times) HDInsight: n = default is 2 or 3 depending on size of Hadoop Region

Create replicated table example
Create table metadata on Control Node Control Node Control Node 1 Control Node 2 Control Node 8 … CREATE TABLE Product ( ProductKey INT NOT NULL , ProductCode INT NOT NULL , ProductEffDate INT NOT NULL , ProductDesc VARCHAR(50) NOT NULL, ProductSubGrp VARCHAR(20) NOT NULL , ProductCategory VARCHAR(20) NOT NULL , ProductPrice FLOAT NOT NULL , … ) WITH DISTRIBUTION = REPLICATE, CLUSTERED INDEX(OrderDateKey) , PARTITION (ProductEffDate RANGE RIGHT FOR VALUES ( , , ) ); Send Create Table SQL to each compute node Create Table Product Product Load Sequence for Replicated tables: Table gets loaded on one node first (node determined by round-robin) When complete, APS copies table onto remaining nodes

Star schema example: Distributed tables
Larger Fact Table Is Hash Distributed Across All Nodes TD SD PD MD Sales Fact A Sales Fact B Sales Fact H … Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End Sales Facts Mktg Camp Id Qty Sold Dollars Sold TD SD PD MD Sales Fact A Sales Fact B Sales Fact H … TD SD PD MD Sales Fact A Sales Fact B Sales Fact H … TD SD PD MD Sales Fact A Sales Fact B Sales Fact H …

Create distributed table: Under the covers
Create table metadata on Control Node Control Node 1 Control Node 2 Control Node 8 … Control Node CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, … ) WITH DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( , , ) ); Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C … Create Table FactSales_H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H

Distributed tables Data is distributed across the entire cluster
A hashing function is used to spread the data Hash created during either Insert or Load Hash performed by DMS not SQL Server Hash is based on a single column in a table Column chosen for hash is not updateable

Extremely important points
The hash is consistent across all tables Great for joining tables together Columns that are joined, must have consistent data types (e.g. numeric, date, character Rationalize your data types!

Consequences of distribution theory
Distribution key is not updateable! Pick a column that: Is Static Does now need inferred members (-1) Does not contain Null Changing Distribution = Re-create the table!

APS distributed query in action
Control Node A.K.A. ‘The Brains’ Optimizer creates parallel query plan Each compute node runs a portion of the query in parallel Results aggregated on each node Final results streamed back through Control Node User Query Optimizer Metadata Statistics Data Movement Services DMS Balanced Storage Compute Server A.K.A. ‘The Brawn’ …

A word about data skew

Performance impact of skew
When the data is skewed… Some queries finish very quickly Others take disproportionately longer The user query is only complete when all queries have finished

Impact of skew on storage
When one bucket is full all buckets are full Skewed Data leads to accelerated consumption of storage capacity Available Storage therefore is a logical concept Calculated as MAX(storage used by bucket) * # of buckets

Demo | Distribution

Microsoft Analytics Platform System
4/17/2019 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

Microsoft Analytics Platform System 03 – Distribution Theory & Design

Similar presentations

Presentation on theme: "Microsoft Analytics Platform System 03 – Distribution Theory & Design"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Microsoft Analytics Platform System 03 – Distribution Theory & Design

Similar presentations

Presentation on theme: "Microsoft Analytics Platform System 03 – Distribution Theory & Design"— Presentation transcript:

Similar presentations

About project

Feedback