Microsoft Analytics Platform System 03 – Distribution Theory & Design

Slides:



Advertisements
Similar presentations
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
Advertisements

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or.
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or.
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or.
© 2010 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
Feature: Reprint Outstanding Transactions Report © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product.
High Performance Analytical Appliance MPP Database Server Platform for high performance Prebuilt appliance with HW & SW included and optimally configured.
Feature: Purchase Requisitions - Requester © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names.
MIX 09 4/15/ :14 PM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
demo Default WANGPSLookup Default WANGPS.
Feature: Payroll and HR Enhancements © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or.
Implementing Business Analytics with MDX Chris Webb London September 29th.
Co- location Mass Market Managed Hosting ISV Hosting.
Dual Partitioning for improved performance in VLDBs Ashwin Rao Karavadi, Rakesh Parida Microsoft IT.
Windows 7 Training Microsoft Confidential. Windows ® 7 Compatibility Version Checking.
Multitenant Model Request/Response General Model.
Feature: Purchase Order Prepayments II © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are.
Announcing Demo Announcing.
Feature: OLE Notes Migration Utility
Session 1.
Built by Developers for Developers…. © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names.
 Rico Mariani Architect Microsoft Corporation.
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or.
Feature: Assign an Item to Multiple Sites © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names.
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or.
Feature: Print Remaining Documents © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or.
Connect with life Connect with life
demo Receive Inventory Export Parse and Normalize.
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or.
Feature: Suggested Item Enhancements – Sales Script and Additional Information © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows.
Feature: Customer Combiner and Modifier © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are.
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or.
demo Instance AInstance B Read “7” Write “8”

customer.
demo © 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names.
demo Demo.
Feature: Void Historical/Open Transaction Updates © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product.
demo QueryForeign KeyInstance /sm:body()/x:Order/x:Delivery/y:TrackingId1Z
Feature: Suggested Item Enhancements – Analysis and Assignment © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and.
projekt202 © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are.
The CLR CoreCLRCoreCLR © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product.
© 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks.
© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or.
SMP MPP with PDW ** Workload requirements usually drive the architecture decision.

IoCompleteRequest (Irp);... p = NULL; …f(p);
SQL Server 2008 R2 Parallel Data Warehouse: Under the Hood Brian Mitchell Senior Premier Field Engineer.
MIX 09 4/17/2018 4:41 PM © 2009 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
Возможности Excel 2010, о которых следует знать
Subtraction – Place Value and Negative Numbers
Title of Presentation 11/22/2018 3:34 PM
Baseline: How Are We Doing Now?
Title of Presentation 12/2/2018 3:48 PM
12/5/ :14 PM © 2007 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
28 days.
Sunil Agarwal | Principal Program Manager
8/04/2019 9:13 PM © 2006 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered.
4/27/17, Bell #8 What amount of net pay has been earned this period?
Виктор Хаджийски Катедра “Металургия на желязото и металолеене”
PENSACOLA ENERGY WORK PLAN OCTOBER 10, 2016
Title of Presentation 5/12/ :53 PM
Шитманов Дархан Қаражанұлы Тарих пәнінің
Title of Presentation 5/24/2019 1:26 PM
5/24/2019 6:44 PM 1/8/18 Bell #10 In a world governed by the gods, is there any room for human will? Do human choices make a difference? EXPLAIN © 2007.
日本初公開!? Vista の新機能を実演 とっちゃん わんくま同盟 7/23/2019 9:09 AM
Title of Presentation 7/24/2019 8:53 PM
Presentation transcript:

Microsoft Analytics Platform System 03 – Distribution Theory & Design Brian Walker | Microsoft ​Architect – Data Insights COE Jesse Fountain | Microsoft ​WW TSP Lead April 17, 2019

Agenda MPP Database Design & Layout MPP Table Distribution Concepts Elements Tempdb MPP Table Distribution Concepts Understanding Data Skew

Elements of MPP database

APS benefits: Appliance simplicity for DBA CREATE DATABASE database_name [ WITH AUTOGROW = ON | OFF , | REPLICATED_SIZE = replicated_size [ GB ] | DISTRIBUTED_SIZE = distributed_size [ GB ] | LOG_SIZE = log_size [ GB ] ]

Total number of databases created = Creating a database WFOHST01 User Creates Database PDW creates a logical shell database on the Control node PDW creates a physical application database on each Compute node Total number of databases created = # of Compute nodes + 1 CTL01 CMP01 CMP02 CMP03 CMP04 CMP05 CMP06

Understanding distributions Each distribution: Maps to a physical table Allocated physical space 8 Distributions on every compute node (A-H) Each Distribution equates to a bucket of data Contains all records for a distribution value

Compute node database DB_GUID DIST_A DIST_B DIST_C DIST_D DIST_E DIST_F DIST_G DIST_H PRIMARY V0 V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 F1 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11 F12 F13 F14 F15 F16 L1 L2 L3 L4 L5 L6 L7 L8 L9 L10 L11 L12 L13 L14 L15 L16 REPLICATED

Tempdb

Tempdb in PDW tempdb Pdwtempdb1 Must specify #<table_name> WFOHST01 tempdb Pdwtempdb1 Must specify #<table_name> WITH (LOCATION = User_DB) CTL01 Shell CMP01 User CMP02 User CMP03 User CMP04 User CMP05 User CMP06 User

Q Tables are used by the DMS during data movement Temporary databases Logical Name Physical Name Control Node Compute Nodes Purpose tempdb pdwtempdb1 temp tables tempdb-sql Q Tables Sorts & Spills Q Tables are used by the DMS during data movement

MPP tables

Two table types Replicated Distributed Why Two Types? Duplicate copy of table maintained on each node Smaller Tables (<3GB) Only Distributed Table is hashed on a single column and uniformly distributed across all nodes Each node has 8 distributions Each distribution is a separate physical table Why Two Types? To Co-Locate data on each node to minimize data movement for multi-table joins

APS benefits: Appliance simplicity for table creation CREATE TABLE table_name [ ( { <column_definition> } [ ,...n ] ) WITH (CLUSTERED COLUMNSTORE INDEX, | CLUSTERED INDEX ( { index_column_name [ ASC | DESC ] } [ ,...n ] ) , | DISTRIBUTION = { HASH ( distribution_column_name ) | REPLICATE }, | PARTITION ( partition_column_name RANGE [ LEFT | RIGHT ] FOR VALUES ( [ boundary_value [,...n] ] ) )

Star schema example: Replicated tables Smaller Dimension Tables are Replicated TD SD PD MD Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End Sales Facts Mktg Camp Id Qty Sold Dollars Sold TD SD PD MD Computer Nodes TD SD PD MD TD SD PD MD

What is the cost of replicating data? Writes will be slower than distributed tables since there is only one writer on each node (as opposed to 8 per node for distributed tables). Instead of writing data once you write it n times PDW region: n = number of compute nodes in your appliance Hadoop: n = replication factor set in Hadoop (default is 3 times) HDInsight: n = default is 2 or 3 depending on size of Hadoop Region

Create replicated table example Create table metadata on Control Node Control Node Control Node 1 Control Node 2 Control Node 8 … CREATE TABLE Product ( ProductKey INT NOT NULL , ProductCode INT NOT NULL , ProductEffDate INT NOT NULL , ProductDesc VARCHAR(50) NOT NULL, ProductSubGrp VARCHAR(20) NOT NULL , ProductCategory VARCHAR(20) NOT NULL , ProductPrice FLOAT NOT NULL , … ) WITH DISTRIBUTION = REPLICATE, CLUSTERED INDEX(OrderDateKey) , PARTITION (ProductEffDate RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ); Send Create Table SQL to each compute node Create Table Product Product Load Sequence for Replicated tables: Table gets loaded on one node first (node determined by round-robin) When complete, APS copies table onto remaining nodes

Star schema example: Distributed tables Larger Fact Table Is Hash Distributed Across All Nodes TD SD PD MD Sales Fact A Sales Fact B Sales Fact H … Time Dim Date Dim ID Calendar Year Calendar Qtr Calendar Mo Calendar Day Store Dim Store Dim ID Store Name Store Mgr Store Size Product Dim Prod Dim ID Prod Category Prod Sub Cat Prod Desc Mktg Campaign Dim Mktg Camp ID Camp Name Camp Mgr Camp Start Camp End Sales Facts Mktg Camp Id Qty Sold Dollars Sold TD SD PD MD Sales Fact A Sales Fact B Sales Fact H … TD SD PD MD Sales Fact A Sales Fact B Sales Fact H … TD SD PD MD Sales Fact A Sales Fact B Sales Fact H …

Create distributed table: Under the covers Create table metadata on Control Node Control Node 1 Control Node 2 Control Node 8 … Control Node CREATE TABLE FactSales ( ProductKey INT NOT NULL , OrderDateKey INT NOT NULL , DueDateKey INT NOT NULL , ShipDateKey INT NOT NULL , ResellerKey INT NOT NULL , EmployeeKey INT NOT NULL , PromotionKey INT NOT NULL , CurrencyKey INT NOT NULL , SalesTerritoryKey INT NOT NULL , SalesOrderNumber VARCHAR(20) NOT NULL, … ) WITH DISTRIBUTION = HASH(ProductKey), CLUSTERED INDEX(OrderDateKey) , PARTITION (OrderDateKey RANGE RIGHT FOR VALUES ( 20010601, 20010901, ) ); Send Create Table SQL to each compute node Create Table FactSales_A Create Table FactSales_B Create Table FactSales_C … Create Table FactSales_H FactSales _A FactSales _B FactSales _C FactSales _D FactSales _E FactSales _F FactSales _G FactSales _H

Distributed tables Data is distributed across the entire cluster A hashing function is used to spread the data Hash created during either Insert or Load Hash performed by DMS not SQL Server Hash is based on a single column in a table Column chosen for hash is not updateable

Extremely important points The hash is consistent across all tables Great for joining tables together Columns that are joined, must have consistent data types (e.g. numeric, date, character Rationalize your data types!

Consequences of distribution theory Distribution key is not updateable! Pick a column that: Is Static Does now need inferred members (-1) Does not contain Null Changing Distribution = Re-create the table!

APS distributed query in action Control Node A.K.A. ‘The Brains’ Optimizer creates parallel query plan Each compute node runs a portion of the query in parallel Results aggregated on each node Final results streamed back through Control Node User Query Optimizer Metadata Statistics Data Movement Services DMS Balanced Storage Compute Server A.K.A. ‘The Brawn’ …

A word about data skew

Performance impact of skew When the data is skewed… Some queries finish very quickly Others take disproportionately longer The user query is only complete when all queries have finished

Impact of skew on storage When one bucket is full all buckets are full Skewed Data leads to accelerated consumption of storage capacity Available Storage therefore is a logical concept Calculated as MAX(storage used by bucket) * # of buckets

Demo | Distribution

Microsoft Analytics Platform System 4/17/2019 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION. © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.