Before we begin Goals Non-goals (but feel free to ask questions)

Slides:



Advertisements
Similar presentations
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
Advertisements

Enable Operational Analytics (HTAP) in SQL Server 2016 and Azure SQL Database Sunil Agarwal Principal Program Manager, SQL Server Product Tiger Team
Clustered Columnstore index deep dive
Data Platform and Analytics Foundational Training
What’s new in Entity Framework Core 2.0
From IT Pros to IT Heroes - with Azure DevTest Labs
5/22/2018 1:39 AM BRK2156 Power BI Report Server: Self-service BI and enterprise reporting on-premises Christopher Finlan Senior Program Manager © Microsoft.
5/25/2018 5:29 AM BRK3081 Delivering High Performance Analytics with Columnstore Index on Traditional DW and HTAP Workloads Sunil Agarwal (Microsoft) Aaron.
Azure Machine Learning Deploying and Managing Models in production
The story of an IoT solution
Operational Analytics in SQL Server 2016 and Azure SQL Database
Delivering enterprise BI with Azure Analysis Services
Microsoft /2/2018 3:42 PM BRK3129 Query Big Data using the Expanded T-SQL footprint with PolyBase in SQL Server 2016 Casey Karst Program Manager.
Use any Amazon S3 application with Azure Blob Storage
Why Is My SQL DW Query Slow?
6/11/2018 8:14 AM THR2175 Building and deploying existing ASP.NET applications using VSTS and Docker on Windows Marcel de Vries CTO, Xpirit © Microsoft.
6/12/2018 2:19 PM BRK3245 DirectQuery in Analysis Services: best practices, performance, and use cases Marco Russo SQLBI © Microsoft Corporation. All rights.
Azure Cloud Shell Magic of Modern Command-line Management
Migrating your IaaS infrastructure from ASM to ARM without downtime
Oracle To SQL Migration – Beyond SSMA (SQL Server Migration Assistant)
TFS Database Import Service for Visual Studio Team Services
Lessons learned from moving to Microsoft Azure
Optimizing Microsoft OneDrive for the enterprise
Virtual Machine Diagnostics in Microsoft Azure
Microsoft Build /22/ :52 PM © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY,
7/22/2018 9:21 PM BRK3270 Building a Better Data Solution: Microsoft SQL Server and Azure Data Services Joey D’Antoni Principal Consultant Denny Cherry.
8/6/ :17 AM THR2214 Hybrid Cloud Activated A customer case study optimizing on-premises & Azure performance and cost Mor Cohen-Tal Senior Product.
Understanding Windows Analytics Update Compliance
Excel and Power BI Better Together Democratization of data
Installation and database instance essentials
SQL Server for Java developers
Workflow Orchestration with Adobe I/O
Customize Office 365 Search and create result sources
Find, try and get line-of-business apps on Microsoft AppSource
Automate all things! Microsoft Azure continuous deployment
9/14/ :46 AM BRK3293 How the Portland Trail Blazers Use Personalization and Acxiom Data to Target Customers Chris Hoder Program Manager, AI + Research.
A developers guide to Azure SQL Data Warehouse
Azure SQL Data Warehouse Scaling: Configuration and Guidance
Azure PowerShell Aaron Roney Senior Program Manager Cormac McCarthy
Port your AWS Knowledge to Azure
Azure SQL Data Warehouse Performance Tuning
TechEd /20/ :49 PM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered.
11/22/2018 1:43 PM THR3005 How to provide business insight from your data using Azure Analysis Services Peter Myers Bitwise Solutions © Microsoft Corporation.
A developers guide to Azure SQL Data Warehouse
Azure Advisor: Optimization in the best way
Mobile Center and VSTS:​ Better together for your Mobile DevOps
11/29/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.
TechEd /2/2018 7:32 AM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Microsoft products for non-profits
Azure SQL DWH: Optimization
Power-up NoSQL with Azure Cosmos DB
1/2/2019 5:18 PM THR3016 Customer stories: Plan and orchestrate large resource deployments on Azure infrastructure Igal Figlin Principal PM Manager – Azure.
TechEd /15/2019 8:08 PM © 2013 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks.
Sunil Agarwal | Principal Program Manager
Context about the Data Warehouse
Overview: Dynamics 365 for Project Service Automation
Virtual Reality with Azure and Unity
2/24/2019 7:49 PM BRK2198 Four new Azure management experiences to run your business critical applications Dushyant Gill | Jan Kalis.
Surviving identity management in a hybrid world
Breaking Down the Value of A Yammer Post: 20 Things to Do
When Bad Things Happen to Good Applications
Explore PnP Partner Pack for IT pros, admins and architects
Getting the most out of Azure resources with Azure Advisor
“Hey Mom, I’ll Fix Your Computer”
4/21/2019 7:09 AM THR2098 Unlock New Opportunities with Nintex Hawkeye Process Intelligence and Workflow Analytics Sr. Product.
Designing Bots that Fit Your Organization
Ask the Experts: Windows 10 deployment and servicing
Digital Transformation: Putting the Jigsaw Together
Diagnostics and troubleshooting in Azure App Service Support Center
Optimizing your content for search and discovery
Presentation transcript:

Before we begin Goals Non-goals (but feel free to ask questions) 9/17/2018 12:59 PM Before we begin Goals Columnstore fundamentals Health and monitoring Performance and scalability Loading patterns and tools Non-goals (but feel free to ask questions) Every detail about SQL Data Warehouse and columnstore Columnstore deep dive (deeper than design and behavior) Read every bullet in every slide Pre-requisites Working knowledge of SQL Server and SQL Data Warehouse © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Getting peak performance from your SQL Data Warehouse columnstore 9/17/2018 12:59 PM BRK4016 Getting peak performance from your SQL Data Warehouse columnstore Shivani Gupta, Joe Yong SQL Data Warehouse Program Management © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Agenda Azure SQLDW Columnstore Primer Columnstore Health 9/17/2018 12:59 PM Agenda Azure SQLDW Columnstore Primer Terminology & Operations (Insert, Update, Delete, Scan) Distributions and Partitions Columnstore Health Pressures (memory and dictionary) Monitoring for health ELT/ETL Patterns and Guidance Maximizing row group quality Partitioning guidance, Ordering Secondary B-Tree Indexes Data Loading Tools Polybase, SSIS, Data Factory, BCP © 2016 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Azure SQL Data Warehouse 9/17/2018 12:59 PM Full managed, data warehouse as a service Massively Parallel Processing (MPP) architecture Separation of storage and compute Industry leading SQL Server in each compute node Control Node Compute SQL DB Blob storage [WASB(S)] Massively Parallel Processing (MPP) Engine Azure Infrastructure and Storage Application or User connection Data Loading (PolyBase, ADF, SSIS, REST, ODBC, ADF, AZCopy) DMS © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

SQLDW Columnstore Primer 9/17/2018 12:59 PM SQLDW Columnstore Primer © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Columnstore Index: Why? 9/17/2018 Columnstore Index: Why? Data stored as rows Data stored as columns C5 C1 C2 C3 C4 rowgroup … Ideal for OLTP Frequent read/write on small set of rows Ideal for Data Warehouse Analytics on large number of rows segment Improved compression: Data from same domain compress better Reduced I/O: Fetch only columns needed Improved Performance: More data fits in memory Optimized for CPU utilization Batch Mode Execution Vector Processing © 2015 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Terminology Row group Column Segment Dictionary Metadata 9/17/2018 12:59 PM Terminology 1 million row “row group” Row group Set of rows handled together, usually 1M Column Segment Values from one column of the row group Dictionary Encoding for string columns to integers Primary dictionary: shared across row segments Secondary dictionary: local to segment (max 1) Metadata Min,max values for segment elimination dictionaries primary secondary Column Segments metadata Min Max Col 1 val Col1 val Col 2 val … © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/17/2018 12:59 PM Load/Insert Batch inserts >= 100K rows go directly to compressed row groups Row groups can be trimmed pre- maturely under memory pressure. Batch inserts < 100k rows go to delta row group (which is a B- tree). Delta RGs are closed after 1 M rows. Tuple Mover moves closed delta RGs to compressed row groups. C1 C2 C3 Delta RGs Closed Open Tuple Mover Batch Size < 100k Batch Size >= 100k Compressed RGs © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

DML – delete and update Rows in delta stores are deleted directly Rows in compressed row groups are not removed Their locators are tracked in the delete bitmap Update = delete + insert Delete bitmap

Scan Scans need to combine data from compressed row groups, delta stores and delete bitmaps to produce correct results Segment metadata used to eliminate row groups that do not qualify

Column Store in SQLDW 60 Distributions 9/17/2018 12:59 PM Column Store in SQLDW 60 Distributions Each Distribution has its own Columnstore per table Multiple row groups per Columnstore © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/17/2018 12:59 PM Partitioning Each distribution table partitioned N-ways. Total number of Columnstores = 60 * N CREATE TABLE [cso].[FactOnlineSales_PTN] ( [OnlineSalesKey] int NOT NULL , [DateKey] datetime NOT NULL , [StoreKey] int NOT NULL , [ProductKey] int NOT NULL , [CurrencyKey] int NOT NULL , [SalesQuantity] int NOT NULL , [SalesAmount] money NOT NULL , [UnitPrice] money NULL ) WITH ( CLUSTERED COLUMNSTORE INDEX , DISTRIBUTION = HASH([ProductKey]) , PARTITION ( [DateKey] RANGE RIGHT FOR VALUES '2007-01-01 00:00:00.000‘ ,'2008-01-01 00:00:00.000' ; © 2014 Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Partitioning and Column Store in SQLDW Table is defined with 2- way partitioning Each distribution’s data split into 2 partitions 2-way partitioning => 120 total Columnstores

Columnstore DMVs sys.pdw_nodes_column_store_row_groups State of various row groups (open, closed, compressed) Total rows sys.pdw_nodes_column_store_segments Encoding used Dictionary id Min,max metadata sys.pdw_nodes_column_store_dictionaries On disk dictionary size

Demo: Primer Concepts

Columnstore Health

What determines scan performance? Size of compressed row groups (aka rowgroup quality) Compression quality decreases with fewer rows Per row overhead of ancillary structures increases with fewer rows 1M rows per row group ideal 100,000 rows per row group acceptable < 10,000 abysmal. Disallowed now. Number of rows in delta stores Delta store is scanned row by row Small batch inserts end up in delta stores Singleton inserts end up in delta stores Higher DOP increases number of delta stores Size of Delete Bitmap Larger delete bitmap implies more time spent merging Large amount of delete and/or update activity grows delete bitmap

Factors affecting Row Group Quality Available Rows 1. Load Batch Size 2. Partitioning Premature Trimming 1. Memory Pressure 2. Dictionary Pressure

Understanding Memory Pressure Memory Required for quality row groups Target rows in the rowgroup (ideal is 1,048,576 = ~1 million rows per rowgroup) Fixed overhead #columns in the table #short string character typed columns (string data types of <=32 bytes) #long string character typed columns (string data types of >32 bytes) See details in rowgroup memory Memory Available to build row groups SLO of SQL Data Warehouse User’s Resource Class Complexity of load query Partitioning Degree of Parallelism Trimming occurs if Required > Available

Est. mem grant example #1: 10 column table

Memory Management (MB per distribution) Service admin account

Degree of Parallelism (DOP) Statement memory grant divided b/w parallel workers Use hint to force serial execution for CTAS (if needed) Create Table MyFactSalesQuota as Select * from FactSalesQuota OPTION(MAXDOP 1); < DWU 2000 – DOP 1 DWU 2000 – DOP 2 DWU 3000 – DOP 3 DWU 6000 – DOP 6

Dictionary Pressure High cardinality (distinct values) Wide strings MAX Dictionary size 16MB in memory DMV Dictionary size Size on disk On disk size < In memory size

DMV for Columnstore Health sys.dm_pdw_nodes_db_column_store_row_group_physical_stats Row Group State No of rows Trim Reason More details at https://azure.microsoft.com/en-us/blog/sql-data-warehouse-columstore-monitor/

Demo: Columnstore Health

ELT/ETL Patterns and Guidance

ELT for maximizing scan performance Sizing batch loads Partitioning Guidance Using Correct Resource Class Working around Dictionary Pressure Judicious use of updates and deletes Secondary B-Tree Indexes (NCI): needle-in-haystack queries How to improve segment elimination

Sizing Batch Loads Target >100,000 per columnstore in each load. With no partitions, this means > 100,000 * 60 rows (~6 million) rows per CTAS or bulk insert With 4 partitions , this means > 100,000 * 60 * 4 (~24 million) rows per CTAS or bulk insert Ideally 1 million per Columnstore in each load With no partitions, this means > 1,000,000 * 60 rows (~60 million) rows per CTAS or bulk insert With 4 partitions , this means > 1,000,000 * 60 * 4 (~240 million) rows per CTAS or bulk insert

Batching trickle loads Scenario: 100,000 rows per Columnstore, no partitions 6,000,000+ rows required for each batch load 500 Rows/Sec 1000 2000 Load threshold exceeded (hours) <3.5 hours < 2 hours < 1 hour

Partitioning Guidance Rowgroups cannot cross partition boundaries Over-partitioning impacts row group quality Over-partitioning impacts compression Make sure your targeted data set allows for at least 6 million rows per partition. Ideally 60 million rows per partition! Partitioning impacts memory grant

Using Correct Resource Class Compute memory required for quality row groups Memory guidance for Columnstore Use views provided. Grant sufficient memory Use correct resource class (usually not smallrc) Scale DWU if needed Keep load query simple – stage to a Heap/CI if needed Force serial execution if needed Don’t Over Allocate Resource Class Does impact concurrency (especially if multiple ELT jobs scheduled in parallel)

Working around Dictionary Pressure Isolate problematic string columns into a separate table  Optimize column types where possible (e.g. use varchar(36) instead of nchar(255), smallint instead of nvarchar(4000)

Judicious use of Updates and Deletes Use Heap/CI staging tables for transformations Alter Index Reorganize/Rebuild to defragment Reorganize is lighter weight and online Rebuild is heavy weight and offline

Secondary B-Tree Indexes (NCI) Needle in a haystack queries (equality or short range) create index l_orderkeynci on lineitem(l_orderkey); select l_orderkey, l_returnflag, l_linestatus from lineitem where l_orderkey = 5660553859; Runs in under a second on 1TB TPC-H data BTREE (NCI) Delete bitmap Delta RGs Compressed RGs

Real world usage of NCI - Yammer Run a DW 6000 Track all user activity to A/B test every new feature Fact Events – 300 billion rows (~40TB) Built NCI on UserId Query went from 10+ hours to < 5mins: SELECT TOP 1000 * FROM fact_events WHERE user_id='1556041068'

Segment Elimination Min, max metadata used to filter out segments Columnstore is not ordered Helps if data arrives naturally ordered e.g. timestamp Index Rebuild will not keep ordering intact Min Max Col 1 val Col1 val Col 2 val …

Ordered CCI for improved segment elimination (Date) CCI Partition Switch ALTER INDEX REBUILD compresses across rowgroups Removes any manual ordering from CCI

Data Loading Tools

Data Loading Options PolyBase SSIS* ADF BCP SQLBulkCopy API Attunity Cloudbeam ASA/Storm** Method Performance PolyBase SSIS ADF BCP SQL Bulkcopy Rate Rate increases with higher DWU Yes Yes* No Rate increases with more concurrent loads Fastest Slowest * With SSIS Azure Feature Pack June 2017 or newer ** Not a good idea

Single gated client loading with SSIS data flows (before) Using SSIS, customers can create data flows with ADO.NET/OLEDB Destination to load data into Azure SQL DW Similar to loading data into Azure SQL DB/SQL Server This method has two bottlenecks on the single SSIS-running machine and Control node

Single gated client loading with SSIS data flows (before) Customers can also execute parallel loads with multiple SSIS-running machines by Divide input data into multiple sources Load concurrently into separate temporary tables Switching them in into partitions of the full final table Overall throughput is still limited by Control node

Parallel loading with PolyBase – T-SQL script Configure the credentials to access your Azure Blob Storage CREATE DATABASE SCOPED CREDENTIAL AzureStorageCredential WITH IDENTITY = 'user', SECRET = '<azure_storage_account_key>'; Define the data source in your Azure Blob Storage with previously configured credentials CREATE EXTERNAL DATA SOURCE AzureStorage WITH ( TYPE = HADOOP, LOCATION = 'wasbs://<blob_container_name>@<azure_storage_account_name>.blob.core.windows.net', CREDENTIAL = AzureStorageCredential); Define the file format of your input data CREATE EXTERNAL FILE FORMAT TextFile WITH ( FORMAT_TYPE = DelimitedText, FORMAT_OPTIONS(FIELD_TERMINATOR = ','));

Parallel loading with PolyBase – T-SQL script Define the external table for your input data CREATE EXTERNAL TABLE dbo.DimDate2External ( DateId INT NOT NULL, CalendarQuarter TINYINT NOT NULL, FiscalQuarter TINYINT NOT NULL) WITH ( LOCATION='/datedimension/', DATA_SOURCE=AzureStorage, FILE_FORMAT=TextFile); Load the external table from your Azure Blob Storage into Azure SQL DW CREATE TABLE dbo.DimDate2 WITH ( CLUSTERED COLUMNSTORE INDEX, DISTRIBUTION = ROUND_ROBIN) AS SELECT * FROM [dbo].[DimDate2External];

Parallel loading with PolyBase (with Azure feature pack) Create Execute SQL task that triggers PolyBase to load data into Azure SQL DW, leveraging MPP architecture Throughput is not limited by Control node; scales out based on DWU level

Parallel loading with PolyBase Parallel loading with PolyBase is the recommended method to load large volumes of data into SQL DW This requires users to manually create/deploy/run multiple tasks on SSIS and/or write T-SQL scripts Create/deploy/run transformation tasks to convert your input data into supported formats if necessary Create/deploy/run Azure Blob Upload Task to load it into Azure Blob Storage Create/deploy/run Execute SQL Task to convert it into external table(s) and trigger PolyBase to load them into Azure SQL DW Create/deploy/run clean-up tasks if necessary We have released Azure SQL DW Upload Task to automate most of these tasks (2 – 4) Leverage existing SSIS expertise Relieves customers from the “burden” of writing T-SQL script Azure Blob Storage SQL DW Cloud On-Premise SQL Server Flat File SSIS Machine 1. Export to a flat file 2. Upload to Azure Blob 3. Create external tables and trigger PolyBase to load them 4. Clean up staging area

Key takeaways

Wrap Up: Key Takeaways Columnstore behaviour is very different from row store Loading methods and efficiency matter a lot Monitor and maintain Columnstore health for better query performance Assess guidance and ELT patterns for your environment and implement for maximum performance, IF appropriate Picking the right loading tools can make a BIG difference

Please evaluate this session Tech Ready 15 9/17/2018 Please evaluate this session From your Please expand notes window at bottom of slide and read. Then Delete this text box. PC or tablet: visit MyIgnite https://myignite.microsoft.com/evaluations Phone: download and use the Microsoft Ignite mobile app https://aka.ms/ignite.mobileapp Your input is important! © 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries. The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

9/17/2018 12:59 PM © Microsoft Corporation. All rights reserved. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.