Using Columnstore indexes in Azure DevOps Services. Lessons learned.

Slides:

Advertisements

Similar presentations

© IBM Corporation Informix Chat with the Labs John F. Miller III Unlocking the Mysteries Behind Update Statistics STSM.

Advertisements

Big Data Working with Terabytes in SQL Server Andrew Novick

Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)

Making Data Warehouse Easy Conor Cunningham – Principal Architect Thomas Kejser – Principal PM.

Cloud Computing Lecture Column Store – alternative organization for big relational data.

Database Management 9. course. Execution of queries.

Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.

Denny Cherry twitter.com/mrdenny.

SQL SERVER DAYS 2011 Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.

INTRODUCING SQL SERVER 2012 COLUMNSTORE INDEXES Exploring and Managing SQL Server 2012 Database Engine Improvements.

2012 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN Welcome November 2012 Columnstore Indexes.

Retele de senzori Curs 2 - 1st edition UNIVERSITATEA „ TRANSILVANIA ” DIN BRAŞOV FACULTATEA DE INGINERIE ELECTRICĂ ŞI ŞTIINŢA CALCULATOARELOR.

October 15-18, 2013 Charlotte, NC Accelerating Database Performance Using Compression Joseph D’Antoni, Solutions Architect Anexinet.

--A Gem of SQL Server 2012, particularly for Data Warehousing-- Present By Steven Wang.

October 15-18, 2013 Charlotte, NC SQL Server Index Internals Tim Chapman Premier Field Engineer.

Doing fast! Optimizing Query performance with ColumnStore Indexes in SQL Server 2012 Margarita Naumova | SQL Master Academy.

Best Practices for Columnstore Indexes Warner Chaves SQL MCM / MVP SQLTurbo.com Pythian.com.

SQL IMPLEMENTATION & ADMINISTRATION Indexing & Views.

Enable Operational Analytics (HTAP) in SQL Server 2016 and Azure SQL Database Sunil Agarwal Principal Program Manager, SQL Server Product Tiger Team

Clustered Columnstore index deep dive

Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:

Chris Index Feng Shui Chris

In-Memory Capabilities

5/25/2018 5:29 AM BRK3081 Delivering High Performance Analytics with Columnstore Index on Traditional DW and HTAP Workloads Sunil Agarwal (Microsoft) Aaron.

Power BI Performance Tips & Tricks

Operational Analytics in SQL Server 2016 and Azure SQL Database

Stored Procedures – Facts and Myths

- for the SSASMD Developer

Antonio Abalos Castillo

Presented by: Warren Sifre

The Ins and Outs of Partitioned Tables

Four Rules For Columnstore Query Performance

A developers guide to Azure SQL Data Warehouse

The Five Ws of Columnstore Indexes

Blazing-Fast Performance:

Please support our sponsors

© Copyright TIBCO Software Inc.

Traveling in time with SQL Server 2017

Table Indexing for the .NET Developer

ColumnStore Index Primer

Azure SQL Data Warehouse Performance Tuning

Introduction to partitioning

BRK2279 Real-World Data Movement and Orchestration Patterns using Azure Data Factory Jason Horner, Attunix Cathrine Wilhelmsen, Inmeta -

A developers guide to Azure SQL Data Warehouse

Azure SQL DWH: Tips and Tricks for developers

JULIE McLAIN-HARPER LINKEDIN: JM HARPER

20 Questions with Azure SQL Data Warehouse

Execution Plans Demystified

11/29/2018 © 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, and other product names are or may be registered trademarks and/or trademarks.

Statistics: What are they and How do I use them

Azure SQL DWH: Optimization

Microsoft SQL Server 2014 for Oracle DBAs Module 7

The Five Ws of Columnstore Indexes

Realtime Analytics OLAP & OLTP in the mix

Designing SSIS Packages for Performance

Sunil Agarwal | Principal Program Manager

Four Rules For Columnstore Query Performance

Azure SQL DWH: Tips and Tricks for developers

Clustered Columnstore Indexes (SQL Server 2014)

Azure SQL DWH: Tips and Tricks for developers

Applying Data Warehouse Techniques

Dell EMC SQL Server Solutions Doug Bernhardt

Using Columnstore indexes in Azure DevOps Services. Lessons learned

Using Columnstore indexes in Azure DevOps Services. Lessons learned

Applying Data Warehouse Techniques

Processing Tabular Models

SQL Server Columnar Storage

All about Indexes Gail Shaw.

T-SQL Tools: Simplicity for Synchronizing Changes Martin Perez.

Sunil Agarwal | Principal Program Manager

Presentation transcript:

Using Columnstore indexes in Azure DevOps Services. Lessons learned. Konstantin Kosinsky

Thanks To Our Sponsors

About Principal Software Engineer in Azure DevOps Analytics service SQL Server /Data Platform MVP before joining Microsoft in 2012 @kkosinsky kokosins@microsoft.com

Agenda What am I working on and how Columnstore indexes help? Columnstore indexes internals Lessons learned 1..9

Azure DevOps Analytics Service Reporting platform for Azure DevOps Includes data from: Pipelines (CI/CD) Work Item (stories, bugs, etc) Tracking Automated and Manual Tests Code (Azure Repos and GitHub) – coming soon

Analytics Service: In Product

Analytics Service: Power BI

Analytics Service: OData OData v4.0 with Aggregation Extensions

Analytics Service: OData OData v4.0 with Aggregation Extensions

Query Engine Requirements Must Support: Huge amount of data Queries that aggregate data across millions of records Arbitrary filters, aggregations and groupings Both updates and deletes (due to late arriving data and re-transformations) Near real-time ingestion and availability of new data On-premises installations Subsecond query response times Online DB maintenance operations ...all within a reasonable cost structure when deployed in large multi-tenant environments.

Columnstore Indexes 10x+ data compression Good performance for database warehouse queries Don’t need to create and maintain indexes for each report Still support updates and trickle inserts

Columnstore Internals c4 min = 1 SELECT sum(c1), sum(c2) FROM MyTable SELECT sum(c1), sum(c2) FROM MyTable WHERE c4 > 22 c4 max = 10 c4 min = 11 c4 max = 20 c4 min = 21 c4 max = 30

Why Columnstore Indexes Are Performant Segment elimination Predicate pushdown Local aggregation (aka aggregation pushdown) Compression Batch mode

Lesson #1: Data Types Are Important Not all data types support aggregate pushdown SELECT SUM(DurationSeconds) FROM AnalyticsModel.tbl_TestResultDaily https://docs.microsoft.com/en-us/sql/relational-databases/indexes/columnstore-indexes-query-performance?view=sql-server-2017#aggregate-pushdown

Lesson #2: Cardinality Matters Number of distinct values affects segment size CompleteDate DATETIMEOFFSET for 1B -> 5.8 Gb CompleteDate DATETIMEOFFSET(0) for 1B -> 2.2 Gb CompleteDateSK INT (YYYYMMDD) or DATE for 1B -> 900Kb select sum(on_disk_size) from sys.column_store_segments s join sys.partitions p on s.partition_id=p.partition_id join sys.columns c on p.object_id=c.object_id and c.column_id=s.column_id where p.object_id = OBJECT_ID('AnalyticsModel.tbl_TestResult') and c.name='CompletedDateSK'

Lesson #2: Cardinality Matters Number of distinct values per row group affects aggregate pushdown for GROUP BY # rows # of unique for the column 1M 16K 17K https://sqlperformance.com/2019/04/sql-plan/grouped-aggregate-pushdown Number of distinct values isn’t only one criteria: https://sqlperformance.com/2019/04/sql-plan/grouped-aggregate-pushdown

Lesson #3: Predicate Pushdown Avoid predicates that touch multiple columns ColumnA = 1 OR ColumnB = 2 -- No pushdown ColumnA > ColumnB -- No pushdown ColumnA = 1 AND ColumnB = 2 -- Allow pushdown Consider creating custom columns that combine logic Pushdown for string predicates has limitation No pushdown before SQL Server 2016 Consider replacing with numeric codes or surrogate keys

Lesson #3.1: Strings + Segment Elimination Segment elimination works only for numeric and datetime types Segment elimination doesn’t work for string and GUID types In most cases it isn’t a problem When it is, consider replacing with numeric codes or surrogate keys

Lesson #4: Segment Elimination Row groups should be aligned with filters Try to insert data in a way that helps segment elimination

Lesson #4: Segment Elimination c1 c2 c3 c4 c5 c6 c4 min = 1 Update is DELETE + INSERT Range for old segment and new segment will overlap Need to read both c4 max = 10 c4 min = 11 c4 max = 20 c4 min = 21 UPDATE MyTable SET c6 +=1 WHERE C4 < 9 OR C4 >25 c4 max = 30 c4 min = 1 SELECT sum(c1), sum(c2) FROM MyTable WHERE c4 > 22 c4 max = 30

Lesson #4: Segment Elimination (cont.) Avoid for updates if you can Consider splitting table Current – small, changes could happen History – records graduated when they done Do periodic maintenance of the index

Lesson #5: Watch the Delta Store Delta store is HEAP without compression To start compression, delta store needs to reach 1M records Each query reads entire delta store Delta store could be larger than you expect REORGANIZE WITH(COMPRESS_ALL_ROW_GROUPS = ON)

Lesson #5: Watch the Delta Store Could we avoid delta store? Yes for CCI. Insert at least 102400 rows Keep statistics up to date. Low row number estimation lead to skipping of that optimization No for NCCIs Could delta store be more that 1M rows? Many parallel writes could lead to multiple delta stores When Tuple Mover is busy

Lesson #6: Physical Partitioning In a multi-tenant environment, you may have a mix of small and large tenants – partitioning strategy matters. All tenants in one large physical partition Small tenants must scan all records from big tenants Locks are on row group level Column cardinality could be high, which means less segment elimination One physical partition per tenant SQL Server limitation of 15K partitions per table Small tenants may never reach 1M and stay forever in the delta store. Group small tenants in one partition to help them compress and huge tenants in dedicated partitions

Lesson #7: Physical Partition Maintenance ALTER INDEX REORGANIZE – clean ups deleted rows and merges row groups Will not touch RG if trim reason is DICTIONARY_SIZE Could merge old and new RGs and mess with segment elimination ALTER INDEX REBUILD Remove all deleted rows Does not guarantee insert order May affect segment elimination SELECT i.type_desc, CSRowGroups.state_desc, total_rows, deleted_rows, size_in_bytes, trim_reason_desc, transition_to_compressed_state_desc, 100*(ISNULL(deleted_rows,0))/total_rows AS 'Fragmentation' FROM sys.indexes AS i JOIN sys.dm_db_column_store_row_group_physical_stats AS CSRowGroups ON i.object_id = CSRowGroups.object_id AND i.index_id = CSRowGroups.index_id WHERE total_rows>0 ORDER BY object_name(i.object_id), i.name, row_group_id;

Lesson #7: Physical Partition Maintenance Built custom solution that will periodically restore proper order Stored procedure that sorts partitions that are in bad shape Clones table structure Copies data from affected partition in desired order (1M batches) Stops ETL for affected partition Applies modifications that happened since process start Switches partitions Restarts ETL SPLITs and MERGEs uses the same approach

Lesson #8: Schema Updates Index and column modifications are mostly offline operations Adding a NULL column or NOT NULL column with DEFAULT is online and fast Adding a column then issuing an UPDATE to set the new value will lead to fragmentation delta store insert + Columnstore delete Analytics Service solution: Create new table, copy data with required modifications, switch tables Use the same approach as maintenance to minimize data latency

Lesson #9: Paging Power BI needs raw data with arbitrary filters and projections Power BI could request a lot of data and we need to force server-side paging OFFSET – FETCH approach needs Sorting Read all data and throws away most of it N + 1 page is more expensive than N page Use skip token (identity column) as pointer to next page Decreases amount of data that need be thrown away Still need sorting N + 1 page is cheaper that N page

Lesson #9: Paging (cont.) To make sorting cheap we could use B-Tree index But we need arbitrary filters  Sorting of wide SELECT requires a lot of memory Analytics solution: Two queries, leveraging Columnstore behaviors First query gets page boundaries: eliminating most of columns, using aggregation Second query gets page: eliminating most of segments, not sorting

Questions? Contact me: @kkosinsky, kokosins@microsoft.com