Turbocharge your DW Queries with ColumnStore Indexes Susan Price Senior Program Manager DW and Big Data
Waiting … April 2012PNWSQL 2
Waiting … April 2012PNWSQL 3
Why use ColumnStore Indexes? Faster, interactive query response time ▫Easier data exploration ▫Better decisions Reduced physical DB design effort ▫Fewer indexes ▫Reduced need for summary aggregates and indexed views ▫May eliminate need for OLAP cubes ▫Transparent to the application Lower TCO April 2012PNWSQL 4
Demo April 2012PNWSQL 5
Agenda Columnstore indexes Batch mode processing How to use ColumnStore Indexes Best practices Troubleshooting ColumnStore Indexes Resources for more information April 2012PNWSQL 6
How do columnstore indexes speed up queries? 7 Columnstore indexes store data column-wise Each page stores data from a single column Highly compressed About 2x better than PAGE compression More data fits in memory Each column can be accessed independently Fetch only needed columns Can dramatically decrease IO … C1C2C3C4 Heaps, B-trees store data row-wise April 2012PNWSQL
Columnstore index Column Segment Segment contains values from one column for a set of rows Segments for the same set of rows comprise a row group Segments are compressed Each segment stored in a separate LOB Segment is unit of transfer between disk and memory C1 C2 C3 C5C6C4 8 April 2012PNWSQL
Index creation and storage 9 Base table ABCD Encode, compress Encode, compress Encode, compress Compressed column segments 1M rows/group Column store index Blobs Row group Row group Row group Segment directory New system table: sys.column_store_segments Includes segment metadata: size, min, max, … April 2012PNWSQL
Observed compression ratios 10 Data Set Uncompressed table size (MB) Column store index size (MB) Compression Ratio Cosmetics1, SQM1, Xbox1, MSSales642,000126, Web Analytics2, Telecom2, X better compression than SQL’s page compression April 2012PNWSQL
Columnstore index example OrderDateKeyProductKeyStoreKeyRegionKeyQuantitySalesAmount April 2012PNWSQL 11
Horizontally partition (Row Groups) OrderDateKeyProductKeyStoreKeyRegionKeyQuantitySalesAmount OrderDateKeyProductKeyStoreKeyRegionKeyQuantitySalesAmount April 2012PNWSQL 12
Vertically partition (Segments) OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount April 2012PNWSQL 13
Compress each segment* OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount OrderDateKey ProductKey StoreKey RegionKey Quantity SalesAmount April 2012PNWSQL 14
Fetch only needed columns StoreKey StoreKey RegionKey Quantity OrderDateKey OrderDateKey ProductKey ProductKey SalesAmount SalesAmount April 2012PNWSQL 15
Creating ColumnStore Indexes Create a columnstore index Create the table Load data into the table Create a non-clustered columnstore index on all, or some, columns CREATE NONCLUSTERED COLUMNSTORE INDEX ncci ON myTable(OrderDate, ProductID, SaleAmount) Object Explorer April 2012PNWSQL 16
Memory management Memory management is automatic Columnstore is persisted on disk Needed columns fetched into memory Columnstore segments flow between disk and memory SELECT C2, SUM(C4) FROM T GROUP BY C2; T.C2 T.C4 T.C2 T.C4 T.C2 T.C1 T.C3 T.C4 17 April 2012PNWSQL
IO and caching New (large) object cache ▫Cache for column segments and dictionaries Aggressive read ahead ▫At segment level ▫At page level within segment New memory broker ▫Brokers memory between buffer pool and object cache 18 April 2012PNWSQL
Data reduction Early segment elimination based on segment metadata ▫Min and max values stored in metadata for each segment Simple filters evaluated in storage engine during CS index scan ▫Conjunctions of comparisons, in-list Bitmap filters ▫Evaluated during index scan ▫Built by Hash Table Build operator 19 April 2012PNWSQL
Min: Max: Segment elimination Min: Max: OrderDateKey ProductKey SalesAmount OrderDateKey ProductKey SalesAmount April 2012PNWSQL
Segment elimination 21 Min: Max: Min: Max: April 2012PNWSQL
Segment elimination Best practice: Create CS index from a clustered index ▫Rows distributed to row groups in clustered index order ▫Does not affect ordering within segments VertiPaq orders data within row groups ▫Good segment elimination for filters on leading key column using min/max values 22 April 2012PNWSQL
Batch mode query execution Vector-oriented processing Compact data representation Highly efficient algorithms Better parallelism 23 Would you rather process your data like this … … or like this? April 2012PNWSQL
Processing data Columnstore index scan can produce batches or rows ▫Batch-enabled operators get batches ▫Non-batch operators get rows Query optimizer decides List of qualifying rows Column vectors Batch object 24 April 2012PNWSQL
Using ColumnStore Indexes Let the query optimizer do the work ▫Optimizer makes a cost-based decision Data access method Columnstore index | B-tree index | Heap Processing mode Batch mode | Row mode Most things “just work” ▫Backup and restore ▫Mirroring, log shipping ▫SSMS April 2012PNWSQL 25
Limitations on using columnstore indexes Creating columnstore index ▫Only on common business data types Maintain table: limited operations ▫Can read but not update the data ▫Can switch partitions in and out Processing queries ▫All read-only T-SQL queries run ▫Some queries are accelerated more than others Yesint, real, string, money, datetime, decimal <= 18 digits Nodecimal > 18 digits, binary, varbinary, CLR, (n)varchar(max), varbinary (max), uniqueidentifier, datetimeoffset with precision > 2 26 April 2012PNWSQL
Loading new data Table can be read, not updated ▫Partition switching is allowed ▫INSERT, UPDATE, DELETE, and MERGE not allowed Methods for loading data ▫Disable, update, rebuild ▫Partition switching ▫UNION ALL between large table with columnstore index and smaller updateable table April 2012PNWSQL 27
When to build a columnstore index Workload ▫Read mostly ▫Most updates are appends ▫Star joins ▫Queries that scan and aggregate large data volumes Workflow ▫Permits partition switching (or drop and rebuild index) ▫Typically nightly load window Table size ▫Large fact tables ▫Consider for large dimension tables ▫Very wide tables April 2012PNWSQL 28
When not to build a columnstore index Workload ▫Frequent loads ▫Many updates and deletes to existing data Especially if in multiple/unpredictable partitions ▫Frequent small look-up queries B-tree indexes may give better performance ▫Your workload does not benefit Workflow ▫Partition switching or rebuilding the index does not fit your workflow April 2012PNWSQL 29
Best practices for creating the index Use a star schema when possible ▫Build CS index on fact tables ▫Consider for large dimension tables Include all the columns in the CS index ▫Don’t use to seek into a row ▫Order of listed columns not important Convert decimal to precision <= 18 if possible Use integer types whenever possible April 2012PNWSQL 30
Best practices for creating the index Ensure enough memory to build the CS index Consider table partitioning to facilitate updates Consider creating the CS index from a clustered index ▫Better segment elimination when predicate on key ▫Slightly better compression (no RID) April 2012PNWSQL 31
Best practices for writing queries Consider modifying queries to hit the “sweet spot” ▫Star joins ▫Inner joins ▫Group By Keep statistics up to date Use MAXDOP > 1 ▫Batch mode processing only for parallel queries April 2012PNWSQL 32
Troubleshooting: Creating the index Are you getting out of memory errors? April 2012PNWSQL 33
Troubleshooting: Creating the index Are you getting out of memory errors? ▫Ensure enough memory ▫Memory requirement related to #cols, data, DOP ▫Memory available ≠ memory on the box when concurrent activity ▫By default, query is restricted to 25% even when RG not enabled ▫Check showplan XML for memory grant info ▫Rough estimate (see FAQs on technet wiki):FAQs Memory grant request in MB = [(4.2 * Num of columns in the CS index) + 68] * DOP + (Num of string cols * 34) April 2012PNWSQL 34
Troubleshooting: Creating the index Why is my index not building in parallel? April 2012PNWSQL 35
Troubleshooting: Creating the index Why is my index not building in parallel? ▫Index build is parallel only if table has > 1 M rows April 2012PNWSQL 36
Troubleshooting: Creating the index Why is my index not building in parallel? ▫Index build is parallel only if table has > 1 M rows How big is my columnstore index? April 2012PNWSQL 37
Troubleshooting: Creating the index Why is my index not building in parallel? ▫Index build is parallel only if table has > 1 M rows How big is my columnstore index? ▫For size and other info, check new catalog views Sys.column_store_segments Sys.column_store_dictionaries ▫Queries in the FAQ make it easy April 2012PNWSQL 38
Troubleshooting: Query performance Is the columnstore index being used? April 2012PNWSQL 39
Troubleshooting: Query performance Is the columnstore index being used? April 2012PNWSQL 40
Troubleshooting: Query performance If the columnstore index is not being used: ▫Are all needed columns present? ▫Cardinality estimate? If selective, optimizer will choose a B-tree Are other nonclustered indexes being used? ▫Too many indexes + bad statistics optimizer confusion ▫Consider using hints and/or disabling other indexes If the columnstore is being used, are there other issues? ▫Sorts, spills? ▫Table spools? ▫Is a lot of data being returned to the client? Not all bottlenecks are query processing April 2012PNWSQL 41
Troubleshooting: Query performance Is batch mode being used to process most of the data? April 2012PNWSQL 42
Troubleshooting: Query performance Is batch mode being used to process most of the data? April 2012PNWSQL 43
Troubleshooting: Query performance If batch mode is not being used to process most of the data ▫Is there a columnstore index being used? ▫Outer joins? ▫DOP? ▫Loop join? Check cardinality estimate ▫Operators not enabled for batch mode? Batch-enabled: Scan, filter, project Local hash partial aggregation Hash inner join, hash table build April 2012PNWSQL 44
Troubleshooting: Query performance Filters or joins on strings? ▫Filters on strings are not pushed into storage engine ▫Joins on integers are more efficient Filter with “OR”? ▫IN-lists but not OR filters pushed down Hash tables don’t fit into memory? ▫Usually due to small memory grant based on CE error, not physical memory limitation ▫Fall back to row mode processing ▫Slower than a row mode join April 2012PNWSQL 45
Real customer experiences Customer Type Industry segment/ Application MeasureWithout ColumnStore Index (sec) With ColumnStore Index Improvement ExternalOnline services Query x ExternalRetailQuery x ExternalHealthcareSet of 6 Queries x InternalHR reporting Avg. response time on production system x InternalFinancial reporting 3 Queries Each > 50x InternalFinancial reporting Queries taking longer than 10 min 90% reduction April 2012PNWSQL 46
Take-aways Columnstore indexes can enable phenomenal performance gains Batch mode processing is an essential ingredient for speedup Some adjustments to schema and loading processes may be necessary Some queries can benefit from tuning Columnstore indexes are not a magic bullet April 2012PNWSQL 47
Resources Columnstore FAQ: ▫ /articles/sql-server-columnstore-index-faq.aspxhttp://social.technet.microsoft.com/wiki/contents /articles/sql-server-columnstore-index-faq.aspx Tuning Guide: ▫ /articles/sql-server-columnstore-performance- tuning.aspxhttp://social.technet.microsoft.com/wiki/contents /articles/sql-server-columnstore-performance- tuning.aspx SIGMOD paper: ▫ 9448http://dl.acm.org/citation.cfm?doid= April 2012PNWSQL 48
Thank you! Questions? April 2012PNWSQL 49
Data Warehouse Workload April 2012PNWSQL 50
Data warehouse workload Read-mostly ▫Load large amounts of data ▫Append new data incrementally ▫Rarely update existing data ▫Often retain data for given window of time (e.g. 1 yr, 3 yr, 7 yr) Sliding window data management Queries touch large amounts of data ▫Join multiple tables ▫Large “fact” tables Star schema is common Star joins are common
Sliding window
Star schema FactSales DimCustomer FactSales(CustomerKey int, ProductKey int, EmployeeKey int, StoreKey int, OrderDateKey int, SalesAmount money) DimCustomer(CustomerKey int, FirstName nvarchar(50), LastName nvarchar(50), Birthdate date, Address nvarchar(50)) DimProduct … DimDate DimEmployee DimStore
Star join query SELECT TOP 10 p.ModelName, p.EnglishDescription, SUM(f.SalesAmount) as SalesAmount FROM FactResellerSalesPart f, DimProduct p, DimEmployee e WHERE f.ProductKey=p.ProductKey AND e.EmployeeKey=f.EmployeeKey AND f.OrderDateKey >= AND p.ProductLine = 'M' -- Mountain AND p.ModelName LIKE '%Frame%' AND e.SalesTerritoryKey = 1 GROUP BY p.ModelName, p.EnglishDescription ORDER BY SUM(f.SalesAmount) desc;
“Typical” data warehouse queries Process large amounts of data ▫Joins, aggregation, filtering Reporting queries Ad hoc queries Often slow (minutes to hours) DBAs spend considerable effort ▫Designing indexes, tuning queries ▫Building summary tables, indexed views, OLAP cubes