SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ dsarka@solidq.com

Agenda DW Problems Bitmap Filtered Hash Joins Table Partitioning Filtered Indexes Indexed Views Data Compression Window Functions Columnstore Indexes 2

Algorithms Complexity Forever* = about 40 billion billion years! 3

SSAS Dimensional Addressing Axis(1).Position(3) Axis(1).Position (1).Members(2) Axis(1) Every cell has an address 4

SSAS Tabular Problems 5

6 RDBMS Joins Merge: complexity~ O(n) –Needs sorted inputs, equijoin Hash:complexity~ O(n) / ~O(n 2 ) –Needs equijoin Nested Loops: complexity~ O(n) (indexed), ~ O(n 2 ) (not indexed) –Works always, can become quadratic Non-equijoins are frequently quadratic –E.g., running totals

Linearize Joins xy 1 = xy 2 = x 2 y 3 = x 2 per partes 0000 0,2 0,04 0,4 0,16 0,6 0,36 0,8 0,64 1111 1,2 1,441,04 1,4 1,961,16 1,6 2,561,36 1,8 3,241,64 2242 2,2 4,842,04 2,4 5,762,16 2,6 6,762,36 2,8 7,842,64 3393 7

8 Bitmap Filtered Star Joins Optimized bitmap filtering for star schema joins –Bitmap representation of a set of values from a dim table to pre-filter rows to join from a fact table –Enables filtering rows early in the plan, allowing subsequent operators to operate on fewer rows

Bloom Filter (1) * Bloom filter is a bit array of m bits –Start with all bits set to 0 k different hash functions defined –Each of which maps some set element to one of the m positions with a uniform random distribution To add an element, feed it to each of the k hash functions to get k array positions –Set the bits at all these positions to 1 9 Source: WikipediaWikipedia

Bloom Filter (2) To test whether and element it is in the set, feed it to each of the k hash functions to get k array positions –If any of the bits at these positions are 0, the element is not in the set –If all are 1, then either the element is in the set, or the bits have been set to 1 during the insertion of other elements 10

Table Partitioning Partition function Partition scheme Aligned indexes Partition elimination Partition switching 11

Filtered Indexes Where clause in the Create Index statement Small B-trees on subset of data only Useful when some values are selective, while others dense –Index on selective values only 12

Indexed Views Useful for queries that aggregate data –Can also reduce number of joins Depending on edition of SQL Server can be used automatically –No need to change reporting queries Many limitations 13

Data Compression Pre-SQL 2005: variable-length data types SQL 2005: vardecimal SQL 2008 − Row compression − Page compression SQL 2008 R2 − Unicode compression

SQL 2008 Compression Row compression –Fixed-width data type values stored in variable format Page compression Prefix compression Dictionary compression

Unicode Compression Works on nchar(n) and nvarchar(n) Automatically with row or page compression Savings depends on language –Up to 50% in English, German –Only 15% in Japanese Very low performance penalty

Window Functions Functions operating on a window (set) of rows defined by an OVER clause Types of functions: Ranking Aggregate Distribution SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqty FROM Sales.EmpOrders; SELECT empid, ordermonth, qty, SUM(qty) OVER(PARTITION BY empid ORDER BY ordermonth ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS runqty FROM Sales.EmpOrders; 17

Window Functions in SQL Server SQL Server 2005: –Ranking calculations –Aggregates with only window partitioning SQL Server 2012: –Aggregates with also window ordering and framing –Offset functions: LAG, LEAD, FIRST_VALUE, LAST_VALUE –Distribution functions: PERCENT_RANK, CUME_DIST, PERCENTILE_CONT, PERCENTILE_DISC 18

SQL Server DW / OLAP Offerings Personal and team level –PowerPivot for Excel (client) –PowerPivot for SharePoint (server) Corporate level –SQL Server –SSAS Tabular –SSAS Dimensional –Fast Track Data Warehouse –Parallel Data Warehouse VertiPaqVertiPaq 19

Trans-Relational Model Not “beyond” relational –Transformation between logical and physical layer Steve Tarin, Required Technologies Inc. (1999) All columns stored in sorted order –All joins become merge joins –Can condense storage –Of course, updates suffer Logically, this is a pure relational model SQL Server uses own variant –Order of columns not preserved – optimized for compression –Leverages parallel hash joins rather than merge joins 20

Columnar Storage (1) Row / Col123 NameColorCity 1NutRedLondon 2BoltGreenParis 3ScrewBlueOslo 4ScrewRedLondon 5CamBlueParis 6CogRedLondon Row / Col123 NameColorCity 1BoltBlueLondon 2CamBlueLondon 3CogGreenLondon 4NutRedOslo 5ScrewRedParis 6ScrewRedParis 21

Columnar Storage (2) Row / Col123 NameColorCity 1Bolt [1:1]Blue [1:2]London [1:3] 2Cam [2:2]Green [3:3]Oslo [4:4] 3Cog [3:3]Red [4:6]Paris [5:6] 4Nut [4:4] 5Screw [5:6] 6 22 Row / Col123 NameColorCity 1BoltBlueLondon 2CamBlueLondon 3CogGreenLondon 4NutRedOslo 5ScrewRedParis 6ScrewRedParis

Row Reconstruction Table Row / Col123 NameColorCity 1Bolt [1:1]Blue [1:2]London [1:3] 2Cam [2:2]Green [3:3]Oslo [4:4] 3Cog [3:3]Red [4:6]Paris [5:6] 4Nut [4:4] 5Screw [5:6] 6 Row / Col123 NameColorCity 1364 2146 3653 4415 5221 6532 23

SQL Server Solution (1) * Converting rows to column segments 24 Source: SQL Server Column Store Indexes by Per-Åke Larson, et al., Microsoft SIGMOD’10, June 12–16, 2011

SQL Server Solution (2) Storing column segments as BLOBs –Leverages existing BLOB storage –Additional segment metadata –Multiple compression algorithms 25

Columnstore Compression Encoding values to 32-bit or 64-bit integer –Dictionary-based encoding –Value-based (prefix) encoding Optimal row ordering with VertiPaq™ algorithm to rearrange rows –Optimal ordering for Run-Length Encoding (RLE) for best overall compression Compression –RLE - data stored as pairs –Bit-Pack– use min number of bits for a value 26

Result: Reduced I/O Fetches only needed columns from disk Columns are compressed Less IO Better buffer hit rates C1 C2 C4 C5 C6 C3 SELECT region, sum (sales) …

Result: Reading Segments Column segment contains values from one column for a set of about 1M rows Column segment is unit of transfer from disk Storage engine can eliminate segments early in the process Because of additional column segment metadata C1 C2 C3 C5C6C4 Set of about 1M rows Column Segment

Reducing CPU Usage Columnstore indexes reduce disk IO Bitmap-filtered hash joins can be executed in parallel Problem: CPU becomes a bottleneck Solution: reduce CPU usage by processing large numbers of rows –Iterators that do not process row-at-a-time –Process batch-at-a-time

Batch Processing Orthogonal to columnstore indices –Can support other storage However, best results with columnstore indices –Sometimes can perform batch operations directly on compressed data Can mix batch and row operators –Can dynamically switch from batch to row mode 30

Batch Operators The following operators support batch mode processing: –Filter –Project –Scan –Local hash (partial) aggregation –Hash inner join –Batch hash table build 31 Source: http://social.technet.microsoft.com/wiki/contents/articles/sql-server- columnstore-index-faq.aspx#Batch_mode_processinghttp://social.technet.microsoft.com/wiki/contents/articles/sql-server- columnstore-index-faq.aspx#Batch_mode_processing

Columnstore Indexes Constraints Base table must be clustered B-tree or heap Columnstore index: –Nonclustered –One per table –Must be partition-aligned –Not allowed on indexed view –Can’t be a filtered index 32

Data Type Restrictions Unsupported types –Decimal > 18 digits –Binary –BLOB –(n)varchar(max) –Uniqueidentifier –Date/time types > 8 bytes –CLR 33

Query Performance Restrictions Outer joins Unions Consider modifying queries to hit “sweet spot” –Inner joins –Star joins –Aggregation 34

Loading New Data Columnstore index makes table read-only Partition switching allowed INSERT, UPDATE, DELETE, and MERGE not allowed Two recommended methods for loading data Disable, update, rebuild Partition switching 35

36 Columnstore Indexes Usage Use when: –Read-mostly workload –Most updates are appending new data –Workflow permits partitioning or index drop/rebuild –Queries often scan & aggregate lots of data Use on fact (and large dimensions) tables Do not use when: –Frequent updates –Partition switching or rebuilding index doesn’t fit workflow –Frequent small look up queries –VertiPaq cannot handle your data model

Review DW Problems Bitmap Filtered Hash Joins Table Partitioning Filtered Indexes Indexed Views Data Compression Windows Functions Columnstore Indexes 37

38 Q & A Questions? Thank you for coming to this conference… …and this presentation!

References Books: –SQL Server Books OnLine –Dejan Sarka, Grega Jerkič and Matija Lah: MCTS Self-Paced Training Kit (Exam 70-463): Building Data Warehouses with Microsoft SQL Server 2012 Courses and Seminars –SQL Server 2012 and SharePoint BI Immersion –Advanced Transact-SQL 39

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ

Similar presentations

Presentation on theme: "SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ

Similar presentations

Presentation on theme: "SQL Server 2012 Data Warehousing Deep Dive Dejan Sarka, SolidQ"— Presentation transcript:

Similar presentations

About project

Feedback