Making Data Warehouse Easy Conor Cunningham – Principal Architect Thomas Kejser – Principal PM.

Slides:



Advertisements
Similar presentations
Data Management and Index Options for SQL Server Data Warehouses Atlanta MDF.
Advertisements

CHAPTER OBJECTIVE: NORMALIZATION THE SNOWFLAKE SCHEMA.
SQL SERVER 2012 XVELOCITY COLUMNSTORE INDEX Conor Cunningham Principal Architect SQL Server Engine.
Big Data Working with Terabytes in SQL Server Andrew Novick
Help! My table is getting too big! How to divide and conquer SQL Relay 2014.
James Serra – Data Warehouse/BI/MDM Architect
Dos and don’ts of Columnstore indexes The basis of xVelocity in-memory technology What’s it all about The compression methods (RLE / Dictionary encoding)
SQL Server 2005 features for VLDBs. SQL Server 2005 features for VLDBs aka (it’s fixed in the next release)
Building a Data Warehouse with SQL Server Presented by John Sterrett.
Architecting a Large-Scale Data Warehouse with SQL Server 2005 Mark Morton Senior Technical Consultant IT Training Solutions DAT313.
SQL Server Query Optimizer Cost Formulas Joe Chang
Fast Track, Microsoft SQL Server 2008 Parallel Data Warehouse and Traditional Data Warehouse Design BI Best Practices and Tuning for Scaling SQL Server.
1DBTest2008. Motivation Background Relational Data Warehousing (DW) SQL Server 2008 Starjoin improvement Testing Challenge Extending Enterprise-class.
IST722 Data Warehousing Business Intelligence Development with SQL Server Analysis Services and Excel 2013 Michael A. Fudge, Jr.
Lecture 8 Index Organized Tables Clusters Index compression
ISV Innovation Presented by ISV Innovation Presented by Business Intelligence Fundamentals: Data Loading Ola Ekdahl IT Mentors 9/12/08.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
Columnstore Indexes in SQL Server 2012 Conor Cunningham Principal Architect, Microsoft SQL Server Representing Microsoft Development.
1 Data Warehouses BUAD/American University Data Warehouses.
Table Indexing for the.NET Developer Denny Cherry twitter.com/mrdenny.
Data Warehousing.
Parallel Execution Plans Joe Chang
Parallel Execution Plans Joe Chang
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
TPC-H Studies Joe Chang
Denny Cherry twitter.com/mrdenny.
Query Optimizer Execution Plan Cost Model Joe Chang
Chapter 4 Logical & Physical Database Design
Chapter 5 Index and Clustering
INTRODUCING SQL SERVER 2012 COLUMNSTORE INDEXES Exploring and Managing SQL Server 2012 Database Engine Improvements.
2012 © Trivadis BASEL BERN LAUSANNE ZÜRICH DÜSSELDORF FRANKFURT A.M. FREIBURG I.BR. HAMBURG MÜNCHEN STUTTGART WIEN Welcome November 2012 Columnstore Indexes.
Page 1 © Hortonworks Inc – All Rights Reserved Hive: Data Organization for Performance Gopal Vijayaraghavan.
Data Management Conference Performance & Scalability Simon Sabin London September 29th.
Scalable data access with Impala Zbigniew Baranowski Maciej Grzybek Daniel Lanza Garcia Kacper Surdy.
Table Structures and Indexing. The concept of indexing If you were asked to search for the name “Adam Wilbert” in a phonebook, you would go directly to.
--A Gem of SQL Server 2012, particularly for Data Warehousing-- Present By Steven Wang.
How to kill SQL Server Performance Håkan Winther.
SQLUG.be Case study: Redesign CDR archiving on SQL Server 2012 By Ludo Bernaerts April 16,2012.
SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP,MCP. SQL SERVER Database Administration.
OM. Platinum Level Sponsors Gold Level Sponsors Pre Conference Sponsor Venue Sponsor Key Note Sponsor.
SQL Server Statistics DEMO SQL Server Statistics SREENI JULAKANTI,MCTS.MCITP SQL SERVER Database Administration.
Building the Corporate Data Warehouse Pindaro Demertzoglou Lally School of Management Data Resource Management.
DESIGNING HIGH PERFORMANCE ETL FOR DATA WAREHOUSE. Best Practices and approaches. Alexei Khalyako (SQLCAT) & Marcel Franke (pmOne)
Honest Bob’s Cube Processing Bob Duffy Database Architect.
Doing fast! Optimizing Query performance with ColumnStore Indexes in SQL Server 2012 Margarita Naumova | SQL Master Academy.
Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:
Power BI Performance Tips & Tricks
Very Large Databases in your future
IBM DATASTAGE online Training at GoLogica
Data Warehouse.
A developers guide to Azure SQL Data Warehouse
Blazing-Fast Performance:
Power BI Performance …Tips and Techniques.
Table Indexing for the .NET Developer
Azure SQL Data Warehouse Performance Tuning
20 Questions with Azure SQL Data Warehouse
Execution Plans Demystified
Azure SQL DWH: Optimization
Sunil Agarwal | Principal Program Manager
SQL Server Query Optimizer Cost Formulas
Four Rules For Columnstore Query Performance
Introduction to Execution Plans
Building your First Cube with SSAS
Applying Data Warehouse Techniques
Introduction to Execution Plans
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Using Columnstore indexes in Azure DevOps Services. Lessons learned
Introduction to Execution Plans
Applying Data Warehouse Techniques
Using Columnstore indexes in Azure DevOps Services. Lessons learned.
Presentation transcript:

Making Data Warehouse Easy Conor Cunningham – Principal Architect Thomas Kejser – Principal PM

Introduction We build and implement Data Warehouses (and the engines that run them) We also fix DWs that others build This talk covers the key patterns we use We will also show you how you can make your life easier with Microsoft’s SQL technologies

What World do you Live in? Hardware should be bought when I know the details Hardware should be bought when I know the details I need to know my hardware CAPEX before I decide to invest I need to know my hardware CAPEX before I decide to invest I can’t wait for you to figure all that out Do it, NOW! I can’t wait for you to figure all that out Do it, NOW!

Sketch a Rough Model 1.Define Roughly on Business Problem 2.Decide on Dimensions – Dim columns can wait 3.Build Dimension/Fact Matrix Fact/DimSalesInventoryPurchases CustomerX ProductXXX TimeXXX DateXXX StoreXX WarehouseXX

Estimate Storage ≈ 4B ≈ 1/3 or sp_estimate_compression ≈ 8B

Why Integer Keys are Cheaper Smaller row sizes More rows/page = more compression Faster to join Faster in column stores

Pick Standard HW Configuration Small (GB to low TB) : Business Decision Appliance Medium (up to 80TB): Fast Track Large (100s of TB): PDW – Note: Elastic scale plus for lower sizes too! Careful with sizes, some are listed pre- compression

Server Config / File Layout 1.Follow FT Guidance! 2.You probably don’t need to do anything else

Why does Fast Track/PDW Work? Warehouses are I/O hungry – GB/sec – This is high (in a SAN terms) We did the HW testing for you Guidance on data layout

Implement Prototype Model Design schema Analyse data quality with DQS/Excel – Probably not what you expected to find! Start with small data samples!

Schema Tool Discussion! SSMS with Schema Designer SQL Server Data Tools

Prototyping Hints Generate INTEGER keys out of strings keys with hash Focus on Type 1 Dimensions PowerPivot/Excel to show data fast Drive conversation with end users! KeyNameCity 1ThomasLondon KeyNameCityFromTo 1ThomasMalmo ThomasLondon Customer Type 1 Customer Type 2

Prototype: What users will teach you They will change/refocus their mind when they see the actual data You have probably forgotten some dimension data You may have misestimated data sizes

Schema Design Hints Build Star Schema Beginners may want to avoid snowflakes (most of our users just use star) Implement a Date Table (use INT key in YYYYMMDD format) – Fact.MyDate BETWEEN AND – Fact.MyDate BETWEEN ‘ ’ and ‘ ’ – YEAR(Fact.MyDate ) = 2000 Identity, Sequences Usually you can validate PK/FK Constraints during load and avoid them in the model Fact Table – fixed sized columns, declared NOT NULL (if possible) For ColumnStore, data types need to be the basic ones…

Why Facts/Dimensions? Optimizers have a tough job Our QO generates star joins early in search We look for the star join pattern to do this – 1 big table, dimensions joined to it… Following this pattern will help you – Reduced compilation time – Better plan quality (average) You can look at the plans and see whether the optimizer got the “right” shape – Wrong Plan  your query is non-standard OR perhaps QO messed up!

Partition/Index the Model Partition fact by load window Fact cluster/heap? – Cluster fact on seek key – Cluster fact on date column (if cardinality > partitions) – Leave as heap Column Store index on – All columns of fact – Columns of large dim Cluster the Dim on Key

If(followedpattern) {expect …} Star Join Shape > Properties: – Usually all Hash Joins – Parallelism – Bitmaps – Join dimensions together, then scan Fact – Indexes on filtered Dim columns helpful if they are covering

The Approximate Plan Partial Aggregate Fact CSI Scan Dim Scan Dim Seek Batch Build Batch Build Hash Join Hash Join Hash Stream Aggregate

Column Store Plan Shapes For ColumnStore, it’s the same shape Minor differences – Batch mode (Not Row Mode) – Parallelism works differently – Converts to row mode above the star join shape If you don’t get a batch mode plan, performance is likely to be much slower (usually this implies a schema design issue or a plan costing issue) Partitioning Sliding Window works well with ColumnStore (especially since the table must be is readonly)

Data Maintenance Statistics – Add manually on Correlated Columns – Update fact statistics after ETL load – Leave Dim to auto update Rebuilding indexes? – Probably not needed – If needed, make part of ETL load Switch out old partitions and drop switch target – Automate this

Serve the Data Self Service – Tabular / Dimensional Cubes – Excel / PowerPivot / PowerView Fixed Reports – Reporting Services – PowerView Don’t clean data in “serving engines” – Materialise post-cleaned data as column in relational source

?? !!