Data Warehousing Best Practices

Slides:



Advertisements
Similar presentations
Adam Jorgensen Pragmatic Works Performance Optimization in SQL Server Analysis Services 2008.
Advertisements

1 Use or disclosure of data contained on this sheet is subject to the restriction on the title page of this proposal or quotation. An Introduction to Data.
Big Data Working with Terabytes in SQL Server Andrew Novick
High Performance Analytical Appliance MPP Database Server Platform for high performance Prebuilt appliance with HW & SW included and optimally configured.
IS 4420 Database Fundamentals Chapter 6: Physical Database Design and Performance Leon Chen.
Components and Architecture CS 543 – Data Warehousing.
Data Warehousing - 3 ISYS 650. Snowflake Schema one or more dimension tables do not join directly to the fact table but must join through other dimension.
Chapter 13 The Data Warehouse
Data Warehousing: Defined and Its Applications Pete Johnson April 2002.
Making Data Warehouse Easy Conor Cunningham – Principal Architect Thomas Kejser – Principal PM.
Architecting a Large-Scale Data Warehouse with SQL Server 2005 Mark Morton Senior Technical Consultant IT Training Solutions DAT313.
DBI308. What are SQL Server Fast Track Reference Configurations General Fast Track Recommendations Reference Configurations and Best Practices FT 3.0.
Fast Track, Microsoft SQL Server 2008 Parallel Data Warehouse and Traditional Data Warehouse Design BI Best Practices and Tuning for Scaling SQL Server.
Chapter 2: Designing Physical Storage MCITP Administrator: Microsoft SQL Server 2005 Database Server Infrastructure Design Study Guide (70-443)
Best Practices for Data Warehousing. 2 Agenda – Best Practices for DW-BI Best Practices in Data Modeling Best Practices in ETL Best Practices in Reporting.
1 © Prentice Hall, 2002 Physical Database Design Dr. Bijoy Bordoloi.
CS 345: Topics in Data Warehousing Tuesday, October 19, 2004.
HBase A column-centered database 1. Overview An Apache project Influenced by Google’s BigTable Built on Hadoop ▫A distributed file system ▫Supports Map-Reduce.
Physical Database Design & Performance. Optimizing for Query Performance For DBs with high retrieval traffic as compared to maintenance traffic, optimizing.
Data Warehousing at Acxiom Paul Montrose Data Warehousing at Acxiom Paul Montrose.
Chapter 6 1 © Prentice Hall, 2002 The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited) Project Identification and Selection Project Initiation.
SESSION CODE: BIE07-INT Eric Kraemer Senior Program Manager Microsoft Corporation.
Data warehousing and online analytical processing- Ref Chap 4) By Asst Prof. Muhammad Amir Alam.
Data Management Conference Data Warehousing John Plummer TSP Architect
1 Data Warehouses BUAD/American University Data Warehouses.
Decision Support and Date Warehouse Jingyi Lu. Outline Decision Support System OLAP vs. OLTP What is Date Warehouse? Dimensional Modeling Extract, Transform,
1 Chapter 10 Joins and Subqueries. 2 Joins & Subqueries Joins – Methods to combine data from multiple tables – Optimizer information can be limited based.
Physical Database Design Purpose- translate the logical description of data into the technical specifications for storing and retrieving data Goal - create.
7 Strategies for Extracting, Transforming, and Loading.
Chapter 8 Physical Database Design. Outline Overview of Physical Database Design Inputs of Physical Database Design File Structures Query Optimization.
Chapter 4 Logical & Physical Database Design
By N.Gopinath AP/CSE.  The data warehouse architecture is based on a relational database management system server that functions as the central repository.
Infrastructure for Data Warehouses. Basics Of Data Access Data Store Machine Memory Buffer Memory Cache Data Store Buffer Bus Structure.
Chapter 5 Index and Clustering
How to kill SQL Server Performance Håkan Winther.
Configuring SQL Server for a successful SharePoint Server Deployment Haaron Gonzalez Solution Architect & Consultant Microsoft MVP SharePoint Server
Doing fast! Optimizing Query performance with ColumnStore Indexes in SQL Server 2012 Margarita Naumova | SQL Master Academy.
Indexing strategies and good physical designs for performance tuning Kenneth Ureña /SpanishPASSVC.
Hitting the SQL Server “Go Faster” Button Rob Douglas #509 | Brisbane 2016.
When Good Design Goes Bad Bob Duffy Database Architect Prodata SQL Centre of Excellence March 2015.
Plan for Final Lecture What you may expect to be asked in the Exam?
Data Warehouse ETL By Garrett EDmondson Thanks to our Gold Sponsors:
Flash Storage 101 Revolutionizing Databases
Antonio Abalos Castillo
Hitting the SQL Server “Go Faster” Button
Database Management Systems (CS 564)
Data warehouse and OLAP
Informatica PowerCenter Performance Tuning Tips
Physical Database Design and Performance
Chapter 13 The Data Warehouse
Informix Red Brick Warehouse 5.1
Data Warehouse in the Cloud – Marketing or Reality?
Installation and database instance essentials
IBM DATASTAGE online Training at GoLogica
Data Warehouse.
Database Performance Tuning and Query Optimization
Upgrading to Microsoft SQL Server 2014
CHAPTER 5: PHYSICAL DATABASE DESIGN AND PERFORMANCE
What is the Azure SQL Datawarehouse?
Hitting the SQL Server “Go Faster” Button
Proving Hardware Bottlenecks &
SQL 2014 In-Memory OLTP What, Why, and How
20 Questions with Azure SQL Data Warehouse
An Introduction to Data Warehousing
Typically data is extracted from multiple sources
The Physical Design Stage of SDLC (figures 2.4, 2.5 revisited)
Data Warehouse.
Data Warehousing Concepts
Applying Data Warehouse Techniques
Applying Data Warehouse Techniques
Presentation transcript:

Data Warehousing Best Practices Douglas Barrett, Sr. Solutions Architect 1

Me WhereScape Grey hairs, lots of scars over 12 years Retail Bank Oracle 7 with OLAP Services Microsoft Consulting Many projects WhereScape Consulting pre-sales, consultant, architect, Jack of all trades WhereScape A data warehousing company – provides products and services to build data warehouses pragmatically. Pragmatic = agile with a small a WhereScape is a data warehousing company – products and services. We started off in NZ (hence the funny accent) which is a very constrained market – and realised that nobody could spend a fortune on building the data warehouse. So we built a product automate the build – not as easy as it sounds - it’s a mature product now that allows us to take an agile approach to data warehouse development.

Data Warehouse Projects Have to align the stars to get started and build relevant, long lived data warehouses Sponsorship & Governance Project Approach & Team DWH Implementation & Platform Wider BI Environment There is still a large number of failures, unloved and short lived Data Warehouses. We use agile techniques during the build of a data warehousing – not same as an agile methodology which grew up in code dev – but agile techniques including small team of experienced people, stand ups, frequent delivery cycles and engagement with the business. Big difference between traditional agile and agile dwh is data – data means changing code and refactoring the database at the same time – deployment is much more interesting when you are promoting change into production data warehouse that needs to retain data.

The Warehouse Platform Data Warehousing The Warehouse Platform Loading the Warehouse Querying the Warehouse From a technical point of view I have broken the data warehouse into 3 main parts: The platform – represented by the bucket The data warehouse load – extract data from source systems and make it presentable And data warehouse querying or getting data our of the bucket

Data Warehouse vs Transactional Systems Scan vs Seek Sequential vs Random Read vs Mixed R/W Bulk Inserts vs Transactions Low Concurrency vs High Concurrency Scan Rate vs IOPS Typically a data warehouse is loaded on a batch basis – perhaps once a night – then it is effectively Read Only. People will be running reports over the data that covers a large chunk of data eg sales per sales person year to date KPIs – typcially show a calculation over a big chunk of data There are usually not so many concurrent users of a data warehouse or BI solution as there using the source systems. So we are dealing with much fewer queries – but much bigger ones. The key metric for determining the effectiveness of the system is based on scan rates not IOPS.

Inmon vs Kimball debate Data Warehouse Design Inmon vs Kimball debate EDW and Dimensional Complexity Pure Dimensional If you have worked in data warehousing you will know that there are two competing camps – Inmon and Kimball. They actually coexist. You can go for a pure dimensional data warehouse if your solution is simple or there is one or two source systems. But when data volumes increase or there are multiple source systems or there is a lot of complexity I would suggest that you build an Enterprise Data Layer to the Warehouse to keep the processing manageable. Data Volume

Classic Processing Streams Sales Ops Financials Marketing Inventory Financial HR Planning Complexity creeps up Orders Forecast Potentially source * datamart = data feeds

Classic Processing Streams Sales Financials EDW DV DS Inventory HR This is where an enterprise data repository sitting in the middle becomes more useful. Traditionally takes a lot of time to build – slowing down that delivery. Consider Data Vault – it is a modular pattern based design that is quick to implement and relatively simple to build and it allows you to simplify the bit in the middle. Bridges the gap between Kimball and Inmon to provide longevity and delivery. Orders Data Business Potentially source + datamart = data feeds http://danlinstedt.com/

Processing Implementation Data Warehouse Data Preparation End User Layer Source System Load Transform Detail Aggregate Src Table Load Table DV Table Src Table Stg Table Dim Table Load Table DV Table Agg Table Incrementally load the data into the staging area of the warehouse. Publish the data to a data store, data vault or enterprise data layer. Maintains a raw, granular, historical record of the data that we load. Transform the data into a presentation layer data mart and then aggregate it for reporting and analysis. Several layers of processing. Source Files Load Table DV Table Stg Table Fact Table OLAP Src Table Load Table DV Table Stg Table Dim Table Src Table

The Warehouse Platform Small Classic <4TB Medium Fast Track 2 – 48TB Large PDW 20+TB I have tried to categorize the different set of practises into small, medium and large. Classic data warehousing is what has been built to date on standard SQL Server boxes. Fast Track best practises take a new approach to data warehousing that allows SQL Server to scale up much higher on a single box optimized and balanced for data warehousing. Parallel Data Warehouse server = Massively Parallel, Shared Nothing server architecture for scaling the data warehouse over many servers acting as a single database. This takes Microsoft SQL Server up to compete with Teradata and Netezza, where data warehouses can reach hundreds of Terabytes in size.

Classic Data Warehouse – SQL Setup Beefy Windows Server with SQL Server The basics: Bulk / Simple Recovery Model Pre Allocate File space Turn auto-shrink off Keep update stats on Use all available disks Separate Staging from EUL Classic data warehouse – the smaller size warehouse, that only scales a mere handful of Terabytes – vast majority of medium and even larger sized organizations. First of we would use a standard SQL Server installation – preferably enterprise edition (if we need some to the more advanced scalability features) sitting on a Windows server.

Classic Data Warehouse - Disk Layout Data Files Random I/O – RAID 5 Log Files Sequential I/O – RAID 1+0 TempDb Separate on own disks (SSD) or Data disk Pre-allocate space More spindles the better Equal sized files, one per CPU SAN often shared using OLTP optimization Shared SAN will be optimized for OLTP or transactional operations, not for scan based workloads that the data warehouse will generate.

Loading the Classic Data Warehouse Loading Basics Load incrementally Load as little as possible Don’t use Foreign keys in the star schema Monitor Fragmentation Partition big tables Sort in tempdb for Indexes SQL Server 2008+ MERGE – simplify code Minimal Logging – improves load performance CDC – assists change detection. Typically IO bound. The less that needs to be written the better.

Change Data Capture Incremental loading No changes to applications Identify changes Identify previous states Unfortunately there is an impact on source No one using it. Use in conjunction with Replication.

Bulk loading in the database Minimal Logging Bulk loading in the database 30-60% Performance Improvement** TABLOCK hint on INSERT Can be part of a transaction Trace Flag 610 Fn_dblog Minimal logging turns a standard Insert statement into a BULK logged or minimally logged command.

Minimal Logging Table Indexes Rows in table Hints Without TF 610 With TF 610 Concurrent possible Heap Any TABLOCK Minimal Yes None Full Heap + Index Depends (3) No Cluster Empty TABLOCK, ORDER (1) Yes (2) Cluster + Index Trace flag encourages minimal logging TF610.

Minimal Logging Table Indexes Rows in table Hints Without TF 610 With TF 610 Concurrent possible Heap Any TABLOCK Minimal Yes None Full Heap + Index Depends (3) No Cluster Empty TABLOCK, ORDER (1) Yes (2) Cluster + Index Common scenarios in classic data warehouse.

The Classic Data Warehouse Model Star Schema Design Dimensions – short and fat Clustered Index on integer Surrogate Key Btree index on Business Key Btree index on other commonly used attributes Facts – long and thin Clustered Index on primary date key Btree indexes on dimension keys Compress Fact tables Aggregations using tables or views Partition big Fact tables – sliding window Use Analysis Services for Ad-hoc reporting The dimension tables in a star schema should be clustered on the identity integer surrogate key. This increments so will not get fragmented. The business key should have a btree- this will be useful for key lookup operations during processing of the data warehouse. The fact table should be clustered in the order of a incrementing date that is commonly used to scan or sort the data eg transaction date.

Classic Data Warehouse Optimizations SQL 2008+ Star Join Query Optimization Compression Partition Table Parallelism Partition Aligned Index Views

Star Join Optimisation Automatic (no coding or hints required) Speeds up queries (15-20%) Detects facts and dimensions based on row counts Join reduction processing  Bit map filters Extensions added to the optimizer. Show query plan with Bitmaps created on Dimension records going into Predicate on the Fact.

Compression ROW and PAGE Compresses Storage Less I/O More data in memory More CPU PAGE includes ROW PAGE compression minimizes the data redundancy in columns in one or more rows on a given page. It uses a proprietary implementation of the LZ78 (Lempel-Ziv) algorithm, storing the redundant data only once on the page and then referencing it from the multiple columns. Note that when you use PAGE compression, ROW compression is actually also included. The ROW and PAGE compression can be enabled on a table or an index or on one or more partitions for the partitioned tables and indexes.

Partitioning Improves large table management Improves load times Insert into an empty table & swap into fact Remove Indices Improves query time Partition pruning Distribute partitions Improves delete time – sliding window deletes Reduce fragmentation Added complexity And watch out for statistics (full table) Partitioning improves management of large tables by managing it as a set of smaller – related tables. Partitioning can also improve query time by pruning during a table scan – by only scanning the partitions that are included in the query.

The Warehouse Platform – Fast Track Reference architecture & Design Guidance The core to making SQL Server scale on a single server up to 48 TB is to make sure that the server is optimized for the data warehouse and that the data warehouse design is going to optimize usage of the server. Scaling up is defined as providing predictable performance for a specific data warehouse workload. Principles: Balanced systems Predictable Performance Workload Centric

System components The whole design is optimized for scan based operations – so in combination with a scan based design the data warehosue can scale. MS have worked with hardware vendors to make sure that they provide a set of balanced servers. Each component needs to be matched to support dwh workload. Can use fewer disks over a traditional system built for random I/O

Platform sizing Science around platform sizing based on: Maximum Core Consumption Rate (MCR) Benchmark Consumption Rate (BCR) User Data Capacity (UDC) Then head to a h/w partner and buy a pre-configured dwh server Pre-configured with Windows 2008r2 and SQL 2008r2 3 key calculations. MCR – SQL Processing Rate using standard data and query. Maximum I/O bandwidth for the server BCR – query or set of queries that are definitive of DWH workload. UDC – user data capacity taking into account growth rates Tweaks are made for Windows and SQL Multi-path IO enabled for Windows E – increases contiguous extents in each file allocated to database table T1117 even growth of files in a filegroup T834 large page allocations in memory for the buffer pool

Disk layout Spread database files across all data LUNs Each FileGroup should have a file on each data LUN Each File should be of equal size Files are distributed over a set of LUNs dedicated to data files. The files are set to an equal size over the LUNs, with logs on a separate set of LUNs. Files are set to grow at the same rate to keep the data distributed evenly over the drives.

The Warehouse Design – Fast Track Encourage scanning of contiguous data Indexing light Heaps of heaps Encourages Scans Reduces maintenance overhead Aggregation light Avoid premature aggregation Avoid Fragmentation at all costs Database design uses high scan rates, and encourage table scans. Fragmentation will undo all of this work by reducing the efficiency of the scan.

The Warehouse Design – Fast Track Large tables: Heap is good for full table scans Partitioned Heap will restrict scan (pruning) Clustered Index for range restricted scans Secondary indexes where restrictive queries are common Compress large tables

Fast Track - Fragmentation Fragmentation occurs at: File System Do not use OS defragmentation Pre-allocate file space Extent Limit concurrent DML Index Rebuild, not re-organise Use TempDB Monitor using dmv: sys.dm_db_index_physical_stats

The Data Warehouse Design – Loading Loading data BULK INSERT with TABLOCK Moving data INSERT.. SELECT Use TABLOCK Use MAXDOP 1 Use partition switching or partition per load period Use bulk insert and minimal logging were possible. No parallel DML operations.

The Warehouse Design – Statistics Auto update and create statistics on Manually update statistics for partitioned table Manually update statistics for increasing keys eg date in Fact table Distribution values for existing statistics do not include new range.

Completing The Value Proposition… Hardware partners Development Tool To Build/Manage the Data Warehouse R Fast Track partners of Microsoft – there are several hardware vendors but only one software vendor. 32 32

Parallel Data Warehouse Massively Parallel, Shared Nothing, Architecture Love to see one of these.

Denali Data Warehouse Column Store Index Vertipaq – columns in pages, not rows.

Data Warehouse best practises Data Warehouse Platform Loading Querying Questions: douglas.barrett@wherescape.com