Download presentation
Presentation is loading. Please wait.
1
Data Warehousing Best Practices
Douglas Barrett, Sr. Solutions Architect 1
2
Me WhereScape Grey hairs, lots of scars over 12 years Retail Bank
Oracle 7 with OLAP Services Microsoft Consulting Many projects WhereScape Consulting pre-sales, consultant, architect, Jack of all trades WhereScape A data warehousing company – provides products and services to build data warehouses pragmatically. Pragmatic = agile with a small a WhereScape is a data warehousing company – products and services. We started off in NZ (hence the funny accent) which is a very constrained market – and realised that nobody could spend a fortune on building the data warehouse. So we built a product automate the build – not as easy as it sounds - it’s a mature product now that allows us to take an agile approach to data warehouse development.
3
Data Warehouse Projects
Have to align the stars to get started and build relevant, long lived data warehouses Sponsorship & Governance Project Approach & Team DWH Implementation & Platform Wider BI Environment There is still a large number of failures, unloved and short lived Data Warehouses. We use agile techniques during the build of a data warehousing – not same as an agile methodology which grew up in code dev – but agile techniques including small team of experienced people, stand ups, frequent delivery cycles and engagement with the business. Big difference between traditional agile and agile dwh is data – data means changing code and refactoring the database at the same time – deployment is much more interesting when you are promoting change into production data warehouse that needs to retain data.
4
The Warehouse Platform
Data Warehousing The Warehouse Platform Loading the Warehouse Querying the Warehouse From a technical point of view I have broken the data warehouse into 3 main parts: The platform – represented by the bucket The data warehouse load – extract data from source systems and make it presentable And data warehouse querying or getting data our of the bucket
5
Data Warehouse vs Transactional Systems
Scan vs Seek Sequential vs Random Read vs Mixed R/W Bulk Inserts vs Transactions Low Concurrency vs High Concurrency Scan Rate vs IOPS Typically a data warehouse is loaded on a batch basis – perhaps once a night – then it is effectively Read Only. People will be running reports over the data that covers a large chunk of data eg sales per sales person year to date KPIs – typcially show a calculation over a big chunk of data There are usually not so many concurrent users of a data warehouse or BI solution as there using the source systems. So we are dealing with much fewer queries – but much bigger ones. The key metric for determining the effectiveness of the system is based on scan rates not IOPS.
6
Inmon vs Kimball debate
Data Warehouse Design Inmon vs Kimball debate EDW and Dimensional Complexity Pure Dimensional If you have worked in data warehousing you will know that there are two competing camps – Inmon and Kimball. They actually coexist. You can go for a pure dimensional data warehouse if your solution is simple or there is one or two source systems. But when data volumes increase or there are multiple source systems or there is a lot of complexity I would suggest that you build an Enterprise Data Layer to the Warehouse to keep the processing manageable. Data Volume
7
Classic Processing Streams
Sales Ops Financials Marketing Inventory Financial HR Planning Complexity creeps up Orders Forecast Potentially source * datamart = data feeds
8
Classic Processing Streams
Sales Financials EDW DV DS Inventory HR This is where an enterprise data repository sitting in the middle becomes more useful. Traditionally takes a lot of time to build – slowing down that delivery. Consider Data Vault – it is a modular pattern based design that is quick to implement and relatively simple to build and it allows you to simplify the bit in the middle. Bridges the gap between Kimball and Inmon to provide longevity and delivery. Orders Data Business Potentially source + datamart = data feeds
9
Processing Implementation
Data Warehouse Data Preparation End User Layer Source System Load Transform Detail Aggregate Src Table Load Table DV Table Src Table Stg Table Dim Table Load Table DV Table Agg Table Incrementally load the data into the staging area of the warehouse. Publish the data to a data store, data vault or enterprise data layer. Maintains a raw, granular, historical record of the data that we load. Transform the data into a presentation layer data mart and then aggregate it for reporting and analysis. Several layers of processing. Source Files Load Table DV Table Stg Table Fact Table OLAP Src Table Load Table DV Table Stg Table Dim Table Src Table
10
The Warehouse Platform
Small Classic <4TB Medium Fast Track 2 – 48TB Large PDW 20+TB I have tried to categorize the different set of practises into small, medium and large. Classic data warehousing is what has been built to date on standard SQL Server boxes. Fast Track best practises take a new approach to data warehousing that allows SQL Server to scale up much higher on a single box optimized and balanced for data warehousing. Parallel Data Warehouse server = Massively Parallel, Shared Nothing server architecture for scaling the data warehouse over many servers acting as a single database. This takes Microsoft SQL Server up to compete with Teradata and Netezza, where data warehouses can reach hundreds of Terabytes in size.
11
Classic Data Warehouse – SQL Setup
Beefy Windows Server with SQL Server The basics: Bulk / Simple Recovery Model Pre Allocate File space Turn auto-shrink off Keep update stats on Use all available disks Separate Staging from EUL Classic data warehouse – the smaller size warehouse, that only scales a mere handful of Terabytes – vast majority of medium and even larger sized organizations. First of we would use a standard SQL Server installation – preferably enterprise edition (if we need some to the more advanced scalability features) sitting on a Windows server.
12
Classic Data Warehouse - Disk Layout
Data Files Random I/O – RAID 5 Log Files Sequential I/O – RAID 1+0 TempDb Separate on own disks (SSD) or Data disk Pre-allocate space More spindles the better Equal sized files, one per CPU SAN often shared using OLTP optimization Shared SAN will be optimized for OLTP or transactional operations, not for scan based workloads that the data warehouse will generate.
13
Loading the Classic Data Warehouse
Loading Basics Load incrementally Load as little as possible Don’t use Foreign keys in the star schema Monitor Fragmentation Partition big tables Sort in tempdb for Indexes SQL Server 2008+ MERGE – simplify code Minimal Logging – improves load performance CDC – assists change detection. Typically IO bound. The less that needs to be written the better.
14
Change Data Capture Incremental loading No changes to applications
Identify changes Identify previous states Unfortunately there is an impact on source No one using it. Use in conjunction with Replication.
15
Bulk loading in the database
Minimal Logging Bulk loading in the database 30-60% Performance Improvement** TABLOCK hint on INSERT Can be part of a transaction Trace Flag 610 Fn_dblog Minimal logging turns a standard Insert statement into a BULK logged or minimally logged command.
16
Minimal Logging Table Indexes Rows in table Hints Without TF 610
With TF 610 Concurrent possible Heap Any TABLOCK Minimal Yes None Full Heap + Index Depends (3) No Cluster Empty TABLOCK, ORDER (1) Yes (2) Cluster + Index Trace flag encourages minimal logging TF610.
17
Minimal Logging Table Indexes Rows in table Hints Without TF 610
With TF 610 Concurrent possible Heap Any TABLOCK Minimal Yes None Full Heap + Index Depends (3) No Cluster Empty TABLOCK, ORDER (1) Yes (2) Cluster + Index Common scenarios in classic data warehouse.
18
The Classic Data Warehouse Model
Star Schema Design Dimensions – short and fat Clustered Index on integer Surrogate Key Btree index on Business Key Btree index on other commonly used attributes Facts – long and thin Clustered Index on primary date key Btree indexes on dimension keys Compress Fact tables Aggregations using tables or views Partition big Fact tables – sliding window Use Analysis Services for Ad-hoc reporting The dimension tables in a star schema should be clustered on the identity integer surrogate key. This increments so will not get fragmented. The business key should have a btree- this will be useful for key lookup operations during processing of the data warehouse. The fact table should be clustered in the order of a incrementing date that is commonly used to scan or sort the data eg transaction date.
19
Classic Data Warehouse Optimizations
SQL 2008+ Star Join Query Optimization Compression Partition Table Parallelism Partition Aligned Index Views
20
Star Join Optimisation
Automatic (no coding or hints required) Speeds up queries (15-20%) Detects facts and dimensions based on row counts Join reduction processing Bit map filters Extensions added to the optimizer. Show query plan with Bitmaps created on Dimension records going into Predicate on the Fact.
21
Compression ROW and PAGE Compresses Storage Less I/O
More data in memory More CPU PAGE includes ROW PAGE compression minimizes the data redundancy in columns in one or more rows on a given page. It uses a proprietary implementation of the LZ78 (Lempel-Ziv) algorithm, storing the redundant data only once on the page and then referencing it from the multiple columns. Note that when you use PAGE compression, ROW compression is actually also included. The ROW and PAGE compression can be enabled on a table or an index or on one or more partitions for the partitioned tables and indexes.
22
Partitioning Improves large table management Improves load times
Insert into an empty table & swap into fact Remove Indices Improves query time Partition pruning Distribute partitions Improves delete time – sliding window deletes Reduce fragmentation Added complexity And watch out for statistics (full table) Partitioning improves management of large tables by managing it as a set of smaller – related tables. Partitioning can also improve query time by pruning during a table scan – by only scanning the partitions that are included in the query.
23
The Warehouse Platform – Fast Track
Reference architecture & Design Guidance The core to making SQL Server scale on a single server up to 48 TB is to make sure that the server is optimized for the data warehouse and that the data warehouse design is going to optimize usage of the server. Scaling up is defined as providing predictable performance for a specific data warehouse workload. Principles: Balanced systems Predictable Performance Workload Centric
24
System components The whole design is optimized for scan based operations – so in combination with a scan based design the data warehosue can scale. MS have worked with hardware vendors to make sure that they provide a set of balanced servers. Each component needs to be matched to support dwh workload. Can use fewer disks over a traditional system built for random I/O
25
Platform sizing Science around platform sizing based on:
Maximum Core Consumption Rate (MCR) Benchmark Consumption Rate (BCR) User Data Capacity (UDC) Then head to a h/w partner and buy a pre-configured dwh server Pre-configured with Windows 2008r2 and SQL 2008r2 3 key calculations. MCR – SQL Processing Rate using standard data and query. Maximum I/O bandwidth for the server BCR – query or set of queries that are definitive of DWH workload. UDC – user data capacity taking into account growth rates Tweaks are made for Windows and SQL Multi-path IO enabled for Windows E – increases contiguous extents in each file allocated to database table T1117 even growth of files in a filegroup T834 large page allocations in memory for the buffer pool
26
Disk layout Spread database files across all data LUNs
Each FileGroup should have a file on each data LUN Each File should be of equal size Files are distributed over a set of LUNs dedicated to data files. The files are set to an equal size over the LUNs, with logs on a separate set of LUNs. Files are set to grow at the same rate to keep the data distributed evenly over the drives.
27
The Warehouse Design – Fast Track
Encourage scanning of contiguous data Indexing light Heaps of heaps Encourages Scans Reduces maintenance overhead Aggregation light Avoid premature aggregation Avoid Fragmentation at all costs Database design uses high scan rates, and encourage table scans. Fragmentation will undo all of this work by reducing the efficiency of the scan.
28
The Warehouse Design – Fast Track
Large tables: Heap is good for full table scans Partitioned Heap will restrict scan (pruning) Clustered Index for range restricted scans Secondary indexes where restrictive queries are common Compress large tables
29
Fast Track - Fragmentation
Fragmentation occurs at: File System Do not use OS defragmentation Pre-allocate file space Extent Limit concurrent DML Index Rebuild, not re-organise Use TempDB Monitor using dmv: sys.dm_db_index_physical_stats
30
The Data Warehouse Design – Loading
Loading data BULK INSERT with TABLOCK Moving data INSERT.. SELECT Use TABLOCK Use MAXDOP 1 Use partition switching or partition per load period Use bulk insert and minimal logging were possible. No parallel DML operations.
31
The Warehouse Design – Statistics
Auto update and create statistics on Manually update statistics for partitioned table Manually update statistics for increasing keys eg date in Fact table Distribution values for existing statistics do not include new range.
32
Completing The Value Proposition…
Hardware partners Development Tool To Build/Manage the Data Warehouse R Fast Track partners of Microsoft – there are several hardware vendors but only one software vendor. 32 32
33
Parallel Data Warehouse
Massively Parallel, Shared Nothing, Architecture Love to see one of these.
34
Denali Data Warehouse Column Store Index
Vertipaq – columns in pages, not rows.
35
Data Warehouse best practises
Data Warehouse Platform Loading Querying Questions:
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.