Optimizing SQL Server and Databases for large Fact Tables =tg= Thomas Grohser, NTT Data SQL Server MVP SQL Server Performance Engineering SQL Saturday #575 December 10th 2016, Providence, RI
select * from =tg= where topic = @@Version Remark SQL 4.21 First SQL Server ever used (1994) SQL 6.0 First Log Shipping with failover SQL 6.5 First SQL Server Cluster (NT4.0 + Wolfpack) SQL 7.0 2+ billion rows / month in a single Table SQL 2000 938 days with 100% availability SQL 2000 IA64 First SQL Server on Itanium IA64 SQL 2005 IA64 First OLTP long distance database mirroring SQL 2008 IA64 First Replication into mirrored databases SQL 2008R2 IA64 SQL 2008R2 x64 First 256 CPUs & >500.000 STMT/sec First Scale out > 1.000.000 STMT/sec First time 1.2+ trillion rows in a table SQL 2012 > 220.000 Transactions per second > 1.3 Trillion Rows in a table SQL 2014 > 400.000 Transactions per second Fully automated deploy and management SQL 2016 AlwaysOn Automatic HA and DR, crossed the PB in storage SQL vNext Can’t wait to push the limits even further =tg= Thomas Grohser, NTT DATA Senior Director Technical Solutions Architecture email: Thomas.grohser@nttdata.com / tg@grohser.com Focus on SQL Server Security, Performance Engineering, Infrastructure and Architecture New Papers coming 2016 Close Relationship with SQLCAT (SQL Server Customer Advisory Team) SCAN (SQL Server Customer Advisory Network) TAP (Technology Adoption Program) Product Teams in Redmond Active PASS member and PASS Summit Speaker 22 Years with SQL Server
NTT DATA Overview Why NTT DATA for MS Services: 20,000 professionals – Optimizing balanced global delivery $1.6B – Annual revenues with history of above-market growth Long-term relationships – >1,000 clients; mid-market to large enterprise Delivery excellence – Enabled by process maturity, tools and accelerators Flexible engagement – Spans consulting, staffing, managed services, outsourcing, and cloud Industry expertise – Driving depth in select industry verticals Why NTT DATA for MS Services: NTT DATA is a Microsoft Gold Certified Partner. We cover the entire MS Stack, from applications to infrastructure to the cloud Proven track record with 500+ MS solutions delivered in the past 20 years
Agenda Defining the issue/problem Looking at the tools Using the right tools Q&A ATTENTION: Important Information may be displayed on any slide at any time! ! Without Warning !
Definition of a large fact table Moving individual target over time 2001 for me big was > 1 billion rows > 90 GB 2011 for me big was > 1.3 trillion rows > 250 TB 2016 ??? 10 PB ???
Size matters not! Having the right tools in place and knowing how to use them to handle the data is the solution.
The Problem Trying to run 30 reports on a big fact table that each need to scan the whole table… The data is ready at 5am in the morning… Reports need to be ready by 9am… The baseline Each report takes about 2 hours to finish…
Good news for people with SA Tools Hardware (Server, Storage) SQL Server (Standard, (BI), Enterprise) Clever Configuration Clever Query Scheduling Good news for people with SA
Hardware “The grade of steel”
CPU is not the limit On a modern CPU each core can process about 500 MB/s How many cores do we have in commodity server? 4-22 cores (that’s 4 more since April 2016) 1-8 sockets That’s 4 to 176 cores or ~2 – ~88 GB per second or ~7 to ~300 TB per hour CPU Capacity is a rarely a bottle neck
Understanding how SQL scans data SQL Servers reads the data page by page SQL Server may perform read-ahead Dynamically adjusts read-ahead size by table Standard Edition: Up to 128 pages Enterprise Edition: Up to 512 pages That’s up to 1 MB (Std) or 4 MB (Ent) Read ahead as much as possible… Why? Reading 4 MB takes about as long as reading 8 KB So lets help SQL doing it.
Read Ahead happens if … The next data needed is in contiguous pages on the disk. Problem with 2 or more tables that grow at the same time.
Multiple Data Files 1-3-5-7-9-… 2-4-6-8-… 1-2-4-5-7-8-… 3-6-9-… 1-3 5 8-9 … 2-4 6-7 …
Multiple File Groups FG1 FG2 1-2-3-4-5-6-7-8-9-… 1-2-3-4-5-6-7-8-9-…
SQL Server Startup Options -E can be your friend if you have large tables -E allocates 64 extents at a time That is 4 MB at a time for each table instead of 64 KB The cost of it: every table is at least 4MB (including all the ones in tempdb!
Multiple Data Files Revisited
IO and Storage Path
Read speed factor – Direct Attached 1X RAID 5 0.25-4X 1X RAID 1 1-2X 2X RAID 5 0.5-2X
Read speed factor - SAN On SAN the paths to the array are most likely the limiting factor Ensure there are enough paths to the array Try disable read cache if possible (most of the time makes it faster) 1X 1X 1X 2X
Understand the path to the drives Cache Fiber Channel Ports Controllers/Processors Switch HBA RAID Cntr. SAN DAS SSD SSD NVRAM
IO Bottle necks Rotating Disks (10-160 MB/sec) ~ 0.1 GB/s Disk Interface / SSD (3-12 Gb/sec) ~ 1 GB/s RAID Controller (1-8 GB/sec) ~ 8 GB/s Ethernet (1 or 10 Gb/sec) ~ 1 GB/s Fiber Channel (2-16 Gb/sec) ~ 2 GB/s Host bus Adapter (2-32 Gb/sec) ~ 4 GB/s PCIe Express Bus (0.25-32 GB/sec) ~ 32 GB/s System (4-16 PCIe Busses) ~ 512 GB/s
Schema and Indexes
Table Partitioning Great tool to make maintaining the database easier but does not give us much in performance. Could actually slow us down. Might be needed to spread data across multiple File Groups
Row and Page Compression ROW compression Almost now overhead Can save several unused bytes in each row Remember: 1 byte less on 1 billion rows is 1 GB Page Compression Some overhead Can save a lot on repeating patterns (same values within a page) New data is not compressed ! Never compress lookup data
Mary Go Round Piggy Back Scan Query 1 Query 2 Enterprise Edition Only Automatically invoked With planning much better results
Column Store Index With SQL2016 finally fully usable (updateable without workarounds, can be the clustered index) ~40% faster then before Awesome compression ratios Even better results if a lot of queries only require a few columns of the fact table
THANK YOU! and may the force be with you… Questions? thomas.grohser@nttdata.com tg@grohser.com