System Architecture: Big Iron (NUMA) Joe Chang jchang6@yahoo.com www.qdpma.com © 2010 Elemental Inc. All rights reserved.
About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL-execution plan Performance Monitoring, Profiler/Trace aggregation
Scaling SQL on NUMA Topics OLTP – Thomas Kejser session “Designing High Scale OLTP Systems” Data Warehouse Ongoing Database Development Bulk Load – SQL CAT paper + TK session “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar
Server Systems and Architecture
Symmetric Multi-Processing CPU System Bus MCH ICH PXH SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)
Non-Uniform Memory Access CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory Controller Memory Controller Memory Controller Memory Controller Node Controller Node Controller Node Controller Node Controller Shared Bus or X Bar NUMA Architecture - Path to memory is not uniform Node: Processors, Memory, Separate or combined Memory + Node Controllers Nodes connected by shared bus, cross-bar, ring Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware
AMD Opteron Technically, Opteron is NUMA, HT2100 HT1100 Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic)
8-way Opteron Sys Architecture CPU 2 4 6 1 3 5 7 Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other
http://www.techpowerup.com/img/09-08-26/17d.jpg
Nehalem System Architecture http://www.informit.com/articles/article.aspx?p=1606902 http://www.intel.com/intelpress/articles/qpi2.htm Gurbir Singh Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible © 2010 Elemental Inc. All rights reserved.
NUMA Local and Remote Memory Local memory is closer than remote Physical access time is shorter What is actual access time? With cache coherency requirement!
HT Assist – Probe Filter part of L3 cache used as directory cache ZDNET
Source Snoop Coherency From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line
DL980G7 From HP PREAM Architecture whitepaper: Search HP PREMA, document 4AA3-0643ENW From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data) © 2010 Elemental Inc. All rights reserved.
HP ProLiant DL980 Architecture Node Controllers reduces effective memory latency
Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag + L4 cache tags 64M eDRAM L4 cache data
IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system
OS Memory Models SUMA: Sufficiently Uniform Memory Access Memory interleaved across nodes 25 17 9 1 Node 0 24 16 8 27 19 11 3 Node 1 26 18 10 2 29 21 13 5 Node 2 28 20 12 4 31 23 15 7 Node 3 30 22 14 6 2 1 NUMA: first interleaved within a node, then spanned across nodes 7 5 3 1 Node 0 6 4 2 15 13 11 9 Node 1 14 12 10 8 23 21 19 17 Node 2 22 20 18 16 31 29 27 25 Node 3 30 28 26 24 2 1 Memory stripe is then spanned across nodes
Windows OS NUMA Support Memory models SUMA – Sufficiently Uniform Memory Access NUMA – separate memory pools by Node Node 0 24 16 8 1 25 17 9 Node 1 2 26 18 10 3 27 19 11 Node 2 4 28 20 12 5 29 21 13 Node 3 6 30 22 14 7 31 23 15 Memory is striped across NUMA nodes
Memory Model Example: 4 Nodes SUMA Memory Model memory access uniformly distributed 25% of memory accesses local, 75% remote NUMA Memory Model Goal is better than 25% local node access True local access time also needs to be faster Cache Coherency may increase local access
Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East Node 0 Node 1 Node 2 Node 3 0-0 0-1 1-0 1-1 2-0 2-1 3-0 3-1 NE MidA SE Cen 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Node 4 Node 5 Node 6 Node 7 4-0 4-1 5-0 5-1 6-0 6-1 7-0 7-1 Tex Mnt Cal PNW Mountain California Pacific NW
Architecting for NUMA End to End Affinity App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East Node 0 Node 1 Node 2 Node 3 0-0 0-1 1-0 1-1 2-0 2-1 3-0 3-1 NE MidA SE Cen 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Node 4 Node 5 Node 6 Node 7 4-0 4-1 5-0 5-1 6-0 6-1 7-0 7-1 Tex Mnt Cal PNW Mountain California Pacific NW
HP-UX LORA HP-UX – Not Microsoft Windows Locality-Optimizer Resource Alignment 12.5% Interleaved Memory 87.5% NUMA node Local Memory
System Tech Specs Processors Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
Software Stack
Operating System Windows Server 2003 RTM, SP1 Windows Server 2008 Network limitations (default) Scalable Networking Pack (912222) Windows Server 2008 Windows Server 2008 R2 (64-bit only) Breaks 64 logical processor limit NUMA IO enhancements? Impacts OLTP Search NUMA I/O Optimizations, Bruce Worthington Microsoft Windows Server 2003 Scalable Networking Pack Numaio @ microsoft.com Search: MSI-X Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000 © 2010 Elemental Inc. All rights reserved.
SQL Server version SQL Server 2000 Serious disk IO limitations (1GB/sec ?) Problematic parallel execution plans SQL Server 2005 (fixed most S2K problems) 64-bit on X64 (Opteron and Xeon) SP2 – performance improvement 10%(?) SQL Server 2008 & R2 Compression, Filtered Indexes, etc Star join, Parallel query to partitioned table Introduction to New Data Warehouse Scalability Features in SQL Server 2008 http://msdn.microsoft.com/en-us/library/cc278097(SQL.100).aspx © 2010 Elemental Inc. All rights reserved.
Configuration SQL Server Startup Parameter: E Auto_Date_Correlation Trace Flags 834, 836, 2301 Auto_Date_Correlation Order date < A, Ship date > A Implied: Order date > A-C, Ship date < A+C Port Affinity – mostly OLTP Dedicated processor ? for log writer ?
Storage Performance for Data Warehousing Joe Chang jchang6@yahoo.com www.qdpma.com © 2010 Elemental Inc. All rights reserved.
About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL-execution plan Performance Monitoring, Profiler/Trace aggregation
Storage
Organization Structure In many large IT departments DB and Storage are in separate groups Storage usually has own objectives Bring all storage into one big system under full management (read: control) Storage as a Service, in the Cloud One size fits all needs Usually have zero DB knowledge Of course we do high bandwidth, 600MB/sec good enough for you?
Data Warehouse Storage OLTP – Throughput with Fast Response DW – Flood the queues for maximum through-put Do not use shared storage for data warehouse! Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else. My advice: stay away from shared storage, controlled by different team.
Nominal and Net Bandwidth PCI-E Gen 2 – 5 Gbit/sec signaling x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net SAS 6Gbit/s – 6 Gbit/s x4 port: 3GB/s nominal, 2.2GB/sec net? Fibre Channel 8 Gbit/s nominal 780GB/s point-to-point, 680MB/s from host to SAN to back-end loop SAS RAID Controller, x8 PCI-E G2, 2 x4 6G 2.8GB/s Depends on the controller, will change!
Storage – SAS Direct-Attach . RAID PCI-E x8 SAS x4 PCI-E x4 2 x10GbE Many Fat Pipes Very Many Disks Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller Option B: Split enclosure over 2 x4 SAS ports, 1 controller Balance by pipe bandwidth Don’t forget fat network pipes
Storage – FC/SAN PCI-E x8 Gen 2 Slot with quad-port 8Gb FC 2 x10GbE PCI-E x4 . HBA PCI-E x8 8Gb FC PCI-E x8 Gen 2 Slot with quad-port 8Gb FC If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS! SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density. 1-2 15-disk enclosures per 8Gb FC port, 20-30MB/s per disk?
Storage – SSD / HDD Hybrid No RAID w/SSD? . SAS PCI-E x8 SAS x4 PCI-E x4 RAID 2 x10GbE SSD Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files Log: Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs
SSD Current: mostly 3Gbps SAS/SATA SDD Some 6Gbps SATA SSD Fusion IO – direct PCI-E Gen2 interface 320GB-1.2TB capacity, 200K IOPS, 1.5GB/s No RAID ? HDD is fundamentally a single point failure SDD could be built with redundant components HP report problems with SSD on RAID controllers, Fujitsu did not?
Big DW Storage – iSCSI Are you nuts?
Storage Configuration - Arrays Shown: two 12-disk Arrays per 24-disk enclosure Options: between 6-16 disks per array SAN systems may recommend R10 4+4 or R5 7+1 Very Many Spindles Comment on Meta LUN
Data Consumption Rate: Xeon TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors Total Cores Q1 sec SQL MB/s per core GHz Mem GB SF 2 Xeon 5355 2 Xeon 5570 2 Xeon 5680 8 12 85.4 42.2 21.0 5sp2 8sp1 8r2 1,165.5 2,073.5 4,166.7 145.7 259.2 347.2 2.66 2.93 3.33 64 144 192 100 4 Xeon 7560 32 37.2 7,056.5 220.5 2.26 640 300 Nehalem Westmere Neh.-EX Conroe 8 Xeon 7560 183.8 14,282 223.2 512 3000 Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.
Data Consumption Rate: Opteron TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors GHz Total Cores Mem GB SQL Q1 sec SF Total MB/s MB/s per core 4 Opt 8220 2.8 8 128 5rtm 309.7 300 868.7 121.1 8 Opt 8360 2.5 32 256 8rtm 91.4 300 2,872.0 89.7 Barcelona 8 Opt 8384 2.7 32 256 8rtm 72.5 300 3,620.7 113.2 Shanghai 8 Opt 8439 48 49.0 8sp1 5,357.1 111.6 2.8 256 300 Istanbul 8 Opt 8439 48 166.9 8rtm 5,242.7 109.2 2.8 512 1000 2 Opt 6176 24 20.2 8r2 4,331.7 180.5 2.3 192 100 Magny-C 4 Opt 6176 48 31.8 8,254.7 172.0 512 300 - Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?
Data Consumption Rate TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors Total Cores Q1 sec SQL MB/s per core GHz Mem GB SF 2 Xeon 5355 2 Xeon 5570 2 Xeon 5680 2 Opt 6176 8 12 24 85.4 42.2 21.0 20.2 5sp2 8sp1 8r2 1165.5 2073.5 4166.7 4331.7 145.7 259.2 347.2 180.5 2.66 2.93 3.33 2.3 64 144 192 100 4 Opt 8220 8 Opt 8360 32 309.7 91.4 5rtm 8rtm 868.7 2872.0 121.1 89.7 2.8 2.5 128 256 300 8 Opt 8384 8 Opt 8439 48 72.5 49.0 3620.7 5357.1 113.2 111.6 2.7 4 Opt 6176 31.8 8254.7 172.0 512 8 Xeon 7560 183.8 14282 223.2 2.26 3000 Barcelona Shanghai Istanbul Magny-C
Storage Targets Processors 2 Xeon X5680 4 Opt 6176 4 Xeon X7560 2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk Processors 2 Xeon X5680 4 Opt 6176 4 Xeon X7560 8 Xeon X7560 Total Cores 12 48 32 64 BW Core 350 175 250 225 Target MB/s 4200 8400 8000 14400 PCI-E x8-x4 5 - 1 6 - 4 9 - 5 SAS HBA 2 4 6 11† Storage Units/Disks 2 - 48 4 - 96 6 - 144 10 - 240 Storage Units/Disks 4 - 96 8 - 192 12 - 288 20 - 480 Actual Bandwidth 5 GB/s 10 GB/s 15 GB/s 26 GB/s † 8-way : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk 24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk, possible but not always practical 24 disks per x4 SAS port requires 50MB/sec, more achievable in practice Think: Shortest path to metal (iron-oxide)
Your Storage and the Optimizer Sequential IOPS 1,350 350,000 Model Optimizer SAS 2x4 Disks - 24 48 BW (KB/s) 10,800 2,800,000 “Random” IOPS 320 9,600 19,200 Sequential- Rand IO ratio 4.22 36.5 18.2 45,000 FC 4G 30 360,000 12,000 3.75 SSD 8 280,000 1.25 Assumptions 2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8 HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratio A DW configured storage system has a 18-36 ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction
Data Consumption Rates TPC-H SF100 Query 1, 9, 13, 21 TPC-H SF300 Query 1, 9, 13, 21
Fast Track Reference Architecture My Complaints Several Expensive SAN systems (11 disks) Each must be configured independently $1,500-2,000 amortized per disk Too many 2-disk Arrays 2 LUN per Array, too many data files Build Indexes with MAXDOP 1 Is this brain dead? Designed around 100MB/sec per disk Not all DW is single scan, or sequential Scripting?
Fragmentation Weak Storage System Table Weak Storage System 1) Fragmentation could degrade IO performance, 2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time. Powerful Storage System 3) Fragmentation has very little impact. 4) Defragmenting has mild impact, and completes within night time window. What is the correct conclusion? File Partition LUN Disk
Operating System View of Storage
Operating System Disk View Controller 1 Port 0 Controller 1 Port 1 Disk 2 Basic 396GB Online Disk 3 Controller 2 Port 0 Controller 2 Port 1 Disk 4 Disk 5 Controller 3 Port 0 Controller 3 Port 1 Disk 6 Disk 7 Additional disks not shown, Disk 0 is boot drive, 1 – install source?
File Layout Each File Group is distributed across all data disks Disk 2, Partition 0 File Group for the big Table File 1 Partition 1 File Group for all others Partition 2 Tempdb Partition 4 Backup and Load Disk 3 Partition 0 File 2 Small File Group Disk 4 Partition 0 File 3 Disk 5 Partition 0 File 4 Disk 6 Partition 0 File 5 Disk 7 Partition 0 File 6 Log disks not shown, tempdb share common pool with data
File Groups and Files Dedicated File Group for largest table Never defragment One file group for all other regular tables Load file group? Rebuild indexes to different file group
Partitioning - Pitfalls Disk 2 File Group 1 Disk 3 File Group 2 Disk 4 File Group 3 Disk 5 File Group 4 Disk 6 File Group 5 Disk 7 File Group 6 Table Partition 1 Table Partition 2 Table Partition 3 Table Partition 4 Table Partition 5 Table Partition 6 Common Partitioning Strategy Partition Scheme maps partitions to File Groups What happens in a table scan? Read first from Part 1 then 2, then 3, … ? SQL 2008 HF to read from each partition in parallel? What if partitions have disparate sizes?
Parallel Execution Plans Joe Chang jchang6@yahoo.com www.qdpma.com © 2010 Elemental Inc. All rights reserved.
About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL-execution plan Performance Monitoring, Profiler/Trace aggregation
So you bought a 64+ core box Now Learn all about Parallel Execution All guns (cores) blazing Negative scaling Super-scaling High degree of parallelism & small SQL Anomalies, execution plan changes etc Compression Partitioning Yes, this can happen, how will you know No I have not been smoking pot How much in CPU do I pay for this? Great management tool, what else?
Parallel Execution Plans Reference: Adam Machanic PASS
Execution Plan Quickie I/O and CPU Cost components F4 Estimated Execution Plan Cost is duration in seconds on some reference platform IO Cost for scan: 1 = 10,800KB/s, 810 implies 8,748,000KB IO in Nested Loops Join: 1 = 320/s, multiple of 0.003125
Index + Key Lookup - Scan Actual CPU Time (Data in memory) LU 1919 1919 Scan 8736 8727 (926.67- 323655 * 0.0001581) / 0.003125 = 280160 (86.6%) True cross-over approx 1,400,000 rows 1 row : page 1,093,729 pages/1350 = 810.17 (8,748MB)
Index + Key Lookup - Scan Actual CPU Time LU 2138 321 Scan 18622 658 (817- 280326 * 0.0001581) / 0.003125 = 247259 (88%) 8748000KB/8/1350 = 810
Actual Execution Plan Estimated Actual Note Actual Number of Rows, Rebinds, Rewinds Actual Estimated
Row Count and Executions Outer Inner Source For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows
Parallel Plans
Parallelism Operations Distribute Streams Non-parallel source, parallel destination Repartition Streams Parallel source and destination Gather Streams Destination is non-parallel
Parallel Execution Plans Note: gold circle with double arrow, and parallelism operations
Parallel Scan (and Index Seek) IO Cost same CPU reduce by degree of parallelism, except no reduction for DOP 16 DOP 1 DOP 2 8X 4X IO contributes most of cost! DOP 4 DOP 8
Parallel Scan 2 DOP 16
Hash Match Aggregate CPU cost only reduces By 2X,
Parallel Scan IO Cost is the same CPU cost reduced in proportion to degree of parallelism, last 2X excluded? On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration). A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option? The IO component can be a very large portion of the overall plan cost Not reducing IO cost in parallel plan may inhibit generating favorable plan, i.e., not sufficient to offset the contribution from the Parallelism operations. A parallel execution plan is more likely on larger systems (-P to fake it?)
Actual Execution Plan - Parallel
More Parallel Plan Details
Parallel Plan - Actual
Parallelism – Hash Joins
Hash Join Cost DOP 4 DOP 1 DOP 2 Search: Understanding Hash Joins For In-memory, Grace, Recursive DOP 8
Hash Join Cost CPU Cost is linear with number of rows, outer and inner source See BOL on Hash Joins for In-Memory, Grace, Recursive IO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit) Parallel Plan: Memory allocation is per thread! Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?
Parallelism Repartition Streams DOP 2 DOP 4 DOP 8
Bitmap BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed. SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.
Parallel Execution Plan Summary Queries with high IO cost may show little plan cost reduction on parallel execution Plans with high portion hash or sort cost show large parallel plan cost reduction Parallel plans may be inhibited by high row count in Parallelism Repartition Streams Watch out for (Parallel) Merge Joins!
Scaling Theory
Parallel Execution Strategy Partition work into little pieces Ensures each thread has same amount High overhead to coordinate Partition into big pieces May have uneven distribution between threads Small table join to big table Thread for each row from small table Partitioned table options
What Should Scale? Trivially parallelizable: 3 2 2 Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently, 3) Small amount of coordination to consolidate threads
More Difficult? Parallelizable: 4 3 2 3 2 Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage 3) Large coordination effort between threads 4) More work … Consolidate
Partitioned Tables No Repartition Streams Regular Table No Repartition Streams operations!
Scaling Reality 8-way Quad-Core Opteron Windows Server 2008 R2 SQL Server 2008 SP1 + HF 27
Test Queries TPC-H SF 10 database Standard, Compressed, Partitioned (30) Line Item Table SUM, 59M rows, 8.75GB Orders Table 15M rows
CPU-sec Standard CPU-sec to SUM 1 or 2 columns in Line Item Compressed
Speed Up Standard Compressed
Line Item sum 1 column CPU-sec Speed up relative to DOP 1
Line Item Sum w/Group By CPU-sec Speedup
Hash Join CPU-sec Speedup
Key Lookup and Table Scan CPU-sec 1.4M rows Speedup
Parallel Execution Summary Contention in queries w/low cost per page Simple scan, High Cost per Page – improves scaling! Multiple Aggregates, Hash Join, Compression Table Partitioning – alternative query plans Loop Joins – broken at high DOP Merge Join – seriously broken (parallel)
Scaling DW Summary Massive IO bandwidth Parallel options for data load, updates etc Investigate Parallel Execution Plans Scaling from DOP 1, 2, 4, 8, 16, 32 etc Scaling with and w/o HT Strategy for limiting DOP with multiple users
Fixes from Microsoft Needed Contention issues in parallel execution Table scan, Nested Loops Better plan cost model for scaling Back-off on parallelism if gain is negligible Fix throughput degradation with multiple users running big DW queries Sybase and Oracle, Throughput is close to Power or better
Test Systems
Test Systems 2-way quad-core Xeon 5430 2.66GHz Windows Server 2008 R2, SQL 2008 R2 8-way dual-core Opteron 2.8GHz Windows Server 2008 SP1, SQL 2008 SP1 8-way quad-core Opteron 2.7GHz Barcelona Windows Server 2008 R2, SQL 2008 SP1 Build 2789 8-way systems were configured for AD- not good!
Test Methodology Boot with all processors Run queries at MAXDOP 1, 2, 4, 8, etc Not the same as running on 1-way, 2-way, 4-way server Interpret results with caution
References Search Adam Machanic PASS
SQL Server Scaling on Big Iron (NUMA) Systems TPC-H Joe Chang jchang6@yahoo.com www.qdpma.com © 2010 Elemental Inc. All rights reserved.
About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL-execution plan Performance Monitoring, Profiler/Trace aggregation
TPC-H
TPC-H DSS – 22 queries, geometric mean Power – single stream 60X range plan cost, comparable actual range Power – single stream Tests ability to scale parallel execution plans Throughput – multiple streams Scale Factor 1 – Line item data is 1GB 875MB with DATE instead of DATETIME Only single column indexes allowed, Ad-hoc
Observed Scaling Behaviors Good scaling, leveling off at high DOP Perfect Scaling ??? Super Scaling Negative Scaling especially at high DOP Execution Plan change Completely different behavior
TPC-H Published Results
TPC-H SF 100GB 2-way Xeon 5355, 5570, 5680, Opt 6176 Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power. Westmere and Magny-Cours, both 192GB memory, are very close
TPC-H SF 300GB 8x QC/6C & 4x12C Opt, 6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz
TPC-H SF 1000
TPC-H SF 3TB X7460 & X7560 Nehalem-EX 64 cores better than 96 Core 2.
TPC-H SF 100GB, 300GB & 3TB SF100 2-way SF300 8x QC/6C & 4x12C Westmere and Magny-Cours are very close Between 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power SF300 8x QC/6C & 4x12C 6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz SF 3TB X7460 & X7560 Nehalem-EX 64 cores better than 96 Core 2.
TPC-H Published Results SQL Server excels in Power Limited by Geometric mean, anomalies Trails in Throughput Other DBMS get better throughput than power SQL Server throughput below Power by wide margin Speculation – SQL Server does not throttle back parallelism with load?
TPC-H SF100 Processors GHz 23,378.0 13,381.0 17,686.7 2 Xeon 5355 8 Power Through put QphH Processors Total Cores SQL GHz Mem GB SF 23,378.0 13,381.0 17,686.7 2 Xeon 5355 8 5sp2 2.66 64 100 67,712.9 38,019.1 50,738.4 2x5570 HDD 8sp1 2.93 144 99,426.3 94,761.5 55,038.2 53,855.6 73,974.6 71,438.3 2 Xeon 5680 2 Opt 6176 12 24 8r2 3.33 2.3 192 70,048.5 37,749.1 51,422.4 2x5570 SSD 72,110.5 36,190.8 51,085.6 5570 Fusion
TPC-H SF300 Processors GHz 25,206.4 67,287.4 75,161.2 109,067.1 Power Through put QphH Processors Total Cores SQL GHz Mem GB SF 25,206.4 67,287.4 75,161.2 109,067.1 13,283.8 41,526.4 44,271.9 76,869.0 18,298.5 52,860.2 57,684.7 91,558.2 4 Opt 8220 8 Opt 8360 8 32 5rtm 8rtm 2.8 2.5 128 256 8 Opt 8384 8 Opt 8439 48 8sp1 2.7 300 129,198.3 89,547.7 107,561.2 4 Opt 6176 8r2 2.3 512 152,453.1 96,585.4 121,345.6 4 Xeon 7560 2.26 640 All of the above are HP results?, Sun result Opt 8384, sp1, Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5
TPC-H 1TB Processors GHz 95,789.1 69,367.6 81,367.6 8 Opt 8439 48 8R2? Total Cores Mem GB SQL SF Power Through put QphH 95,789.1 69,367.6 81,367.6 8 Opt 8439 48 8R2? 2.8 512 1000 108,436.8 96,652.7 102,375.3 8 Opt 8439 48 ASE 2.8 384 1000 139,181.0 141,188.1 140,181.1 Itanium 9350 64 O11R2 1.73 512 1000
TPC-H 3TB Processors GHz 120,254.8 87,841.4 102,254.8 16 Xeon 7460 96 Total Cores Mem GB SQL SF Power Through put QphH 120,254.8 87,841.4 102,254.8 16 Xeon 7460 96 8r2 2.66 1024 3000 185,297.7 142,685.6 162,601.7 8 Xeon 7560 64 8r2 2.26 512 3000 142,790.7 171,607.4 156,537.3 Itanium 9350 64 Sybase 1.73 512 1000 142,790.7 171,607.4 156,537.3 POWER6 64 Sybase 5.0 512 3000
TPC-H Published Results Processors GHz Total Cores Mem GB SQL SF Power Through put QphH 2 Xeon 5355 2.66 8 64 5sp2 100 23,378 13,381 17,686.7 2 Xeon 5570 2.93 8 144 8sp1 100 72,110.5 36,190.8 51,085.6 2 Xeon 5680 3.33 12 192 8r2 100 99,426.3 55,038.2 73,974.6 2 Opt 6176 2.3 24 192 8r2 100 94,761.5 53,855.6 71,438.3 4 Opt 8220 2.8 8 128 5rtm 300 25,206.4 13,283.8 18,298.5 8 Opt 8360 2.5 32 256 8rtm 300 67,287.4 41,526.4 52,860.2 8 Opt 8384 2.7 32 256 8rtm 300 75,161.2 44,271.9 57,684.7 8 Opt 8439 2.8 48 256 8sp1 300 109,067.1 76,869.0 91,558.2 4 Opt 6176 2.3 48 512 8r2 300 129,198.3 89,547.7 107,561.2 8 Xeon 7560 2.26 64 512 8r2 3000 185,297.7 142,685.6 162,601.7
SF100 2-way Big Queries (sec) Query time in sec Xeon 5570 with SATA SSD poor on Q9, reason unknown Both Xeon 5680 and Opteron 6176 big improvement over Xeon 5570
SF100 Middle Q Query time in sec Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknown Opteron 6176 poor on Q11
SF100 Small Queries Query time in sec Xeon 5680 and Opteron poor on Q20 Note limited scaling on Q2, & 17
SF300 32+ cores Big Queries Query time in sec Opteron 6176 poor relative to 8439 on Q9 & 13, same number of total cores
SF300 Middle Q Query time in sec Opteron 6176 much better than 8439 on Q11 & 19 Worse on Q12
SF300 Small Q Query time in sec Opteron 6176 much better on Q2, even with 8439 on others
SF1000
SF1000
SF1000
SF1000
SF1000 Itanium - Superdome
SF 3TB – 8×7560 versus 16×7460 5.6X Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3
64 cores, 7560 relative to PWR6
TPC-H Summary Scaling is impressive on some SQL Limited ability (value) is scaling small Q Anomalies, negative scaling
TPC-H Queries
Q1 Pricing Summary Report
Query 2 Minimum Cost Supplier Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q3
Q6 Forecasting Revenue Change
Q7 Volume Shipping
Q8 National Market Share
Q9 Product Type Profit Measure
Q11 Important Stock Identification Non-Parallel Parallel
Q12 Random IO?
Q13 Why does Q13 have perfect scaling?
Q17 Small Quantity Order Revenue
Q18 Large Volume Customer Non-Parallel Parallel
Q19
Q20? Date functions are usually written as because Line Item date columns are “date” type CAST helps DOP 1 plan, but get bad plan for parallel This query may get a poor execution plan
Q21 Suppliers Who Kept Orders Waiting Note 3 references to Line Item
Q22
Joe Chang jchang6@yahoo.com www.qdpma.com TPC-H Studies Joe Chang jchang6@yahoo.com www.qdpma.com © 2010 Elemental Inc. All rights reserved.
About Joe Chang SQL Server Execution Plan Cost Model True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL-execution plan Performance Monitoring, Profiler/Trace aggregation
TPC-H
TPC-H DSS – 22 queries, geometric mean Power – single stream 60X range plan cost, comparable actual range Power – single stream Tests ability to scale parallel execution plans Throughput – multiple streams Scale Factor 1 – Line item data is 1GB 875MB with DATE instead of DATETIME Only single column indexes allowed, Ad-hoc
SF 10, test studies Not valid for publication Auto-Statistics enabled, Excludes compile time Big Queries – Line Item Scan Super Scaling – Mission Impossible Small Queries & High Parallelism Other queries, negative scaling Did not apply T2301, or disallow page locks
Big Q: Plan Cost vs Actual memory affects Hash IO onset Plan Cost reduction from DOP1 to 16/32 Q1 28% Q9 44% Q18 70% Q21 20% Plan Cost @ 10GB Plan Cost says scaling is poor except for Q18, Q18 & Q 21 > 3X Q1, Q9 Actual Query time In seconds Plan Cost is poor indicator of true parallelism scaling
Big Query: Speed Up and CPU Holy Grail Q13 has slightly better than perfect scaling? In general, excellent scaling to DOP 8-24, weak afterwards Speed up relative to DOP 1 Poor DOP 32 might be resource contention, perhaps exclude SQL from 1 CPU CPU time In seconds © 2010 Elemental Inc. All rights reserved.
Super Scaling Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged CPU time = 100 sec, elapse time = 100 sec What is best case for DOP 2? Assuming nearly zero Repartition Threads cost CPU time = 100 sec, elapsed time = 50? Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan! No, I have not started drinking, yet
Super Scaling CPU normalized to DOP 1 CPU-sec goes down from DOP 1 to 2 and higher (typically 8) 3.5X speedup from DOP 1 to 2 (Normalized to DOP 1) Speed up relative to DOP 1
CPU and Query time in seconds CPU time Query time
Super Scaling Summary Most probable cause Bitmap Filters are great, Bitmap Operator in Parallel Plan Bitmap Filters are great, Question for Microsoft: Can I use Bitmap Filters in OLTP systems with non-parallel plans? © 2010 Elemental Inc. All rights reserved.
Small Queries – Plan Cost vs Act Query 3 and 16 have lower plan cost than Q17, but not included Plan Cost Q4,6,17 great scaling to DOP 4, then weak Negative scaling also occurs Query time
Small Queries CPU & Speedup CPU time What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling Speed up Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8
High Parallelism – Small Queries Why? Almost No value TPC-H geometric mean scoring Small queries have as much impact as large Linear sum of weights large queries OLTP with 32, 64+ cores Parallelism good if super-scaling Default max degree of parallelism 0 Seriously bad news, especially for small Q Increase cost threshold for parallelism? Sometimes you do get lucky
Q that go Negative Query time “Speedup”
CPU
Other Queries – CPU & Speedup CPU time Q3 has problems beyond DOP 2 Speedup
Other - Query Time seconds
Scaling Summary Some queries show excellent scaling Super-scaling, better than 2X Sharp CPU jump on last DOP doubling Need strategy to cap DOP To limit negative scaling Especially for some smaller queries? Other anomalies
Compression PAGE
Compression Overhead - Overall Query time compressed relative to uncompressed 40% overhead for compression at low DOP, 10% overhead at max DOP??? CPU time compressed relative to uncompressed
Query time compressed relative to uncompressed CPU time compressed relative to uncompressed
Compressed Table LINEITEM – real data may be more compressible Uncompressed: 8,749,760KB, Average Bytes per row: 149 Compressed: 4,819,592KB, Average Bytes per row: 82
Orders and Line Item on Order Key Partitioning Orders and Line Item on Order Key
Partitioning Impact - Overall Query time partitioned relative to not partitioned CPU time partitioned relative to not partitioned
Query time partitioned relative to not partitioned CPU time partitioned relative to not partitioned
Plan for Partitioned Tables
Scaling DW Summary Massive IO bandwidth Parallel options for data load, updates etc Investigate Parallel Execution Plans Scaling from DOP 1, 2, 4, 8, 16, 32 etc Scaling with and w/o HT Strategy for limiting DOP with multiple users
Fixes from Microsoft Needed Contention issues in parallel execution Table scan, Nested Loops Better plan cost model for scaling Back-off on parallelism if gain is negligible Fix throughput degradation with multiple users running big DW queries Sybase and Oracle, Throughput is close to Power or better
Query Plans
Big Queries
Q1 Pricing Summary Report
Q1 Plan Non-Parallel Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction Parallel
Q9 Product Type Profit Measure IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower Non-Parallel Parallel
Q9 Non-Parallel Plan Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost Join sequence: Supplier, (Part, PartSupp), Line Item, Orders
Q9 Parallel Plan Non-Parallel: (Supplier), (Part, PartSupp), Line Item, Orders Parallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp
Q9 Non-Parallel Plan details Table Scans comprise 64%, IO from 4 tables contribute 58% of plan cost
Q9 Parallel reg vs Partitioned
Q13 Why does Q13 have perfect scaling?
Q18 Large Volume Customer Non-Parallel Parallel
Q18 Graphical Plan Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan
Q18 Plan Details Non-Parallel Parallel Non-Parallel Plan Hash Match cost is 1245 IO, 494.6 CPU DOP 16/32: size is below IO threshold, CPU reduced by >10X
Q21 Suppliers Who Kept Orders Waiting Non-Parallel Parallel Note 3 references to Line Item
Q21 Non-Parallel Plan H3 H2 H1 H3 H2 H1
Q21 Parallel
Q21 3 full Line Item clustered index scans Plan cost is approx 3X Q1, single “scan”
Super Scaling
Q7 Volume Shipping Non-Parallel Parallel
Q7 Non-Parallel Plan Join sequence: Nation, Customer, Orders, Line Item
Q7 Parallel Plan Join sequence: Nation, Customer, Orders, Line Item
Q8 National Market Share Non-Parallel Parallel
Q8 Non-Parallel Plan Join sequence: Part, Line Item, Orders, Customer
Q8 Parallel Plan Join sequence: Part, Line Item, Orders, Customer
Q11 Important Stock Identification Non-Parallel Parallel
Q11 Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Q11 Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp
Small Queries
Query 2 Minimum Cost Supplier Wordy, but only touches the small tables, second lowest plan cost (Q15)
Q2 Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)
Q2 PartSupp is now Index Scan + Key Lookup
Q6 Forecasting Revenue Change Note sure why this blows CPU Scalar values are pre-computed, pre-converted
Q20? Date functions are usually written as because Line Item date columns are “date” type CAST helps DOP 1 plan, but get bad plan for parallel This query may get a poor execution plan
Q20
Q20
Q20 alternate - parallel Statistics estimation error here Penalty for mistake applied here
Other Queries
Q3
Q3
Q12 Random IO? Will this generate random IO?
Query 12 Plans Non-Parallel Parallel
Queries that go Negative
Q17 Small Quantity Order Revenue
Q17 Table Spool is concern
Q17 the usual suspects
Q19
Q19
Q22
Q22
Speedup from DOP 1 query time CPU relative to DOP 1