System Architecture: Big Iron (NUMA)

System Architecture: Big Iron (NUMA)
Joe Chang © 2010 Elemental Inc. All rights reserved.

About Joe Chang SQL Server Execution Plan Cost Model
True cost structure by system architecture Decoding statblob (distribution statistics) SQL Clone – statistics-only database Tools ExecStats – cross-reference index use by SQL-execution plan Performance Monitoring, Profiler/Trace aggregation

Scaling SQL on NUMA Topics
OLTP – Thomas Kejser session “Designing High Scale OLTP Systems” Data Warehouse Ongoing Database Development Bulk Load – SQL CAT paper + TK session “The Data Loading Performance Guide” Other Sessions with common coverage: Monitoring and Tuning Parallel Query Execution II, R Meyyappan (SQLBits 6) Inside the SQL Server Query Optimizer, Conor Cunningham Notes from the field: High Performance Storage, John Langford SQL Server Storage – 1000GB Level, Brent Ozar

Server Systems and Architecture

Symmetric Multi-Processing
CPU System Bus MCH ICH PXH SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory SMP makes no reference to memory architecture? Not to be confused to Simultaneous Multi-Threading (SMT) Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)

Non-Uniform Memory Access
CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU CPU Memory Controller Memory Controller Memory Controller Memory Controller Node Controller Node Controller Node Controller Node Controller Shared Bus or X Bar NUMA Architecture - Path to memory is not uniform Node: Processors, Memory, Separate or combined Memory + Node Controllers Nodes connected by shared bus, cross-bar, ring Traditionally, 8-way+ systems Local memory latency ~150ns, remote node memory ~ ns, can cause erratic behavior if OS/code is not NUMA aware

AMD Opteron Technically, Opteron is NUMA,
HT2100 HT1100 Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior! For practical purposes: behave like SMP system Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns? Actual: more complicated because of snooping (cache coherency traffic)

8-way Opteron Sys Architecture
CPU 2 4 6 1 3 5 7 Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other

Nehalem System Architecture
Gurbir Singh Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible © 2010 Elemental Inc. All rights reserved.

NUMA Local and Remote Memory
Local memory is closer than remote Physical access time is shorter What is actual access time? With cache coherency requirement!

HT Assist – Probe Filter
part of L3 cache used as directory cache ZDNET

Source Snoop Coherency
From HP PREMA Architecture whitepaper: All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line

DL980G7 From HP PREAM Architecture whitepaper:
Search HP PREMA, document 4AA3-0643ENW From HP PREAM Architecture whitepaper: Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory (*only cache tags, not cache data) © 2010 Elemental Inc. All rights reserved.

HP ProLiant DL980 Architecture
Node Controllers reduces effective memory latency

Superdome 2 – Itanium, sx3000 Agent – Remote Ownership Tag
+ L4 cache tags 64M eDRAM L4 cache data

IBM x3850 X5 (Glue-less) Connect two 4-socket Nodes to make 8-way system

OS Memory Models SUMA: Sufficiently Uniform Memory Access
Memory interleaved across nodes 25 17 9 1 Node 0 24 16 8 27 19 11 3 Node 1 26 18 10 2 29 21 13 5 Node 2 28 20 12 4 31 23 15 7 Node 3 30 22 14 6 2 1 NUMA: first interleaved within a node, then spanned across nodes 7 5 3 1 Node 0 6 4 2 15 13 11 9 Node 1 14 12 10 8 23 21 19 17 Node 2 22 20 18 16 31 29 27 25 Node 3 30 28 26 24 2 1 Memory stripe is then spanned across nodes

Windows OS NUMA Support
Memory models SUMA – Sufficiently Uniform Memory Access NUMA – separate memory pools by Node Node 0 24 16 8 1 25 17 9 Node 1 2 26 18 10 3 27 19 11 Node 2 4 28 20 12 5 29 21 13 Node 3 6 30 22 14 7 31 23 15 Memory is striped across NUMA nodes

Memory Model Example: 4 Nodes
SUMA Memory Model memory access uniformly distributed 25% of memory accesses local, 75% remote NUMA Memory Model Goal is better than 25% local node access True local access time also needs to be faster Cache Coherency may increase local access

Architecting for NUMA End to End Affinity
App Server TCP Port CPU Memory Table Web determines port for each user by group (but should not be by geography!) Affinitize port to NUMA node Each node access localized data (partition?) OS may allocate substantial chunk from Node 0? North East Node 0 Node 1 Node 2 Node 3 0-0 0-1 1-0 1-1 2-0 2-1 3-0 3-1 NE MidA SE Cen 1440 1441 1442 1443 1444 1445 1446 1447 Mid Atlantic South East Central Texas Node 4 Node 5 Node 6 Node 7 4-0 4-1 5-0 5-1 6-0 6-1 7-0 7-1 Tex Mnt Cal PNW Mountain California Pacific NW

HP-UX LORA HP-UX – Not Microsoft Windows
Locality-Optimizer Resource Alignment 12.5% Interleaved Memory 87.5% NUMA node Local Memory

System Tech Specs Processors Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12
Cores DIMM PCI-E G2 Total Cores Max memory Base 2 x Xeon X56x0 6 18 5 x8+,1 x4 12 192G* $7K 4 x Opteron 6100 12 32 5 x8, 1 x4 48 512G $14K 4 x Xeon X7560 8 64 4 x8, 6 x4† 32 1TB $30K 8 x Xeon X7560 8 128 9 x8, 5 x4‡ 64 2TB $100K 8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K 16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB, † Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots

Software Stack

Operating System Windows Server 2003 RTM, SP1 Windows Server 2008
Network limitations (default) Scalable Networking Pack (912222) Windows Server 2008 Windows Server 2008 R2 (64-bit only) Breaks 64 logical processor limit NUMA IO enhancements? Impacts OLTP Search NUMA I/O Optimizations, Bruce Worthington Microsoft Windows Server 2003 Scalable Networking Pack microsoft.com Search: MSI-X Do not bother trying to do DW on 32-bit OS or 32-bit SQL Server Don’t try to do DW on SQL Server 2000 © 2010 Elemental Inc. All rights reserved.

SQL Server version SQL Server 2000
Serious disk IO limitations (1GB/sec ?) Problematic parallel execution plans SQL Server 2005 (fixed most S2K problems) 64-bit on X64 (Opteron and Xeon) SP2 – performance improvement 10%(?) SQL Server 2008 & R2 Compression, Filtered Indexes, etc Star join, Parallel query to partitioned table Introduction to New Data Warehouse Scalability Features in SQL Server 2008 © 2010 Elemental Inc. All rights reserved.

Configuration SQL Server Startup Parameter: E Auto_Date_Correlation
Trace Flags 834, 836, 2301 Auto_Date_Correlation Order date < A, Ship date > A Implied: Order date > A-C, Ship date < A+C Port Affinity – mostly OLTP Dedicated processor ? for log writer ?

Storage Performance for Data Warehousing

Storage

Organization Structure
In many large IT departments DB and Storage are in separate groups Storage usually has own objectives Bring all storage into one big system under full management (read: control) Storage as a Service, in the Cloud One size fits all needs Usually have zero DB knowledge Of course we do high bandwidth, 600MB/sec good enough for you?

Data Warehouse Storage
OLTP – Throughput with Fast Response DW – Flood the queues for maximum through-put Do not use shared storage for data warehouse! Storage system vendors like to give the impression the SAN is a magical, immensely powerful box that can meet all your needs. Just tell us how much capacity you need and don’t worry about anything else. My advice: stay away from shared storage, controlled by different team.

Nominal and Net Bandwidth
PCI-E Gen 2 – 5 Gbit/sec signaling x8 = 5GB/s, net BW 4GB/s, x4 = 2GB/s net SAS 6Gbit/s – 6 Gbit/s x4 port: 3GB/s nominal, 2.2GB/sec net? Fibre Channel 8 Gbit/s nominal 780GB/s point-to-point, 680MB/s from host to SAN to back-end loop SAS RAID Controller, x8 PCI-E G2, 2 x4 6G 2.8GB/s Depends on the controller, will change!

Storage – SAS Direct-Attach
. RAID PCI-E x8 SAS x4 PCI-E x4 2 x10GbE Many Fat Pipes Very Many Disks Option A: 24-disks in one enclosure for each x4 SAS port. Two x4 SAS ports per controller Option B: Split enclosure over 2 x4 SAS ports, 1 controller Balance by pipe bandwidth Don’t forget fat network pipes

Storage – FC/SAN PCI-E x8 Gen 2 Slot with quad-port 8Gb FC
2 x10GbE PCI-E x4 . HBA PCI-E x8 8Gb FC PCI-E x8 Gen 2 Slot with quad-port 8Gb FC If 8Gb quad-port is not supported, consider system with many x4 slots, or consider SAS! SAN systems typically offer 3.5in 15-disk enclosures. Difficult to get high spindle count with density. disk enclosures per 8Gb FC port, 20-30MB/s per disk?

Storage – SSD / HDD Hybrid
No RAID w/SSD? . SAS PCI-E x8 SAS x4 PCI-E x4 RAID 2 x10GbE SSD Storage enclosures typically 12 disks per channel. Can only support bandwidth of a few SSD. Use remaining bays for extra storage with HDD. No point expending valuable SSD space for backups and flat files Log: Single DB – HDD, unless rollbacks or T-log backups disrupts log writes. Multi DB – SSD, otherwise to many RAID1 pairs to logs

SSD Current: mostly 3Gbps SAS/SATA SDD
Some 6Gbps SATA SSD Fusion IO – direct PCI-E Gen2 interface 320GB-1.2TB capacity, 200K IOPS, 1.5GB/s No RAID ? HDD is fundamentally a single point failure SDD could be built with redundant components HP report problems with SSD on RAID controllers, Fujitsu did not?

Big DW Storage – iSCSI Are you nuts?

Storage Configuration - Arrays
Shown: two 12-disk Arrays per 24-disk enclosure Options: between 6-16 disks per array SAN systems may recommend R or R5 7+1 Very Many Spindles Comment on Meta LUN

Data Consumption Rate: Xeon
TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors Total Cores Q1 sec SQL MB/s per core GHz Mem GB SF 2 Xeon 5355 2 Xeon 5570 2 Xeon 5680 8 12 85.4 42.2 21.0 5sp2 8sp1 8r2 1,165.5 2,073.5 4,166.7 145.7 259.2 347.2 2.66 2.93 3.33 64 144 192 100 4 Xeon 7560 32 37.2 7,056.5 220.5 2.26 640 300 Nehalem Westmere Neh.-EX Conroe 8 Xeon 7560 183.8 14,282 223.2 512 3000 Data consumption rate is much higher for current generation Nehalem and Westmere processors than Core 2 referenced in Microsoft FTDW document. TPC-H Q1 is more compute intensive than the FTDW light query.

Data Consumption Rate: Opteron
TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M Processors GHz Total Cores Mem GB SQL Q1 sec SF Total MB/s MB/s per core 4 Opt 8220 2.8 8 128 5rtm 309.7 300 868.7 121.1 8 Opt 8360 2.5 32 256 8rtm 91.4 300 2,872.0 89.7 Barcelona 8 Opt 8384 2.7 32 256 8rtm 72.5 300 3,620.7 113.2 Shanghai 8 Opt 8439 48 49.0 8sp1 5,357.1 111.6 2.8 256 300 Istanbul 8 Opt 8439 48 166.9 8rtm 5,242.7 109.2 2.8 512 1000 2 Opt 6176 24 20.2 8r2 4,331.7 180.5 2.3 192 100 Magny-C 4 Opt 6176 48 31.8 8,254.7 172.0 512 300 - Expected Istanbul to have better performance per core than Shanghai due to HT Assist. Magny-Cours has much better performance per core! (at 2.3GHz versus 2.8 for Istanbul), or is this Win/SQL 2K8 R2?

Data Consumption Rate TPC-H Query 1 Lineitem scan, SF1 1GB, 2k8 875M
Processors Total Cores Q1 sec SQL MB/s per core GHz Mem GB SF 2 Xeon 5355 2 Xeon 5570 2 Xeon 5680 2 Opt 6176 8 12 24 85.4 42.2 21.0 20.2 5sp2 8sp1 8r2 1165.5 2073.5 4166.7 4331.7 145.7 259.2 347.2 180.5 2.66 2.93 3.33 2.3 64 144 192 100 4 Opt 8220 8 Opt 8360 32 309.7 91.4 5rtm 8rtm 868.7 2872.0 121.1 89.7 2.8 2.5 128 256 300 8 Opt 8384 8 Opt 8439 48 72.5 49.0 3620.7 5357.1 113.2 111.6 2.7 4 Opt 6176 31.8 8254.7 172.0 512 8 Xeon 7560 183.8 14282 223.2 2.26 3000 Barcelona Shanghai Istanbul Magny-C

Storage Targets Processors 2 Xeon X5680 4 Opt 6176 4 Xeon X7560
2U disk enclosure 24 x 73GB 15K 2.5in disks $14K, $600 per disk Processors 2 Xeon X5680 4 Opt 6176 4 Xeon X7560 8 Xeon X7560 Total Cores 12 48 32 64 BW Core 350 175 250 225 Target MB/s 4200 8400 8000 14400 PCI-E x8-x4 5 - 1 6 - 4 9 - 5 SAS HBA 2 4 6 11† Storage Units/Disks 2 - 48 4 - 96 Storage Units/Disks 4 - 96 Actual Bandwidth 5 GB/s 10 GB/s 15 GB/s 26 GB/s † 8-way : 9 controllers in x8 slots, 24 disks per x4 SAS port 2 controllers in x4 slots, 12 disk 24 15K disks per enclosure, 12 disks per x4 SAS port requires 100MB/sec per disk, possible but not always practical 24 disks per x4 SAS port requires 50MB/sec, more achievable in practice Think: Shortest path to metal (iron-oxide)

Your Storage and the Optimizer
Sequential IOPS 1,350 350,000 Model Optimizer SAS 2x4 Disks - 24 48 BW (KB/s) 10,800 2,800,000 “Random” IOPS 320 9,600 19,200 Sequential- Rand IO ratio 4.22 36.5 18.2 45,000 FC 4G 30 360,000 12,000 3.75 SSD 8 280,000 1.25 Assumptions 2.8GB/sec per SAS 2 x4 Adapter, Could be 3.2GB/sec per PCI-E G2 x8 HDD 400 IOPS per disk – Big query key lookup, loop join at high queue, and short-stroked, possible skip-seek. SSD 35,000 IOPS The SQL Server Query Optimizer make key lookup versus table scan decisions based on a 4.22 sequential-to-random IO ratio A DW configured storage system has a ratio, 30 disks per 4G FC about matches the QO, SSD is in the other direction

Data Consumption Rates
TPC-H SF100 Query 1, 9, 13, 21 TPC-H SF300 Query 1, 9, 13, 21

Fast Track Reference Architecture
My Complaints Several Expensive SAN systems (11 disks) Each must be configured independently $1,500-2,000 amortized per disk Too many 2-disk Arrays 2 LUN per Array, too many data files Build Indexes with MAXDOP 1 Is this brain dead? Designed around 100MB/sec per disk Not all DW is single scan, or sequential Scripting?

Fragmentation Weak Storage System
Table Weak Storage System 1) Fragmentation could degrade IO performance, 2) Defragmenting very large table on a weak storage system could render the database marginally to completely non-functional for a very long time. Powerful Storage System 3) Fragmentation has very little impact. 4) Defragmenting has mild impact, and completes within night time window. What is the correct conclusion? File Partition LUN Disk

Operating System View of Storage

Operating System Disk View
Controller 1 Port 0 Controller 1 Port 1 Disk 2 Basic 396GB Online Disk 3 Controller 2 Port 0 Controller 2 Port 1 Disk 4 Disk 5 Controller 3 Port 0 Controller 3 Port 1 Disk 6 Disk 7 Additional disks not shown, Disk 0 is boot drive, 1 – install source?

File Layout Each File Group is distributed across all data disks
Disk 2, Partition 0 File Group for the big Table File 1 Partition 1 File Group for all others Partition 2 Tempdb Partition 4 Backup and Load Disk 3 Partition 0 File 2 Small File Group Disk 4 Partition 0 File 3 Disk 5 Partition 0 File 4 Disk 6 Partition 0 File 5 Disk 7 Partition 0 File 6 Log disks not shown, tempdb share common pool with data

File Groups and Files Dedicated File Group for largest table
Never defragment One file group for all other regular tables Load file group? Rebuild indexes to different file group

Partitioning - Pitfalls
Disk 2 File Group 1 Disk 3 File Group 2 Disk 4 File Group 3 Disk 5 File Group 4 Disk 6 File Group 5 Disk 7 File Group 6 Table Partition 1 Table Partition 2 Table Partition 3 Table Partition 4 Table Partition 5 Table Partition 6 Common Partitioning Strategy Partition Scheme maps partitions to File Groups What happens in a table scan? Read first from Part 1 then 2, then 3, … ? SQL 2008 HF to read from each partition in parallel? What if partitions have disparate sizes?

Parallel Execution Plans

So you bought a 64+ core box
Now Learn all about Parallel Execution All guns (cores) blazing Negative scaling Super-scaling High degree of parallelism & small SQL Anomalies, execution plan changes etc Compression Partitioning Yes, this can happen, how will you know No I have not been smoking pot How much in CPU do I pay for this? Great management tool, what else?

Reference: Adam Machanic PASS

Execution Plan Quickie
I/O and CPU Cost components F4 Estimated Execution Plan Cost is duration in seconds on some reference platform IO Cost for scan: 1 = 10,800KB/s, implies 8,748,000KB IO in Nested Loops Join: 1 = 320/s, multiple of

Index + Key Lookup - Scan
Actual CPU Time (Data in memory) LU Scan ( * ) / = (86.6%) True cross-over approx 1,400,000 rows 1 row : page 1,093,729 pages/1350 = (8,748MB)

Index + Key Lookup - Scan
Actual CPU Time LU Scan ( * ) / = (88%) KB/8/1350 = 810

Actual Execution Plan Estimated Actual Note Actual Number
of Rows, Rebinds, Rewinds Actual Estimated

Row Count and Executions
Outer Inner Source For Loop Join inner source and Key Lookup, Actual Num Rows = Num of Exec × Num of Rows

Parallel Plans

Parallelism Operations
Distribute Streams Non-parallel source, parallel destination Repartition Streams Parallel source and destination Gather Streams Destination is non-parallel

Note: gold circle with double arrow, and parallelism operations

Parallel Scan (and Index Seek)
IO Cost same CPU reduce by degree of parallelism, except no reduction for DOP 16 DOP 1 DOP 2 8X 4X IO contributes most of cost! DOP 4 DOP 8

Parallel Scan 2 DOP 16

Hash Match Aggregate CPU cost only reduces By 2X,

Parallel Scan IO Cost is the same
CPU cost reduced in proportion to degree of parallelism, last 2X excluded? On a weak storage system, a single thread can saturate the IO channel, Additional threads will not increase IO (reduce IO duration). A very powerful storage system can provide IO proportional to the number of threads. It might be nice if this was optimizer option? The IO component can be a very large portion of the overall plan cost Not reducing IO cost in parallel plan may inhibit generating favorable plan, i.e., not sufficient to offset the contribution from the Parallelism operations. A parallel execution plan is more likely on larger systems (-P to fake it?)

Actual Execution Plan - Parallel

More Parallel Plan Details

Parallel Plan - Actual

Parallelism – Hash Joins

Hash Join Cost DOP 4 DOP 1 DOP 2 Search: Understanding Hash Joins
For In-memory, Grace, Recursive DOP 8

Hash Join Cost CPU Cost is linear with number of rows, outer and inner source See BOL on Hash Joins for In-Memory, Grace, Recursive IO Cost is zero for small intermediate data size, beyond set point proportional to server memory(?) IO is proportional to excess data (beyond in-memory limit) Parallel Plan: Memory allocation is per thread! Summary: Hash Join plan cost depends on memory if IO component is not zero, in which case is disproportionately lower with parallel plans. Does not reflect real cost?

Parallelism Repartition Streams
DOP 2 DOP 4 DOP 8

Bitmap BOL: Optimizing Data Warehouse Query Performance Through Bitmap Filtering A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. Essentially, the filter performs a semi-join reduction; that is, only the rows in the second table that qualify for the join to the first table are processed. SQL Server uses the Bitmap operator to implement bitmap filtering in parallel query plans. Bitmap filtering speeds up query execution by eliminating rows with key values that cannot produce any join records before passing rows through another operator such as the Parallelism operator. A bitmap filter uses a compact representation of a set of values from a table in one part of the operator tree to filter rows from a second table in another part of the tree. By removing unnecessary rows early in the query, subsequent operators have fewer rows to work with, and the overall performance of the query improves. The optimizer determines when a bitmap is selective enough to be useful and in which operators to apply the filter. For more information, see Optimizing Data Warehouse Query Performance Through Bitmap Filtering.

Parallel Execution Plan Summary
Queries with high IO cost may show little plan cost reduction on parallel execution Plans with high portion hash or sort cost show large parallel plan cost reduction Parallel plans may be inhibited by high row count in Parallelism Repartition Streams Watch out for (Parallel) Merge Joins!

Scaling Theory

Parallel Execution Strategy
Partition work into little pieces Ensures each thread has same amount High overhead to coordinate Partition into big pieces May have uneven distribution between threads Small table join to big table Thread for each row from small table Partitioned table options

What Should Scale? Trivially parallelizable:
3 2 2 Trivially parallelizable: 1) Split large chunk of work among threads, 2) Each thread works independently, 3) Small amount of coordination to consolidate threads

More Difficult? Parallelizable:
4 3 2 3 2 Parallelizable: 1) Split large chunk of work among threads, 2) Each thread works on first stage 3) Large coordination effort between threads 4) More work … Consolidate

Partitioned Tables No Repartition Streams Regular Table
No Repartition Streams operations!

Scaling Reality 8-way Quad-Core Opteron Windows Server 2008 R2
SQL Server 2008 SP1 + HF 27

Test Queries TPC-H SF 10 database
Standard, Compressed, Partitioned (30) Line Item Table SUM, 59M rows, 8.75GB Orders Table 15M rows

CPU-sec Standard CPU-sec to SUM 1 or 2 columns in Line Item Compressed

Speed Up Standard Compressed

Line Item sum 1 column CPU-sec Speed up relative to DOP 1

Line Item Sum w/Group By
CPU-sec Speedup

Hash Join CPU-sec Speedup

Key Lookup and Table Scan
CPU-sec 1.4M rows Speedup

Parallel Execution Summary
Contention in queries w/low cost per page Simple scan, High Cost per Page – improves scaling! Multiple Aggregates, Hash Join, Compression Table Partitioning – alternative query plans Loop Joins – broken at high DOP Merge Join – seriously broken (parallel)

Scaling DW Summary Massive IO bandwidth
Parallel options for data load, updates etc Investigate Parallel Execution Plans Scaling from DOP 1, 2, 4, 8, 16, 32 etc Scaling with and w/o HT Strategy for limiting DOP with multiple users

Fixes from Microsoft Needed
Contention issues in parallel execution Table scan, Nested Loops Better plan cost model for scaling Back-off on parallelism if gain is negligible Fix throughput degradation with multiple users running big DW queries Sybase and Oracle, Throughput is close to Power or better

Test Systems

Test Systems 2-way quad-core Xeon 5430 2.66GHz
Windows Server 2008 R2, SQL 2008 R2 8-way dual-core Opteron 2.8GHz Windows Server 2008 SP1, SQL 2008 SP1 8-way quad-core Opteron 2.7GHz Barcelona Windows Server 2008 R2, SQL 2008 SP1 Build 2789 8-way systems were configured for AD- not good!

Test Methodology Boot with all processors
Run queries at MAXDOP 1, 2, 4, 8, etc Not the same as running on 1-way, 2-way, 4-way server Interpret results with caution

References Search Adam Machanic PASS

SQL Server Scaling on Big Iron (NUMA) Systems
TPC-H Joe Chang © 2010 Elemental Inc. All rights reserved.

TPC-H DSS – 22 queries, geometric mean Power – single stream
60X range plan cost, comparable actual range Power – single stream Tests ability to scale parallel execution plans Throughput – multiple streams Scale Factor 1 – Line item data is 1GB 875MB with DATE instead of DATETIME Only single column indexes allowed, Ad-hoc

Observed Scaling Behaviors
Good scaling, leveling off at high DOP Perfect Scaling ??? Super Scaling Negative Scaling especially at high DOP Execution Plan change Completely different behavior

TPC-H Published Results

TPC-H SF 100GB 2-way Xeon 5355, 5570, 5680, Opt 6176
Between 2-way Xeon 5570, all are close, HDD has best throughput, SATA SSD has best composite, and Fusion-IO has be power. Westmere and Magny-Cours, both 192GB memory, are very close

TPC-H SF 300GB 8x QC/6C & 4x12C Opt,
6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz

TPC-H SF 1000

TPC-H SF 3TB X7460 & X7560 Nehalem-EX 64 cores better than 96 Core 2.

TPC-H SF 100GB, 300GB & 3TB SF100 2-way SF300 8x QC/6C & 4x12C
Westmere and Magny-Cours are very close Between 2-way Xeon 5570, all are close, HDD has best through-put, SATA SSD has best composite, and Fusion-IO has be power SF300 8x QC/6C & 4x12C 6C Istanbul improved over 4C Shanghai by 45% Power, 73% Through-put, 59% overall. 4x12C 2.3GHz improved17% over 8x6C 2.8GHz SF 3TB X7460 & X7560 Nehalem-EX 64 cores better than 96 Core 2.

SQL Server excels in Power Limited by Geometric mean, anomalies Trails in Throughput Other DBMS get better throughput than power SQL Server throughput below Power by wide margin Speculation – SQL Server does not throttle back parallelism with load?

TPC-H SF100 Processors GHz 23,378.0 13,381.0 17,686.7 2 Xeon 5355 8
Power Through put QphH Processors Total Cores SQL GHz Mem GB SF 23,378.0 13,381.0 17,686.7 2 Xeon 5355 8 5sp2 2.66 64 100 67,712.9 38,019.1 50,738.4 2x5570 HDD 8sp1 2.93 144 99,426.3 94,761.5 55,038.2 53,855.6 73,974.6 71,438.3 2 Xeon 5680 2 Opt 6176 12 24 8r2 3.33 2.3 192 70,048.5 37,749.1 51,422.4 2x5570 SSD 72,110.5 36,190.8 51,085.6 5570 Fusion

TPC-H SF300 Processors GHz 25,206.4 67,287.4 75,161.2 109,067.1
Power Through put QphH Processors Total Cores SQL GHz Mem GB SF 25,206.4 67,287.4 75,161.2 109,067.1 13,283.8 41,526.4 44,271.9 76,869.0 18,298.5 52,860.2 57,684.7 91,558.2 4 Opt 8220 8 Opt 8360 8 32 5rtm 8rtm 2.8 2.5 128 256 8 Opt 8384 8 Opt 8439 48 8sp1 2.7 300 129,198.3 89,547.7 107,561.2 4 Opt 6176 8r2 2.3 512 152,453.1 96,585.4 121,345.6 4 Xeon 7560 2.26 640 All of the above are HP results?, Sun result Opt 8384, sp1, Pwr 67,095.6, Thr 45,343.5, QphH 55,157.5

TPC-H 1TB Processors GHz 95,789.1 69,367.6 81,367.6 8 Opt 8439 48 8R2?
Total Cores Mem GB SQL SF Power Through put QphH 95,789.1 69,367.6 81,367.6 8 Opt 8439 48 8R2? 2.8 512 1000 108,436.8 96,652.7 102,375.3 8 Opt 8439 48 ASE 2.8 384 1000 139,181.0 141,188.1 140,181.1 Itanium 9350 64 O11R2 1.73 512 1000

TPC-H 3TB Processors GHz 120,254.8 87,841.4 102,254.8 16 Xeon 7460 96
Total Cores Mem GB SQL SF Power Through put QphH 120,254.8 87,841.4 102,254.8 16 Xeon 7460 96 8r2 2.66 1024 3000 185,297.7 142,685.6 162,601.7 8 Xeon 7560 64 8r2 2.26 512 3000 142,790.7 171,607.4 156,537.3 Itanium 9350 64 Sybase 1.73 512 1000 142,790.7 171,607.4 156,537.3 POWER6 64 Sybase 5.0 512 3000

Processors GHz Total Cores Mem GB SQL SF Power Through put QphH 2 Xeon 5355 2.66 8 64 5sp2 100 23,378 13,381 17,686.7 2 Xeon 5570 2.93 8 144 8sp1 100 72,110.5 36,190.8 51,085.6 2 Xeon 5680 3.33 12 192 8r2 100 99,426.3 55,038.2 73,974.6 2 Opt 6176 2.3 24 192 8r2 100 94,761.5 53,855.6 71,438.3 4 Opt 8220 2.8 8 128 5rtm 300 25,206.4 13,283.8 18,298.5 8 Opt 8360 2.5 32 256 8rtm 300 67,287.4 41,526.4 52,860.2 8 Opt 8384 2.7 32 256 8rtm 300 75,161.2 44,271.9 57,684.7 8 Opt 8439 2.8 48 256 8sp1 300 109,067.1 76,869.0 91,558.2 4 Opt 6176 2.3 48 512 8r2 300 129,198.3 89,547.7 107,561.2 8 Xeon 7560 2.26 64 512 8r2 3000 185,297.7 142,685.6 162,601.7

SF100 2-way Big Queries (sec)
Query time in sec Xeon 5570 with SATA SSD poor on Q9, reason unknown Both Xeon 5680 and Opteron 6176 big improvement over Xeon 5570

SF100 Middle Q Query time in sec
Xeon 5570-HDD and 5680-SSD poor on Q12, reason unknown Opteron 6176 poor on Q11

SF100 Small Queries Query time in sec
Xeon 5680 and Opteron poor on Q20 Note limited scaling on Q2, & 17

SF300 32+ cores Big Queries Query time in sec
Opteron 6176 poor relative to 8439 on Q9 & 13, same number of total cores

SF300 Middle Q Query time in sec
Opteron 6176 much better than 8439 on Q11 & 19 Worse on Q12

SF300 Small Q Query time in sec
Opteron 6176 much better on Q2, even with 8439 on others

SF1000

SF1000 Itanium - Superdome

SF 3TB – 8×7560 versus 16×7460 5.6X Broadly 50% faster overall, 5X+ on one, slower on 2, comparable on 3

64 cores, 7560 relative to PWR6

TPC-H Summary Scaling is impressive on some SQL
Limited ability (value) is scaling small Q Anomalies, negative scaling

TPC-H Queries

Q1 Pricing Summary Report

Query 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)

Q6 Forecasting Revenue Change

Q7 Volume Shipping

Q8 National Market Share

Q9 Product Type Profit Measure

Q11 Important Stock Identification
Non-Parallel Parallel

Q12 Random IO?

Q13 Why does Q13 have perfect scaling?

Q17 Small Quantity Order Revenue

Q18 Large Volume Customer

Q20? Date functions are usually written as
because Line Item date columns are “date” type CAST helps DOP 1 plan, but get bad plan for parallel This query may get a poor execution plan

Q21 Suppliers Who Kept Orders Waiting
Note 3 references to Line Item

TPC-H DSS – 22 queries, geometric mean Power – single stream
60X range plan cost, comparable actual range Power – single stream Tests ability to scale parallel execution plans Throughput – multiple streams Scale Factor 1 – Line item data is 1GB 875MB with DATE instead of DATETIME Only single column indexes allowed, Ad-hoc

SF 10, test studies Not valid for publication Auto-Statistics enabled,
Excludes compile time Big Queries – Line Item Scan Super Scaling – Mission Impossible Small Queries & High Parallelism Other queries, negative scaling Did not apply T2301, or disallow page locks

Big Q: Plan Cost vs Actual
memory affects Hash IO onset Plan Cost reduction from DOP1 to 16/32 Q1 28% Q9 44% Q18 70% Q21 20% Plan 10GB Plan Cost says scaling is poor except for Q18, Q18 & Q 21 > 3X Q1, Q9 Actual Query time In seconds Plan Cost is poor indicator of true parallelism scaling

Big Query: Speed Up and CPU
Holy Grail Q13 has slightly better than perfect scaling? In general, excellent scaling to DOP 8-24, weak afterwards Speed up relative to DOP 1 Poor DOP 32 might be resource contention, perhaps exclude SQL from 1 CPU CPU time In seconds © 2010 Elemental Inc. All rights reserved.

Super Scaling Suppose at DOP 1, a query runs for 100 seconds, with one CPU fully pegged CPU time = 100 sec, elapse time = 100 sec What is best case for DOP 2? Assuming nearly zero Repartition Threads cost CPU time = 100 sec, elapsed time = 50? Super Scaling: CPU time decreases going from Non-Parallel to Parallel plan! No, I have not started drinking, yet

Super Scaling CPU normalized to DOP 1 CPU-sec goes down from DOP 1 to 2 and higher (typically 8) 3.5X speedup from DOP 1 to 2 (Normalized to DOP 1) Speed up relative to DOP 1

CPU and Query time in seconds
CPU time Query time

Super Scaling Summary Most probable cause Bitmap Filters are great,
Bitmap Operator in Parallel Plan Bitmap Filters are great, Question for Microsoft: Can I use Bitmap Filters in OLTP systems with non-parallel plans? © 2010 Elemental Inc. All rights reserved.

Small Queries – Plan Cost vs Act
Query 3 and 16 have lower plan cost than Q17, but not included Plan Cost Q4,6,17 great scaling to DOP 4, then weak Negative scaling also occurs Query time

Small Queries CPU & Speedup
CPU time What did I get for all that extra CPU?, Interpretation: sharp jump in CPU means poor scaling, disproportionate means negative scaling Speed up Query 2 negative at DOP 2, Q4 is good, Q6 get speedup, but at CPU premium, Q17 and 20 negative after DOP 8

High Parallelism – Small Queries
Why? Almost No value TPC-H geometric mean scoring Small queries have as much impact as large Linear sum of weights large queries OLTP with 32, 64+ cores Parallelism good if super-scaling Default max degree of parallelism 0 Seriously bad news, especially for small Q Increase cost threshold for parallelism? Sometimes you do get lucky

Q that go Negative Query time “Speedup”

Other Queries – CPU & Speedup
CPU time Q3 has problems beyond DOP 2 Speedup

Other - Query Time seconds

Scaling Summary Some queries show excellent scaling
Super-scaling, better than 2X Sharp CPU jump on last DOP doubling Need strategy to cap DOP To limit negative scaling Especially for some smaller queries? Other anomalies

Compression PAGE

Compression Overhead - Overall
Query time compressed relative to uncompressed 40% overhead for compression at low DOP, 10% overhead at max DOP??? CPU time compressed relative to uncompressed

Query time compressed relative to uncompressed
CPU time compressed relative to uncompressed

Compressed Table LINEITEM – real data may be more compressible
Uncompressed: 8,749,760KB, Average Bytes per row: 149 Compressed: 4,819,592KB, Average Bytes per row: 82

Orders and Line Item on Order Key
Partitioning Orders and Line Item on Order Key

Partitioning Impact - Overall
Query time partitioned relative to not partitioned CPU time partitioned relative to not partitioned

Query time partitioned relative to not partitioned
CPU time partitioned relative to not partitioned

Plan for Partitioned Tables

Scaling DW Summary Massive IO bandwidth
Parallel options for data load, updates etc Investigate Parallel Execution Plans Scaling from DOP 1, 2, 4, 8, 16, 32 etc Scaling with and w/o HT Strategy for limiting DOP with multiple users

Fixes from Microsoft Needed
Contention issues in parallel execution Table scan, Nested Loops Better plan cost model for scaling Back-off on parallelism if gain is negligible Fix throughput degradation with multiple users running big DW queries Sybase and Oracle, Throughput is close to Power or better

Query Plans

Big Queries

Q1 Pricing Summary Report

Q1 Plan Non-Parallel Parallel plan 28% lower than scalar, IO is 70%, no parallel plan cost reduction Parallel

Q9 Product Type Profit Measure
IO from 4 tables contribute 58% of plan cost, parallel plan is 39% lower Non-Parallel Parallel

Q9 Non-Parallel Plan Table/Index Scans comprise 64%, IO from 4 tables contribute 58% of plan cost Join sequence: Supplier, (Part, PartSupp), Line Item, Orders

Q9 Parallel Plan Non-Parallel: (Supplier), (Part, PartSupp), Line Item, Orders Parallel: Nation, Supplier, (Part, Line Item), Orders, PartSupp

Q9 Non-Parallel Plan details
Table Scans comprise 64%, IO from 4 tables contribute 58% of plan cost

Q9 Parallel reg vs Partitioned

Q13 Why does Q13 have perfect scaling?

Q18 Large Volume Customer

Q18 Graphical Plan Non-Parallel Plan: 66% of cost in Hash Match, reduced to 5% in Parallel Plan

Q18 Plan Details Non-Parallel Parallel Non-Parallel Plan Hash Match cost is 1245 IO, CPU DOP 16/32: size is below IO threshold, CPU reduced by >10X

Q21 Suppliers Who Kept Orders Waiting
Non-Parallel Parallel Note 3 references to Line Item

Q21 Non-Parallel Plan H3 H2 H1 H3 H2 H1

Q21 Parallel

Q21 3 full Line Item clustered index scans
Plan cost is approx 3X Q1, single “scan”

Super Scaling

Q7 Volume Shipping Non-Parallel Parallel

Q7 Non-Parallel Plan Join sequence: Nation, Customer, Orders, Line Item

Q7 Parallel Plan Join sequence: Nation, Customer, Orders, Line Item

Q8 National Market Share

Q8 Non-Parallel Plan Join sequence: Part, Line Item, Orders, Customer

Q8 Parallel Plan Join sequence: Part, Line Item, Orders, Customer

Q11 Important Stock Identification

Q11 Join sequence: A) Nation, Supplier, PartSupp, B) Nation, Supplier, PartSupp

Small Queries

Query 2 Minimum Cost Supplier
Wordy, but only touches the small tables, second lowest plan cost (Q15)

Q2 Clustered Index Scan on Part and PartSupp have highest cost (48%+42%)

Q2 PartSupp is now Index Scan + Key Lookup

Q6 Forecasting Revenue Change
Note sure why this blows CPU Scalar values are pre-computed, pre-converted

Q20? Date functions are usually written as
because Line Item date columns are “date” type CAST helps DOP 1 plan, but get bad plan for parallel This query may get a poor execution plan

Q20 alternate - parallel Statistics estimation error here
Penalty for mistake applied here

Other Queries

Q12 Random IO? Will this generate random IO?

Query 12 Plans Non-Parallel Parallel

Queries that go Negative

Q17 Small Quantity Order Revenue

Q17 Table Spool is concern

Q17 the usual suspects

Speedup from DOP 1 query time
CPU relative to DOP 1

System Architecture: Big Iron (NUMA)

Similar presentations

Presentation on theme: "System Architecture: Big Iron (NUMA)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

System Architecture: Big Iron (NUMA)

Similar presentations

Presentation on theme: "System Architecture: Big Iron (NUMA)"— Presentation transcript:

Similar presentations

About project

Feedback