Auto-tuning Memory Intensive Kernels for Multicore

Auto-tuning Memory Intensive Kernels for Multicore
Sam Williams

Outline Challenges arising from Optimizing Single Thread Performance
New Challenges Arising when Optimizing Multicore SMP Performance Performance Modeling and Little’s Law Multicore SMPs of Interest Auto-tuning Sparse Matrix-Vector Multiplication (SpMV) Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD) Summary

Challenges arising from Optimizing Single Thread Performance

Instruction-Level Parallelism
On modern pipelined architectures, operations (like floating-point addition) have a latency of 4-6 cycles (until the result is ready). However, independent adds can be pipelined one after another. Although this increases the peak flop rate, one can only achieve peak flops on the condition that on any given cycle the program has >4 independent adds ready to execute. failing to do so will result in a >4x drop in performance. The problem is exacerbated by superscalar or VLIW architectures like POWER or Itanium. One must often reorganize kernels to express more instruction-level parallelism

ILP Example (1x1 BCSR) Consider the core of SpMV
No ILP in the inner loop OOO can’t accelerate serial FMAs for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i]*X[C[i]] } y[r] = y0; 1x1 Register Block FMA FMA FMA FMA FMA time = time = 4 time = 8 time = 12 time = 16

ILP Example (1x4 BCSR) What about 1x4 BCSR ?
Still no ILP in the inner loop FMAs are still dependent on each other for(all rows){ y0 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i] ] y0+=V[i+1]*X[C[i]+1] y0+=V[i+2]*X[C[i]+2] y0+=V[i+3]*X[C[i]+3] } y[r] = y0; 1x4 Register Block FMA FMA FMA FMA time = time = 4 time = 8 time = 12

ILP Example (4x1 BCSR) What about 4x1 BCSR ? Updating 4 different rows
The 4 FMAs are independent Thus they can be pipelined. for(all rows){ y0 = 0.0;y1 = 0.0; y2 = 0.0;y3 = 0.0; for(all tiles in this row){ y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3; 4x1 Register Block FMA FMA FMA FMA FMA FMA FMA FMA time = time = 1 time = 2 time = 3 time = 4 time = 5 time = 6 time = 7

Data-level Parallelism
DLP = apply the same operation to multiple independent operands. Today, rather than relying on superscalar issue, many architectures have adopted SIMD as an efficient means of boosting peak performance. (SSE, Double Hummer, AltiVec, Cell, GPUs, etc…) Typically these instructions operate on four single precision (or two double precision) numbers at a time. However, some are more GPUs(32), Larrabee(16), and AVX(8) Failing to use these instructions may cause a 2-32x drop in performance Unfortunately, most compilers utterly fail to generate these instructions. + + + +

Memory-Level Parallelism (1)
Although caches may filter many memory requests, in HPC many memory references will still go all the way to DRAM. Memory latency (as measured in core cycles) grew by an order of magnitude in the 90’s Today, the latency of a memory operation can exceed 200 cycles (1 double every 80ns is unacceptably slow). Like ILP, we wish to pipeline requests to DRAM Several solutions exist today HW stream prefetchers HW Multithreading (e.g. hyperthreading) SW line prefetch DMA

Memory-Level Parallelism (2)
HW stream prefetchers are by far the easiest to implement and exploit. They detect a series of consecutive cache misses and speculate that the next addresses in the series will be needed. They then prefetch that data into the cache or a dedicated buffer. To effectively exploit a HW prefetcher, ensure your array references accesses 100’s of consecutive addresses. e.g. read A[i]…A[i+255] without any jumps or discontinuities This force limits the effectiveness (shape) of the cache blocking you implemented in HW1 as you accessed: A[(j+0)*N+i]…A[(j+0)*N+i+B], jump A[(j+1)*N+i]…A[(j+1)*N+i+B], jump A[(j+2)*N+i]…A[(j+2)*N+i+B], jump …

Cache Subtleties Set associative caches have a limited number of sets (S) and ways (W), the product of which is the capacity (in cache lines). As seen in HW1, it can be beneficial to reorganize kernels to reduce the working size and eliminate capacity misses. Conflict misses can severely impair performance, be very challenging to identify and eliminate. Given address may only be placed in W different locations in the cache. Poor access patterns or roughly power of two problem sizes can be especially bad Results in too many addresses mapped to the same set. Not all of them can be kept in the cache and some will have to be evicted. Padding arrays (problem sizes) or skewing access pattern can eliminate conflict misses.

Array padding Example Padding changes the data layout
Consider a large matrix with a power of two number of double A[N][M];// column major with M~pow2 A[i][j] and A[i+1][j] will likely be mapped to the same set. We can pad each column with a couple extra rows double A[N][M+pad]; Such techniques are applicable in many other domains (stencils, lattice-boltzman methods, etc…)

New Challenges Arising when Optimizing Multicore SMP Performance

What are SMPs ? SMP = shared memory parallel.
Multiple chips (typically < 32 threads) can address any location in a large shared memory through a network or bus Caches are almost universally coherent You can still run MPI on an SMP, but you trade free (always pay for it) cache-coherency traffic for additional memory traffic (for explicit communication) you trade user-level function calls for system calls Alternately, you use a SPMD threading model (pthreads, OpenMP, UPC) If communication between cores or threads is significant, then threaded implementations win out. As computation:communication ratio increases, MPI asymptotically approached threaded implementations.

What is multicore ? What are multicore SMPs ?
Today, multiple cores are integrated on the same chip Almost universally this is done in a SMP fashion For “convince”, programming multicore SMPs is indistinguishable from programming multi-socket SMPs. (easy transition) Multiple cores can share: memory controllers caches occasionally FPUs Although there was a graceful transition from multiple sockets to multiple cores from the point of view of correctness, achieving good performance can be incredibly challenging.

Affinity We may wish one pair of threads to share a cache, but be disjoint from another pair of threads. We can control the mapping of threads to linux processors But, mapping of linux processors to physical cores/sockets is machine/OS dependent. Inspect /proc/cpuinfo or use PLPA 1 2 3 4 5 6 7 1 2 3 4 5 6 7

NUMA Challenges Recent multicore SMPs have integrated the memory controllers on chip. As a result, memory-access is non-uniform (NUMA) That is, the bandwidth to read a given address varies dramatically among between cores Exploit NUMA (affinity+first touch) when you malloc/init data.

Implicit allocation for NUMA
Consider an OpenMP example for implicitly NUMA initialization: #pragma omp parallel for for (j=0; j<N; j++) { a[j] = 1.0; b[j] = 2.0; c[j] = 0.0; } The first accesses to the array (read or write) must be parallelized. DO NOT TOUCH BETWEEN MALLOC AND INIT When the for loop is parallelized, each thread initializes a range of i Exploits the OS’s first touch policy. Relies on assumption OpenMP maps threads correctly

New Cache Challenges shared caches + SPMD programming models can exacerbate conflict misses. Individually, threads may produce significant cache associativity pressure based on access pattern. (power of 2 problem sizes) Collectively, threads may produce excessive cache associativity pressure. (power of 2 problem sizes decomposed with a power of two number of threads) This can be much harder to diagnose and correct This problem arises whether using MPI or a threaded model.

New Memory Challenges The number of memory controllers and bandwidth on multicore SMPs is growing much slower than the number of cores. codes are becoming increasingly memory-bound as a fraction of the cores can saturate a socket’s memory bandwidth Multicore has traded bit-or word-parallelism for thread-parallelism. However, main memory is still built from bit-parallel devices (DIMMs) Must restructure memory-intensive apps to the bit-parallel nature of DIMMs (sequential access)

Synchronization Using multiple concurrent threads can create ordering and race errors. Locks are one solution. Must balance granularity and frequency SPMD programming model + barriers are often a better/simpler solution. spin barriers can be orders of magnitude faster than pthread library barriers.

Performance Modeling and Little’s Law

System Abstraction Abstractly describe any system (or subsystem) as a combination of black-boxed storage, computational units, and the bandwidth between them. These can be hierarchically composed. A volume of data must be transferred from the storage component, processed, and another volume of data must be returned. Consider the basic parameters governing performance of the channel: Bandwidth, Latency, Concurrency Bandwidth can be measured in: GB/s, Gflop/s, MIPS, etc… Latency can be measured in: seconds, cycles, etc… Concurrency the volume in flight across the channel, and can be measured in bytes, cache lines, operations, instructions, etc… DRAM CPU Cache core RF FU’s

Little’s Law Little’s law related concurrency, bandwidth, and latency
To achieve peak bandwidth, one must satisfy: Concurrency = Latency × Bandwidth For example, a memory controller with 20GB/s of bandwidth, and 100ns of latency requires the CPU to express 2KB of concurrency (memory-level parallelism) Similarly, given an expressed concurrency, one can bound attained performance: That is, as more concurrency is injected, we get progressively better performance Note, this assumes continual, pipelined accesses. BWattained = max Concurrencyexpressed / Latency BWmax

Where’s the bottleneck?
We’ve described bandwidths DRAM  CPU Cache  Core Register File  Functional units But in an application, one of these may be a performance-limiting bottleneck. We can take any pair and compare how quickly data can be transferred to how quickly it can be processed to determine the bottleneck. DRAM CPU Cache core RF FU’s

Arithmetic Intensity A r i t h m e t i c I n t e n s i t y
O( log(N) ) O( 1 ) O( N ) A r i t h m e t i c I n t e n s i t y SpMV, BLAS1,2 FFTs Stencils (PDEs) Dense Linear Algebra (BLAS3) Lattice Methods Particle Methods Consider the first case (DRAM-CPU) True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses.

Kernel Arithmetic Intensity and Architecture
For a given architecture, one may calculate its flop:byte ratio. For a 2.3GHz Quad Core Opteron, 1 SIMD add + 1 SIMD multiply per cycle per core 12.8GB/s of DRAM bandwidth = 36.8 / 12.6 ~ 2.9 flops per byte When a kernel’s arithmetic intensity is substantially less than the architecture’s flop:byte ratio, transferring data will take longer than computing on it  memory-bound When a kernel’s arithmetic intensity is substantially greater than the architecture’s flop:byte ratio, computation will take longer than data transfers  compute-bound

Memory Traffic Definition
Total bytes to/from DRAM Can categorize into: Compulsory misses Capacity misses Conflict misses Write allocations … Oblivious of lack of sub-cache line spatial locality

Roofline Model Basic Concept
Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis. Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance. Moreover, provides insights as to which optimizations will potentially be beneficial. Attainable Performanceij = min FLOP/s with Optimizations1-i AI * Bandwidth with Optimizations1-j

Roofline Model Basic Concept
Plot on log-log scale Given AI, we can easily bound performance But architectures are much more complicated We will bound performance as we eliminate specific forms of in-core parallelism 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth actual FLOP:Byte ratio

Roofline Model computational ceilings
Opterons have dedicated multipliers and adders. If the code is dominated by adds, then attainable performance is half of peak. We call these Ceilings They act like constraints on performance 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s Stream Bandwidth actual FLOP:Byte ratio

Opterons have 128-bit datapaths. If instructions aren’t SIMDized, attainable performance will be halved 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth actual FLOP:Byte ratio

On Opterons, floating-point instructions have a 4 cycle latency. If we don’t express 4-way ILP, performance will drop by as much as 4x 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out ILP actual FLOP:Byte ratio

Roofline Model communication ceilings
We can perform a similar exercise taking away parallelism from the memory subsystem 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth actual FLOP:Byte ratio

Explicit software prefetch instructions are required to achieve peak bandwidth 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth w/out SW prefetch actual FLOP:Byte ratio

Opterons are NUMA As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth. We could continue this by examining strided or random memory access patterns 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth w/out SW prefetch w/out NUMA actual FLOP:Byte ratio

Roofline Model computation + communication ceilings
We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth. 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA w/out ILP actual FLOP:Byte ratio

Roofline Model locality walls
Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA only compulsory miss traffic w/out ILP FLOPs AI = Compulsory Misses actual FLOP:Byte ratio

Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +write allocation traffic only compulsory miss traffic w/out ILP FLOPs AI = Allocations + Compulsory Misses actual FLOP:Byte ratio

Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +capacity miss traffic +write allocation traffic only compulsory miss traffic w/out ILP FLOPs AI = Capacity + Allocations + Compulsory actual FLOP:Byte ratio

Remember, memory traffic includes more than just compulsory misses. As such, actual arithmetic intensity may be substantially lower. Walls are unique to the architecture-kernel combination 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +conflict miss traffic +capacity miss traffic +write allocation traffic only compulsory miss traffic w/out ILP FLOPs AI = Conflict + Capacity + Allocations + Compulsory actual FLOP:Byte ratio

Optimization Categorization
Maximizing (attained) In-core Performance Maximizing (attained) Memory Bandwidth Minimizing (total) Memory Traffic

Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance

Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? unroll & jam explicit SIMD reorder eliminate branches

Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Exploit NUMA Hide memory latency Satisfy Little’s Law ? memory affinity SW prefetch DMA lists unit-stride streams TLB blocking ? unroll & jam explicit SIMD reorder eliminate branches

Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Exploit NUMA Hide memory latency Satisfy Little’s Law Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? memory affinity SW prefetch DMA lists unit-stride streams TLB blocking ? unroll & jam explicit SIMD reorder eliminate branches ? cache blocking array padding compress data streaming stores

Optimizations remove these walls and ceilings which act to constrain performance. 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA +conflict miss traffic +capacity miss traffic +write allocation traffic only compulsory miss traffic w/out ILP actual FLOP:Byte ratio

Optimizations remove these walls and ceilings which act to constrain performance. 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP mul / add imbalance attainable GFLOP/s w/out SIMD Stream Bandwidth w/out SW prefetch w/out NUMA only compulsory miss traffic w/out ILP actual FLOP:Byte ratio

Optimizations remove these walls and ceilings which act to constrain performance. 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth w/out SW prefetch w/out NUMA only compulsory miss traffic actual FLOP:Byte ratio

Optimizations remove these walls and ceilings which act to constrain performance. 0.5 1.0 1/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) peak DP attainable GFLOP/s Stream Bandwidth only compulsory miss traffic actual FLOP:Byte ratio

Maximizing In-core Performance Maximizing Memory Bandwidth Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance Each optimization has a large parameter space What are the optimal parameters? Exploit NUMA Hide memory latency Satisfy Little’s Law Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? memory affinity SW prefetch DMA lists unit-stride streams TLB blocking ? unroll & jam explicit SIMD reorder eliminate branches ? cache blocking array padding compress data streaming stores Note the difference between ‘optimization’ and ‘parameter’

Auto-tuning? Provides performance portability across the existing breadth and evolution of microprocessors One time up front productivity cost is amortized by the number of machines its used on Auto-tuning does not invent new optimizations Auto-tuning automates the code generation and exploration of the optimization and parameter space Two components: parameterized code generator (we wrote ours in Perl) Auto-tuning exploration benchmark (combination of heuristics and exhaustive search) Can be extended with ISA specific optimizations (e.g. DMA, SIMD)

Multicore SMPs of Interest
(used throughout the rest of the talk)

Multicore SMPs Used Intel Xeon E5345 (Clovertown)
AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade Note superscalar, multithreaded, heterogenity

Multicore SMPs Used (Conventional cache-based memory hierarchy)
Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade

Multicore SMPs Used (local store-based memory hierarchy)

Multicore SMPs Used (CMT = Chip-MultiThreading)

Multicore SMPs Used (threads)
Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 8 threads 8 threads Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 threads 16* threads *SPEs only

Multicore SMPs Used (peak double precision flops)
Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 75 GFlop/s 74 Gflop/s Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 19 GFlop/s 29* GFlop/s *SPEs only

Multicore SMPs Used (total DRAM bandwidth)
Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 21 GB/s (read) 10 GB/s (write) 21 GB/s Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 42 GB/s (read) 21 GB/s (write) 51 GB/s *SPEs only

Multicore SMPs Used (Non-Uniform Memory Access - NUMA)
Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade *SPEs only

Roofline Model for these multicore SMPs
Note, the multithreaded Niagara is limited by the instruction mix rather than a lack of expressed in-core parallelism Clearly some architectures are more dependent on bandwidth optimizations while others on in-core optimizations. 0.5 1.0 1/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/8 actual FLOP:Byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 QS20 Cell Blade (PPEs) w/out FMA w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/4 1/2 1 2 4 8 16 QS20 Cell Blade (SPEs) w/out SIMD w/out ILP w/out FMA peak DP Stream Bandwidth misaligned DMA w/out NUMA

Auto-tuning Sparse Matrix-Vector Multiplication (SpMV)
Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

Sparse Matrix Vector Multiplication
What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP (bad for pipelined or superscalar), Difficult to exploit DLP (bad for SIMD) (a) algebra conceptualization (c) CSR reference code for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; A x y (b) CSR data structure A.val[ ] A.rowStart[ ] ... A.col[ ]

The Dataset (matrices)
Unlike dense BLAS, performance is dictated by sparsity Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number 2K x 2K Dense matrix stored in sparse format Dense Well Structured (sorted by nonzeros/row) Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship Economics Epidemiology Poorly Structured hodgepodge FEM / Accelerator Circuit webbase Extreme Aspect Ratio (linear programming) LP

SpMV Performance (simple parallelization)
Out-of-the box SpMV performance on a suite of 14 matrices Scalability isn’t great Is this performance good? NOTE: on Cell, I’m just using the PPEs (for portability) Naïve Pthreads Naïve

NUMA for SpMV On NUMA architectures, all large arrays should be partitioned either explicitly (multiple malloc()’s + affinity) implicitly (parallelize initialization and rely on first touch) You cannot partition on granularities less than the page size 512 elements on x86 2M elements on Niagara For SpMV, partition the matrix and perform multiple malloc()’s Pin submatrices so they are co-located with the cores tasked to process them

Prefetch for SpMV SW prefetch injects more MLP into the memory subsystem. Can try to prefetch the values indices source vector or any combination thereof In general, should only insert one prefetch per cache line (works best on unrolled code) for(all rows){ y0 = 0.0; y1 = 0.0; y2 = 0.0; y3 = 0.0; for(all tiles in this row){ PREFETCH(V+PFDistance); y0+=V[i ]*X[C[i]] y1+=V[i+1]*X[C[i]] y2+=V[i+2]*X[C[i]] y3+=V[i+3]*X[C[i]] } y[r+0] = y0; y[r+1] = y1; y[r+2] = y2; y[r+3] = y3;

SpMV Performance (NUMA and Software Prefetching)
NUMA-aware allocation is essential on memory-bound NUMA SMPs. Explicit software prefetching can boost bandwidth and change cache replacement policies Cell PPEs are likely latency-limited. used exhaustive search

ILP/DLP vs Bandwidth In the multicore era, which is the bigger issue?
a lack of ILP/DLP (a major advantage of BCSR) insufficient memory bandwidth per core There are many architectures than when running low arithmetic intensity kernels, there is so little available memory bandwidth (per core) that you won’t notice a complete lack of ILP Perhaps we should concentrate on minimizing memory traffic rather than maximizing ILP/DLP Rather than benchmarking every combination, just Select the register blocking that minimizes the matrix foot print.

SpMV Performance (Matrix Compression)
After maximizing memory bandwidth, the only hope is to minimize memory traffic. exploit: register blocking other formats smaller indices Use a traffic minimization heuristic rather than search Benefit is clearly matrix-dependent. Register blocking enables efficient software prefetching (one per cache line)

Cache blocking for SpMV
Cache-blocking sparse matrices is very different than cache-blocking dense matrices. Rather than changing loop bounds, store entire submatrices contiguously. The columns spanned by each cache block are selected so that all submatrices place the same pressure on the cache i.e. touch the same number of unique source vector cache lines TLB blocking is a similar concept but instead of on 8 byte granularities, it uses 4KB granularities thread 0 thread 1 thread 2 thread 3

Auto-tuned SpMV Performance (cache and TLB blocking)
Fully auto-tuned SpMV performance across the suite of matrices Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

Auto-tuned SpMV Performance (architecture specific optimizations)
Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

Auto-tuned SpMV Performance (max speedup)
Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? 2.7x 4.0x 2.9x 35x +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

Auto-tuned SpMV Performance (architecture specific optimizations)
Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? Performance is better, but is performance good? Auto-tuning resulted in better performance, but did it result in good performance? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

Roofline model for SpMV
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Double precision roofline models In-core optimizations 1..i DRAM optimizations 1..j FMA is inherent in SpMV (place at bottom) peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s 8 dataset dataset fits in snoop filter w/out ILP w/out ILP 8 GFlopsi,j(AI) = min InCoreGFlopsi StreamBWj * AI mul/add imbalance 4 mul/add imbalance 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out SIMD attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for SpMV (overlay arithmetic intensity)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < 0.166 peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s 8 dataset dataset fits in snoop filter w/out ILP w/out ILP 8 mul/add imbalance 4 mul/add imbalance 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out SIMD attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for SpMV (out-of-the-box parallel)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < 0.166 For simplicity: dense matrix in sparse format peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s 8 dataset dataset fits in snoop filter w/out ILP w/out ILP 8 mul/add imbalance 4 mul/add imbalance 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out SIMD attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for SpMV (NUMA & SW prefetch)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 compulsory flop:byte ~ 0.166 utilize all memory channels peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s 8 dataset dataset fits in snoop filter w/out ILP w/out ILP 8 mul/add imbalance 4 mul/add imbalance 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out SIMD attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for SpMV (matrix compression)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Inherent FMA Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s w/out ILP 8 dataset dataset fits in snoop filter w/out ILP 8 mul/add imbalance 4 mul/add imbalance 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out SIMD attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for SpMV (matrix compression)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Inherent FMA Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions peak DP peak DP 64 64 w/out SIMD w/out SIMD 32 32 16 16 attainable Gflop/s attainable Gflop/s w/out ILP 8 w/out ILP 8 mul/add imbalance 4 dataset fits in snoop filter mul/add imbalance 4 w/out SW prefetch w/out NUMA 2 2 Performance is bandwidth limited 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out SIMD attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out FMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

SpMV Performance (summary)
Median SpMV performance aside, unlike LBMHD, SSE was unnecessary to achieve performance Cell still requires a non-portable, ISA-specific implementation to achieve good performance. Novel SpMV implementations may require ISA-specific (SSE) code to achieve better performance.

Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)
Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Application Track

LBMHD Plasma turbulence simulation via Lattice Boltzmann Method
Two distributions: momentum distribution (27 scalar components) magnetic distribution (15 vector components) Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector) Arithmetic Intensity: Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) Requires about 1300 floating point operations per lattice update Just over 1.0 flops/byte (ideal) Cache capacity requirements are independent of problem size Two problem sizes: 643 (0.3 GB) 1283 (2.5 GB) periodic boundary conditions momentum distribution 14 4 13 16 5 8 9 21 12 +Y 2 25 1 3 24 23 22 26 18 6 17 19 7 10 11 20 15 +Z +X magnetic distribution macroscopic variables

LBMHD Performance (reference implementation)
Generally, scalability looks good Scalability is good but is performance good? PPE performance is abysmal Naïve+NUMA *collision() only

LBMHD Performance (lattice-aware array padding)
LBMHD touches >150 arrays. Most caches have limited associativity Conflict misses are likely Apply heuristic to pad arrays +Padding Naïve+NUMA

Vectorization Two phases with a lattice method’s collision() operator:
reconstruction of macroscopic variables updating discretized velocities Normally this is done one point at a time. Change to do a vector’s worth at a time (loop interchange + tuning)

LBMHD Performance (vectorization)
Restructure loops to attain good TLB page locality and streaming accesses +Vectorization +Padding Naïve+NUMA *collision() only

LBMHD Performance (architecture specific optimizations)
Add unrolling and reordering of inner loop Additionally, it exploits SIMD where the compiler doesn’t Include a SPE/Local Store optimized version +small pages +Explicit SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA *collision() only

LBMHD Performance (architecture specific optimizations)
Add unrolling and reordering of inner loop Additionally, it exploits SIMD where the compiler doesn’t Include a SPE/Local Store optimized version 1.6x 4x 3x 130x +small pages +Explicit SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA *collision() only

Roofline model for LBMHD
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Far more adds than multiplies (imbalance) Huge data sets peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for LBMHD (overlay arithmetic intensity)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Far more adds than multiplies (imbalance) Essentially random access to memory Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for LBMHD (out-of-the-box parallel performance)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Far more adds than multiplies (imbalance) Essentially random access to memory Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses Peak VF performance with 64 threads (out of 128) - high conflict misses peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for LBMHD (Padding, Vectorization, Unrolling, Reordering, …)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Vectorize the code to eliminate TLB capacity misses Ensures unit stride access (bottom bandwidth ceiling) Tune for optimal VL Clovertown pinned to lower BW ceiling peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 dataset fits in snoop filter 8 w/out ILP 4 w/out ILP 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 No naïve SPE implementation 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for LBMHD (SIMDization + cache bypass)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~1.0 on x86/Cell peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 16 16 attainable Gflop/s attainable Gflop/s dataset fits in snoop filter w/out ILP 8 w/out ILP 8 w/out SIMD 4 w/out SIMD 4 w/out SW prefetch w/out NUMA 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

Roofline model for LBMHD (SIMDization + cache bypass)
Intel Xeon E5345 (Clovertown) Opteron 2356 (Barcelona) 128 128 Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~1.0 on x86/Cell peak DP peak DP 64 64 mul/add imbalance mul/add imbalance 32 32 w/out SIMD w/out SIMD 16 16 attainable Gflop/s attainable Gflop/s 8 8 w/out ILP 4 dataset fits in snoop filter w/out ILP 4 w/out SW prefetch w/out NUMA 3 out of 4 machines hit the Roofline 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio Sun T2+ T5140 (Victoria Falls) IBM QS20 Cell Blade 128 128 64 64 peak DP 32 32 peak DP 16 16 w/out FMA attainable Gflop/s attainable Gflop/s 25% FP bank conflicts 8 w/out SW prefetch 8 w/out ILP w/out NUMA w/out NUMA 12% FP 4 4 w/out SIMD 2 2 1 1 1/16 1/8 1/4 1/2 1 2 4 8 1/16 1/8 1/4 1/2 1 2 4 8 flop:DRAM byte ratio flop:DRAM byte ratio

LBMHD Performance (Summary)
Reference code is clearly insufficient Portable C code is insufficient on Barcelona and Cell Cell gets all its performance from the SPEs despite only 2x the area, and 2x the peak DP FLOPs

Summary

Summary Introduced the Roofline Model
Apply bound and bottleneck analysis Performance and requisite optimizations are inferred visually Extended auto-tuning to multicore Fundamentally different from running auto-tuned serial code on multicore SMPs. Apply the concept to LBMHD and SpMV. Auto-tuning LBMHD and SpMV Multicore has had a transformative effect on auto-tuning. (move from latency limited to bandwidth limited) Maximizing memory bandwidth and minimizing memory traffic is key. Compilers are reasonably effective at in-core optimizations, but totally ineffective at cache and memory issues. Library or framework is a necessity in managing these issues. Comments on architecture Ultimately machines are bandwidth-limited without new algorithms Architectures with caches required significantly more tuning than the local store-based Cell

Acknowledgements Research supported by:
Microsoft and Intel funding (Award # ) DOE Office of Science under contract number DE-AC02-05CH11231 NSF contract CNS Sun Microsystems - Niagara2 / Victoria Falls machines AMD - access to Quad-core Opteron (barcelona) access Forschungszentrum Jülich - access to QS20 Cell blades IBM - virtual loaner program to QS20 Cell blades

Questions ? Samuel Williams, Andrew Waterman, David Patterson, "Roofline: An Insightful Visual Performance Model for Floating-Point Programs and Multicore Architectures", Communications of the ACM (CACM), April 2009. Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007. Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Application Track

Auto-tuning Memory Intensive Kernels for Multicore

Similar presentations

Presentation on theme: "Auto-tuning Memory Intensive Kernels for Multicore"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Auto-tuning Memory Intensive Kernels for Multicore

Similar presentations

Presentation on theme: "Auto-tuning Memory Intensive Kernels for Multicore"— Presentation transcript:

Similar presentations

About project

Feedback