Download presentation
Presentation is loading. Please wait.
Published byZoe Dean Modified over 8 years ago
1
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory Intensive Kernels for Multicore Samuel Williams 1,2, Kaushik Datta 1, Jonathan Carter 2, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2, David Bailey 2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory samw@cs.berkeley.edu
2
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Motivation Multicore is the de facto solution for increased peak performance for the next decade However, given the diversity of architectures, multicore guarantees neither good scalability nor good (attained) performance We need a solution that provides performance portability
3
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 What’s a Memory Intensive Kernel?
4
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Arithmetic Intensity in HPC True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others Arithmetic intensity is ultimately limited by compulsory traffic Arithmetic intensity is diminished by conflict or capacity misses. A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods
5
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 Memory Intensive Let us define memory intensive to be when the kernel’s arithmetic intensity is less than machine’s balance (flop:byte) Performance ~ Stream BW * Arithmetic Intensity Technology allows peak flops to improve faster than bandwidth. more and more kernels will be considered memory intensive
6
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Outline Motivation Memory Intensive Kernels Multicore SMPs of Interest Software Optimizations Introduction to Auto-tuning Auto-tuning Memory Intensive Kernels Sparse Matrix Vector Multiplication (SpMV) Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD) Heat Equation Stencil (3D Laplacian) Summary
7
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 Multicore SMPs of Interest (used throughout the rest of the talk)
8
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2
9
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy Disjoint Local Store Memory Hierarchy
10
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 cache-based Pthreads implementation cache-based Pthreads implementation local store-based libspe implementation local store-based libspe implementation
11
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 multithreaded cores
12
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Multicore SMPs Used (threads) 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 8 threads 16 * threads 128 threads * SPEs only
13
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 Multicore SMPs Used (peak double precision flops) 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 75 GFlop/s74 Gflop/s 29* GFlop/s19 GFlop/s * SPEs only
14
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Multicore SMPs Used (total DRAM bandwidth) 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 21 GB/s (read) 10 GB/s (write) 21 GB/s 51 GB/s 42 GB/s (read) 21 GB/s (write) * SPEs only
15
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Categorization of Software Optimizations
16
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Optimization Categorization Maximizing (attained) In-core Performance Minimizing (total) Memory Traffic Maximizing (attained) Memory Bandwidth
17
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance
18
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches
19
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores Maximizing Memory Bandwidth Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking
20
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20 Optimization Categorization Maximizing In-core Performance Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Minimizing Memory Traffic Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores
21
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores
22
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores Each optimization has a large parameter space What are the optimal parameters? Each optimization has a large parameter space What are the optimal parameters?
23
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Introduction to Auto-tuning
24
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Out-of-the-box Code Problem Out-of-the-box code has (unintentional) assumptions on: cache sizes (>10MB) functional unit latencies(~1 cycle) etc… These assumptions may result in poor performance when they exceed the machine characteristics
25
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Auto-tuning? Provides performance portability across the existing breadth and evolution of microprocessors One time up front productivity cost is amortized by the number of machines its used on Auto-tuning does not invent new optimizations Auto-tuning automates the exploration of the optimization and parameter space Two components: parameterized code generator (we wrote ours in Perl) Auto-tuning exploration benchmark (combination of heuristics and exhaustive search) Can be extended with ISA specific optimizations (e.g. DMA, SIMD)
26
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Auto-tuning Memory Intensive Kernels Sparse Matrix Vector Multiplication (SpMV)SC’07 Lattice-Boltzmann Magneto-hydrodynamics (LBMHD)IPDPS’08 Explicit Heat Equation (Stencil)SC’08
27
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Auto-tuning Sparse Matrix- Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
28
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28 Sparse Matrix Vector Multiplication What’s a Sparse Matrix ? Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data to reconstruct the matrix structure What’s SpMV ? Evaluate y=Ax A is a sparse matrix, x & y are dense vectors Challenges Very low arithmetic intensity (often <0.166 flops/byte) Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) (a) algebra conceptualization (c) CSR reference code for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; } Axy (b) CSR data structure A.val[ ] A.rowStart[ ]... A.col[ ]...
29
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 The Dataset (matrices) Unlike dense BLAS, performance is dictated by sparsity Suite of 14 matrices All bigger than the caches of our SMPs We’ll also include a median performance number Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)
30
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30 SpMV Performance (simple parallelization) Out-of-the box SpMV performance on a suite of 14 matrices Scalability isn’t great Is this performance good? Naïve Pthreads Naïve
31
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 Auto-tuned SpMV Performance (portable C) Fully auto-tuned SpMV performance across the suite of matrices Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
32
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Auto-tuned SpMV Performance (architecture specific optimizations) Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
33
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Auto-tuned SpMV Performance (architecture specific optimizations) Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? Performance is better, but is performance good? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve Auto-tuning resulted in better performance, but did it result in good performance? Auto-tuning resulted in better performance, but did it result in good performance?
34
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Roofline Model 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/41/4 1/21/2 124816 1 Log scale !
35
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Naïve Roofline Model Unrealistically optimistic model Hand optimized Stream BW benchmark 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 124 peak DP 8 IBM QS20 Cell Blade Sun T2+ T5140 (Victoria Falls) Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) hand optimized Stream BW Gflop/s(AI) = min Peak Gflop/s StreamBW * AI
36
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Roofline model for SpMV Double precision roofline models In-core optimizations 1..i DRAM optimizations 1..j FMA is inherent in SpMV (place at bottom) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter GFlops i,j (AI) = min InCoreGFlops i StreamBW j * AI
37
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Roofline model for SpMV (overlay arithmetic intensity) Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < 0.166 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter
38
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Roofline model for SpMV (out-of-the-box parallel) Two unit stride streams Inherent FMA No ILP No DLP FP is 12-25% Naïve compulsory flop:byte < 0.166 For simplicity: dense matrix in sparse format 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter
39
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 Roofline model for SpMV (NUMA & SW prefetch) compulsory flop:byte ~ 0.166 utilize all memory channels 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter
40
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 40 Roofline model for SpMV (matrix compression) Inherent FMA Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls)
41
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 41 Roofline model for SpMV (matrix compression) Inherent FMA Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) Performance is bandwidth limited Performance is bandwidth limited
42
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 42 Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD) Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Application Track
43
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 43 LBMHD Plasma turbulence simulation via Lattice Boltzmann Method Two distributions: momentum distribution (27 scalar components) magnetic distribution (15 vector components) Three macroscopic quantities: Density Momentum (vector) Magnetic Field (vector) Arithmetic Intensity: Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes) Requires about 1300 floating point operations per lattice update Just over 1.0 flops/byte (ideal) Cache capacity requirements are independent of problem size Two problem sizes: 64 3 (0.3 GB) 128 3 (2.5 GB) periodic boundary conditions momentum distribution 14 4 13 16 5 8 9 21 12 +Y 2 25 1 3 24 23 22 26 0 18 6 17 19 7 10 11 20 15 +Z +X magnetic distribution 14 13 16 21 12 25 24 23 22 26 18 17 19 20 15 +Y +Z +X macroscopic variables +Y +Z +X
44
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 44 Initial LBMHD Performance Generally, scalability looks good Scalability is good but is performance good? * collision() only Naïve+NUMA
45
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 45 Auto-tuned LBMHD Performance (portable C) Auto-tuning avoids cache conflict and TLB capacity misses Additionally, it exploits SIMD where the compiler doesn’t Include a SPE/Local Store optimized version * collision() only +Vectorization +Padding Naïve+NUMA
46
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 46 Auto-tuned LBMHD Performance (architecture specific optimizations) Auto-tuning avoids cache conflict and TLB capacity misses Additionally, it exploits SIMD where the compiler doesn’t Include a SPE/Local Store optimized version * collision() only +Explicit SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +small pages
47
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 47 Roofline model for LBMHD Far more adds than multiplies (imbalance) Huge data sets 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP
48
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 48 Roofline model for LBMHD (overlay arithmetic intensity) Far more adds than multiplies (imbalance) Essentially random access to memory Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP
49
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 49 Roofline model for LBMHD (out-of-the-box parallel performance) Far more adds than multiplies (imbalance) Essentially random access to memory Flop:byte ratio ~0.7 NUMA allocation/access Little ILP No DLP High conflict misses Peak VF performance with 64 threads (out of 128) - high conflict misses 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out SIMD w/out ILP
50
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 50 Roofline model for LBMHD (Padding, Vectorization, Unrolling, Reordering, …) Vectorize the code to eliminate TLB capacity misses Ensures unit stride access (bottom bandwidth ceiling) Tune for optimal VL Clovertown pinned to lower BW ceiling 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out SIMD w/out ILP
51
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 51 Roofline model for LBMHD (SIMDization + cache bypass) Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~1.0 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out ILP w/out SIMD
52
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 52 Roofline model for LBMHD (SIMDization + cache bypass) Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~1.0 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out SIMD w/out ILP 3 out of 4 machines hit the Roofline 3 out of 4 machines hit the Roofline
53
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 53 The Heat Equation Stencil Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architecture”, Supercomputing (SC) (to appear), 2008.
54
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 54 The Heat Equation Stencil PDE grid +Y +Z +X stencil for heat equation PDE y+1 y-1 x-1 z-1 z+1 x+1 x,y,z Explicit Heat equation (Laplacian ~ 2 F(x,y,z)) on a regular grid Storage: one double per point in space 7-point nearest neighbor stencil Must: read every point from DRAM perform 8 flops (linear combination) write every point back to DRAM Just over 0.5 flops/byte (ideal) Cache locality is important Run one problem size: 256 3
55
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 55 Stencil Performance (out-of-the-box code) Expect performance to be between SpMV and LBMHD Scalability is universally poor Performance is poor Naïve
56
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 56 Auto-tuned Stencil Performance (portable C) Proper NUMA management is essential on most architectures Moreover proper cache blocking is still essential even with MB’s of cache +SW Prefetching +Unrolling +Thread/Cache Blocking +Padding +NUMA Naïve +Collaborative Threading
57
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 57 Auto-tuned Stencil Performance (architecture specific optimizations) Cache bypass can significantly improve Barcelona performance. DMA, SIMD, and cache blocking were essential on Cell +Explicit SIMDization +SW Prefetching +Unrolling +Thread/Cache Blocking +Padding +NUMA +Cache bypass / DMA Naïve +Collaborative Threading
58
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 58 Roofline model for Stencil (out-of-the-box code) Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1 / 3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP
59
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 59 Roofline model for Stencil (out-of-the-box code) Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1 / 3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP No naïve SPE implementation IBM QS20 Cell Blade
60
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 60 Roofline model for Stencil (out-of-the-box code) Large datasets 2 unit stride streams No NUMA Little ILP No DLP Far more adds than multiplies (imbalance) Ideal flop:byte ratio 1 / 3 High locality requirements Capacity and conflict misses will severely impair flop:byte ratio 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP No naïve SPE implementation IBM QS20 Cell Blade
61
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 61 Roofline model for Stencil (NUMA, cache blocking, unrolling, prefetch, …) Cache blocking helps ensure flop:byte ratio is as close as possible to 1 / 3 Clovertown has huge caches but is pinned to lower BW ceiling Cache management is essential when capacity/thread is low 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP No naïve SPE implementation IBM QS20 Cell Blade
62
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 62 Roofline model for Stencil (SIMDization + cache bypass) Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~0.5 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP IBM QS20 Cell Blade
63
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 63 Roofline model for Stencil (SIMDization + cache bypass) Make SIMDization explicit Technically, this swaps ILP and SIMD ceilings Use cache bypass instruction: movntpd Increases flop:byte ratio to ~0.5 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP IBM QS20 Cell Blade 3 out of 4 machines basically on Roofline 3 out of 4 machines basically on Roofline
64
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 64 Summary
65
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 65 Out-of-the-box Performance Ideal productivity (just type ‘make’) Kernels sorted by arithmetic intensity Maximum performance with any concurrency Note: Cell = 4 PPE threads (no SPEs) Surprisingly, most architectures got similar performance
66
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 66 Portable Performance Portable (C only) auto-tuning Sacrifice some productivity upfront, amortize by reuse Dramatic increases in performance on Barcelona and Victoria Falls Clovertown is increasingly memory bound Cell = 4 PPE threads (no SPEs)
67
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 67 Architecture Specific Performance ISA specific auto-tuning (reduced portability & productivity) explicit: SPE parallelization with DMAs SIMDization(SSE+Cell) cache bypass Cell gets all its performance from the SPEs (DP limits perf) Barcelona gets half its performance from SIMDization
68
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 68 Power Efficiency Used a digital power meter to measure sustained power under load Efficiency = Sustained Performance / Sustained Power Victoria Falls’ power efficiency is severely limited by FBDIMM power Despite Cell’s weak double precision, it delivers the highest power efficiency
69
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 69 Summary (auto-tuning) Auto-tuning provides portable performance Auto-tuning with architecture (ISA) specific optimizations are essential on some machines The roofline model qualifies performance Also tells us which optimizations are important As well as how much further improvement is possible
70
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 70 Summary (architectures) Clovertown severely bandwidth limited Barcelona delivers good performance with architecture specific auto-tuning bandwidth limited after ISA optimizations Victoria Falls challenged by interaction of TLP with shared caches often limited by in-core performance Cell Broadband Engine performance entirely from architecture specific auto-tuning bandwidth limited on SpMV DP FP limited on Stencil(slightly) and LBMHD(severely) delivers the best power efficiency
71
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 71 Acknowledgements Research supported by: Microsoft and Intel funding (Award #20080469) DOE Office of Science under contract number DE-AC02-05CH11231 NSF contract CNS-0325873 Sun Microsystems - Niagara2 / Victoria Falls machines AMD - access to Quad-core Opteron (barcelona) access Forschungszentrum Jülich - access to QS20 Cell blades IBM - virtual loaner program to QS20 Cell blades
72
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 72 Questions ? Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007. Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Application Track Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architecture”, Supercomputing (SC) (to appear), 2008.
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.