Presentation is loading. Please wait.

Presentation is loading. Please wait.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.

Similar presentations


Presentation on theme: "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory."— Presentation transcript:

1 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory Intensive Kernels for Multicore Samuel Williams 1,2, Kaushik Datta 1, Jonathan Carter 2, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2, David Bailey 2 1 University of California, Berkeley 2 Lawrence Berkeley National Laboratory samw@cs.berkeley.edu

2 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Motivation  Multicore is the de facto solution for increased peak performance for the next decade  However, given the diversity of architectures, multicore guarantees neither good scalability nor good (attained) performance  We need a solution that provides performance portability

3 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 What’s a Memory Intensive Kernel?

4 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Arithmetic Intensity in HPC  True Arithmetic Intensity (AI) ~ Total Flops / Total DRAM Bytes  Some HPC kernels have an arithmetic intensity that scales with problem size (increased temporal locality), but remains constant on others  Arithmetic intensity is ultimately limited by compulsory traffic  Arithmetic intensity is diminished by conflict or capacity misses. A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

5 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 Memory Intensive  Let us define memory intensive to be when the kernel’s arithmetic intensity is less than machine’s balance (flop:byte) Performance ~ Stream BW * Arithmetic Intensity  Technology allows peak flops to improve faster than bandwidth.  more and more kernels will be considered memory intensive

6 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Outline  Motivation  Memory Intensive Kernels  Multicore SMPs of Interest  Software Optimizations  Introduction to Auto-tuning  Auto-tuning Memory Intensive Kernels  Sparse Matrix Vector Multiplication (SpMV)  Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD)  Heat Equation Stencil (3D Laplacian)  Summary

7 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 Multicore SMPs of Interest (used throughout the rest of the talk)

8 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2

9 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy Disjoint Local Store Memory Hierarchy

10 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 cache-based Pthreads implementation cache-based Pthreads implementation local store-based libspe implementation local store-based libspe implementation

11 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Multicore SMPs Used 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 multithreaded cores

12 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Multicore SMPs Used (threads) 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 8 threads 16 * threads 128 threads * SPEs only

13 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 Multicore SMPs Used (peak double precision flops) 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 75 GFlop/s74 Gflop/s 29* GFlop/s19 GFlop/s * SPEs only

14 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Multicore SMPs Used (total DRAM bandwidth) 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers HyperTransport Opteron 667MHz DDR2 DIMMs 10.66 GB/s 2x64b memory controllers Opteron 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 512KB victim 2MB Shared quasi-victim (32 way) SRI / crossbar 2MB Shared quasi-victim (32 way) SRI / crossbar HyperTransport 4GB/s (each direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC Crossbar 179 GB/s90 GB/s 8 x 6.4 GB/s (1 per hub per direction) BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 <20GB/s (each direction) AMD Opteron 2356 (Barcelona)Intel Xeon E5345 (Clovertown) IBM QS20 Cell BladeSun T2+ T5140 (Victoria Falls) 667MHz FBDIMMs Chipset (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) 10.66 GB/s Core FSB Core 10.66 GB/s Core FSB Core 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 4MB shared L2 21 GB/s (read) 10 GB/s (write) 21 GB/s 51 GB/s 42 GB/s (read) 21 GB/s (write) * SPEs only

15 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Categorization of Software Optimizations

16 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Optimization Categorization Maximizing (attained) In-core Performance Minimizing (total) Memory Traffic Maximizing (attained) Memory Bandwidth

17 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance

18 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Optimization Categorization Minimizing Memory Traffic Maximizing Memory Bandwidth Maximizing In-core Performance Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches

19 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores Maximizing Memory Bandwidth Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking

20 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20 Optimization Categorization Maximizing In-core Performance Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Minimizing Memory Traffic Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores

21 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores

22 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 Optimization Categorization Maximizing In-core Performance Minimizing Memory Traffic Maximizing Memory Bandwidth Exploit in-core parallelism (ILP, DLP, etc…) Good (enough) floating-point balance ? ? unroll & jam ? ? explicit SIMD ? ? reorder ? ? eliminate branches Exploit NUMA Hide memory latency Satisfy Little’s Law ? ? memory affinity ? ? SW prefetch ? ? DMA lists ? ? unit-stride streams ? ? TLB blocking Eliminate: Capacity misses Conflict misses Compulsory misses Write allocate behavior ? ? cache blocking ? ? array padding ? ? compress data ? ? streaming stores Each optimization has a large parameter space What are the optimal parameters? Each optimization has a large parameter space What are the optimal parameters?

23 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Introduction to Auto-tuning

24 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Out-of-the-box Code Problem  Out-of-the-box code has (unintentional) assumptions on:  cache sizes (>10MB)  functional unit latencies(~1 cycle)  etc…  These assumptions may result in poor performance when they exceed the machine characteristics

25 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Auto-tuning?  Provides performance portability across the existing breadth and evolution of microprocessors  One time up front productivity cost is amortized by the number of machines its used on  Auto-tuning does not invent new optimizations  Auto-tuning automates the exploration of the optimization and parameter space  Two components:  parameterized code generator (we wrote ours in Perl)  Auto-tuning exploration benchmark (combination of heuristics and exhaustive search)  Can be extended with ISA specific optimizations (e.g. DMA, SIMD)

26 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Auto-tuning Memory Intensive Kernels  Sparse Matrix Vector Multiplication (SpMV)SC’07  Lattice-Boltzmann Magneto-hydrodynamics (LBMHD)IPDPS’08  Explicit Heat Equation (Stencil)SC’08

27 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Auto-tuning Sparse Matrix- Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

28 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28 Sparse Matrix Vector Multiplication  What’s a Sparse Matrix ?  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data to reconstruct the matrix structure  What’s SpMV ?  Evaluate y=Ax  A is a sparse matrix, x & y are dense vectors  Challenges  Very low arithmetic intensity (often <0.166 flops/byte)  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD) (a) algebra conceptualization (c) CSR reference code for (r=0; r<A.rows; r++) { double y0 = 0.0; for (i=A.rowStart[r]; i<A.rowStart[r+1]; i++){ y0 += A.val[i] * x[A.col[i]]; } y[r] = y0; } Axy (b) CSR data structure A.val[ ] A.rowStart[ ]... A.col[ ]...

29 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 The Dataset (matrices)  Unlike dense BLAS, performance is dictated by sparsity  Suite of 14 matrices  All bigger than the caches of our SMPs  We’ll also include a median performance number Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)

30 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30 SpMV Performance (simple parallelization)  Out-of-the box SpMV performance on a suite of 14 matrices  Scalability isn’t great  Is this performance good? Naïve Pthreads Naïve

31 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 Auto-tuned SpMV Performance (portable C)  Fully auto-tuned SpMV performance across the suite of matrices  Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

32 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Auto-tuned SpMV Performance (architecture specific optimizations)  Fully auto-tuned SpMV performance across the suite of matrices  Included SPE/local store optimized version  Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

33 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Auto-tuned SpMV Performance (architecture specific optimizations)  Fully auto-tuned SpMV performance across the suite of matrices  Included SPE/local store optimized version  Why do some optimizations work better on some architectures?  Performance is better, but is performance good? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve Auto-tuning resulted in better performance, but did it result in good performance? Auto-tuning resulted in better performance, but did it result in good performance?

34 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Roofline Model 2 1/81/8 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/41/4 1/21/2 124816 1 Log scale !

35 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Naïve Roofline Model  Unrealistically optimistic model  Hand optimized Stream BW benchmark 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 124 peak DP 8 IBM QS20 Cell Blade Sun T2+ T5140 (Victoria Falls) Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) hand optimized Stream BW Gflop/s(AI) = min Peak Gflop/s StreamBW * AI

36 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Roofline model for SpMV  Double precision roofline models  In-core optimizations 1..i  DRAM optimizations 1..j  FMA is inherent in SpMV (place at bottom) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter GFlops i,j (AI) = min InCoreGFlops i StreamBW j * AI

37 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Roofline model for SpMV (overlay arithmetic intensity)  Two unit stride streams  Inherent FMA  No ILP  No DLP  FP is 12-25%  Naïve compulsory flop:byte < 0.166 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

38 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Roofline model for SpMV (out-of-the-box parallel)  Two unit stride streams  Inherent FMA  No ILP  No DLP  FP is 12-25%  Naïve compulsory flop:byte < 0.166  For simplicity: dense matrix in sparse format 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

39 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 Roofline model for SpMV (NUMA & SW prefetch)  compulsory flop:byte ~ 0.166  utilize all memory channels 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) dataset dataset fits in snoop filter

40 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 40 Roofline model for SpMV (matrix compression)  Inherent FMA  Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls)

41 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 41 Roofline model for SpMV (matrix compression)  Inherent FMA  Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) Performance is bandwidth limited Performance is bandwidth limited

42 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 42 Auto-tuning Lattice-Boltzmann Magneto-Hydrodynamics (LBMHD) Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Application Track

43 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 43 LBMHD  Plasma turbulence simulation via Lattice Boltzmann Method  Two distributions:  momentum distribution (27 scalar components)  magnetic distribution (15 vector components)  Three macroscopic quantities:  Density  Momentum (vector)  Magnetic Field (vector)  Arithmetic Intensity:  Must read 73 doubles, and update 79 doubles per lattice update (1216 bytes)  Requires about 1300 floating point operations per lattice update  Just over 1.0 flops/byte (ideal)  Cache capacity requirements are independent of problem size  Two problem sizes:  64 3 (0.3 GB)  128 3 (2.5 GB)  periodic boundary conditions momentum distribution 14 4 13 16 5 8 9 21 12 +Y 2 25 1 3 24 23 22 26 0 18 6 17 19 7 10 11 20 15 +Z +X magnetic distribution 14 13 16 21 12 25 24 23 22 26 18 17 19 20 15 +Y +Z +X macroscopic variables +Y +Z +X

44 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 44 Initial LBMHD Performance  Generally, scalability looks good  Scalability is good  but is performance good? * collision() only Naïve+NUMA

45 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 45 Auto-tuned LBMHD Performance (portable C)  Auto-tuning avoids cache conflict and TLB capacity misses  Additionally, it exploits SIMD where the compiler doesn’t  Include a SPE/Local Store optimized version * collision() only +Vectorization +Padding Naïve+NUMA

46 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 46 Auto-tuned LBMHD Performance (architecture specific optimizations)  Auto-tuning avoids cache conflict and TLB capacity misses  Additionally, it exploits SIMD where the compiler doesn’t  Include a SPE/Local Store optimized version * collision() only +Explicit SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA +small pages

47 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 47 Roofline model for LBMHD  Far more adds than multiplies (imbalance)  Huge data sets 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP

48 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 48 Roofline model for LBMHD (overlay arithmetic intensity)  Far more adds than multiplies (imbalance)  Essentially random access to memory  Flop:byte ratio ~0.7  NUMA allocation/access  Little ILP  No DLP  High conflict misses 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP

49 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 49 Roofline model for LBMHD (out-of-the-box parallel performance)  Far more adds than multiplies (imbalance)  Essentially random access to memory  Flop:byte ratio ~0.7  NUMA allocation/access  Little ILP  No DLP  High conflict misses  Peak VF performance with 64 threads (out of 128) - high conflict misses 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out SIMD w/out ILP

50 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 50 Roofline model for LBMHD (Padding, Vectorization, Unrolling, Reordering, …)  Vectorize the code to eliminate TLB capacity misses  Ensures unit stride access (bottom bandwidth ceiling)  Tune for optimal VL  Clovertown pinned to lower BW ceiling 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA No naïve SPE implementation IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out SIMD w/out ILP

51 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 51 Roofline model for LBMHD (SIMDization + cache bypass)  Make SIMDization explicit  Technically, this swaps ILP and SIMD ceilings  Use cache bypass instruction: movntpd  Increases flop:byte ratio to ~1.0 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out ILP w/out SIMD mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out ILP w/out SIMD

52 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 52 Roofline model for LBMHD (SIMDization + cache bypass)  Make SIMDization explicit  Technically, this swaps ILP and SIMD ceilings  Use cache bypass instruction: movntpd  Increases flop:byte ratio to ~1.0 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Sun T2+ T5140 (Victoria Falls) 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 peak DP mul/add imbalance dataset fits in snoop filter Intel Xeon E5345 (Clovertown) w/out SIMD w/out ILP 3 out of 4 machines hit the Roofline 3 out of 4 machines hit the Roofline

53 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 53 The Heat Equation Stencil Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architecture”, Supercomputing (SC) (to appear), 2008.

54 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 54 The Heat Equation Stencil PDE grid +Y +Z +X stencil for heat equation PDE y+1 y-1 x-1 z-1 z+1 x+1 x,y,z  Explicit Heat equation (Laplacian ~  2 F(x,y,z)) on a regular grid  Storage: one double per point in space  7-point nearest neighbor stencil  Must:  read every point from DRAM  perform 8 flops (linear combination)  write every point back to DRAM  Just over 0.5 flops/byte (ideal)  Cache locality is important  Run one problem size: 256 3

55 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 55 Stencil Performance (out-of-the-box code)  Expect performance to be between SpMV and LBMHD  Scalability is universally poor  Performance is poor Naïve

56 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 56 Auto-tuned Stencil Performance (portable C)  Proper NUMA management is essential on most architectures  Moreover proper cache blocking is still essential even with MB’s of cache +SW Prefetching +Unrolling +Thread/Cache Blocking +Padding +NUMA Naïve +Collaborative Threading

57 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 57 Auto-tuned Stencil Performance (architecture specific optimizations)  Cache bypass can significantly improve Barcelona performance.  DMA, SIMD, and cache blocking were essential on Cell +Explicit SIMDization +SW Prefetching +Unrolling +Thread/Cache Blocking +Padding +NUMA +Cache bypass / DMA Naïve +Collaborative Threading

58 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 58 Roofline model for Stencil (out-of-the-box code)  Large datasets  2 unit stride streams  No NUMA  Little ILP  No DLP  Far more adds than multiplies (imbalance)  Ideal flop:byte ratio 1 / 3  High locality requirements  Capacity and conflict misses will severely impair flop:byte ratio 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP

59 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 59 Roofline model for Stencil (out-of-the-box code)  Large datasets  2 unit stride streams  No NUMA  Little ILP  No DLP  Far more adds than multiplies (imbalance)  Ideal flop:byte ratio 1 / 3  High locality requirements  Capacity and conflict misses will severely impair flop:byte ratio 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP No naïve SPE implementation IBM QS20 Cell Blade

60 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 60 Roofline model for Stencil (out-of-the-box code)  Large datasets  2 unit stride streams  No NUMA  Little ILP  No DLP  Far more adds than multiplies (imbalance)  Ideal flop:byte ratio 1 / 3  High locality requirements  Capacity and conflict misses will severely impair flop:byte ratio 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP No naïve SPE implementation IBM QS20 Cell Blade

61 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 61 Roofline model for Stencil (NUMA, cache blocking, unrolling, prefetch, …)  Cache blocking helps ensure flop:byte ratio is as close as possible to 1 / 3  Clovertown has huge caches but is pinned to lower BW ceiling  Cache management is essential when capacity/thread is low 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP No naïve SPE implementation IBM QS20 Cell Blade

62 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 62 Roofline model for Stencil (SIMDization + cache bypass)  Make SIMDization explicit  Technically, this swaps ILP and SIMD ceilings  Use cache bypass instruction: movntpd  Increases flop:byte ratio to ~0.5 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP IBM QS20 Cell Blade

63 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 63 Roofline model for Stencil (SIMDization + cache bypass)  Make SIMDization explicit  Technically, this swaps ILP and SIMD ceilings  Use cache bypass instruction: movntpd  Increases flop:byte ratio to ~0.5 on x86/Cell 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 1 2 1 / 16 flop:DRAM byte ratio attainable Gflop/s 4 8 16 32 64 128 1/81/8 1/41/4 1/21/2 1248 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA w/out FMA peak DP w/out ILP w/out SIMD w/out NUMA bank conflicts peak DP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) w/out SIMD w/out ILP IBM QS20 Cell Blade 3 out of 4 machines basically on Roofline 3 out of 4 machines basically on Roofline

64 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 64 Summary

65 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 65 Out-of-the-box Performance  Ideal productivity (just type ‘make’)  Kernels sorted by arithmetic intensity  Maximum performance with any concurrency  Note: Cell = 4 PPE threads (no SPEs)  Surprisingly, most architectures got similar performance

66 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 66 Portable Performance  Portable (C only) auto-tuning  Sacrifice some productivity upfront, amortize by reuse  Dramatic increases in performance on Barcelona and Victoria Falls  Clovertown is increasingly memory bound  Cell = 4 PPE threads (no SPEs)

67 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 67 Architecture Specific Performance  ISA specific auto-tuning (reduced portability & productivity)  explicit:  SPE parallelization with DMAs  SIMDization(SSE+Cell)  cache bypass  Cell gets all its performance from the SPEs (DP limits perf)  Barcelona gets half its performance from SIMDization

68 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 68 Power Efficiency  Used a digital power meter to measure sustained power under load  Efficiency = Sustained Performance / Sustained Power  Victoria Falls’ power efficiency is severely limited by FBDIMM power  Despite Cell’s weak double precision, it delivers the highest power efficiency

69 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 69 Summary (auto-tuning)  Auto-tuning provides portable performance  Auto-tuning with architecture (ISA) specific optimizations are essential on some machines  The roofline model qualifies performance  Also tells us which optimizations are important  As well as how much further improvement is possible

70 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 70 Summary (architectures)  Clovertown  severely bandwidth limited  Barcelona  delivers good performance with architecture specific auto-tuning  bandwidth limited after ISA optimizations  Victoria Falls  challenged by interaction of TLP with shared caches  often limited by in-core performance  Cell Broadband Engine  performance entirely from architecture specific auto-tuning  bandwidth limited on SpMV  DP FP limited on Stencil(slightly) and LBMHD(severely)  delivers the best power efficiency

71 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 71 Acknowledgements  Research supported by:  Microsoft and Intel funding (Award #20080469)  DOE Office of Science under contract number DE-AC02-05CH11231  NSF contract CNS-0325873  Sun Microsystems - Niagara2 / Victoria Falls machines  AMD - access to Quad-core Opteron (barcelona) access  Forschungszentrum Jülich - access to QS20 Cell blades  IBM - virtual loaner program to QS20 Cell blades

72 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 72 Questions ? Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007. Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS), 2008. Best Paper, Application Track Kaushik Datta, Mark Murphy, Vasily Volkov, Samuel Williams, Jonathan Carter, Leonid Oliker, David Patterson, John Shalf, Katherine Yelick, “Stencil Computation Optimization and Autotuning on State-of-the-Art Multicore Architecture”, Supercomputing (SC) (to appear), 2008.


Download ppt "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory."

Similar presentations


Ads by Google