Presentation is loading. Please wait.

Presentation is loading. Please wait.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.

Similar presentations


Presentation on theme: "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on."— Presentation transcript:

1 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on Multicore Computers Samuel Williams David Patterson, advisor and chair Kathy Yelick Sara McMains Ph.D. Dissertation Talk samw@cs.berkeley.edu Auto-tuning Performance on Multicore Computers

2 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Multicore Processors

3 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 Memory Bandwidth Superscalar Era  Single thread performance scaled at 50% per year  Bandwidth increases much more slowly, but we could add additional bits or channels.  Lack of diversity in architecture = lack of individual tuning  Power wall has capped single thread performance Log(Performance) 1990199520002005 ~50% / year Processor Memory Latency ~25% / year ~12% / year 0% / year 7% / year Multicore is the agreed solution Multicore is the agreed solution

4 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Superscalar Era  Single thread performance scaled at 50% per year  Bandwidth increases much more slowly, but we could add additional bits or channels.  Lack of diversity in architecture = lack of individual tuning  Power wall has capped single thread performance Log(Performance) 1990199520002005 ~50% / year ~25% / year Processor Memory ~12% / year 0% / year But, no agreement on what it should look like But, no agreement on what it should look like

5 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5  Take existing power limited superscalar processors, and add cores.  Serial code shows no improvement  Parallel performance might improve at 25% per year. = improvement in power efficiency from process tech.  DRAM bandwidth is currently only improving by ~12% per year.  Intel / AMD model. Multicore Era Log(Performance) 20032007 +0% / year (serial) 20052009 +25% / year (parallel) ~12% / year

6 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6  Radically different approach  Give up superscalar OOO cores in favor of small, efficient, in-order cores  Huge, abrupt, shocking drop in single thread performance  May eliminate power wall, allowing many cores, and performance to scale at up to 40% per year  Troubling, the number of DRAM channels many need to double every three years  Niagara, Cell, GPUs Multicore Era Log(Performance) 2003200720052009 +0% / year (serial) +40% / year (parallel) +12% / year

7 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7  Another option would be to reduce per core performance (and power) every year  Graceful degradation in serial performance  Bandwidth is still an issue Multicore Era Log(Performance) 2003200720052009 +40% / year (parallel) -12% / year (serial) +12% / year

8 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Multicore Era  There are still many other architectural knobs  Superscalar? Dual issue? VLIW ?  Pipeline depth  SIMD width ?  Multithreading (vertical, simultaneous)?  Shared caches vs. Private Caches ?  FMA vs. MUL+ADD vs. MUL or ADD  Clustering of cores  Crossbars vs. Rings vs. Meshes  Currently, no consensus on optimal configuration  As a result, there is a plethora of multicore architectures

9 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Computational Motifs

10 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Computational Motifs  Evolved from the Phil Colella’s Seven Dwarfs of Parallel Computing  Numerical Methods common throughout scientific computing  Dense and Sparse Linear Algebra  Computations on Structured and Unstructured Grids  Spectral Methods  N-Body Particle Methods  Monte Carlo  Within each dwarf, there are a number of computational kernels  The Berkeley View, and subsequently Par Lab expanded these to many other domains in computing (embedded, SPEC, DB, Games)  Graph Algorithms  Combinational Logic  Finite State Machines  etc…  rechristened Computational Motifs  Each could be black-boxed into libraries or frameworks by domain experts.  But how do we get good performance given the diversity of architecture?

11 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Auto-tuning

12 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Auto-tuning (motivation)  Given a huge diversity of processor architectures, an code hand-optimized for one architecture will likely deliver poor performance on another. Moreover, code optimized for one input data set may deliver poor performance on another.  We want a single code base that delivers performance portability across the breadth of architectures today and into the future.  Auto-tuners are composed of two principal components:  A code generator based on high-level functionality rather than parsing C  And the auto-tuner proper that searches for the optimal parameters for each optimization.  Auto-tuner’s don’t invent or discover optimizations, they search through the parameter space for a variety of known optimizations.  Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

13 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 Auto-tuning (code generation)  The code generator produces many code variants for some numerical kernel using known optimizations and transformations.  For example,  cache blocking adds several parameterized loop nests.  Prefetching adds parameterized intrinsics to the code  Loop unrolling and reordering explicitly unrolls the code. Each unrolling is a unique code variant.  Kernels can have dozens of different optimizations, some of which can produce hundreds of code variants.  The code generators used in this work are kernel-specific, and were written in Perl.

14 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Auto-tuning (search)  In this work, we use two search techniques:  Exhaustive - examine every combination of parameter for every optimization. (often intractable)  Heuristics - use knowledge of architecture or algorithm to restrict the search space. Parameter Space for Optimization A Parameter Space for Optimization B Parameter Space for Optimization A Parameter Space for Optimization B

15 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Overview Multicore SMPs The Roofline Model Auto-tuning LBMHD Auto-tuning SpMV Summary Future Work Auto-tuning Performance on Multicore Computers

16 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Thesis Contributions  Introduced the Roofline Model  Extended auto-tuning to the structured grid motif (specifically LBMHD)  Extended auto-tuning to multicore  Fundamentally different from running auto-tuned serial code on multicore SMPs.  Apply the concept to LBMHD and SpMV.  Analyzed the breadth of multicore architectures in the context of auto-tuned SpMV and LBMHD  Discussed future directions in auto-tuning

17 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Multicore SMPs Chapter 3 Multicore SMPs Overview Multicore SMPs The Roofline Model Auto-tuning LBMHD Auto-tuning SpMV Summary Future Work

18 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Multicore SMP Systems _ Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun UltraSPARC T2+ T5140 (Victoria Falls) IBM QS20 (Cell Blade) 667MHz FBDIMMs MCH (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) FSB 10.66 GB/s Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 10.66 GB/s FSB 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K 4GB/s (each direction) Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 8 x 6.4 GB/s (1 per hub per direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC <20GB/s (each direction) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2

19 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Multicore SMP Systems (Conventional cache-based memory hierarchy) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun UltraSPARC T2+ T5140 (Victoria Falls) IBM QS20 (Cell Blade) 667MHz FBDIMMs MCH (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) FSB 10.66 GB/s Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 10.66 GB/s FSB 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K 4GB/s (each direction) Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 8 x 6.4 GB/s (1 per hub per direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 512MB XDR DRAM 25.6 GB/s SPE 256K MFC <20GB/s (each direction) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC 512MB XDR DRAM 25.6 GB/s SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC EIB (ring network) XDR memory controllers EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 BIF VMT PPE VMT PPE 512K L2 512K L2 BIF

20 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20 Multicore SMP Systems (local store-based memory hierarchy) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun UltraSPARC T2+ T5140 (Victoria Falls) 667MHz FBDIMMs MCH (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) FSB 10.66 GB/s Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 10.66 GB/s FSB 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K 4GB/s (each direction) Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 8 x 6.4 GB/s (1 per hub per direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar

21 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21 Multicore SMP Systems (CMT = Chip Multithreading) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) 667MHz FBDIMMs MCH (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) FSB 10.66 GB/s Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 10.66 GB/s FSB 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K 4GB/s (each direction) Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport IBM QS20 (Cell Blade) BIF 512MB XDR DRAM 25.6 GB/s VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC <20GB/s (each direction) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 XDR memory controllers EIB (ring network) Sun UltraSPARC T2+ T5140 (Victoria Falls) 8 x 6.4 GB/s (1 per hub per direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar

22 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 Multicore SMP Systems (peak double-precision FLOP rates) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun UltraSPARC T2+ T5140 (Victoria Falls) IBM QS20 (Cell Blade) 667MHz FBDIMMs MCH (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) FSB 10.66 GB/s Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 10.66 GB/s FSB 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K 4GB/s (each direction) Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 8 x 6.4 GB/s (1 per hub per direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC <20GB/s (each direction) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 74.66 GFLOP/s 73.60 GFLOP/s 29.25 GFLOP/s 18.66 GFLOP/s

23 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Multicore SMP Systems (DRAM pin bandwidth) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun UltraSPARC T2+ T5140 (Victoria Falls) IBM QS20 (Cell Blade) 667MHz FBDIMMs MCH (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) FSB 10.66 GB/s Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 10.66 GB/s FSB 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K 4GB/s (each direction) Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 8 x 6.4 GB/s (1 per hub per direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC <20GB/s (each direction) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2 21 GB/s(read) 10 GB/s(write) 21 GB/s 51 GB/s 42 GB/s(read) 21 GB/s(write)

24 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Multicore SMP Systems (Non-Uniform Memory Access) Intel Xeon E5345 (Clovertown) AMD Opteron 2356 (Barcelona) Sun UltraSPARC T2+ T5140 (Victoria Falls) IBM QS20 (Cell Blade) 667MHz FBDIMMs MCH (4x64b controllers) 10.66 GB/s(write)21.33 GB/s(read) FSB 10.66 GB/s Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 Core 4MB L2 4MB L2 10.66 GB/s FSB 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K 4GB/s (each direction) Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 667MHz DDR2 DIMMs 667MHz DDR2 DIMMs 10.6 GB/s Opteron 512K Opteron 512K 2MB victim SRI / xbar 2x64b controllers HyperTransport 8 x 6.4 GB/s (1 per hub per direction) 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar 667MHz FBDIMMs 21.33 GB/s10.66 GB/s 4MB Shared L2 (16 way) (64b interleaved) 4MB Shared L2 (16 way) (64b interleaved) 4 Coherency Hubs 2x128b controllers MT SPARC 179 GB/s90 GB/s MT SPARC Crossbar BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers VMT PPE VMT PPE 512K L2 512K L2 SPE 256K MFC <20GB/s (each direction) SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC BIF 512MB XDR DRAM 25.6 GB/s EIB (ring network) XDR memory controllers SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC SPE 256K MFC VMT PPE VMT PPE 512K L2 512K L2

25 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Roofline Model Chapter 4 Roofline Model Overview Multicore SMPs The Roofline Model Auto-tuning LBMHD Auto-tuning SpMV Summary Future Work

26 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Memory Traffic  Total bytes to/from DRAM  Can categorize into:  Compulsory misses  Capacity misses  Conflict misses  Write allocations  …  Oblivious of lack of sub-cache line spatial locality

27 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Arithmetic Intensity  For purposes of this talk, we’ll deal with floating-point kernels  Arithmetic Intensity ~ Total FLOPs / Total DRAM Bytes  Includes cache effects  Many interesting problems have constant AI (w.r.t. problem size)  Bad given slowly increasing DRAM bandwidth  Bandwidth and Traffic are key optimizations A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

28 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28 Basic Idea  Synthesize communication, computation, and locality into a single visually-intuitive performance figure using bound and bottleneck analysis.  Given a kernel’s arithmetic intensity (based on DRAM traffic after being filtered by the cache), programmers can inspect the figure, and bound performance.  Moreover, provides insights as to which optimizations will potentially be beneficial. Attainable Performance ij = min FLOP/s with Optimizations 1-i AI * Bandwidth with Optimizations 1-j

29 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 Constructing a Roofline Model (computational ceilings)  Plot on log-log scale  Given AI, we can easily bound performance  But architectures are much more complicated  We will bound performance as we eliminate specific forms of in-core parallelism actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth

30 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30 Constructing a Roofline Model (computational ceilings)  Opterons have dedicated multipliers and adders.  If the code is dominated by adds, then attainable performance is half of peak.  We call these Ceilings  They act like constraints on performance actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth mul / add imbalance

31 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 Constructing a Roofline Model (computational ceilings)  Opterons have 128-bit datapaths.  If instructions aren’t SIMDized, attainable performance will be halved actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth mul / add imbalance w/out SIMD

32 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Constructing a Roofline Model (computational ceilings)  On Opterons, floating-point instructions have a 4 cycle latency.  If we don’t express 4-way ILP, performance will drop by as much as 4x actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD w/out ILP peak DP Stream Bandwidth mul / add imbalance

33 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Constructing a Roofline Model (communication ceilings)  We can perform a similar exercise taking away parallelism from the memory subsystem actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth

34 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Constructing a Roofline Model (communication ceilings)  Explicit software prefetch instructions are required to achieve peak bandwidth actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth w/out SW prefetch

35 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Constructing a Roofline Model (communication ceilings)  Opterons are NUMA  As such memory traffic must be correctly balanced among the two sockets to achieve good Stream bandwidth.  We could continue this by examining strided or random memory access patterns actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 peak DP Stream Bandwidth w/out SW prefetch w/out NUMA

36 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Constructing a Roofline Model (computation + communication)  We may bound performance based on the combination of expressed in-core parallelism and attained bandwidth. actual FLOP:Byte ratio attainable GFLOP/s Opteron 2356 (Barcelona) 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD peak DP mul / add imbalance w/out ILP Stream Bandwidth w/out SW prefetch w/out NUMA

37 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Constructing a Roofline Model (locality walls)  Remember, memory traffic includes more than just compulsory misses.  As such, actual arithmetic intensity may be substantially lower.  Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic FLOPs Compulsory Misses AI =

38 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Constructing a Roofline Model (locality walls)  Remember, memory traffic includes more than just compulsory misses.  As such, actual arithmetic intensity may be substantially lower.  Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic FLOPs Allocations + Compulsory Misses AI =

39 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 Constructing a Roofline Model (locality walls)  Remember, memory traffic includes more than just compulsory misses.  As such, actual arithmetic intensity may be substantially lower.  Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic FLOPs Capacity + Allocations + Compulsory AI =

40 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 40 Constructing a Roofline Model (locality walls)  Remember, memory traffic includes more than just compulsory misses.  As such, actual arithmetic intensity may be substantially lower.  Walls are unique to the architecture-kernel combination actual FLOP:Byte ratio attainable GFLOP/s 0.5 1.0 1/81/8 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 w/out SIMD mul / add imbalance w/out ILP w/out SW prefetch w/out NUMA Opteron 2356 (Barcelona) peak DP Stream Bandwidth only compulsory miss traffic+write allocation traffic+capacity miss traffic+conflict miss traffic FLOPs Conflict + Capacity + Allocations + Compulsory AI =

41 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 41 Roofline Models for SMPs 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (PPEs) w/out FMA w/out ILP peak DP Stream Bandwidth w/out SW prefetchw/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (SPEs) w/out SIMD w/out ILP w/out FMA peak DP Stream Bandwidth misaligned DMA w/out NUMA  Note, the multithreaded Niagara is limited by the instruction mix rather than a lack of expressed in-core parallelism  Clearly some architectures are more dependent on bandwidth optimizations while others on in-core optimizations. 0.5 1.0 1/81/8 actual FLOP:Byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA

42 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 42 Auto-tuning Lattice-Boltzmann Magnetohydrodynamics (LBMHD) Chapter 6 Auto-tuning Lattice-Boltzmann Magnetohydrodynamics (LBMHD) Overview Multicore SMPs The Roofline Model Auto-tuning LBMHD Auto-tuning SpMV Summary Future Work

43 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 43 Introduction to Lattice Methods  Structured grid code, with a series of time steps  Popular in CFD  Allows for complex boundary conditions  No temporal locality between points in space within one time step  Higher dimensional phase space  Simplified kinetic model that maintains the macroscopic quantities  Distribution functions (e.g. 5-27 velocities per point in space) are used to reconstruct macroscopic quantities  Significant Memory capacity requirements 14 12 4 16 13 5 9 8 21 2 0 25 3 1 24 22 23 26 18 15 6 19 17 7 11 10 20 +Z +Y +X

44 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 44 LBMHD (general characteristics)  Plasma turbulence simulation  Two distributions:  momentum distribution (27 scalar components)  magnetic distribution (15 vector components)  Three macroscopic quantities:  Density  Momentum (vector)  Magnetic Field (vector)  Must read 73 doubles, and update 79 doubles per point in space  Requires about 1300 floating point operations per point in space  Just over 1.0 FLOPs/byte (ideal) momentum distribution 14 4 13 16 5 8 9 21 12 +Y 2 25 1 3 24 23 22 26 0 18 6 17 19 7 10 11 20 15 +Z +X magnetic distribution 14 13 16 21 12 25 24 23 22 26 18 17 19 20 15 +Y +Z +X macroscopic variables +Y +Z +X

45 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 45 LBMHD (implementation details)  Data Structure choices:  Array of Structures: no spatial locality, strided access  Structure of Arrays: huge number of memory streams per thread, but guarantees spatial locality, unit-stride, and vectorizes well  Parallelization  Fortran version used MPI to communicate between nodes. = bad match for multicore  The version in this work uses pthreads for multicore (this thesis is not about innovation in the threading model or programming language)  MPI is not used when auto-tuning  Two problem sizes:  64 3 (~330MB)  128 3 (~2.5GB)

46 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 46 SOA Memory Access Pattern  Consider a simple D2Q9 lattice method using SOA  There are 9 read arrays, and 9 write arrays, but all accesses are unit stride  LBMHD has 73 read and 79 write streams per thread x dimension write_array[ ][ ] read_array[ ][ ] ?

47 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 47 Roofline Models for SMPs  LBMHD has a AI of 0.7 on write allocate architectures, and 1.0 on those with cache bypass or no write allocate.  MUL / ADD imbalance  Some architectures will be bandwidth-bound, while others compute bound. 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (PPEs) w/out FMA w/out ILP peak DP Stream Bandwidth w/out SW prefetchw/out NUMA 0.5 1.0 1/81/8 actual FLOP:Byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (SPEs) w/out SIMD w/out ILP w/out FMA peak DP Stream Bandwidth misaligned DMA w/out NUMA

48 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 48 LBMHD Performance (reference implementation)  Standard cache-based implementation can be easily parallelized with pthreads.  NUMA is implicitly exploited  Although scalability looks good, is performance ? Performance Concurrency & Problem Size

49 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 49 LBMHD Performance (reference implementation)  Standard cache-based implementation can be easily parallelized with pthreads.  NUMA is implicitly exploited  Although scalability looks good, is performance ?

50 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 50 LBMHD Performance (reference implementation)  Superscalar performance is surprisingly good given the complexity of the memory access pattern.  Cell PPE performance is abysmal. 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (PPEs) w/out FMA w/out ILP peak DP Stream Bandwidth w/out SW prefetchw/out NUMA 0.5 1.0 1/81/8 actual FLOP:Byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (SPEs) w/out SIMD w/out ILP w/out FMA peak DP Stream Bandwidth misaligned DMA w/out NUMA

51 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 51 LBMHD Performance (array padding)  LBMHD touches >150 arrays.  Most caches have limited associativity  Conflict misses are likely  Apply heuristic to pad arrays

52 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 52 LBMHD Performance (vectorization)  LBMHD touches > 150 arrays.  Most TLBs have << 128 entries.  Vectorization technique creates a vector of points that are being updated.  Loops are interchanged and strip mined  Exhaustively search for the optimal “vector length” that balances page locality with L1 cache misses.

53 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 53 LBMHD Performance (other optimizations)  Heuristic-based software prefetching  Exhaustive search for low- level optimizations  Loop unrolling/reordering  SIMDization  Cache Bypass increases arithmetic intensity by 50%  Small TLB pages on VF

54 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 54 LBMHD Performance (Cell SPE implementation)  We can write a local store implementation and run it on the Cell SPEs.  Ultimately, Cell’s weak DP hampers performance

55 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 55 LBMHD Performance (speedup for largest problem)  We can write a local store implementation and run it on the Cell SPEs.  Ultimately, Cell’s weak DP hampers performance 1.6x 4x 3x 130x

56 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 56 LBMHD Performance (vs. Roofline)  Most architectures reach their roofline bound performance  Clovertown’s snoop filter is ineffective.  Niagara suffers from instruction mix issues  Cell PPEs are latency limited  Cell SPEs are compute-bound 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (PPEs) w/out FMA w/out ILP peak DP Stream Bandwidth w/out SW prefetchw/out NUMA 0.5 1.0 1/81/8 actual FLOP:Byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (SPEs) w/out SIMD w/out ILP w/out FMA peak DP Stream Bandwidth misaligned DMA w/out NUMA

57 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 57 LBMHD Performance (Summary)  Reference code is clearly insufficient  Portable C code is insufficient on Barcelona and Cell  Cell gets all its performance from the SPEs  despite only 2x the area, and 2x the peak DP FLOPs

58 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 58 Auto-tuning Sparse Matrix-Vector Multiplication Chapter 8 Auto-tuning Sparse Matrix-Vector Multiplication Overview Multicore SMPs The Roofline Model Auto-tuning LBMHD Auto-tuning SpMV Summary Future Work

59 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 59 Sparse Matrix- Vector Multiplication  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data  Evaluate: y = Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector  Difficult to load balance Axy

60 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 60 Dataset (Matrices)  Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)

61 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 61 Roofline Models for SMPs  Reference SpMV implementation has an AI of 0.166, but can readily exploit FMA.  The best we can hope for is an AI of 0.25 for non-symmetric matrices.  All architectures are memory- bound, but some may need in- core optimizations 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (PPEs) w/out ILP w/out FMA peak DP Stream Bandwidth w/out SW prefetchw/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (SPEs) w/out ILP w/out FMA w/out SIMD peak DP Stream Bandwidth misaligned DMA w/out NUMA

62 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 62 SpMV Performance (Reference Implementation)  Reference implementation is CSR.  Simple parallelization by rows balancing nonzeros.  No implicit NUMA exploitation  Despite superscalar’s use of 8 cores, they see little speedup.  Niagara and PPEs show near linear speedups Performance Matrix

63 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 63 SpMV Performance (Reference Implementation)  Reference implementation is CSR.  Simple parallelization by rows balancing nonzeros.  No implicit NUMA exploitation  Despite superscalar’s use of 8 cores, they see little speedup.  Niagara and PPEs show near linear speedups

64 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 64 SpMV Performance (Reference Implementation)  Roofline for dense matrix in sparse format.  Superscalars achieve bandwidth- limited performance  Niagara comes very close to bandwidth limit.  Clearly, NUMA and prefetching will be essential. 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (PPEs) w/out ILP w/out FMA peak DP Stream Bandwidth w/out SW prefetchw/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (SPEs) w/out ILP w/out FMA w/out SIMD peak DP Stream Bandwidth misaligned DMA w/out NUMA

65 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 65 SpMV Performance (NUMA and Software Prefetching)  NUMA-aware allocation is essential on memory-bound NUMA SMPs.  Explicit software prefetching can boost bandwidth and change cache replacement policies  Cell PPEs are likely latency-limited.  used exhaustive search

66 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 66 SpMV Performance (Matrix Compression)  After maximizing memory bandwidth, the only hope is to minimize memory traffic.  exploit:  register blocking  other formats  smaller indices  Use a traffic minimization heuristic rather than search  Benefit is clearly matrix-dependent.  Register blocking enables efficient software prefetching (one per cache line)

67 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 67 SpMV Performance (Cache and TLB Blocking)  Based on limited architectural knowledge, create heuristic to choose a good cache and TLB block size  Hierarchically store the resultant blocked matrix.  Benefit can be significant on the most challenging matrix

68 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 68 SpMV Performance (Cell SPE implementation)  Cache blocking can be easily transformed into local store blocking.  With a few small tweaks for DMA, we can run a simplified version of the auto-tuner on Cell  BCOO only  2x1 and larger  always blocks

69 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 69 SpMV Performance (median speedup)  Cache blocking can be easily transformed into local store blocking.  With a few small tweaks for DMA, we can run a simplified version of the auto-tuner on Cell  BCOO only  2x1 and larger  always blocks 2.8x 2.6x 1.4x 15x

70 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 70 SpMV Performance (vs. Roofline)  Roofline for dense matrix in sparse format.  Compression improves AI  Auto-tuning can allow us to slightly exceed Stream bandwidth (but not pin bandwidth)  Cell PPEs perennially deliver poor performance 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (PPEs) w/out ILP w/out FMA peak DP Stream Bandwidth w/out SW prefetchw/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 UltraSparc T2+ T5140 (Victoria Falls) 25% FP 12% FP 6% FP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Xeon E5345 (Clovertown) mul / add imbalance w/out SIMD w/out ILP Bandwidth on small datasets peak DP Bandwidth on large datasets 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 Opteron 2356 (Barcelona) mul / add imbalance w/out SIMD w/out ILP peak DP Stream Bandwidth w/out SW prefetch w/out NUMA 0.5 1.0 1/81/8 actual FLOP:byte ratio attainable GFLOP/s 2.0 4.0 8.0 16.0 32.0 64.0 128.0 256.0 1/41/4 1/21/2 124816 QS20 Cell Blade (SPEs) w/out ILP w/out FMA w/out SIMD peak DP Stream Bandwidth misaligned DMA w/out NUMA

71 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 71 SpMV Performance (summary)  Median SpMV performance  aside, unlike LBMHD, SSE was unnecessary to achieve performance  Cell still requires a non-portable, ISA-specific implementation to achieve good performance.  Novel SpMV implementations may require ISA-specific (SSE) code to achieve better performance.

72 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 72 Summary Overview Multicore SMPs The Roofline Model Auto-tuning LBMHD Auto-tuning SpMV Summary Future Work

73 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 73 Summary  Introduced the Roofline Model  Apply bound and bottleneck analysis  Performance and requisite optimizations are inferred visually  Extended auto-tuning to multicore  Fundamentally different from running auto-tuned serial code on multicore SMPs.  Apply the concept to LBMHD and SpMV.  Auto-tuning LBMHD and SpMV  Multicore has had a transformative effect on auto-tuning. (move from latency limited to bandwidth limited)  Maximizing memory bandwidth and minimizing memory traffic is key.  Compilers are reasonably effective at in-core optimizations, but totally ineffective at cache and memory issues.  Library or framework is a necessity in managing these issues.  Comments on architecture  Ultimately machines are bandwidth-limited without new algorithms  Architectures with caches required significantly more tuning than the local store- based Cell

74 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 74 Future Directions in Auto-tuning Chapter 9 Future Directions in Auto-tuning Overview Multicore SMPs The Roofline Model Auto-tuning LBMHD Auto-tuning SpMV Summary Future Work

75 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 75 Future Work (Roofline)  Automatic generation of Roofline figures  Kernel oblivious  Select computational metric of interest  Select communication channel of interest  Designate common “optimizations”  Requires a benchmark  Using performance counters to generate runtime Roofline figures  Given a real kernel, we wish to understand the bottlenecks to performance.  Much friendlier visualization of performance counter data

76 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 76 Future Work (Making search tractable) Parameter Space for Optimization A Parameter Space for Optimization B Parameter Space for Optimization A Parameter Space for Optimization B  Given the explosion in optimizations, exhaustive search is clearly not tractable.  Moreover, heuristics require extensive architectural knowledge.  In our SC08 work, we tried a greedy approach (one optimization at a time)  We could make it iterative or, we could make it look like steepest descent (with some local search)

77 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 77 Future Work (Auto-tuning Motifs)  We could certainly auto-tune other individual kernels in any motif, but this requires building a kernel-specific auto-tuner  However, we should strive for motif-wide auto-tuning.  Moreover, we want to decouple data type (e.g. double precision) from the parallelization structure. 1. A motif description or pattern language for each motif.  e.g. taxonomy of structured grids + code snippet for stencil  write auto-tuner parses these, and produces optimized code. 2. A series of DAG rewrite rules for each motif. Rules allow: Insertion of additional nodes Duplication of nodes Reordering

78 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 78 Rewrite Rules x yA 0 0 0 0 0 0 0 0  Consider SpMV  In FP, each node in the DAG is a MAC  DAG makes locality explicit (e.g. local store blocking)  BCSR adds zeros to the DAG  We can cut edges and reconnect them to enable parallelization.  We can reorder operations  Any other data type/node type conforming to these rules can reuse all our auto-tuning efforts

79 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 79 Acknowledgments

80 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 80 Acknowledgments  Berkeley ParLab  Thesis Committee: David Patterson, Kathy Yelick, Sara McMains  BeBOP group: Jim Demmel, Kaushik Datta, Shoaib Kamil, Rich Vuduc, Rajesh Nishtala, etc…  Rest of ParLab  Lawrence Berkeley National Laboratory  FTG group: Lenny Oliker, John Shalf, Jonathan Carter, …  Hardware Donations and Remote Access  Sun Microsystems  IBM  AMD  FZ Julich  Intel This research was supported by the ASCR Office in the DOE Office of Science under contract number DE-AC02-05CH11231, by Microsoft and Intel funding through award #20080469, and by matching funding by U.C. Discovery through award #DIG07-10227.

81 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 81 Questions?


Download ppt "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on."

Similar presentations


Ads by Google