P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix Kernels Sam Williams 1,2 Richard Vuduc 3, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2, James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Motivation Multicore is the de facto solution for improving peak performance for the next decade How do we ensure this applies to sustained performance as well ? Processor architectures are extremely diverse and compilers can rarely fully exploit them Require a HW/SW solution that guarantees performance without completely sacrificing productivity
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 Overview Examine Sparse Matrix Vector Multiplication (SpMV) kernel Present and analyze two threaded & auto-tuned implementations Benchmarked performance across 4 diverse multicore architectures Intel Xeon (Clovertown) AMD Opteron Sun Niagara2 (Huron) IBM QS20 Cell Blade We show Auto-tuning can significantly improve performance Cell consistently delivers good performance and efficiency Niagara2 delivers good performance and productivity
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Multicore SMPs used
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 Multicore SMP Systems
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy Disjoint Local Store Memory Hierarchy
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Multicore SMP Systems (memory hierarchy) Cache + Pthreads implementationsb Cache + Pthreads implementationsb Local Store + libspe implementations Local Store + libspe implementations
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Multicore SMP Systems (peak flops) 75 Gflop/s17 Gflop/s PPEs: 13 Gflop/s SPEs: 29 Gflop/s 11 Gflop/s
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Multicore SMP Systems (peak DRAM bandwidth) 21 GB/s(read) 10 GB/s(write) 21 GB/s 51 GB/s 42 GB/s(read) 21 GB/s(write)
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Multicore SMP Systems Uniform Memory Access Non-Uniform Memory Access
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Arithmetic Intensity Arithmetic Intensity ~ Total Flops / Total DRAM Bytes Some HPC kernels have an arithmetic intensity that scales with with problem size (increasing temporal locality) But there are many important and interesting kernels that don’t A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 Auto-tuning
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Auto-tuning Hand optimizing each architecture/dataset combination is not feasible Goal: Productive Solution for Performance portability Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search Perl script generates many possible kernels (Generate SIMD optimized kernels) Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/… Performance depends on the optimizations generated Heuristics are often desirable when the search space isn’t tractable Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Sparse Matrix Vector Multiplication Sparse Matrix Most entries are 0.0 Performance advantage in only storing/operating on the nonzeros Requires significant meta data Evaluate y=Ax A is a sparse matrix x & y are dense vectors Challenges Difficult to exploit ILP(bad for superscalar), Difficult to exploit DLP(bad for SIMD) Irregular memory access to source vector Difficult to load balance Very low computational intensity (often >6 bytes/flop) = likely memory bound Axy
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Dataset (Matrices) Pruned original SPARSITY suite down to 14 none should fit in cache Subdivided them into 4 categories Rank ranges from 2K to 1M Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Naïve Serial Implementation Vanilla C implementation Matrix stored in CSR (compressed sparse row) Explored compiler options, but only the best is presented here x86 core delivers > 10x the performance of a Niagara2 thread IBM Cell Blade (PPE) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Naïve Parallel Implementation SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20 SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance 4x threads = 3.4x performance
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21 SPMD style Partition by rows Load balance by nonzeros N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 1.4% of peak flops 29% of bandwidth 1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth 25% of peak flops 39% of bandwidth 2.7% of peak flops 4% of bandwidth 2.7% of peak flops 4% of bandwidth
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 Auto-tuned Performance (+NUMA & SW Prefetching) Use first touch, or libnuma to exploit NUMA. Also includes process affinity. Tag prefetches with temporal locality Auto-tune: search for the optimal prefetch distances IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Auto-tuned Performance (+Matrix Compression) If memory bound, only hope is minimizing memory traffic Heuristically compress the parallelized matrix to minimize it Implemented with SSE Benefit of prefetching is hidden by requirement of register blocking Options: register blocking, index size, format, etc… IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Auto-tuned Performance (+Cache/TLB Blocking) Reorganize matrix to maximize locality of source vector accesses IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Auto-tuned Performance (+DIMMs, Firmware, Padding) Clovertown was already fully populated with DIMMs Gave Opteron as many DIMMs as Clovertown Firmware update for Niagara2 Array padding to avoid inter- thread conflict misses PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Auto-tuned Performance (+DIMMs, Firmware, Padding) Clovertown was already fully populated with DIMMs Gave Opteron as many DIMMs as Clovertown Firmware update for Niagara2 Array padding to avoid inter- thread conflict misses PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 10% of peak flops 10% of bandwidth 10% of peak flops 10% of bandwidth
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Auto-tuned Performance (+Cell/SPE version) Wrote a double precision Cell/SPE version DMA, local store blocked, NUMA aware, etc… Only 2x1 and larger BCOO Only the SpMV-proper routine changed About 12x faster (median) than using the PPEs alone. IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28 Wrote a double precision Cell/SPE version DMA, local store blocked, NUMA aware, etc… Only 2x1 and larger BCOO Only the SpMV-proper routine changed About 12x faster than using the PPEs alone. Auto-tuned Performance (+Cell/SPE version) IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 40% of peak flops 92% of bandwidth 40% of peak flops 92% of bandwidth
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 Auto-tuned Performance (How much did double precision and 2x1 blocking hurt) Model faster cores by commenting out the inner kernel calls, but still performing all DMAs Enabled 1x1 BCOO ~16% improvement IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve +better Cell implementation
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30 Wrote a double precision Cell/SPE version DMA, local store blocked, NUMA aware, etc… Only 2x1 and larger BCOO Only the SpMV-proper routine changed About 12x faster than using the PPEs alone. Speedup from Auto-tuning Median & (max) IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 1.3x (2.9x) 1.3x (2.9x) 26x (34x) 26x (34x) 3.9x (4.4x) 3.9x (4.4x) 1.6x (2.7x) 1.6x (2.7x)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 Summary
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Aggregate Performance (Fully optimized) Cell consistently delivers the best full system performance Although, Niagara2 delivers near comparable per socket performance Dual core Opteron delivers far better performance (bandwidth) than Clovertown Clovertown has far too little effective FSB bandwidth Huron has far more bandwidth than it can exploit (too much latency, too few cores)
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Parallel Efficiency (average performance per thread, Fully optimized) Aggregate Mflop/s / #cores Niagara2 & Cell showed very good multicore scaling Clovertown showed very poor multicore scaling on both applications For SpMV, Opteron and Clovertown showed good multisocket scaling
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Power Efficiency (Fully Optimized) Used a digital power meter to measure sustained power under load Calculate power efficiency as: sustained performance / sustained power All cache-based machines delivered similar power efficiency FBDIMMs (~12W each) sustained power 8 DIMMs on Clovertown (total of ~330W) 16 DIMMs on N2 machine (total of ~450W)
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Productivity Niagara2 required significantly less work to deliver good performance. Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Summary Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance. Niagara2 delivered both very good performance and productivity Cell delivered very good performance and efficiency (processor and power) Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as Architectural transparency is invaluable in optimizing code
EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Acknowledgements UC Berkeley RADLab Cluster (Opterons) PSI cluster(Clovertowns) Sun Microsystems Niagara2 donations Forschungszentrum Jülich Cell blade cluster access
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Questions?
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 switch to pOSKI