Presentation is loading. Please wait.

Presentation is loading. Please wait.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Similar presentations


Presentation on theme: "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix."— Presentation transcript:

1 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix Kernels Sam Williams 1,2 Richard Vuduc 3, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2, James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology samw@cs.berkeley.edu

2 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Motivation  Multicore is the de facto solution for improving peak performance for the next decade  How do we ensure this applies to sustained performance as well ?  Processor architectures are extremely diverse and compilers can rarely fully exploit them  Require a HW/SW solution that guarantees performance without completely sacrificing productivity

3 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 Overview  Examine Sparse Matrix Vector Multiplication (SpMV) kernel  Present and analyze two threaded & auto-tuned implementations  Benchmarked performance across 4 diverse multicore architectures  Intel Xeon (Clovertown)  AMD Opteron  Sun Niagara2 (Huron)  IBM QS20 Cell Blade  We show  Auto-tuning can significantly improve performance  Cell consistently delivers good performance and efficiency  Niagara2 delivers good performance and productivity

4 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Multicore SMPs used

5 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 Multicore SMP Systems

6 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy

7 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy Disjoint Local Store Memory Hierarchy

8 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Multicore SMP Systems (memory hierarchy) Cache + Pthreads implementationsb Cache + Pthreads implementationsb Local Store + libspe implementations Local Store + libspe implementations

9 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Multicore SMP Systems (peak flops) 75 Gflop/s17 Gflop/s PPEs: 13 Gflop/s SPEs: 29 Gflop/s 11 Gflop/s

10 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Multicore SMP Systems (peak DRAM bandwidth) 21 GB/s(read) 10 GB/s(write) 21 GB/s 51 GB/s 42 GB/s(read) 21 GB/s(write)

11 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Multicore SMP Systems Uniform Memory Access Non-Uniform Memory Access

12 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Arithmetic Intensity  Arithmetic Intensity ~ Total Flops / Total DRAM Bytes  Some HPC kernels have an arithmetic intensity that scales with with problem size (increasing temporal locality)  But there are many important and interesting kernels that don’t A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

13 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 Auto-tuning

14 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Auto-tuning  Hand optimizing each architecture/dataset combination is not feasible  Goal: Productive Solution for Performance portability  Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search  Perl script generates many possible kernels  (Generate SIMD optimized kernels)  Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/…  Performance depends on the optimizations generated  Heuristics are often desirable when the search space isn’t tractable  Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

15 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

16 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Sparse Matrix Vector Multiplication  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data  Evaluate y=Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector  Difficult to load balance  Very low computational intensity (often >6 bytes/flop) = likely memory bound Axy

17 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Dataset (Matrices)  Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)

18 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Naïve Serial Implementation  Vanilla C implementation  Matrix stored in CSR (compressed sparse row)  Explored compiler options, but only the best is presented here  x86 core delivers > 10x the performance of a Niagara2 thread IBM Cell Blade (PPE) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown

19 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Naïve Parallel Implementation  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve

20 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance 4x threads = 3.4x performance

21 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 1.4% of peak flops 29% of bandwidth 1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth 25% of peak flops 39% of bandwidth 2.7% of peak flops 4% of bandwidth 2.7% of peak flops 4% of bandwidth

22 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 Auto-tuned Performance (+NUMA & SW Prefetching)  Use first touch, or libnuma to exploit NUMA.  Also includes process affinity.  Tag prefetches with temporal locality  Auto-tune: search for the optimal prefetch distances IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

23 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Auto-tuned Performance (+Matrix Compression)  If memory bound, only hope is minimizing memory traffic  Heuristically compress the parallelized matrix to minimize it  Implemented with SSE  Benefit of prefetching is hidden by requirement of register blocking  Options: register blocking, index size, format, etc… IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

24 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Auto-tuned Performance (+Cache/TLB Blocking)  Reorganize matrix to maximize locality of source vector accesses IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

25 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Auto-tuned Performance (+DIMMs, Firmware, Padding)  Clovertown was already fully populated with DIMMs  Gave Opteron as many DIMMs as Clovertown  Firmware update for Niagara2  Array padding to avoid inter- thread conflict misses  PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

26 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Auto-tuned Performance (+DIMMs, Firmware, Padding)  Clovertown was already fully populated with DIMMs  Gave Opteron as many DIMMs as Clovertown  Firmware update for Niagara2  Array padding to avoid inter- thread conflict misses  PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 10% of peak flops 10% of bandwidth 10% of peak flops 10% of bandwidth

27 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Auto-tuned Performance (+Cell/SPE version)  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster (median) than using the PPEs alone. IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

28 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster than using the PPEs alone. Auto-tuned Performance (+Cell/SPE version) IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 40% of peak flops 92% of bandwidth 40% of peak flops 92% of bandwidth

29 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 Auto-tuned Performance (How much did double precision and 2x1 blocking hurt)  Model faster cores by commenting out the inner kernel calls, but still performing all DMAs  Enabled 1x1 BCOO  ~16% improvement IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve +better Cell implementation

30 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster than using the PPEs alone. Speedup from Auto-tuning Median & (max) IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 1.3x (2.9x) 1.3x (2.9x) 26x (34x) 26x (34x) 3.9x (4.4x) 3.9x (4.4x) 1.6x (2.7x) 1.6x (2.7x)

31 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 Summary

32 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Aggregate Performance (Fully optimized)  Cell consistently delivers the best full system performance  Although, Niagara2 delivers near comparable per socket performance  Dual core Opteron delivers far better performance (bandwidth) than Clovertown  Clovertown has far too little effective FSB bandwidth  Huron has far more bandwidth than it can exploit  (too much latency, too few cores)

33 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Parallel Efficiency (average performance per thread, Fully optimized)  Aggregate Mflop/s / #cores  Niagara2 & Cell showed very good multicore scaling  Clovertown showed very poor multicore scaling on both applications  For SpMV, Opteron and Clovertown showed good multisocket scaling

34 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Power Efficiency (Fully Optimized)  Used a digital power meter to measure sustained power under load  Calculate power efficiency as: sustained performance / sustained power  All cache-based machines delivered similar power efficiency  FBDIMMs (~12W each) sustained power  8 DIMMs on Clovertown (total of ~330W)  16 DIMMs on N2 machine (total of ~450W)

35 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Productivity  Niagara2 required significantly less work to deliver good performance.  Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

36 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Summary  Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.  Niagara2 delivered both very good performance and productivity  Cell delivered very good performance and efficiency (processor and power)  Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as discussed @SC07)  Architectural transparency is invaluable in optimizing code

37 EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Acknowledgements  UC Berkeley  RADLab Cluster (Opterons)  PSI cluster(Clovertowns)  Sun Microsystems  Niagara2 donations  Forschungszentrum Jülich  Cell blade cluster access

38 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Questions?

39 P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 switch to pOSKI


Download ppt "P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix."

Similar presentations


Ads by Google