P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Parallel computer architecture classification
POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Evolution of Processor Architecture, and the Implications for Performance Optimization.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.
Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.
1 7 Questions for Parallelism Applications: 1. What are the apps? 2. What are kernels of apps? Hardware: 3. What are the HW building blocks? 4. How to.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.
Minimizing Communication in Numerical Linear Algebra Sparse-Matrix-Vector-Multiplication (SpMV) Jim Demmel EECS & Math Departments,
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,
Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
The Potential of the Cell Processor for Scientific Computing
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Memory-Efficient Optimization of Gyrokinetic Particle-to-Grid Interpolation for Multicore.
1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
High Performance Computing An overview Alan Edelman Massachusetts Institute of Technology Applied Mathematics & Computer Science and AI Labs (Interactive.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.
TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
University of California, Berkeley
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
Parallel Computers Today
Samuel Williams1,2, David Patterson1,
for more information ... Performance Tuning
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Auto-tuning Memory Intensive Kernels for Multicore
Hybrid Programming with OpenMP and MPI
Multicore and GPU Programming
Kaushik Datta1,2, Mark Murphy2,
Presentation transcript:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix Kernels Sam Williams 1,2 Richard Vuduc 3, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2, James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Motivation  Multicore is the de facto solution for improving peak performance for the next decade  How do we ensure this applies to sustained performance as well ?  Processor architectures are extremely diverse and compilers can rarely fully exploit them  Require a HW/SW solution that guarantees performance without completely sacrificing productivity

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 Overview  Examine Sparse Matrix Vector Multiplication (SpMV) kernel  Present and analyze two threaded & auto-tuned implementations  Benchmarked performance across 4 diverse multicore architectures  Intel Xeon (Clovertown)  AMD Opteron  Sun Niagara2 (Huron)  IBM QS20 Cell Blade  We show  Auto-tuning can significantly improve performance  Cell consistently delivers good performance and efficiency  Niagara2 delivers good performance and productivity

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Multicore SMPs used

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 Multicore SMP Systems

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy Disjoint Local Store Memory Hierarchy

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Multicore SMP Systems (memory hierarchy) Cache + Pthreads implementationsb Cache + Pthreads implementationsb Local Store + libspe implementations Local Store + libspe implementations

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Multicore SMP Systems (peak flops) 75 Gflop/s17 Gflop/s PPEs: 13 Gflop/s SPEs: 29 Gflop/s 11 Gflop/s

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Multicore SMP Systems (peak DRAM bandwidth) 21 GB/s(read) 10 GB/s(write) 21 GB/s 51 GB/s 42 GB/s(read) 21 GB/s(write)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Multicore SMP Systems Uniform Memory Access Non-Uniform Memory Access

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Arithmetic Intensity  Arithmetic Intensity ~ Total Flops / Total DRAM Bytes  Some HPC kernels have an arithmetic intensity that scales with with problem size (increasing temporal locality)  But there are many important and interesting kernels that don’t A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 Auto-tuning

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Auto-tuning  Hand optimizing each architecture/dataset combination is not feasible  Goal: Productive Solution for Performance portability  Our auto-tuning approach finds a good performance solution by a combination of heuristics and exhaustive search  Perl script generates many possible kernels  (Generate SIMD optimized kernels)  Auto-tuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/…  Performance depends on the optimizations generated  Heuristics are often desirable when the search space isn’t tractable  Proven value in Dense Linear Algebra(ATLAS), Spectral(FFTW,SPIRAL), and Sparse Methods(OSKI)

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), 2007.

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Sparse Matrix Vector Multiplication  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data  Evaluate y=Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector  Difficult to load balance  Very low computational intensity (often >6 bytes/flop) = likely memory bound Axy

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Dataset (Matrices)  Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Naïve Serial Implementation  Vanilla C implementation  Matrix stored in CSR (compressed sparse row)  Explored compiler options, but only the best is presented here  x86 core delivers > 10x the performance of a Niagara2 thread IBM Cell Blade (PPE) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Naïve Parallel Implementation  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance 4x threads = 3.4x performance

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 1.4% of peak flops 29% of bandwidth 1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth 25% of peak flops 39% of bandwidth 2.7% of peak flops 4% of bandwidth 2.7% of peak flops 4% of bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22 Auto-tuned Performance (+NUMA & SW Prefetching)  Use first touch, or libnuma to exploit NUMA.  Also includes process affinity.  Tag prefetches with temporal locality  Auto-tune: search for the optimal prefetch distances IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Auto-tuned Performance (+Matrix Compression)  If memory bound, only hope is minimizing memory traffic  Heuristically compress the parallelized matrix to minimize it  Implemented with SSE  Benefit of prefetching is hidden by requirement of register blocking  Options: register blocking, index size, format, etc… IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Auto-tuned Performance (+Cache/TLB Blocking)  Reorganize matrix to maximize locality of source vector accesses IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Auto-tuned Performance (+DIMMs, Firmware, Padding)  Clovertown was already fully populated with DIMMs  Gave Opteron as many DIMMs as Clovertown  Firmware update for Niagara2  Array padding to avoid inter- thread conflict misses  PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Auto-tuned Performance (+DIMMs, Firmware, Padding)  Clovertown was already fully populated with DIMMs  Gave Opteron as many DIMMs as Clovertown  Firmware update for Niagara2  Array padding to avoid inter- thread conflict misses  PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 10% of peak flops 10% of bandwidth 10% of peak flops 10% of bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Auto-tuned Performance (+Cell/SPE version)  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster (median) than using the PPEs alone. IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster than using the PPEs alone. Auto-tuned Performance (+Cell/SPE version) IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 40% of peak flops 92% of bandwidth 40% of peak flops 92% of bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29 Auto-tuned Performance (How much did double precision and 2x1 blocking hurt)  Model faster cores by commenting out the inner kernel calls, but still performing all DMAs  Enabled 1x1 BCOO  ~16% improvement IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve +better Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster than using the PPEs alone. Speedup from Auto-tuning Median & (max) IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 1.3x (2.9x) 1.3x (2.9x) 26x (34x) 26x (34x) 3.9x (4.4x) 3.9x (4.4x) 1.6x (2.7x) 1.6x (2.7x)

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Aggregate Performance (Fully optimized)  Cell consistently delivers the best full system performance  Although, Niagara2 delivers near comparable per socket performance  Dual core Opteron delivers far better performance (bandwidth) than Clovertown  Clovertown has far too little effective FSB bandwidth  Huron has far more bandwidth than it can exploit  (too much latency, too few cores)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Parallel Efficiency (average performance per thread, Fully optimized)  Aggregate Mflop/s / #cores  Niagara2 & Cell showed very good multicore scaling  Clovertown showed very poor multicore scaling on both applications  For SpMV, Opteron and Clovertown showed good multisocket scaling

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 Power Efficiency (Fully Optimized)  Used a digital power meter to measure sustained power under load  Calculate power efficiency as: sustained performance / sustained power  All cache-based machines delivered similar power efficiency  FBDIMMs (~12W each) sustained power  8 DIMMs on Clovertown (total of ~330W)  16 DIMMs on N2 machine (total of ~450W)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 Productivity  Niagara2 required significantly less work to deliver good performance.  Cache based machines required search for some optimizations, while Cell relied solely on heuristics (less time to tune)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Summary  Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.  Niagara2 delivered both very good performance and productivity  Cell delivered very good performance and efficiency (processor and power)  Our multicore specific auto-tuned SpMV implementation significantly outperformed existing parallelization strategies including an auto-tuned MPI implementation (as  Architectural transparency is invaluable in optimizing code

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37 Acknowledgements  UC Berkeley  RADLab Cluster (Opterons)  PSI cluster(Clovertowns)  Sun Microsystems  Niagara2 donations  Forschungszentrum Jülich  Cell blade cluster access

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Questions?

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 switch to pOSKI