P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.

Slides:



Advertisements
Similar presentations
Lecture 6: Multicore Systems
Advertisements

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Parallel computer architecture classification
Performance Metrics Inspired by P. Kent at BES Workshop Run larger: Length, Spatial extent, #Atoms, Weak scaling Run longer: Time steps, Optimizations,
1 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D) Sam Williams 1,2, Jonathan Carter 2, Leonid Oliker 2, John Shalf 2, Katherine Yelick 1,2 1 University.
Cache Coherent Distributed Shared Memory. Motivations Small processor count –SMP machines –Single shared memory with multiple processors interconnected.
POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.
Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Evolution of Processor Architecture, and the Implications for Performance Optimization.
Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.
Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.
Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.
Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.
L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”
C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.
Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams
Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,
An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.
L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235
Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.
National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Optimizing Collective Communication.
Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors
Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.
The Memory Hierarchy 21/05/2009Lecture 32_CA&O_Engr Umbreen Sabir.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.
L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Extracting Ultra-Scale Lattice Boltzmann Performance via Hierarchical and Distributed.
2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.
The Potential of the Cell Processor for Scientific Computing
Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,
Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007
Computer Organization CS224 Fall 2012 Lessons 45 & 46.
1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.
Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.
1 Parallel Applications Computer Architecture Ning Hu, Stefan Niculescu & Vahe Poladian November 22, 2002.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.
3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)
Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.
P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
Jun Doi IBM Research – Tokyo Early Performance Evaluation of Lattice QCD on POWER+GPU Cluster 17 July 2015.
Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.
Optimizing the Performance of Sparse Matrix-Vector Multiplication
University of California, Berkeley
5.2 Eleven Advanced Optimizations of Cache Performance
The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.
Parallel Computers Today
Samuel Williams1,2, David Patterson1,
Lecture 14 Virtual Memory and the Alpha Memory Hierarchy
The Coming Multicore Revolution: What Does it Mean for Programming?
Auto-tuning Memory Intensive Kernels for Multicore
Kaushik Datta1,2, Mark Murphy2,
Presentation transcript:

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix and Structured Grid Kernels Samuel Williams 1,2, Richard Vuduc 3, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2, James Demmel 1,2, Jonathan Carter 2, David Patterson 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 2 Overview  Multicore is the de facto performance solution for the next decade  Examined Sparse Matrix Vector Multiplication (SpMV) kernel  Important, common, memory intensive, HPC kernel  Present 2 autotuned threaded implementations  Compare with leading MPI implementation(PETSc) with an autotuned serial kernel (OSKI)  Examined Lattice-Boltzmann Magneto-hydrodynamic (LBMHD) application  memory intensive HPC application (structured grid)  Present 2 autotuned threaded implementations  Benchmarked performance across 4 diverse multicore architectures  Intel Xeon (Clovertown)  AMD Opteron  Sun Niagara2 (Huron)  IBM QS20 Cell Blade  We show  Cell consistently delivers good performance and efficiency  Niagara2 delivers good performance and productivity

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 3 Autotuning Autotuning Multicore SMPs HPC Kernels SpMV LBMHD Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 4 Autotuning  Hand optimizing each architecture/dataset combination is not feasible  Autotuning finds a good performance solution be heuristics or exhaustive search  Perl script generates many possible kernels  Generate SSE optimized kernels  Autotuning benchmark examines kernels and reports back with the best one for the current architecture/dataset/compiler/…  Performance depends on the optimizations generated  Heuristics are often desirable when the search space isn’t tractable

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 5 Multicore SMPs used Autotuning Multicore SMPs HPC Kernels SpMV LBMHD Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 6 Multicore SMP Systems

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 7 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 8 Multicore SMP Systems (memory hierarchy) Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 9 Multicore SMP Systems (memory hierarchy) Cache + Pthreads implementations Local Store + libspe implementations

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 10 Multicore SMP Systems (peak flops) 75 Gflop/s (TLP + ILP + SIMD) 17 Gflop/s (TLP + ILP) PPEs: 13 Gflop/s (TLP + ILP) SPEs: 29 Gflop/s (TLP + ILP + SIMD) 11 Gflop/s (TLP)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 11 Multicore SMP Systems (peak DRAM bandwidth) 21 GB/s(read) 10 GB/s(write) 21 GB/s 51 GB/s 42 GB/s(read) 21 GB/s(write)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 12 Multicore SMP Systems Uniform Memory Access Non-Uniform Memory Access

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 13 HPC Kernels Autotuning Multicore SMPs HPC Kernels SpMV LBMHD Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 14 Arithmetic Intensity  Arithmetic Intensity ~ Total Compulsory Flops / Total Compulsory Bytes  Many HPC kernels have an arithmetic intensity that scales with with problem size (increasing temporal locality)  But there are many important and interesting kernels that don’t  Low arithmetic intensity kernels are likely to be memory bound  High arithmetic intensity kernels are likely to be processor bound  Ignores memory addressing complexity A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 15 Arithmetic Intensity  Arithmetic Intensity ~ Total Compulsory Flops / Total Compulsory Bytes  Many HPC kernels have an arithmetic intensity that scales with with problem size (increasing temporal locality)  But there are many important and interesting kernels that don’t  Low arithmetic intensity kernels are likely to be memory bound  High arithmetic intensity kernels are likely to be processor bound  Ignores memory addressing complexity A r i t h m e t i c I n t e n s i t y O( N ) O( log(N) ) O( 1 ) SpMV, BLAS1,2 Stencils (PDEs) Lattice Methods FFTs Dense Linear Algebra (BLAS3) Particle Methods Good Match for Clovertown, eDP Cell, …Good Match for Cell, Niagara2

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 16 Sparse Matrix-Vector Multiplication (SpMV) Samuel Williams, Leonid Oliker, Richard Vuduc, John Shalf, Katherine Yelick, James Demmel, "Optimization of Sparse Matrix-Vector Multiplication on Emerging Multicore Platforms", Supercomputing (SC), Autotuning Multicore SMPs HPC Kernels SpMV LBMHD Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 17 Sparse Matrix Vector Multiplication  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data  Evaluate y=Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector  Difficult to load balance  Very low computational intensity (often >6 bytes/flop) = likely memory bound Axy

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 18 Dataset (Matrices)  Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 19 Naïve Serial Implementation  Vanilla C implementation  Matrix stored in CSR (compressed sparse row)  Explored compiler options, but only the best is presented here  x86 core delivers > 10x the performance of a Niagara2 thread IBM Cell Blade (PPE) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 20 Naïve Parallel Implementation  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 21  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance 4x threads = 3.4x performance

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 22  SPMD style  Partition by rows  Load balance by nonzeros  N2 ~ 2.5x x86 machine Naïve Parallel Implementation IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown Naïve Pthreads Naïve 1.4% of peak flops 29% of bandwidth 1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth 25% of peak flops 39% of bandwidth 2.7% of peak flops 4% of bandwidth 2.7% of peak flops 4% of bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 23 Autotuned Performance (+NUMA & SW Prefetching)  Use first touch, or libnuma to exploit NUMA.  Also includes process affinity.  Tag prefetches with temporal locality  Autotune: search for the optimal prefetch distances IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 24 Autotuned Performance (+Matrix Compression)  If memory bound, only hope is minimizing memory traffic  Heuristically compress the parallelized matrix to minimize it  Implemented with SSE  Benefit of prefetching is hidden by requirement of register blocking  Options: register blocking, index size, format, etc… IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 25 Autotuned Performance (+Cache/TLB Blocking)  Reorganize matrix to maximize locality of source vector accesses IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 26 Autotuned Performance (+DIMMs, Firmware, Padding)  Clovertown was already fully populated with DIMMs  Gave Opteron as many DIMMs as Clovertown  Firmware update for Niagara2  Array padding to avoid inter- thread conflict misses  PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 27 Autotuned Performance (+DIMMs, Firmware, Padding)  Clovertown was already fully populated with DIMMs  Gave Opteron as many DIMMs as Clovertown  Firmware update for Niagara2  Array padding to avoid inter- thread conflict misses  PPE’s use ~1/3 of Cell chip area IBM Cell Blade (PPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 10% of peak flops 10% of bandwidth 10% of peak flops 10% of bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 28 Autotuned Performance (+Cell/SPE version)  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster (median) than using the PPEs alone. IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 29  Wrote a double precision Cell/SPE version  DMA, local store blocked, NUMA aware, etc…  Only 2x1 and larger BCOO  Only the SpMV-proper routine changed  About 12x faster than using the PPEs alone. Autotuned Performance (+Cell/SPE version) IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 65% of bandwidth 20% of peak flops 65% of bandwidth 54% of peak flops 57% of bandwidth 54% of peak flops 57% of bandwidth 40% of peak flops 92% of bandwidth 40% of peak flops 92% of bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 30 Autotuned Performance (How much did double precision and 2x1 blocking hurt)  Model faster cores by commenting out the inner kernel calls, but still performing all DMAs  Enabled 1x1 BCOO  ~16% improvement IBM Cell Blade (SPEs) Sun Niagara2 (Huron) AMD OpteronIntel Clovertown +More DIMMs(opteron), +FW fix, array padding(N2), etc… +Cache/TLB Blocking +Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve +better Cell implementation

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 31 MPI vs. Threads  On x86 machines, autotuned(OSKI) shared memory MPICH implementation rarely scales beyond 2 threads  Still debugging MPI issues on Niagara2, but so far, it rarely scales beyond 8 threads. Autotuned pthreads Autotuned MPI Naïve Serial Sun Niagara2 (Huron) AMD OpteronIntel Clovertown

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 32 Lattice-Boltzmann Magneto- Hydrodynamics (LBMHD)  Preliminary results Samuel Williams, Jonathan Carter, Leonid Oliker, John Shalf, Katherine Yelick, "Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms", International Parallel & Distributed Processing Symposium (IPDPS) (to appear), Autotuning Multicore SMPs HPC Kernels SpMV LBMHD Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 33 Lattice Methods  Structured grid code, with a series of time steps  Popular in CFD  Allows for complex boundary conditions  Higher dimensional phase space  Simplified kinetic model that maintains the macroscopic quantities  Distribution functions (e.g. 27 velocities per point in space) are used to reconstruct macroscopic quantities  Significant Memory capacity requirements Z +Y +X

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 34 LBMHD (general characteristics)  Plasma turbulence simulation  Two distributions:  momentum distribution (27 components)  magnetic distribution (15 vector components)  Three macroscopic quantities:  Density  Momentum (vector)  Magnetic Field (vector)  Must read 73 doubles, and update(write) 79 doubles per point in space  Requires about 1300 floating point operations per point in space  Just over 1.0 flops/byte (ideal)  No temporal locality between points in space within one time step

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 35 LBMHD (implementation details)  Data Structure choices:  Array of Structures: lacks spatial locality  Structure of Arrays: huge number of memory streams per thread, but vectorizes well  Parallelization  Fortran version used MPI to communicate between nodes. = bad match for multicore  This version uses pthreads for multicore, and MPI for inter-node  MPI is not used when autotuning  Two problem sizes:  64 3 (~330MB)  (~2.5GB)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 36 Pthread Implementation  Not naïve  fully unrolled loops  NUMA-aware  1D parallelization  Always used 8 threads per core on Niagara2 Cell version was not autotuned IBM Cell Blade Sun Niagara2 (Huron) Intel ClovertownAMD Opteron

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 37  Not naïve  fully unrolled loops  NUMA-aware  1D parallelization  Always used 8 threads per core on Niagara2 Pthread Implementation Cell version was not autotuned IBM Cell Blade Sun Niagara2 (Huron) Intel ClovertownAMD Opteron 4.8% of peak flops 16% of bandwidth 4.8% of peak flops 16% of bandwidth 14% of peak flops 17% of bandwidth 14% of peak flops 17% of bandwidth 54% of peak flops 14% of bandwidth 54% of peak flops 14% of bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 38 Autotuned Performance (+Stencil-aware Padding)  This lattice method is essentially 79 simultaneous 72-point stencils  Can cause conflict misses even with highly associative L1 caches (not to mention opteron’s 2 way)  Solution: pad each component so that when accessed with the corresponding stencil(spatial) offset, the components are uniformly distributed in the cache Cell version was not autotuned IBM Cell Blade Sun Niagara2 (Huron) Intel ClovertownAMD Opteron +Padding Naïve+NUMA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 39 Autotuned Performance (+Vectorization)  Each update requires touching ~150 components, each likely to be on a different page  TLB misses can significantly impact performance  Solution: vectorization  Fuse spatial loops, strip mine into vectors of size VL, and interchange with phase dimensional loops  Autotune: search for the optimal vector length  Significant benefit on some architectures  Becomes irrelevant when bandwidth dominates performance Cell version was not autotuned IBM Cell Blade Sun Niagara2 (Huron) Intel ClovertownAMD Opteron +Vectorization +Padding Naïve+NUMA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 40 Autotuned Performance (+Explicit Unrolling/Reordering)  Give the compilers a helping hand for the complex loops  Code Generator: Perl script to generate all power of 2 possibilities  Autotune: search for the best unrolling and expression of data level parallelism  Is essential when using SIMD intrinsics Cell version was not autotuned IBM Cell Blade Sun Niagara2 (Huron) Intel ClovertownAMD Opteron +Unrolling +Vectorization +Padding Naïve+NUMA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 41 Autotuned Performance (+Software prefetching)  Expanded the code generator to insert software prefetches in case the compiler doesn’t.  Autotune:  no prefetch  prefetch 1 line ahead  prefetch 1 vector ahead.  Relatively little benefit for relatively little work Cell version was not autotuned IBM Cell Blade Sun Niagara2 (Huron) Intel ClovertownAMD Opteron +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 42 Autotuned Performance (+SIMDization, including non-temporal stores)  Compilers(gcc & icc) failed at exploiting SIMD.  Expanded the code generator to use SIMD intrinsics.  Explicit unrolling/reordering was extremely valuable here.  Exploited movntpd to minimize memory traffic (only hope if memory bound)  Significant benefit for significant work Cell version was not autotuned IBM Cell Blade Sun Niagara2 (Huron) Intel ClovertownAMD Opteron +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 43 Autotuned Performance (Cell/SPE version)  First attempt at cell implementation.  VL, unrolling, reordering fixed  Exploits DMA and double buffering to load vectors  Straight to SIMD intrinsics.  Despite the relative performance, Cell’s DP implementation severely impairs performance Intel ClovertownAMD Opteron Sun Niagara2 (Huron) IBM Cell Blade * * collision() only +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 44  First attempt at cell implementation.  VL, unrolling, reordering fixed  Exploits DMA and double buffering to load vectors  Straight to SIMD intrinsics.  Despite the relative performance, Cell’s DP implementation severely impairs performance Autotuned Performance (Cell/SPE version) Intel ClovertownAMD Opteron Sun Niagara2 (Huron) IBM Cell Blade * * collision() only +SIMDization +SW Prefetching +Unrolling +Vectorization +Padding Naïve+NUMA 7.5% of peak flops 17% of bandwidth 7.5% of peak flops 17% of bandwidth 42% of peak flops 35% of bandwidth 42% of peak flops 35% of bandwidth 59% of peak flops 15% of bandwidth 59% of peak flops 15% of bandwidth 57% of peak flops 33% of bandwidth 57% of peak flops 33% of bandwidth

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 45 Summary Autotuning Multicore SMPs HPC Kernels SpMV LBMHD Summary

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 46 Aggregate Performance (Fully optimized)  Cell consistently delivers the best full system performance  Niagara2 delivers comparable per socket performance  Dual core Opteron delivers far better performance (bandwidth) than Clovertown, but as the flop:byte ratio increases its performance advantage decreases.  Huron has far more bandwidth than it can exploit  (too much latency, too few cores)  Clovertown has far too little effective FSB bandwidth

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 47 Parallel Efficiency (average performance per thread, Fully optimized)  Aggregate Mflop/s / #cores  Niagara2 & Cell show very good multicore scaling  Clovertown showed very poor multicore scaling on both applications  For SpMV, Opteron and Clovertown showed good multisocket scaling  Clovertown runs into bandwidth limits far short of its theoretical peak even for LBMHD  Opteron lacks the bandwidth for SpMV, and the FP resources to use its bandwidth for LBMHD

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 48 Power Efficiency (Fully Optimized)  Used a digital power meter to measure sustained power under load  Calculate power efficiency as: sustained performance / sustained power  All cache-based machines delivered similar power efficiency  FBDIMMs (~12W each) sustained power  8 DIMMs on Clovertown (total of ~330W)  16 DIMMs on N2 machine (total of ~450W)

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 49 Productivity  Niagara2 required significantly less work to deliver good performance.  For LBMHD, Clovertown, Opteron, and Cell all required SIMD (hampers productivity) for best performance.  Virtually every optimization was required (sooner or later) for Opteron and Cell.  Cache based machines required search for some optimizations, while cell always relied on heuristics

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 50 Summary  Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.  Niagara2 delivered both very good performance and productivity  Cell delivered very good performance and efficiency (processor and power)  Our multicore specific autotuned SpMV implementation significantly outperformed an autotuned MPI implementation  Our multicore autotuned LBMHD implementation significantly outperformed the already optimized serial implementation  Sustainable memory bandwidth is essential even on kernels with moderate computational intensity (flop:byte ratio)  Architectural transparency is invaluable in optimizing code

EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 51 Acknowledgements  UC Berkeley  RADLab Cluster (Opterons)  PSI cluster(Clovertowns)  Sun Microsystems  Niagara2 access  Forschungszentrum Jülich  Cell blade cluster access

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 52 Questions?

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 53 Backup Slides