BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Slides:

Advertisements

Similar presentations

Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters

Advertisements

CSE431 Chapter 7A.1Irwin, PSU, 2008 CSE 431 Computer Architecture Fall 2008 Chapter 7A: Intro to Multiprocessor Systems Mary Jane Irwin (

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP Evolution of Processor Architecture, and the Implications for Performance Optimization.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Auto-tuning Memory Intensive Kernels for Multicore Sam Williams

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Lattice Boltzmann Simulation Optimization on Leading Multicore Platforms Samuel Williams.

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Auto-tuning Stencil Codes for.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

L12: Sparse Linear Algebra on GPUs CS6963. Administrative Issues Next assignment, triangular solve – Due 5PM, Monday, March 8 – handin cs6963 lab 3 ”

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

Authors: Tong Li, Dan Baumberger, David A. Koufaty, and Scott Hahn [Systems Technology Lab, Intel Corporation] Source: 2007 ACM/IEEE conference on Supercomputing.

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

An approach for solving the Helmholtz Equation on heterogeneous platforms An approach for solving the Helmholtz Equation on heterogeneous platforms G.

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

1 Hardware Support for Collective Memory Transfers in Stencil Computations George Michelogiannakis, John Shalf Computer Architecture Laboratory Lawrence.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Performance on.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 A Generalized Framework for Auto-tuning Stencil Computations Shoaib Kamil 1,3, Cy.

L AWRENCE B ERKELEY N ATIONAL L ABORATORY FUTURE TECHNOLOGIES GROUP 1 Lattice Boltzmann Hybrid Auto-Tuning on High-End Computational Platforms Samuel Williams,

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 The Roofline Model: A pedagogical.

2007/11/2 First French-Japanese PAAP Workshop 1 The FFTE Library and the HPC Challenge (HPCC) Benchmark Suite Daisuke Takahashi Center for Computational.

The Potential of the Cell Processor for Scientific Computing

JAVA AND MATRIX COMPUTATION

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

High Performance Computing An overview Alan Edelman Massachusetts Institute of Technology Applied Mathematics & Computer Science and AI Labs (Interactive.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.

IMP: Indirect Memory Prefetcher

TI Information – Selective Disclosure Implementation of Linear Algebra Libraries for Embedded Architectures Using BLIS September 28, 2015 Devangi Parikh.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

1 Lecture 5a: CPU architecture 101 boris.

Parallel Computers Today LANL / IBM Roadrunner > 1 PFLOPS Two Nvidia 8800 GPUs > 1 TFLOPS Intel 80- core chip > 1 TFLOPS  TFLOPS = floating point.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Parallel Processing - introduction

5.2 Eleven Advanced Optimizations of Cache Performance

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Morgan Kaufmann Publishers

Parallel Computers Today

Samuel Williams1,2, David Patterson1,

for more information ... Performance Tuning

Auto-tuning Memory Intensive Kernels for Multicore

EE 4xx: Computer Architecture and Performance Programming

Memory System Performance Chapter 3

Kaushik Datta1,2, Mark Murphy2,

Presentation transcript:

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07) Samuel Williams 1,2, Richard Vuduc 3, Leonid Oliker 1,2, John Shalf 2, Katherine Yelick 1,2, James Demmel 1,2 1 University of California Berkeley 2 Lawrence Berkeley National Laboratory 3 Georgia Institute of Technology

BIPS Overview  Multicore is the de facto performance solution for the next decade  Examined Sparse Matrix Vector Multiplication (SpMV) kernel  Important HPC kernel  Memory intensive  Challenging for multicore  Present two auto-tuned threaded implementations:  Pthread, cache-based implementation  Cell local store-based implementation  Benchmarked performance across 4 diverse multicore architectures  Intel Xeon (Clovertown)  AMD Opteron  Sun Niagara2  IBM Cell Broadband Engine  Compare with leading MPI implementation(PETSc) with an auto-tuned serial kernel (OSKI)  Show Cell delivers good performance and efficiency, while Niagara2 delivers good performance and productivity

BIPS Sparse Matrix Vector Multiplication  Sparse Matrix  Most entries are 0.0  Performance advantage in only storing/operating on the nonzeros  Requires significant meta data  Evaluate y=Ax  A is a sparse matrix  x & y are dense vectors  Challenges  Difficult to exploit ILP(bad for superscalar),  Difficult to exploit DLP(bad for SIMD)  Irregular memory access to source vector  Difficult to load balance  Very low computational intensity (often >6 bytes/flop) Axy

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N  Dataset (Matrices)  Multicore SMPs Test Suite

BIPS Matrices Used  Pruned original SPARSITY suite down to 14  none should fit in cache  Subdivided them into 4 categories  Rank ranges from 2K to 1M Dense Protein FEM / Spheres FEM / Cantilever Wind Tunnel FEM / Harbor QCD FEM / Ship EconomicsEpidemiology FEM / Accelerator Circuitwebbase LP 2K x 2K Dense matrix stored in sparse format Well Structured (sorted by nonzeros/row) Poorly Structured hodgepodge Extreme Aspect Ratio (linear programming)

BIPS Multicore SMP Systems 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction)

BIPS Multicore SMP Systems (memory hierarchy) 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction) Conventional Cache-based Memory Hierarchy Disjoint Local Store Memory Hierarchy

BIPS Multicore SMP Systems (2 implementations) 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction) Cache+Pthreads SpMV implementation Local Store SpMV Implementation

BIPS Multicore SMP Systems (cache) 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction) 16MB (vectors fit) 4MB (local store) 4MB

BIPS Multicore SMP Systems (peak flops) 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction) 75 Gflop/s (TLP + ILP + SIMD) 17 Gflop/s (TLP + ILP) 29 Gflop/s (TLP + ILP + SIMD) 11 Gflop/s (TLP)

BIPS Multicore SMP Systems (peak read bandwidth) 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction) 21 GB/s 51 GB/s43 GB/s

BIPS Multicore SMP Systems (NUMA) 4MB Shared L2 Core2 FSB Fully Buffered DRAM 10.6GB/s Core2 Chipset (4x64b controllers) 10.6GB/s 10.6 GB/s(write) 4MB Shared L2 Core2 4MB Shared L2 Core2 FSB Core2 4MB Shared L2 Core GB/s(read) Intel Clovertown Crossbar Switch Fully Buffered DRAM 4MB Shared L2 (16 way) 42.7GB/s (read), 21.3 GB/s (write) 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 8K D$MT UltraSparcFPU 179 GB/s (fill) 90 GB/s (writethru) Sun Niagara2 4x128b FBDIMM memory controllers AMD Opteron 1MB victim Opteron 1MB victim Opteron Memory Controller / HT 1MB victim Opteron 1MB victim Opteron Memory Controller / HT DDR2 DRAM 10.6GB/s 4GB/s (each direction) Uniform Memory Access Non-Uniform Memory Access

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Naïve Implementation for Cache-based Machines  Performance across suite  Also, included a median performance number

BIPS vanilla C Performance Intel ClovertownAMD Opteron Sun Niagara2  Vanilla C implementation  Matrix stored in CSR (compressed sparse row)  Explored compiler options - only the best is presented here  x86 core delivers > 10x performance of Niagara2 thread

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N  SPMD, strong scaling  Optimized for multicore/threading  Variety of shared memory programming models are acceptable(not just Pthreads)  More colors = more optimizations = more work Pthread Implementation for Cache-Based Machines

BIPS Naïve Parallel Performance Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2  Simple row parallelization  Pthreads (didn’t have to be)  1P Niagara2 > 2x5x 2P x86 machines

BIPS Naïve Parallel Performance Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2 8x cores = 1.9x performance 4x cores = 1.5x performance 64x threads = 41x performance  Simple row parallelization  Pthreads (didn’t have to be)  1P Niagara2 > 2x5x 2P x86 machines

BIPS Naïve Parallel Performance Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2 1.4% of peak flops 29% of bandwidth 1.4% of peak flops 29% of bandwidth 4% of peak flops 20% of bandwidth 4% of peak flops 20% of bandwidth 25% of peak flops 39% of bandwidth 25% of peak flops 39% of bandwidth  Simple row parallelization  Pthreads (didn’t have to be)  1P Niagara2 > 2x5x 2P x86 machines

BIPS Case for Auto-tuning  How do we deliver good performance across all these architectures, across all matrices without exhaustively optimizing every combination  Auto-tuning (in general)  Write a Perl script that generates all possible optimizations  Heuristically, or exhaustively searches the optimizations  Existing SpMV solution: OSKI (developed at UCB)  This work:  Tuning geared toward multi-core/-threading  generates SSE/SIMD intrinsics, prefetching, loop transformations, alternate data structures, etc…  “prototype for parallel OSKI”

BIPS Performance (+NUMA & SW Prefetching) +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2  NUMA aware allocation  Tag prefetches with the appropriate temporal locality  Tune for the optimal distance

BIPS Matrix Compression  For memory bound kernels, minimizing memory traffic should maximize performance  Compress the meta data  Exploit structure to eliminate meta data  Heuristic: select the compression that minimizes the matrix size:  power of 2 register blocking  CSR/COO format  16b/32b indices  etc…  Side effect: matrix may be minimized to the point where it fits entirely in cache

BIPS Performance (+matrix compression) +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2  Very significant on some  missed the boat on others

BIPS Performance (+cache & TLB blocking) +Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2  Cache blocking geared towards sparse accesses

BIPS Performance (more DIMMs,firmware fix …) +More DIMMs, Rank configuration, etc… +Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2

BIPS Performance (more DIMMs,firmware fix …) +More DIMMs, Rank configuration, etc… +Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 66% of bandwidth 20% of peak flops 66% of bandwidth 52% of peak flops 54% of bandwidth 52% of peak flops 54% of bandwidth

BIPS Performance (more DIMMs,firmware fix …) +More DIMMs, Rank configuration, etc… +Cache/TLB Blocking +Matrix Compression +Software Prefetching +NUMA/Affinity Naïve Pthreads Naïve Single Thread Intel ClovertownAMD Opteron Sun Niagara2 3 essential optimizations: Parallelize, Prefetch, Compress 3 essential optimizations: Parallelize, Prefetch, Compress 4 essential optimizations: Parallelize, NUMA, Prefetch, Compress 4 essential optimizations: Parallelize, NUMA, Prefetch, Compress 2 essential optimizations: Parallelize, Compress 2 essential optimizations: Parallelize, Compress

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N  Comments  Performance Cell Implementation

BIPS Cell Implementation  No vanilla C implementation (aside from the PPE)  In some cases, what were optional optimizations on cache based machines, are requirements for correctness on Cell  Even SIMDized double precision is extremely weak  Scalar double precision is unbearable  Minimum register blocking is 2x1 (SIMDizable)  Can increase memory traffic by 66%  Optional Cache blocking is now required local store blocking  Spatial and temporal locality is captured by software when the matrix is optimized  In essence, the high bits of column indices are grouped into DMA lists  Branchless implementation  Despite the following performance numbers, Cell was still handicapped by double precision

BIPS Performance (all) Intel ClovertownAMD Opteron Sun Niagara2 IBM Cell Broadband Engine

BIPS Intel ClovertownAMD Opteron Sun Niagara2 Hardware Efficiency IBM Cell Broadband Engine 39% of peak flops 89% of bandwidth 39% of peak flops 89% of bandwidth 4% of peak flops 52% of bandwidth 4% of peak flops 52% of bandwidth 20% of peak flops 66% of bandwidth 20% of peak flops 66% of bandwidth 52% of peak flops 54% of bandwidth 52% of peak flops 54% of bandwidth

BIPS Intel ClovertownAMD Opteron Sun Niagara2 Productivity IBM Cell Broadband Engine 3 essential optimizations: Parallelize, Prefetch, Compress 3 essential optimizations: Parallelize, Prefetch, Compress 4 essential optimizations: Parallelize, NUMA, Prefetch, Compress 4 essential optimizations: Parallelize, NUMA, Prefetch, Compress 2 essential optimizations: Parallelize, Compress 2 essential optimizations: Parallelize, Compress 5 essential optimizations: Parallelize, NUMA, DMA, Compress, Cache(LS) Block 5 essential optimizations: Parallelize, NUMA, DMA, Compress, Cache(LS) Block

BIPS Parallel Efficiency Intel ClovertownAMD Opteron Sun Niagara2 IBM Cell Broadband Engine

BIPS Parallel Efficiency (breakdown) Intel ClovertownAMD Opteron Sun Niagara2 IBM Cell Broadband Engine multicore multisocket MT

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Multicore MPI Implementation  This is the default approach to programming multicore

BIPS Multicore MPI Implementation  Used PETSc with shared memory MPICH  Used OSKI UCB) to optimize each thread  = good autotuned shared memory MPI implementation MPI(autotuned)Pthreads(autotuned)Naïve Single Thread Intel ClovertownAMD Opteron

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Summary

BIPS Median Performance & Efficiency  1P Niagara2 was consistently better than 2P x86 machines (more potential bandwidth)  Cell delivered by far the best performance (better utilization)  Used digital power meter to measure sustained system power  FBDIMM is power hungry (12W/DIMM)  Clovertown(330W)  Niagara2 (350W) power

BIPS Summary  Paradoxically, the most complex/advanced architectures required the most tuning, and delivered the lowest performance.  Niagara2 delivered both very good performance and productivity  Cell delivered very good performance and efficiency  90% of memory bandwidth (most get ~50%)  High power efficiency  Easily understood performance  Extra traffic = lower performance (future work can address this)  Our multicore specific autotuned implementation significantly outperformed an autotuned MPI implementation  It exploited: Matrix compression geared towards multicore, rather than single NUMA Prefetching

BIPS Acknowledgments  UC Berkeley  RADLab Cluster (Opterons)  PSI cluster(Clovertowns)  Sun Microsystems  Niagara2 access  Forschungszentrum Jülich  Cell blade cluster access

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Questions?

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Backup Slides

BIPS Parallelization  Matrix partitioned by rows and balanced by the number of nonzeros  SPMD like approach  A barrier() is called before and after the SpMV kernel  Each sub matrix stored separately in CSR  Load balancing can be challenging  # of threads explored in powers of 2 (in paper) A x y

BIPS Exploiting NUMA, Affinity  Bandwidth on the Opteron(and Cell) can vary substantially based on placement of data  Bind each sub matrix and the thread to process it together  Explored libnuma, Linux, and Solaris routines  Adjacent blocks bound to adjacent cores Opteron DDR2 DRAM Opteron DDR2 DRAM Opteron DDR2 DRAM Opteron DDR2 DRAM Opteron DDR2 DRAM Opteron DDR2 DRAM Single Thread Multiple Threads, One memory controller Multiple Threads, Both memory controllers

BIPS Cache and TLB Blocking  Accesses to the matrix and destination vector are streaming  But, access to the source vector can be random  Reorganize matrix (and thus access pattern) to maximize reuse.  Applies equally to TLB blocking (caching PTEs)  Heuristic: block destination, then keep adding more columns as long as the number of source vector cache lines(or pages) touched is less than the cache(or TLB). Apply all previous optimizations individually to each cache block.  Search: neither, cache, cache&TLB  Better locality at the expense of confusing the hardware prefetchers. A x y

BIPS Banks, Ranks, and DIMMs  In this SPMD approach, as the number of threads increases, so to does the number of concurrent streams to memory.  Most memory controllers have finite capability to reorder the requests. (DMA can avoid or minimize this)  Addressing/Bank conflicts become increasingly likely  Add more DIMMs, configuration of ranks can help  Clovertown system was already fully populated