1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

POLITECNICO DI MILANO Parallelism in wonderland: are you ready to see how deep the rabbit hole goes? ILP: VLIW Architectures Marco D. Santambrogio:

The University of Adelaide, School of Computer Science

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D) Sam Williams 1,2, Jonathan Carter 2, Leonid Oliker 2, John Shalf 2, Katherine Yelick 1,2 1 University.

Benchmarking Sparse Matrix-Vector Multiply In 5 Minutes Hormozd Gahvari, Mark Hoemmen, James Demmel, and Kathy Yelick January 21, 2007 Hormozd Gahvari,

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.

Sparse Matrix Algorithms CS 524 – High-Performance Computing.

Performance Analysis, Modeling, and Optimization: Understanding the Memory Wall Leonid Oliker (LBNL) and Katherine Yelick (UCB and LBNL)

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Analysis and Performance Results of a Molecular Modeling Application on Merrimac Erez, et al. Stanford University 2004 Presented By: Daniel Killebrew.

Chapter 17 Parallel Processing.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

Chapter 13 Finite Difference Methods: Outline Solving ordinary and partial differential equations Finite difference methods (FDM) vs Finite Element Methods.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

L11: Sparse Linear Algebra on GPUs CS Sparse Linear Algebra 1 L11: Sparse Linear Algebra CS6235

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

1 Chapter 04 Authors: John Hennessy & David Patterson.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Operating Systems ECE344 Ashvin Goel ECE University of Toronto Threads and Processes.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Collective Buffering: Improving Parallel I/O Performance By Bill Nitzberg and Virginia Lo.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS The Potential of the Cell Processor for Scientific Computing Leonid Oliker Samuel Williams,

How to Build a CPU Cache COMP25212 – Lecture 2. Learning Objectives To understand: –how cache is logically structured –how cache operates CPU reads CPU.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

The Potential of the Cell Processor for Scientific Computing

JAVA AND MATRIX COMPUTATION

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

COMP SYSTEM ARCHITECTURE HOW TO BUILD A CACHE Antoniu Pop COMP25212 – Lecture 2Jan/Feb 2015.

Optimizing Stencil Computations March 18, Administrative Midterm coming April 3? In class March 25, can bring one page of notes Review notes, readings.

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Sunpyo Hong, Hyesoon Kim

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Optimizing 3D Multigrid to Be Comparable to the FFT Michael Maire and Kaushik Datta Note: Several diagrams were taken from Kathy Yelick’s CS267 lectures.

Optimizing the Performance of Sparse Matrix-Vector Multiplication

University of California, Berkeley

Simultaneous Multithreading

Christopher Han-Yu Chou Supervisor: Dr. Guy Lemieux

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Cache Memory Presentation I

/ Computer Architecture and Design

for more information ... Performance Tuning

Mihir Awatramani Lakshmi kiran Tondehal Xinying Wang Y. Ravi Chandra

/ Computer Architecture and Design

Memory System Performance Chapter 3

Presentation transcript:

1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley November 1, 2006

2 Cell Architecture Programming Cell Benchmarks & Performance –Stencils on Structured Grids –Sparse Matrix-Vector Multiplication Summary Outline

3 Cell Architecture

4 3.2GHz, 9 Core SMP –One core is a conventional cache based PPC –The other 8 are local memory based SIMD processors (SPEs) 25.6GB/s memory bandwidth 1.6GHz) to XDR 2 chip (16 SPE) SMP blades DRAM_ GB/s I/OMEM SPU0 PPU0 SPU1 SPU2 SPU3 SPU4 SPU5 SPU6 SPU7 MEMI/O SPU14 SPU15 SPU13 SPU12 SPU11 SPU10 SPU9 SPU8 PPU1 DRAM_ GB/s

5 Memory Architectures Conventional Register File Main Memory ld reg, addr Cache ld reg, Local_Addr DMA LocalAddr, GlobalAddr Register File Local Memory Main Memory Three Level

6 SPE notes SIMD Limited dual issue –1 float/ALU + 1 load/store/permute/etc… per cycle 4 FMA single precision datapaths, 6 cycle latency 1 FMA double precision datapath, 13 cycle latency Double precision instructions are not dual issued, and will stall all subsequent instruction issues by 7 cycles. 128b aligned loads/stores (local store to/from register file) –Must rotate to access unaligned data –Must permute to operate on scalar granularities

7 Cell Programming Modified SPMD (Single Program Multiple Data) –Dual Program Multiple Data (control + computation) –Data access similar to MPI, but data is shared like pthreads Power core is used to: –Load/initialize data structures –Spawn SPE threads –Parallelize data structures –Pass pointers to SPEs –Synchronize SPE threads –Communicate with other processors –Perform other I/O operations

8 Processors Evaluated Cell SPEX1E SSPPower5OpteronItanium2 ArchitectureSIMDVectorSuper Scalar VLIW Frequency3.2 GHz1.13 GHz1.9 GHz2.2 GHz1.4 GHz GFLOP/s (double) Cores used84 (MSP)111 Aggregate: L2 Cache-2MB1.9MB1MB256KB L3 Cache--36MB-3MB DRAM Bandwidth25.6 GB/s34 GB/s10+5 GB/s6.4 GB/s GFLOP/s (double)

9 Stencil Operations on Structured Grids

10 Stencil Operations Simple Example - The Heat Equation –dT/dt = k  2 T –Parabolic PDE on 3D discretized scalar domain Jacobi Style (read from current grid, write to next grid) –8 FLOPs per point, typically double precision –Next[x,y,z] = Alpha*Current[x,y,z] + Beta*(Current[x-1,y,z] + Current[x+1,y,z] + Current[x,y-1,z] + Current[x,y+1,z] + Current[x,y,z-1] + Current[x,y,z+1] ) –Doesn’t exploit the FMA pipeline well –Basically 6 streams presented to the memory subsystem Explicit ghost zones bound grid [x,y,z] [x+1,y,z] [x,y-1,z] [x,y,z-1] [x,y+1,z] [x,y,z+1] [x-1,y,z]

11 Optimization - Planes Naïve approach (cacheless vector machine) is to load 5 streams and store one. This is 8 flops per 48 bytes –memory limits performance to 4.2 GFLOP/s A better approach is to make each DMA the size of a plane –cache the 3 most recent planes (z-1, z, z+1) –there are only two streams (one load, one store) –memory now limits performance to 12.8 GFLOP/s Still must compute on each plane after it is loaded –e.g. forall Current_local[x,y] update Next_local[x,y] –Note: computation can severely limit performance

12 Optimization - Double Buffering Add a input stream buffer and and output stream buffer (keep 6 planes in local store) Two phases (transfer & compute) are now overlapped Thus it is possible to hide the faster of DMA transfer time and computation time

13 Optimization - Cache Blocking Domains can be quite large (~1GB) A single plane, let alone 6, might not fit in the local store Partition domain into cache blocked slabs so that 6 cache blocked planes can fit in the local store Partitioning in the Y dimension maintains good spatial and temporal locality Has the added benefit that cache blocks are independent and thus can be parallelized across multiple SPEs Memory efficiency can be diminished as an intra grid ghost zone is implicitly created.

14 Instead of computing on pencils, compute on ribbons (4x2) Hides functional unit & local store latency Minimizes local store memory traffic Minimizes loop overhead May not be beneficial / noticeable for cache based machines Optimization - Register Blocking

15 Double Precision Results ~256 3 problem Performance model matches well with hardware 5-9x performance of Opteron/Itanium/Power5 X1E ran a slight variant (beta hard coded to be 1.0, not cache blocked)

16 Optimization - Temporal Blocking If the application allows it, perform block the outer (temporal) loop Only appropriate on memory bound implementations –Improves computational intensity –Cell SP or Cell with faster DP Simple approach –Overlapping trapezoids in time-space plot –Can be inefficient due to duplicated work –If performing n steps, the local store must hold 3(n+1) planes Time Skewing is algorithmically superior, but harder to parallelize. Cache Oblivious is similar but implemented recursively. Time t Time t+1 Time t+2

17 Temporal Blocking Results Cell is computationally bound in double precision (no benefit in temporal blocking, so only cache blocked shown) Cache machines show the average of 4 steps of time skewing

18 Temporal Blocking Results (2) Temporal blocking on Cell was implemented in single precision (four step average) Others still use double precision

19 Sparse Matrix-Vector Multiplication

20 Sparse Matrix Vector Multiplication Most of the matrix entries are zeros, thus the non zero entries are sparsely distributed Dense methods compute on all the data, sparse methods only compute the nonzeros (as only they compute to the result) Can be used for unstructured grid problems Issues –Like DGEMM, can exploit a FMA well –Very low computational intensity (1 FMA for every 12+ bytes) –Non FP instructions can dominate –Can be very irregular –Row lengths can be unique and very short

21 Compressed Sparse Row Compressed Sparse Row (CSR) is the standard format –Array of nonzero values –Array of corresponding column for each nonzero value –Array of row starts containing index (in the above arrays) of first nonzero in the row

22 Optimization - Double Buffer Nonzeros Computation and Communication are approximately equally expensive While operating on the current set of nonzeros, load the next (~1K nonzero buffers) Need complex (thought) code to stop and restart a row between buffers Can nearly double performance

23 Optimization - SIMDization Row Padding –Pad rows to the nearest multiple of 128b –Might requires O(N) explicit zeros –Loop overhead still present –Generally works better in double precision BCSR –Nonzeros are grouped into dense r x c blocks(sub matrices) –O(nnz) explicit zeros are added –Choose r&c so that it meshes well with 128b registers –Performance can hurt especially in DP as computing on zeros is very wasteful –Can hide latency and amortize loop overhead

24 Optimization - Cache Blocked Vectors Doubleword DMA gathers from DRAM can be expensive Cache block source and destination vectors Finite LS, so what’s the best aspect ratio? DMA large blocks into local store Gather operations into local store –ISA vs. memory subsystem inefficiencies –Exploits temporal and spatial locality within the SPE In effect, the sparse matrix is explicitly blocked into submatrices, and we can skip, or otherwise optimize empty submatrices Indices are now relative to the cache block –half words –reduces memory traffic by 16%

25 Optimization - Load Balancing Potentially irregular problem Load imbalance can severely hurt performance Partitioning by rows is clearly insufficient Partitioning by nonzeros is inefficient when the matrix has few but highly variable nonzeros per row. Define a cost function of number of row starts and number of nonzeros. Determine the parameters via static timing analysis or tuning.

26 Other Approaches BeBop / OSKI on the Itanium2 & Opteron –uses BCSR –auto tunes for optimal r x c blocking –can also exploit symmetry –Cell implementation is base case Cray’s routines on the X1E –Report best of CSRP, Segmented Scan & Jagged Diagonal

27 Benchmark Matrices #15 - PDE N=40K, NNZ=1.6M #17 - FEM N=22K, NNZ=1M #18 - Circuit N=17K, NNZ=125K #36 - CFD N=75K, NNZ=325K #06 - FEM Crystal N=14K, NNZ=490K #09 - 3D Tube N=45K, NNZ=1.6M #25 - Financial N=74K, NNZ=335K #27 - NASA N=36K, NNZ=180K #28 - Vibroacoustic N=12K, NNZ=177K #40 - Linear Programming N=31K, NNZ=1M

28 Double Precision SpMV Performance Diagonally Dominant Notes: few nonzeros per row severely limited performance on CFD BCSR & symmetry were clearly exploited on first 2 3-8x faster than Opteron/Itanium

29 Double Precision SpMV Performance less structured/partitionable Notes: few nonzeros per row (hurts a lot) Not structured as well as previous set 4-6x faster than Opteron/Itanium

30 Double Precision SpMV Performance more random Notes: many nonzeros per row Load imbalance hurt cell’s performance by as much as 20% 5-7x faster than Opteron/Itanium

31 Conclusions

32 Conclusions Even in double precision, Cell obtains much better performance on a surprising variety of codes. Cell can eliminate unneeded memory traffic, hide memory latency, and thus achieves a much higher percentage of memory bandwidth. Instruction set can be very inefficient for poorly SIMDizable or misaligned codes. Loop overheads can heavily dominate performance. Programming model could be streamlined

33 Acknowledgments This work is a collaboration with the following LBNL/FTG members: –John Shalf, Lenny Oliker, Shoaib Kamil, Parry Husbands, Kaushik Datta and Kathy Yelick Additional thanks to –Joe Gebis and David Patterson X1E FFT numbers provided by: –Bracy Elton, and Adrian Tate Cell access provided by IBM

34 Questions?

35 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted, provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers, or to redistribute to lists, requires prior specific permission. GSPx October 30-November 2, Santa Clara, CA. Copyright 2006 Global Technology Conferences, Inc.