The Potential of the Cell Processor for Scientific Computing

Slides:

Advertisements

Similar presentations

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Lecture 6: Multicore Systems

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

1 3D Lattice Boltzmann Magneto-hydrodynamics (LBMHD3D) Sam Williams 1,2, Jonathan Carter 2, Leonid Oliker 2, John Shalf 2, Katherine Yelick 1,2 1 University.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Implicit and Explicit Optimizations for Stencil Computations Shoaib Kamil 1,2, Kaushik Datta.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

Chapter 17 Parallel Processing.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU Presented by: Ahmad Lashgar ECE Department, University of Tehran.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

HSDSL, Technion Spring 2014 Preliminary Design Review Matrix Multiplication on FPGA Project No. : 1998 Project B By: Zaid Abassi Supervisor: Rolf.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

GPU in HPC Scott A. Friedman ATS Research Computing Technologies.

Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak.

Frank Casilio Computer Engineering May 15, 1997 Multithreaded Processors.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Automatic Performance Tuning of SpMV on GPGPU Xianyi Zhang Lab of Parallel Computing Institute of Software Chinese Academy of Sciences

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS The Potential of the Cell Processor for Scientific Computing Leonid Oliker Samuel Williams,

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Chapter 8 CPU and Memory: Design, Implementation, and Enhancement The Architecture of Computer Hardware and Systems Software: An Information Technology.

Super computers Parallel Processing By Lecturer: Aisha Dawood.

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Spring 2003CSE P5481 Advanced Caching Techniques Approaches to improving memory system performance eliminate memory operations decrease the number of misses.

1 Structured Grids and Sparse Matrix Vector Multiplication on the Cell Processor Sam Williams Lawrence Berkeley National Lab University of California Berkeley.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Sparse Matrix-Vector Multiply on the Keystone II Digital Signal Processor Yang Gao, Fan Zhang and Dr. Jason D. Bakos 2014 IEEE High Performance Extreme.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,

Sunpyo Hong, Hyesoon Kim

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

University of California, Berkeley

Auburn University COMP8330/7330/7336 Advanced Parallel and Distributed Computing Parallel Hardware Dr. Xiao Qin Auburn.

Simultaneous Multithreading

5.2 Eleven Advanced Optimizations of Cache Performance

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

Cache Memory Presentation I

/ Computer Architecture and Design

for more information ... Performance Tuning

Advanced Computer Architecture

/ Computer Architecture and Design

Memory System Performance Chapter 3

Presentation transcript:

The Potential of the Cell Processor for Scientific Computing Sam Williams samw@cs.berkeley.edu Computational Research Division Future Technologies Group July 12, 2006

Outline Cell Architecture Programming Cell Benchmarks & Performance Stencils on Structured Grids Sparse Matrix-Vector Multiplication Matrix-Matrix Multiplication FFTs Summary

Cell Architecture

Cell Chip Architecture 9 Core SMP One core is a conventional cache based PPC The other 8 are local memory based SIMD processors (SPEs) Cores connected via 4 rings (4 x 128b @ 1.6GHz) 25.6GB/s memory bandwidth (128b @ 1.6GHz) to XDR I/O channels can be used to build multichip SMPs

Cell Chip Characteristics Core frequency up to 3.2GHz Limited access to 2.1GHz hardware Faster (2.4GHz & 3.2GHz) became available after the paper. ~221mm2 chip, Much smaller than Itanium2 (1/2 to 1/3) About the size of a Power5 Opteron/Pentium before shrink ~500W blades (2 chips + DRAM + network)

SPE Architecture 128b SIMD Dual issue (FP/ALU + load/store/permute/etc…), but not DP Statically Scheduled, in-order 4 FMA SP datapaths, 6 cycle latency 1 FMA DP datapath, 13 cycle latency (including 7 stall cycles) Software managed BTB (branch hints) Local memory based architecture (two steps to access DRAM) DMA commands are queued in the MFC 128b @ 1.6GHz to local store (256KB) Aligned loads from local store to RF (128 x 128b) ~3W @ 3.2GHz

Programming Cell

Estimation, Simulation and Exploration Perform analysis before writing any code Conceptualize Algorithm with Double buffering + long DMAs + in order machine use static timing analysis + memory traffic modeling Model Performance For regular data structures, spreadsheet modeling works SpMV requires more advanced modeling Full System Simulator Cell+ How severely does DP throughput of 1 SIMD instruction every 7 or 8 cycles impair performance? 1 SIMD instruction every 2 cycles Allows dual issuing of DP instructions with loads/stores/permutes/branch

Cell Programming Modified SPMD (Single Program Multiple Data) Dual Program Multiple Data Hierarchical SPMD Kind of clunky approach compared to MPI or pthreads Power core is used to: Load/initialize data structures Spawn SPE threads Parallelize data structures Pass pointers to SPEs Synchronize SPE threads Communicate with other processors Perform other I/O operations

SPE Programming Software controlled Memory DMAs Double buffering Use pointers from PPC to construct global addresses Programmer handles transfers from global to local Compiler handles transfers from local to RF Most applicable when address stream is long and independent DMAs Granularity is 1,2,4,8,16N bytes Issued with very low level intrinsics (no error checking) GETL = Stanza Gather/Scatter = distributed in global, packed in local Double buffering SPU and MFC operate in parallel Operate on current data set while transferring the next/last Time is max of computation time and communication time Prologue and epilogue penalties Although more verbose, intrinsics are an effective way of delivering peak performance.

Benchmark Kernels Stencil Computations on Structured Grids Sparse Matrix-Vector Multiplication Matrix-Matrix Multiplication 1D FFTs 2D FFTs

Processors Used Note: Cell performance does not include the Power core

Stencil Operations on Structured Grids

Stencil Operations Simple Example - The Heat Equation dT/dt = k2T Parabolic PDE on 3D discretized scalar domain Jacobi Style (read from current grid, write to next grid) 7 FLOPs per point, typically double precision Next[x,y,z] = Alpha*Current[x,y,z] + Current[x-1,y,z] + Current[x+1,y,z] + Current[x,y-1,z] + Current[x,y+1,z] + Current[x,y,z-1] + Current[x,y,z+1] Doesn’t exploit the FMA pipeline well Basically 6 streams presented to the memory subsystem Explicit ghost zones bound grid [x,y,z] [x+1,y,z] [x,y-1,z] [x,y,z-1] [x,y+1,z] [x,y,z+1] [x-1,y,z]

Optimization - Planes Naïve approach (cacheless vector machine) is to load 5 streams and store one. This is 7 flops per 48 bytes memory limits performance to 3.7 GFLOP/s A better approach is to make each DMA the size of a plane cache the 3 most recent planes (z-1, z, z+1) there are only two streams (one load, one store) memory now limits performance to 11.2 GFLOP/s Still must compute on each plane after it is loaded e.g. forall Current_local[x,y] update Next_local[x,y] Note: computation can severely limit performance

Optimization - Double Buffering Add a input stream buffer and and output stream buffer (keep 6 planes in local store) Two phases (transfer & compute) are now overlapped Thus it is possible to hide the faster of DMA transfer time and computation time

Optimization - Cache Blocking Domains can be quite large (~1GB) A single plane, let alone 6, might not fit in the local store Partition domain (and thus planes) into cache blocks so that 6 can fit in the local store. Has the added benefit that cache blocks are independent and thus can be parallelized across multiple SPEs Memory efficiency can be diminished as an intra grid ghost zone is implicitly created.

Optimization - Register Blocking Instead of computing on pencils, compute on ribbons (4x2) Hides functional unit & local store latency Minimizes local store memory traffic Minimizes loop overhead May not be beneficial / noticeable for cache based machines

Optimization - Time Skewing If the application allows it, perform multiple time steps within the local store Only appropriate on memory bound implementations Improves computational intensity single precision, or improved double precision Simple approach Overlapping trapezoids in time-space plot Can be inefficient due to duplicated work If performing two steps, local store must now hold 9 planes (thus smaller cache blocks) If performing n steps, the local store must hold 3(n+1) planes Time t Time t+1 Time t+2

Stencil Performance Notes: Performance model when accounting for limited dual issue, matches well with FSS and hardware Double precision runs don’t exploit time skewing In single precision time skewing = 4 Problem was best of 1283 and 2563

Sparse Matrix-Vector Multiplication

Sparse Matrix Vector Multiplication Most of the matrix entries are zero, thus the nonzeros are sparsely distributed Can be used for unstructured grid problems Issues Like DGEMM, can exploit a FMA well Very low computational intensity (1 FMA for every 12+ bytes) Non FP instructions can dominate Can be very irregular Row lengths can be unique and very short

Compressed Sparse Row Compressed Sparse Row (CSR) is the standard format Array of nonzero values Array of corresponding column for each nonzero value Array of row starts containing index (in the above arrays) of first nonzero in the row

Optimization - Double Buffer Nonzeros Computation and Communication are approximately equally expensive While operating on the current set of nonzeros, load the next (~1K nonzero buffers) Need complex (thought) code to stop and restart a row between buffers Can nearly double performance

Optimization - SIMDization BCSR Nonzeros are grouped into r x c blocks O(nnz) explicit zeros are added Choose r&c so that it meshes well with 128b registers Performance can hurt especially in DP as computing on zeros is very wasteful Can hide latency and amortize loop overhead Only used in initial performance model Row Padding Pad rows to the nearest multiple of 128b Might requires O(N) explicit zeros Loop overhead still present Generally works better in double precision

Optimization - Cache Blocked Vectors Doubleword DMA gathers from DRAM can be expensive Cache block source and destination vectors Finite LS, so what’s the best aspect ratio? DMA large blocks into local store Gather operations into local store ISA vs. memory subsystem inefficiencies Exploits temporal and spatial locality within the SPE In effect, the sparse matrix is explicitly blocked into submatrices, and we can skip, or otherwise optimize empty submatrices Indices are now relative to the cache block half words reduces memory traffic by 16%

Optimization - Load Balancing Potentially irregular problem Load imbalance can severely hurt performance Partitioning by rows is clearly insufficient Partitioning by nonzeros is inefficient when average number of nonzeros per row per cache block is small Define a cost function of number of row starts and number of nonzeros. Determine the parameters via static timing analysis or tuning.

Benchmark Matrices 4 nonsymmetric SPARSITY matrices 7pt Heat equation matrix

Other Approaches BeBop / OSKI on the Itanium2 & Opteron uses BCSR auto tunes for optimal r x c blocking Cell implementation is similar Cray’s routines on the X1E Report best of CSRP, Segmented Scan & Jagged Diagonal

Double Precision SpMV Performance 16 SPE version needs broadcast optimization Frequency helps (mildly computationally bound) Cell+ doesn’t help much more (non FP instruction BW)

(performance model only) Dense Matrix Matrix Multiplication (performance model only)

Dense Matrix-Matrix Multiplication Blocking Explicit (BDL) or implicit blocking (gather stanzas) Hybrid method would be to convert and store to DRAM on the fly Choose a block size so that kernel is computationally bound 642 in single precision much easier in double precision (14x computational time, 2x transfer time) Parallelization Partition A & C among SPEs Future work - cannon’s algorithm

GEMM Performance

(performance model only) Fourier Transforms (performance model only)

1D Fast Fourier Transforms Naïve Algorithm Load roots of unity Load data (cyclic) Local work, on-chip transpose, local work i.e. SPEs cooperate on a single FFT No overlap of communication or computation

2D Fast Fourier Transforms Each SPE performs 2 * (N/8) FFTs Double buffer rows overlap communication and computation 2 incoming, 2 outgoing Straightforward algorithm (N2 2D FFT): N simultaneous FFTs, transpose, N simultaneous FFTs, transpose. Long DMAs necessitate transposes transposes represent about 50% of total SP execution time SP Simultaneous FFT’s run at ~ 75 GFLOP/s

FFT Performance

Conclusions

Conclusions Far more predictable than conventional OOO machines Even in double precision, it obtains much better performance on a surprising variety of codes. Cell can eliminate unneeded memory traffic, hide memory latency, and thus achieves a much higher percentage of memory bandwidth. Instruction set can be very inefficient for poorly SIMDizable or misaligned codes. Loop overheads can heavily dominate performance. Programming Model is clunky

Acknowledgments This work (paper in CF06, poster in EDGE06) is a collaboration with the following FTG members: John Shalf, Lenny Oliker, Shoaib Kamil, Parry Husbands, and Kathy Yelick Additional thanks to Joe Gebis and David Patterson X1E FFT numbers provided by: Bracy Elton, and Adrian Tate Cell access provided by: Mike Perrone (IBM Watson) Otto Wohlmuth (IBM Germany) Nehal Desai (Los Alamos National Lab)

Questions?