High Performance Computing on the Cell Broadband Engine Vas Chellappa Electrical & Computer Engineering Carnegie Mellon University Dec 3 2008
Designing “faster” processors Need for speed Parallelism: forms Superscalar Pipelining Vector Multi-core Multi-node
Designing “faster” processors Need for speed Parallelism: forms (limitations) Superscalar (power density) Pipelining (latch overhead: frequency scaling, branching) Vector (programming, only numeric) Multi-core (memory wall, programming) Multi-node (interconnects, reliability)
Multi-core Parallelism Future is definitely multi-core parallelism But what problems/limitations do multi-cores have? Increased programming burden Scaling issues: power, interconnects etc.
The Cell BE Approach Frequency wall: many simple, in-order cores Power wall: vectorized, in-order, arithmetic cores Memory wall: Memory Flow Controller handles programmer driven DMA in background Cell BE Chip Main Mem EIB SPE LS PPE
Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up
Cell Broadband Engine EIB Designed for high-density floating-point computation (PlayStation 3, IBM Roadrunner) Compute: Heterogeneous multi-core (1 PPE + 8 SPEs) 204 Gflop/s (only SPEs) High-speed on-chip interconnect Memory system: Explicit scratchpad-type “local store” DMA based programming Challenges: Parallelization, vectorization, explicit memory New design: new programming paradigm SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS Main Mem Writing by hand is just really hard. Automated tools exist, but do not deliver performance.
Cell BE Processor: A Closer Look Power Processing Element (PPE) Synergistic Processing Element (SPE) x8 Local Stores (LS) Cell BE Chip Main Mem EIB SPE LS PPE
Power Processing Element (PPE) Purpose: Operating System, program control Uses POWER Instruction Set Architecture 2-way multithreaded Cache: 32KB L1-I, 32KB L1-D, 512KB L2 AltiVec SIMD System functions Virtualization, address translation/protection, exception handling http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Roadrunner-tutorial-session-1-web1.pdf
Synergistic Processing Element (SPE) SPU = Processor + LS; SPE = + MFC Synergistic Processing Unit (SPU) Local Store (LS) Memory Flow Controller (MFC)
Synergistic Processing Unit (SPU) Number cruncher Vectorization (4-way/2-way) Peak performance (each SPE) 25.6 Gflop/s (single precision): 3.2 GHz x 4-way (vector) x 2 (FMA) <2 Gflop/s (double precision): Not pipelined EDP version: full speed double precision (12.8 Gflop/s) Comparison: Intel 128 vector registers, each 128B Even, odd pipelines In-order, shallow pipelines No branch prediction (hinting) Completely deterministic
Local Stores (LS) and Memory Flow Cont. (MFC) Each SPU contains a 256KB LS (instead of cache) Explicit read/write (programmer issues DMA) Extremely fast (6-cycle load latency to SPU) Memory Flow Controller Co-processor to handle DMAs (in background) 8/16 command-queue entries Handles DMA-lists (scatter/gather) Barriers, fences, tag groups etc. Mailboxes, signals
Element Interconnect Bus (EIB) 4 data rings (16B wide each) 2 clockwise, 2 counter-clockwise Supports multiple data transfers Data ports: 25.6 Gb/s per direction 204.8 Gb/s sustained peak
Direct Memory Access (DMA) Programmer driven Packet sizes 1B – 16KB Several alignment constraints (bus errors!) Packet size vs. performance DMA lists Get, put: SPE-centric view Mailboxes/signals are also DMAs
Systems using the Cell Sony PlayStation 3 6 available SPEs 7th: hypervisor 8th: defective (yield issues) Can run Linux (Fedora / Yellow Dog Linux) Various PS3-cluster projects IBM BladeCenter QS20/QS22 Two Cell processors Infiniband/Ethernet
IBM Roadrunner Supercomputer at Los Alamos National Lab (NM) Main purpose: model decay of the US nuclear arsenal Performance World’s fastest [TOP500.org] Peak: 1.7 petaflop/s. First to top 1.0 petaflop/s on Linpack Design: hybrid dual-core 64-bit AMD Opterons at 1.8GHz (6,480 Opterons) Cell attached to each Opteron core at 3.2GHz (12,960 Cells) Design hierarchy QS22 Blade = 2 PowerXCell 8i TriBlade = LS21 Opteron Blade + 2x QS22 Cell Blades (PCIe x8) Connected Unit = 180 TriBlades (Infiniband) Cluster = 18 CUs (Infiniband) 90% of peak performance from SPEs Porting programs over Endianness
Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up 17
Programming on the Cell: Philosophy Major differences to traditional processors Not designed for scalar performance Explicit memory access Heterogeneous multi-core Using the SPEs SPMD model (Single Program Multiple Data) Streaming model
Programming Tips What kind of code good/bad for SPEs? No branching (no prediction) Use branch hinting No scalar (no support) Use intrinsics for vectorization, DMA Context switches are expensive Program + data reside in LS. These have to be swapped in/out DMA code: alignment, alignment, alignment! Libraries available to emulate software-managed cache
DMA Programming Main idea: hide memory accesses with multibuffering Compute on one buffer in LS Write back / read in other batches of data Like a completely controlled cache Inter-chip communication Message boxes Signals DMA
Tools for Cell Programming IBM’s Cell SDK 3.0 spu-gcc, ppu-gcc, xlc compilers Simulator libspe: SPE runtime management library Other tools: Assembly visualizer Because SPEs are in-order Single source compiler No OpenMP right now Other tools (from RapidMind, Mercury etc.)
Program Design Use knowledge of architecture to model Back of the envelope calculations Cost of processing? Cost of communication? Trends? Limits? How close is the model? What programming improvements can be made to fit the architecture better?
Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up 23
Creating PPE Program, SPE Threads Each program consists of PPE and SPE sections Program is started up on PPE PPE creates SPE threads pthreads implementation Not full PPE data structure to keep track of SPE threads PPE/SPE shared data structure for argument passing X, Y, Z addresses Thread id Returned cycle count
DMA Access spu_writech(MFC_WrTagMask, -1); spu_mfcdma64(source_address, dest_high_address, dest_low_address, size_in_byes, tag_id, MFC_GET_CMD); spu_mfcstat(MFC_TAG_UPDATE_ALL); Use my DMA_BL_GET, DMA_BL_PUT macros
Compiling Compile ppe, spe programs separately Details: specify SPE program name, call from PPE 32/64 bit (watch out for pointer sizes etc.) Cell SDK has sample Makefiles We will use a simple Makefile
Performance Evaluation: Timing Performance measure: runtime, Gflop/s Timing Each SPE has its own decrementer Decrements at an independent, lower frequency (80GHz on PS3) cat /proc/cpuinfo Reset counter to highest value Measure on each SPE? Average? Min? Max? Which one fits the real-word scenario the best?
Exercise 1: Add/Mul Two Arrays Goal: X[] += Y[] * Z[] Part 1: Infrastructure, understand skeleton code Part 2: Parallelization and vectorization (easy) Part 3: Hiding memory access costs
Part 1 Goal: Evaluate: PPU’s tasks: Use only single SPU. SPU’s task: Understand skeleton code Get infrastructure up and running (compiler, basic code) Evaluate: scalar, sequential code performance PPU’s tasks: Initialize vectors in main memory Start up threads for each SPU, and let them run Verify/print results, performance Use only single SPU. SPU’s task: Get (DMA) all 3 arrays from main memory Perform computation Put (DMA) back result to main memory Write back time to PPU Your tasks: Compile Transform code Timer code
Part 2 Goal Evaluate: PPU: (vector float) d = spu_madd(a,b,c); SPU: Parallelize across 4 SPEs (easy with skeleton code) Vectorize X[] += Y[] * Z[] (easy) Evaluate: Parallel code performance Vectorized parallel code performance PPU: Start up 4 SPU threads Performance evaluation: how? SPU: DMA-get, compute, DMA-put only its own chunk 4-way single precision vectorization (vector float) d = spu_madd(a,b,c); Your tasks: Parallelize Vectorize Performance?
Part 3 Goal: hide memory accesses How?
Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up 32
Exercise Debriefing How effectively did we use the architecture? Parallelization, vectorization mandatory! Memory overlapping: big difference Do our optimizations work for a large size range? Smaller sizes: lower packet sizes? Real world problems (Fourier transform, WHT) Real-world problems are rarely embarrassingly parallel Additional complexities?
WHT on the Cell Vectorization: as before Parallelization: locality-aware! Explicit memory access Provide code Multibuffering? How? Inter-SPE data exchange Algorithms that generate large packet sizes? Overlap? Fast barrier
WHT: Data Exchange
WHT: Data Exchange
WHT: Data Exchange
WHT: Data Exchange
DMA Issues External multibuffering (streaming) Strategies for problem sizes Small/medium: data exchange on-chip, streaming Large: trickier. Break down into parts Using all memory banks
Cell Philosophy Cell philosophies: do they extend to other systems? Yes: Fundamental problems are the same Distributed memory computing Clusters, supercomputers Processing faster than interconnects Higher interconnect bandwidth with larger packets Multicore processors Trend: NUMA, even on-chip Locality-aware parallelism
Wrap-Up Programming Cell BE for high-performance computing Cell: chip multiprocessor designed for HPC Applications from video gaming to supercomputers Programming burden is factor for performance Parallelization, vectorization, memory handling Automated tools yield limited performance Programmers must understand μ-arch., tradeoffs For performance (esp. on Cell)