1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
2/21 Background Joint collaboration of IBM/Sony/Toshiba Develop a new/next-gen processor Initially for Play Station 3 Others, multimedia application (Blu-ray, HDTV) Server systems
3/21 Objective Outstanding performance Overcome memory wall Improve power efficiency Sustain high frequency without increase in pipeline depth Real-time response to user Visual, sound & other sensory feedback Connect to internet (able to handle variety of workloads) Applicable for wide range of platforms Next-generation consumer –electronic systems & beyond
4/21 Synergistic Processing Element
5/21 Power Processor Element (PPE) The PPE is a 64 bit, "Power Architecture“ capable of running POWER or PowerPC binaries Extended Vector Scalar Unit (VSU) The PPE is In-order Dual threaded Dual Issue
6/21 PPE components Copyright: IBM
7/21 Synergistic Processing Elements An SPE is a self contained vector processor (SIMD) which acts as a co-processor SPE’s ISA a cross between VMX and the PS2’s Emotion Engine. In-order (again to minimize circuitry to save power) Statically scheduled (compiler plays big role) Also no dynamic prediction hardware (relies on compiler generated hints) Each SPE consists of: 128 x 128 register Local Store (SRAM) DMA unit FP, LD/ST, Permute, Branch Unit (each pipelined)
8/21 SPE Architecture Copyright: IBM
9/21 SPE Local Store Each SPE has local on-chip memory a.k.a Local Store(LS) serves a secondary register file (not as cache) Avoids coherence logic needed caches as well cache miss penalty Is mapped into memory map of the processor allow LS to LS transactions 128 bit instruction fetch, load and store operation 7 out of every 8 cycles Data/instructions are transferred bet. LS and system memory/other SPE’s LS using DMA unit 128 bytes at a time(transfer rate of 0.5 terabytes/sec) DMA transactions are coherent
10/21 SPE DMA Unit Contains the Memory Flow Controller(MFC) Interface uses Power Architecture page protection model MFC has its own Memory Management Unit (MMU) that is subset of Power core’s MMU This allows consistent interface to system storage map for all processors despite it heterogeneous structure
11/21 Floating Point Performance Both PPE and SPE have Vector instruction capability Esp. each SPU can complete 2 double precision operations per clock cycle - translates to 6.4 GFLOPS at 3.2 GHz OR 8 single precision operations per clock cycle – translates to 25.6 GFLOPS at 3.2 GHz
12/21 Element Interconnect Bus Connects various on chip elements PPE, 8 SPEs, memory controller (MIC) & off-chip I/O interfaces Data-ring structure with control of a bus 4 unidirectional rings but 2 rings run counter direction to other 2 Worst-case maximum latency is only half distance of the ring Each ring is 16 bytes wide and runs at half the core clock frequency (core clock freq ~3.2 GHz)
13/21 Memory and I/O Cell needs tremendous amount of memory and I/O Memory Technology: Rambus XDR DRAM Supports total bandwidth of 25.6 GB/s I/O: Rambus FlexI/O
14/21 Programming the cell is challenging Issues Dividing program among different cores Creating instructions in a different language for the 8 SPEs than for the PowerPC core. Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs SPU local store needs to perform coherent DMA access for accessing system memory
15/21 IBM Approach Manually partition the application into separate code segments and use the compiler that targets the appropriate ISA For SPUs, SIMD code generation can be done by parallelizing compiler with auto-SIMDization Allocating SPE program data in system memory (shared memory view) & have SPE compiler automatically manage the movement of data A naive compiler inserts an explicit DMA transfer for each access to shared memory optimized: employ a software cache mechanism that permits reuse of the temporary buffers in the LS
16/21 IBM Approach (contd..) Using the SPE linker and an embedding tool generate a PPE executable that contains the SPE binary embedded within the data section PPE object is then linked, using a PPE linker with the runtime libraries which are required for thread creation and management, to create a bound executable for the Cell BE program
17/21 Compiling and Binding of a program on CELL Copyright: IBM
18/21 Programming Models Stream processing Serial or parallel pipelines can be setup Example: Set-box consists of reading, video and audio encoding, and display. Serial: chaining SPEs and each SPE does one subtask Parallel: partition same subtask among SPEs
19/21 Programming Model Function Offload Model Application executes on PPE Complex library functions invoked by the main application are offloaded onto one or more SPE Library function(s) are optimized and recompiled for SPE environment SPE executable program is linked into PPE object module as small remote function invocation stub
20/21 Current/Future Applications Sony Play Station 3 Significant improvement over PS2 IBM Blade Server Blade server prototype containing two cell processors Ran at 2.4 GHz (current system run at 3.2 GHz) providing 200 GFLOPS single-precision floating performance per CPU Mercury In corporate cell based system into Military Vehicles Used for target recognition, tracking geo-location, mapping, video processing etc
21/21