Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

Cell Broadband Processor Daniel Bagley Meng Tan

Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical discussion of components  Design choices  Other processors like the cell  Programming for the cell

History of Development  Sony Playstation2 Announce March 1999Announce March 1999 Released March 2000 in JapanReleased March 2000 in Japan 128bit “Emotion Engine”128bit “Emotion Engine” 294mhz, MIPS CPU294mhz, MIPS CPU Single Precision FP OptimizationsSingle Precision FP Optimizations 6.2gflops6.2gflops

History Continued  Partnership between Sony, Toshiba, IBM  Summer of 2000 – High level development talks  Initial goal of 1000x PS2 Power  March 2001, Sony-IBM-Toshiba design center opened  $400m investment.

Overall Goals for Cell  High performance in multimedia apps  Real time performance  Power consumption  Cost  Available by 2005  Avoid memory latency issues associated with control structures

The Cell itself  Power PC based main core (PPE)  Multiple SPEs  On die memory controller  Inter-core transport bus  High speed IO

Cell Die Layout

Cell Implementation  Cell is an architecture  Preliminary PS3 Implementation 1 PPE1 PPE 7 SPE (1 Disabled for yield increase)7 SPE (1 Disabled for yield increase) 221 mm² die size on a 90 nm process221 mm² die size on a 90 nm process Clocked at 3-4ghzClocked at 3-4ghz 256GFLOPS Single Precision @ 4ghz256GFLOPS Single Precision @ 4ghz

Why a Cell Architecture  Follows a trend in computing architecture  Natural extension of dual and multi- core  Extremely low hardware overhead  Software controllable  Specialized hardware more useful for multimedia

Possible Uses  Playstation3 (Obviously)  Blade servers (IBM) Amazing single precision FP performanceAmazing single precision FP performance Scientific applicationsScientific applications  Toshiba HDTV products

Power Processing Element  PowerPC instruction set with AltiVec  Used for general purpose computing and controlling SPE’s  Simultaneous Multithreading  Separate 32 KB L1 Caches and unified 512 KB L2 Cache

PPE (cont.)  Slow but power efficient PowerPC instruction set implementation  Two issue in-order instruction fetch  Conspicuous lack of instruction window  Compare to conventional PowerPC implementations (G5)  Performance depends on SPE utilization

Synergistic Processing Element (SPE)  Specialized hardware  Meant to be used in parallel (7 on PS3 implementation)(7 on PS3 implementation)  On chip memory (256kb)  No branch prediction  In-order execution  Dual issue

SPE Architecture  0.99µm2 on 90nm Process  128 registers (128 bits wide) Instructions assumed to be 4x 32bitInstructions assumed to be 4x 32bit  Variant of VMX instruction set Modified for 128 registersModified for 128 registers  On chip memory is NOT a cache

SPE Execution  Dual issue, in-order  Seven execution units  Vector logic  8 single precision operations per cycle  Significant performance hit for double precision

SPE Execution Diagram

SPE Local Storage Area  NOT a cache  256kb, 4 x 64kb ECC single port SRAM  Completely private to each SPE  Directly addressable by software  Can be used as a cache, but only with software controls  No tag bits, or any extra hardware

SPE LS Scheduling  Software controlled DMA  DMA to and from main memory  Scheduling a HUGE problem Done primarily in softwareDone primarily in software IBM predicts 80-90% usage ideallyIBM predicts 80-90% usage ideally  Request queue handles 16 simultaneous requests Up to 16 kb transfer eachUp to 16 kb transfer each Priority: DMA, L/S, FetchPriority: DMA, L/S, Fetch  Fetch / execute parallelism

SPE Control Logic  Very little in comparison  Represents shift in focus  Complete lack of branch prediction Software branch predictionSoftware branch prediction Loop unrollingLoop unrolling 18 cycle penalty18 cycle penalty  Software controlled DMA

SPE Pipeline  Little ILP, and thus little control logic  Dual issue  Simple commit unit (no reorder buffer or other complexities)  Same execution unit for FP/int

SPE Summary  Essentially small vector computer  Based on Altivec/VMX ISA Extensions for DMA and LS managementExtensions for DMA and LS management Extended for 128x 128bit registerfileExtended for 128x 128bit registerfile  Uniquely suited for real time applications  Extremely fast for certain FP operations  Offload a large amount on to compiler / software.

Element Interconnect Bus  4 concentric rings connecting all Cell elements  128-bit wide interconnects

EIB (cont.)  Designed to minimize coupling noise  Rings of data traveling in alternating directions  Buffers and repeaters at each SPE boundary  Architecture can be scaled up with increased bus latency

EIB (cont.)  Total bandwidth at ~200GB/s  EIB controller located physically in center of chip between SPE’s  Controller reserves channels for each individual data transfer request  Implementation allows for SPE extension horizontally

Memory Interface  Rambus XDR memory to keep Cell at full utilization  3.2 Gbps data bandwidth per device connected to XDR interface  Cell uses dual channel XDR with four devices and 16-bit wide buses to achieve 25.2 GB/s total memory bandwidth

Input / Output Bus  Rambus FlexIO Bus  IO interface consists of 12 unidirectional byte lanes  Each lane supports 6.4 GB/s bandwidth  7 outbound lanes and 5 inbound lanes

Design Choices  In-order execution Abandoning ILPAbandoning ILP ILP – 10-20% increase per generationILP – 10-20% increase per generation Reducing control logicReducing control logic Real time responsivenessReal time responsiveness  Cache Design Software configuration on SPESoftware configuration on SPE Standard L2 cache on PPEStandard L2 cache on PPE

Cell Programming Issues  No Cell compiler in existence to manage utilization of SPE’s at compile time  SPE’s do not natively support context switching. Must be OS managed.  SPE’s are vector processors. Not efficient for general-purpose computation.  PPE’s and SPE’s use different instruction sets.

Cell Programming (cont.)  Functional Offload Model  Simplest model for Cell programming  Optimize existing libraries for SPE computation  Requires no rebuild of main application logic which runs on PPE

Cell Programming (cont.)  Device Extension Model  Take advantage of SPE DMA  Use SPE’s as interfaces to external devices

Cell Programming (cont.)  Computational Acceleration Model  Traditional super-computing methods using Cell  Shared memory or message passing paradigm for accelerating inherently parallel math operations  Can overwrite intensive math libraries without rewriting applications

Cell Programming (cont.)  Streaming model  Use Cell processor as one large programmable pipeline  Partition algorithms into logically sensible steps. Execute each separately, in serial, on separate processors.

Cell Programming (cont.)  Asymmetric Thread Runtime Model  Abstract Cell architecture away from programmer.  Use OS to use processors to each run different threads.

Sample Performance  Demonstration physics engine for real-time game  http://www.research.ibm.com/cell/w hitepapers/cell_online_game.pdf http://www.research.ibm.com/cell/w hitepapers/cell_online_game.pdf http://www.research.ibm.com/cell/w hitepapers/cell_online_game.pdf  182 Compute to DMA ratio on SPE’s  For the right tasks, Cell architecture can be extremely efficient.

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

Similar presentations

Presentation on theme: "Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

Similar presentations

Presentation on theme: "Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical."— Presentation transcript:

Similar presentations

About project

Feedback