1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Parallel Processing with PlayStation3 Lawrence Kalisz.

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

EZ-COURSEWARE State-of-the-Art Teaching Tools From AMS Teaching Tomorrow’s Technology Today.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Fall EE 333 Lillevik 333f06-l20 University of Portland School of Engineering Computer Organization Lecture 20 Pipelining: “bucket brigade” MIPS.

1/1/ / faculty of Electrical Engineering eindhoven university of technology Processor support devices Part 1:Interrupts and shared memory dr.ir. A.C. Verschueren.

1 Microprocessor-based Systems Course 4 - Microprocessors.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

I/O Hardware n Incredible variety of I/O devices n Common concepts: – Port – connection point to the computer – Bus (daisy chain or shared direct access)

Chapter 17 Parallel Processing.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Memory: Virtual MemoryCSCE430/830 Memory Hierarchy: Virtual Memory CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng Zhu.

CHAPTER 9: Input / Output

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

Lect 13-1 Lect 13: and Pentium. Lect Microprocessor Family  Microprocessor  Introduced in 1989  High Integration  On-chip 8K.

ARM Processor Architecture

CS 152 Computer Architecture and Engineering Lecture 23: Putting it all together: Intel Nehalem Krste Asanovic Electrical Engineering and Computer Sciences.

Computer performance.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Motivation Mobile embedded systems are present in: –Cell phones –PDA’s –MP3 players –GPS units.

ECE 526 – Network Processing Systems Design Network Processor Architecture and Scalability Chapter 13,14: D. E. Comer.

Agenda Performance highlights of Cell Target applications

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Operating System Review September 10, 2012Introduction to Computer Security ©2004 Matt Bishop Slide #1-1.

CHAPTER 9: Input / Output

Hardware Definitions –Port: Point of connection –Bus: Interface Daisy Chain (A=>B=>…=>X) Shared Direct Device Access –Controller: Device Electronics –Registers:

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

Top Level View of Computer Function and Interconnection.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

Computer Architecture System Interface Units Iolanthe II approaches Coromandel Harbour.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

I/O Computer Organization II 1 Interconnecting Components Need interconnections between – CPU, memory, I/O controllers Bus: shared communication channel.

EEE440 Computer Architecture

1 Threads, SMP, and Microkernels Chapter Multithreading Operating system supports multiple threads of execution within a single process MS-DOS.

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Computer performance issues* Pipelines, Parallelism. Process and Threads.

Computer Architecture System Interface Units Iolanthe II in the Bay of Islands.

The Alpha – Data Stream Matt Ziegler.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

1 Adapted from UC Berkeley CS252 S01 Lecture 18: Reducing Cache Hit Time and Main Memory Design Virtucal Cache, pipelined cache, cache summary, main memory.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

1 Adapted from UC Berkeley CS252 S01 Lecture 17: Reducing Cache Miss Penalty and Reducing Cache Hit Time Hardware prefetching and stream buffer, software.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

1 load [2], [9] Transfer contents of memory location 9 to memory location 2. Illegal instruction.

ALPHA 21164PC. Alpha 21164PC High-performance alternative to a Windows NT Personal Computer.

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Cell Architecture.

Hyperthreading Technology

CPE 631: Multithreading: Thread-Level Parallelism Within a Processor

* From AMD 1996 Publication #18522 Revision E

Chapter 13: I/O Systems I/O Hardware Application I/O Interface

Presentation transcript:

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect

2 References [1] Kewin Krewell. "CELL MOVES INTO THE LIMELIGHT". Microprocessor {2/14/05-01} [2] Michael Kistler, Michael Perrone,Fabrizio Petrini. "CELL MULTIPROCESSOR COMMUNICATION NETWORK: BUILT FOR SPEED". In IEEE Micro, 26(3), May/June 2006 [3] Cell Broadband Engine resource center ibm.com/developerworks/power/cell/ ibm.com/developerworks/power/cell/ [4] H. Peter Hofstee. “Introduction to Cell Broadband Engine”

3 Agenda Performance highlights of Cell Real time enhancements Target applications Paper I (Cell Moves Into Limelight) Paper II (Cell Multiprocessor Communication Network) Cell Performance Overview Programming Model Power Management Drawbacks

4 Performance Highlights of Cell Delivers GFlop/s single precision & 14.6Gflop/s double precision floating point performance Supports virtualization, large pages from the Power architecture Aggregate memory bandwidth of 25.6 GB/s at 3.2GHz Configurable I/O interface capable of (raw) bandwidth of up to 25GB/s inbound & 35GB/s outbound EIB supports peak bandwidth of 204.8GB/s Extensible timers and counters to manage real-time response of the system

5 Real Time Enhancements Resource Reservation system for reserving bandwidth on shared units such as system memory, I/O interfaces L2 Cache Locking system based on Effective or Real Address ranges  Supports both locking for Streaming, and locking for High Reuse TLB Locking system based on Effective or Real Address ranges or DMA class. Fully pre-emptible context switching capability for each SPE Privileged Attention Event to SPE for use in contractual light weight context switching Multiple concurrent large page support in the PPE and SPE to minimize real-time impact due to TLB misses Up to 4 service classes (software controlled) for DMA commands (improves parallelism) Large page I/O Translation facility for I/O devices, graphics subsystems, etc - minimizes I/O translation cache misses SPE Event Handling facilities for high priority task notification PPE SMT Thread priority controls for Low, Medium and High Priority Instruction dispatch

6 Target Applications Advanced visualization  Ray tracing  Ray casting  Volume rendering Streaming applications  Media encoders and decoders  Streaming encryption and decryption Fast Fourier Transforms (single precision) E.g. Sony Play station 3 Scientific and parallel applications in general

7 CBE Architecture Block Diagram of Cell Processor

8 CBE Architecture - Overview 64bit Power architecture forms the foundation Dual thread Power Processor Element (PPE) Eight Synergistic Processor Elements (SPEs) On-chip Rambus XDR controller with support for two banks of Rambus XDR memory Cell processor production die has 235m transistors and is 235mm 2 Cell doesn’t include networking peripherals or large memory arrays on chip Reaches high performance due to high clock speed and high-performance XDR DRAM interface

9 CBE Architecture – Chip Layout

10 CBE Architecture – Power Core In-order two issue superscalar design 21 clock cycle long pipeline Support for simultaneous (up to 2) multithreading  Round robin scheduling  Duplicated register files, program counters and parallel instruction buffers (before decode stage) 512K on-chip L2 cache A mis-predicted branch – 8 cycle penalty Load – 4 cycle data-cache access time Big-endian processor

11 CBE Architecture – SPEs SIMD-RISC instruction set 128-entry 128 bit unified register file for all data types 4 way SIMD capability - optional “Branch hint” instructions instead of branch prediction logic in hardware – Software controlled branch prediction Can complete up to two instructions per cycle Can perform load, store, shuffle, channel or branch operation in parallel with a computation Not multi-threaded  Avoid miss penalty by having all data present all the time  Reduce complexity in scheduling and die area requirement

12 CBE Architecture – SPEs [2] SPE is capable of limited dual issue operation Improper alignment of instruction causes a swap operation forcing single-issue operation

13 CBE Architecture – Memory Model Power core  32K 2-way instruction cache and 32 K 4-way set associative data cache 256KB local store on SPE, 6 cycle load latency  Software must manage data in and out of local store  Controlled by the memory flow controller  Does not participate in hardware cache coherency  Aliased in the memory map of the processor PPE can load and store from a memory location mapped to the local store (slow) SPE can use the DMA controller to move data to its own or other SPEs local store & between local store and main memory as well as I/O interfaces Memory flow controller on SPE can begin to transfer the data set of the next task as present one is running – Double Buffering

14 CBE Architecture – Memory Model [2] Only quad-word transfers from the SPE local store  Single ported DMA transfers support 1024-bit transfers with quad word enables Local store supports both a wide 128byte and a narrow 16byte access DMA reads occupy single cycle for 128bytes Access to local store is prioritized  DMA transfers of PPE transfers occupy highest priority  SPE loads and stores occupy second highest priority  SPE instruction prefetch gets lowest priority Conflict

15 Memory Flow Controller (MFC) Local to each SPU, connects it to EIB  SPU  MFC via SPU channel interface  Separate read/write channels with blocking and non-blocking semantics MFC runs at the same frequency as EIB Accepts and processes DMA commands issued by SPU/PPE using the channel interface or memory mapped I/O (MMIO) registers asynchronously Supports naturally aligned transfers of 1,2,4, or 8bytes or a multiple of 16bytes to a max of 16KB DMA list – up to 2048 DMA transfers using single MFC DMA command

16 CBE Architecture – Communication Element Interconnect Bus  A data-ring structure with a control bus  Each ring is 16B wide and runs at half of core clock frequency allowing 3 concurrent data transfers as long as their paths don’t overlap  Four unidirectional rings, two running in each direction Implies worst case latency of only half the distance of the ring  Manages token transactions  Separate communication path for command and data  Each bus element connected through a p2p link to the address concentrator  Arbiter takes care of scheduling transfer ensuring no interference with in-flight transactions, gives priority to MFC and rest round robin

17 CBE Architecture – Communication [2] Element Interconnect Bus

18 CBE Architecture – Communication [3] I/O can be configured as two logical interfaces MMIO for easy access of I/O from PPE and SPE Interrupts from SPE and memory flow controller events are treated as external interrupts to PPE Two cell processors can be connected via IOIF0 to form one coherent Cell domain using BIF protocol Signal notification - two channels Mailboxes – 32 bit communication channel between PPE and SPE  Four entry, read blocking inbound  Two single entry, write blocking outbound Special operations to support synchronization mechanism

19 CBE Architecture – DMA Basic Flow of a DMA transfer

20 DMA Latency

21 Interconnect Performance Latency and bandwidthagainst DMA message sizein the absence of contention

22 Interconnect Performance [2]

23 Interconnect Performance [3]

24 Interconnect Performance [4]

25 Interconnect Performance [5]

26 Cell vs. Sony Emotion Engine

27 CBE Programming Tool chain for Cell built on PowerPC Linux Programming of SPE based on C with limited C++ support Debugging tools include extensions for P-Trance and extended GNU debugger (GDB) Programming Models:  Pipeline model  Parallel model  Combination of the two

28 Power Management Capable of being clocked at one-eighth the normal speed when idling Multiple power management states available to privileged software  Active, slow, pause, state retained and isolated (SRI), state lost and isolated (SLI)  Each progressively more aggressive in saving power  Software controls the transitions, but can be linked to external events  SLI state – the device is effectively shut off from the system

29 Drawbacks Full SPE context switch is relatively expensive  This can negatively affect virtualization of SPEs if not properly handled This instantiation of Cell – not suitable for DP math  No support for IEEE 754 precise mode  Use by super computer applications will require further development