Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)

Slides:

Advertisements

Similar presentations

Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.

Advertisements

Slides Prepared from the CI-Tutor Courses at NCSA By S. Masoud Sadjadi School of Computing and Information Sciences Florida.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.

Structure of Computer Systems

Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.

Khaled A. Al-Utaibi  Computers are Every Where  What is Computer Engineering?  Design Levels  Computer Engineering Fields  What.

Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.

GPGPU Introduction Alan Gray EPCC The University of Edinburgh.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon, Dec 5, 2005 Topic: Intro to Multiprocessors and Thread-Level Parallelism.

Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

GPGPU overview. Graphics Processing Unit (GPU) GPU is the chip in computer video cards, PS3, Xbox, etc – Designed to realize the 3D graphics pipeline.

1 Instant replay  The semester was split into roughly four parts. —The 1st quarter covered instruction set architectures—the connection between software.

Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

Computer performance.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Operating Systems What do you have left on your computer after you strip away all of the games and application programs you bought and installed? Name.

A+ Guide to Hardware: Managing, Maintaining, and Troubleshooting, Sixth Edition Chapter 9, Part 11 Satisfying Customer Needs.

Computer System Architectures Computer System Software

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Information and Communication Technology Fundamentals Credits Hours: 2+1 Instructor: Ayesha Bint Saleem.

Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.

CuMAPz: A Tool to Analyze Memory Access Patterns in CUDA

 Design model for a computer  Named after John von Neuman  Instructions that tell the computer what to do are stored in memory  Stored program Memory.

Computer Graphics Graphics Hardware

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Copyright © 2006 by The McGraw-Hill Companies,

Multi-core architectures. Single-core computer Single-core CPU chip.

Multi-Core Architectures

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

Component 4: Introduction to Information and Computer Science

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

CPS 4150 Computer Organization Fall 2006 Ching-Song Don Wei.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

The Octoplier: A New Software Device Affecting Hardware Group 4 Austin Beam Brittany Dearien Brittany Dearien Warren Irwin Amanda Medlin Amanda Medlin.

PC Internal Components Lesson 4.  Intel is perhaps the most recognizable microprocessor manufacturer. List some others.

Copyright © 2006 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill Technology Education Chapter 5A Transforming Data Into Information.

Chapter 5: Computer Systems Design and Organization Dr Mohamed Menacer Taibah University

Playstation2 Architecture Architecture Hardware Design.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

GPGPU introduction. Why is GPU in the picture Seeking exa-scale computing platform Minimize power per operation. – Power is directly correlated to the.

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

High performance computing architecture examples Unit 2.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Heterogeneous Processing KYLE ADAMSKI. Overview What is heterogeneous processing? Why it is necessary Issues with heterogeneity CPU’s vs. GPU’s Heterogeneous.

System SOFTWARE.

Chapter 1: Introduction to the Personal Computer

Cell Architecture.

Architecture & Organization 1

Instructor Materials Chapter 1: Introduction to the Personal Computer

Multicultural Social Community Development Institute ( MSCDI)

Architecture & Organization 1

McGraw-Hill Technology Education

Central Processing Unit

1.1 The Characteristics of Contemporary Processors, Input, Output and Storage Devices Types of Processors.

McGraw-Hill Technology Education

McGraw-Hill Technology Education

Multicore and GPU Programming

Presentation transcript:

Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group) Presented by Max Shneider

Additional Information Source This presentation also contains more general information on Cell, found here: Cell Architecture Explained, Version 2 © Nicholas Blachford,

Background Info – Caches Registers L1 Cache L2 Cache Memory Hard Drive Small Size Large Fast Speed Slow

Background Info – Pipelining stage pipeline (each stage uses a different resource): … Instead of waiting for each instruction to complete before starting the next: Overlap the instructions to maximize resources:

Project History Collaboration between Sony, IBM, and Toshiba initiated in 2000 STI Design Center opened in Austin, TX with joint investment of $400,000,000 Originally designed for PS3, 100x faster than PS2 Since it’s general purpose, will be used for more than gaming (Blu-ray, HDTV, HD Camcorders, IBM servers) Mixture of broadband, entertainment systems, and supercomputers (hence the three companies) Can think of it as: A computer that acts like cells in a biological system A supercomputer on a chip designed for the home

Program Objectives/Challenges 1. Outstanding performance, especially on game/multimedia applications. Limited by: Memory latency and bandwidth Problem is that processor frequencies are not met by memory latencies, getting worse everyday  Worse for multiprocessors (~ 1,000 cycles) than single processors (100’s of cycles) In conventional processors, few concurrent memory accesses due to cache misses and data dependencies Want more simultaneous memory transactions (bandwidth) Power Cooling imposes limits on amount of power Need to improve power efficiency along with performance since no alternative lower-power technology is available

Objectives (cont.) 2. Real-time responsiveness to the user and network Primary focus is keeping player/user satisfied (not keeping the CPU busy)  need real-time support Since most Cell processor devices will be connected to the Internet, must be programmable and flexible to support wide variety of standards Concerns: security, digital rights management, privacy 3. Applicability to a wide range of platforms Game/media and Internet  wide range of applications Developed an open (Linux-based) software development environment to extend reach of the architecture 4. Support for introduction in 2005 Based Cell on Power Architecture because 4 years was not enough to make a completely new design

Cell Architecture Comprised of hardware/software cells that can co- operate on problems Can potentially scale up and distribute cells over a network or the world Performs 10x faster than existing CPUs on many applications Similar to GPUs in that it gives higher performance, but can be used on a wider variety of tasks Each individual cell can theoretically perform 256 GFLOPS at 4 GHz, with power consumption between Watts

Cell Components 1 Power Processor Element (PPE) 8 Synergistic Processor Elements (SPEs)* Element Interconnect Bus (EIB) Direct Memory Access Controller (DMAC) Rambus XDR memory controller Rambus FlexIO (Input / Output) interface  PS3 will only have 7 SPEs, and consumer electronics will only have 6

PPE Runs the operating system and most of the applications, but offloads computationally intensive tasks to the SPEs 64-bit Power Architecture processor with 32-KB Level 1 instruction/data caches and 512-KB Level 2 cache Simpler than other processors Can’t do things like reordering instructions (in-order) But requires far less power (has less circuitry) Can run two threads simultaneously (which means it keeps busy when one thread is stalled and waiting) Erratic performance on branch-heavy applications (result of pipelining) – requires a good compiler Composed of 3 units: 1. Instruction unit (IU) – fetches/decodes instructions, includes L1 instruction cache 2. Fixed-point execution unit (XU) – fixed point, load/store, and branch instructions 3. Vector-scalar unit (VSU) – vector, floating point instructions

PPE

SPE Each of the 8 SPEs acts as an independent processor Given the right task, can perform as well as a top end CPU 32 GFLOPS (so 32 x 8 = 256 GFLOPS for system) Each SPE has a 256-KB local memory store instead of cache (more like a second level register file) Less complex and faster, no coherency problems DMA transfers between local store and system memory Allows many simultaneous memory transactions Makes it easy to add more SPEs to the system SPEs are vector processors – can do multiple operations simultaneously on same instruction Programs need to be “vectorized” to take full advantage (can be done with audio, video, 3D graphics, scientific applications

SPE

SPE Chaining (Stream Processing) SPE reads input from its local store, does processing, and stores result back in its local store Next SPE can read output from first SPE’s local store, do processing, … Absolute timer for exactly timed steam processing Multiple communication streams between SPEs to allow this Internal SPE transfers: 100’s of GB/sec Chip to chip transfers: 10’s of GB/sec

Additional Cell Components DMAC – controls memory access for PPE and SPEs XDR RAM memory – Cell can be configured to have GB’s of memory GB/sec bandwidth (higher than any PC, but necessary to feed SPEs) EIB – connects everything together, allows 3 simultaneous transfers, peaking at 384 GB/sec Rambus FlexIO interface – high bandwidth (76.8 GB/sec) and flexible to support multiple configurations (dual-processors, 4-way multiprocessors, etc.) IBM’s virtualization software – allows multiple operating systems to run at the same time

Additional Cell Components (cont.) Power architecture compatibility –based on the Power architecture, so all existing Power applications can be run on the Cell processor without modification Single-instruction, multiple data (SIMD) architecture SIMD units effectively accelerate multimedia applications They also have mature software support (since they’re included in all mainstream PC processors) SIMD units on the PPE and all SMEs Simplifies software development and migration DRM-like security – Each SPE can lock most of it’s local store for it's own use only

Programming Architecture is great, but can’t use it without software Have to manage SPE local store memory manually (this could eventually be handled by compilers) More efficient, but additional complexity for developers Also, limited ability to change hardware in the future Up to programmers to utilize SIMD units for best performance benefits Primary development language is C with standard multithreading Primary OS is Linux (since it already ran on the PowerPC)

Converting an Application to Cell Requires the following steps: 1. Port application to PowerPC instruction set 2. Figure out which parts of the code should run on SMEs, make those self-contained Best suited for small, repetitive tasks that can be vectorized or parallelized 3. Vectorize the code, use SIMD units properly, and balance data flow to make most efficient use of SPEs Multiple execution threads, careful choice of algorithms and data flow control are all necessary (same as multi-processors) Cell will still suffer from same things as a standard PC (ie. lots of random memory reads) Must worry about size – algorithm and at least some of the data need to fit within the local store

Programming Models Function offload – main application executes on PPE, offload complex functions to SPEs (currently identified by programmer, might be done by compiler in future) Device extension – use SPEs as intelligent front-ends to external devices (can lock local store for security/privacy) Computational acceleration – perform computationally intensive tasks on SPEs, parallelizing the work if necessary Streaming – set up serial or parallel pipelines, as explained earlier (PPE controls, SPEs process) Shared-memory multiprocessor – set up cell as a multiprocessor, with PPE and SPE units interoperating on shared memory Asymmetric thread runtime – organize programs in threads that can be run on PPE or SPEs

Meeting the Design Objectives 1. Outstanding performance, especially on game/multimedia applications SPE local stores instead of caches, 256 KB in size to ease programmability SIMD model accelerates multimedia applications Considerable bandwidth and flexibility inside the chip 2. Real-time responsiveness to the user and network Can interact with individual SPEs through their DMAs Simplicity of SPEs (no cache, etc.) makes it easier to analyze their performance 3. Applicability to a wide range of platforms Can be used for a number of different purposes because of Linux (as opposed to a proprietary OS) 4. Support for introduction in 2005 Met the goal by basing Cell on Power Architecture, which also helps compatibility. SIMD on PPE and SPEs eases programming

Future Potential Multiple discrete computers become multiple computers in a single system Upgrade system by enhancing it (adding more Cells), instead of replacing it Your "computer" might include your PDA, TV, printer and camcorder (basically a network) Moves hardware complexity to system software Slower, but provides more flexibility OS takes care of system changes, programmer doesn’t need to worry about it

Future Potential (cont.)

Discussion Any questions?