Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Slides:

Advertisements

Similar presentations

Parallel Processing with PlayStation3 Lawrence Kalisz.

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multicore Architectures Michael Gerndt. Development of Microprocessors Transistor capacity doubles every 18 months © Intel.

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

AMD OPTERON ARCHITECTURE Omar Aragon Abdel Salam Sayyad This presentation is missing the references used.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Processor Overview Features Designed for consumer and wireless products RISC Processor with Harvard Architecture Vector Floating Point coprocessor Branch.

Microprocessors. Von Neumann architecture Data and instructions in single read/write memory Contents of memory addressable by location, independent of.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

SYNAR Systems Networking and Architecture Group CMPT 886: Architecture of Niagara I Processor Dr. Alexandra Fedorova School of Computing Science SFU.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Chapter Hardwired vs Microprogrammed Control Multithreading

Chapter 17 Parallel Processing.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

7/2/ _23 1 Pipelining ECE-445 Computer Organization Dr. Ron Hayne Electrical and Computer Engineering.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

Group 5 Alain J. Percial Paula A. Ortiz Francis X. Ruiz.

Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Agenda Performance highlights of Cell Target applications

Basic Microcomputer Design. Inside the CPU Registers – storage locations Control Unit (CU) – coordinates the sequencing of steps involved in executing.

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

A Gentler, Kinder Guide to the Multi-core Galaxy Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Guest lecture for ECE4100/6100.

Parallel Processing - introduction  Traditionally, the computer has been viewed as a sequential machine. This view of the computer has never been entirely.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Nicolas Tjioe CSE 520 Wednesday 11/12/2008 Hyper-Threading in NetBurst Microarchitecture David Koufaty Deborah T. Marr Intel Published by the IEEE Computer.

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

Interconnection Networks Computer Architecture: A Quantitative Approach 4 th Edition, Appendix E Timothy Mark Pinkston University of Southern California.

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

Hyper-Threading Technology Architecture and Microarchitecture

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Pentium Architecture Arithmetic/Logic Units (ALUs) : – There are two parallel integer instruction pipelines: u-pipeline and v-pipeline – The u-pipeline.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Different Microprocessors Tamanna Haque Nipa Lecturer Dept. of Computer Science Stamford University Bangladesh.

Lecture 27 Multiprocessor Scheduling. Last lecture: VMM Two old problems: CPU virtualization and memory virtualization I/O virtualization Today Issues.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

Advanced Computer Architecture pg 1 Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8) Henk Corporaal

Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.

High performance computing architecture examples Unit 2.

UltraSparc IV Tolga TOLGAY. OUTLINE Introduction History What is new? Chip Multitreading Pipeline Cache Branch Prediction Conclusion Introduction History.

Niagara: A 32-Way Multithreaded Sparc Processor Kongetira, Aingaran, Olukotun Presentation by: Mohamed Abuobaida Mohamed For COE502 : Parallel Processing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Fall 2012 Parallel Computer Architecture Lecture 4: Multi-Core Processors Prof. Onur Mutlu Carnegie Mellon University 9/14/2012.

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

Parallel Processing - introduction

CSC 4250 Computer Architectures

Cell Architecture.

Computer Structure Multi-Threading

5.2 Eleven Advanced Optimizations of Cache Performance

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Introduction to Pentium Processor

Hyperthreading Technology

Embedded Computer Architecture 5SAI0 Chip Multi-Processors (ch 8)

Large data arrays processing on Cell Broadband Engine

Presentation transcript:

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy

History Jointly designed by Sony, Toshiba,IBM (STI) Design began March 2001 First used in Sony’s PlayStation 3 IBM’s Roadrunner cluster contains over 12,000 Cell processors. IBM Roadrunner cluster

Cell Broadband Engine Nine cores One Power Processing Element (PPE) Main processor Eight Synergistic Processing Elements (SPE) Fully functional co-processors Comprised of Synergistic Processing Unit (SPU), Memory Flow Controller (MFC) Stream processing

Power Processing Element In-order dual-issue design 64-bit Power Architecture Two 32 KB L1 caches (instruction, data), one 512 KB L2 cache Instruction Unit: instruction fetch, decode, branch, issue, completion 4 instructions per cycle per thread into buffer dispatches instructions from buffer dual-issued to Execution Unit Branch prediction: 4-KB x 2-bit branch history table

Pipeline depth: 23 stages

Synergistic Processing Element Implements new instruction-set architecture Each SPU contains a dedicated DMA management queue 256 KB local store memory – Stores instructions and data – Memory transferred via DMA between local and system memory No data load / branch prediction. – Relies on "prepare-to-branch" instructions to pre-fetch data – Loads at least 17 instructions at the branch target address Two instructions per cycle – 128-bit SIMD – In-order dual-issue statically scheduled

On-chip Interconnect : Element Interconnect Bus (EIB) Provides internal connection for 12 ‘units’: PPE 8 SPEs Memory Controller (MIC) 2 Off-chip I/O interfaces Each ‘unit’ has one 16B read port and one 16B write port Circular ring Four 16-byte-wide unidirectional channels which counter-rotate in pairs

Includes an arbitration unit which functions as a set of traffic lights. Runs at half the system clock rate Peak instantaneous EIB bandwidth is 96B per clock 12 concurrent transactions * 16 bytes wide / 2 system clocks per transfer EIB channel is not permitted to convey data requiring more than six steps

Each unit on the EIB can simultaneously send and receive 16B of data every bus cycle. Maximum data bandwidth of the entire EIB is limited by the maximum rate at which addresses are snooped across all units in the system The theoretical peak data bandwidth on the EIB at 3.2 GHz is 128Bx1.6 GHz = GB/s 197 GB/s Actual peak data bandwidth achieved

David Krolak explains: “Well, in the beginning, early in the development process, several people were pushing for a crossbar switch, and the way the bus is designed, you could actually pull out the EIB and put in a crossbar switch if you were willing to devote more silicon space on the chip to wiring. We had to find a balance between connectivity and area, and there just was not enough room to put a full crossbar switch in. So we came up with this ring structure which we think is very interesting. It fits within the area constraints and still has very impressive bandwidth.”

Multi-threading Organization PPE is an in-order, 2-way Simultaneous Multi-Threading (SMT) Each SPU is a vectorial accelerator targeted at the execution of SIMD code All architecture states are duplicated to perform interleaved instruction issuing. Asynchronous DMA transfers. The setup of a DMA takes the SPE a few cycles whereas a cache miss on a normal system causes the CPU to stall to up to thousands of cycles. SPEs can perform other calculations while waiting for data.

Scheduling Policy Two classes of threads defined PPU threads: run on the PPU SPU tasks: run on the SPUs. PPU threads are managed by the Completely Fair Scheduler (CFS) SPU scheduler supports time-sharing in multi-programmed workloads Allows preemption of SPU tasks Cell-based systems allow only one active application to run at the same time to avoid performance degradation.

Completely Fair Scheduler Ranked by Consider an example with two users, A and B, who are running jobs on a machine. User A has just two jobs running, while user B has 48 jobs running. Group scheduling enables CFS to be fair to users A and B, rather than being fair to all 50 jobs running in the system. Both users get a share. B would use his 50% share to run his 48 jobs and would not be able to encroach on A's 50% share.