Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Slides:



Advertisements
Similar presentations
Network II.5 simulator ..
Advertisements

Barcelona Supercomputing Center. The BSC-CNS objectives: R&D in Computer Sciences, Life Sciences and Earth Sciences. Supercomputing support to external.
Larrabee Eric Jogerst Cortlandt Schoonover Francis Tan.
Accelerators for HPC: Programming Models Accelerators for HPC: StreamIt on GPU High Performance Applications on Heterogeneous Windows Clusters
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
4. Shared Memory Parallel Architectures 4.4. Multicore Architectures
Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.
Lecture 38: Chapter 7: Multiprocessors Today’s topic –Vector processors –GPUs –An example 1.
Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.
Structure of Computer Systems
Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.
ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.
CSC457 Seminar YongKang Zhu December 6 th, 2001 About Network Processor.
Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.
Types of Parallel Computers
Main Mem.. CSE 471 Autumn 011 Main Memory The last level in the cache – main memory hierarchy is the main memory made of DRAM chips DRAM parameters (memory.
Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.
Multiprocessors ELEC 6200: Computer Architecture and Design Instructor : Agrawal Name: Nam.
Chapter 17 Parallel Processing.
CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
1 Presenter: Ming-Shiun Yang Sah, A., Balakrishnan, M., Panda, P.R. Design, Automation & Test in Europe Conference & Exhibition, DATE ‘09. A Generic.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)
Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.
Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.
OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.
Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.
Agenda Performance highlights of Cell Target applications
© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,
High Performance Linear Transform Program Generation for the Cell BE
Outline Classification ILP Architectures Data Parallel Architectures
High Performance Computing on the Cell Broadband Engine
Exploiting Data Parallelism in SELinux Using a Multicore Processor Bodhisatta Barman Roy National University of Singapore, Singapore Arun Kalyanasundaram,
TILEmpower-Gx36 - Architecture overview & performance benchmarks – Presented by Younghyun Jo 2013/12/18.
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.
Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.
Challenge the future Delft University of Technology Programming Models for multi-cores Ana Lucia Varbanescu TUDelft / Vrije Universiteit Amsterdam Programming.
A Programmable Processing Array Architecture Supporting Dynamic Task Scheduling and Module-Level Prefetching Junghee Lee *, Hyung Gyu Lee *, Soonhoi Ha.
SJSU SPRING 2011 PARALLEL COMPUTING Parallel Computing CS 147: Computer Architecture Instructor: Professor Sin-Min Lee Spring 2011 By: Alice Cotti.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
CellSs: A Programming Model for the Cell BE Architecture Pieter Bellens, Josep M. Perez, Rosa M. Badia, Jesus Labarta Barcelona Supercomputing Center (BSC-CNS)
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
Operating Systems David Goldschmidt, Ph.D. Computer Science The College of Saint Rose CIS 432.
ServiceSs, a new programming model for the Cloud Daniele Lezzi, Rosa M. Badia, Jorge Ejarque, Raul Sirvent, Enric Tejedor Grid Computing and Clusters Group.
Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.
RSIM: An Execution-Driven Simulator for ILP-Based Shared-Memory Multiprocessors and Uniprocessors.
Survey of multicore architectures Marko Bertogna Scuola Superiore S.Anna, ReTiS Lab, Pisa, Italy.
Playstation2 Architecture Architecture Hardware Design.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
On-chip Parallelism Alvin R. Lebeck CPS 220/ECE 252.
IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.
Cell Architecture.
Chapter 1 Introduction.
The University of Adelaide, School of Computer Science
Multicore and GPU Programming
Presentation transcript:

Programming the Cell Multiprocessor Işıl ÖZ

Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs

Cell Processor Cell Broadband Engine Architecture – Cell BE Developed by STI (SCEI-Toshiba-IBM) design center – STI formed in 2000 – STI design center opened in 2001 – Introduced in 2005 – 65 nm in 2007, 45 nm in 2008

Cell Processor Objectives Outstanding performance especially on game/multimedia applications – Memory latency – Power efficiency – Processor frequency and pipeline depth Real time response to the user and the network Applicable to a wide range of platforms Support for introduction in 2005

Cell Architecture a 64-bit Power processor element (PPE) 8 synergistic processor elements (SPE) Memory controller Bus-interface controller Element interconnect bus

Power Processor Elements PPE – Power core – First level cache L1 – Second level cache L2

PPE Major Units

Synergistic Processor Elements SPEs – DMA (Direct Memory Access Unit) – LS (Local Store Memory) – SXUs (Execution Units)

SPE Organization

Controllers Memory Interface Controller – interfaces to the Rambus XDR I/O unit which communicates directly to DRAM modules Bus Interface Controller – interfaces to the Rambus FlexIO which provides to communicate with system components

Element Interconnect Bus EIB – Coherent, on-chip bus – Connects the processing elements, memory and I/O devices

Programming the Cell Local store memory in SPEs (256KB) SIMD nature of dataflows The size of the register file (128 bits) Single program context

Programming Models Function offload model Device extension model Computational acceleration model Streaming models Shared-memory multiprocessor model Asymmetric thread runtime model

A programming model:CellSs Cell superscalar – Simple and flexible – Automatic parallelism of sequential program – Task scheduling and data handling

CellSs Structure Based on – code annotations – C language Composed of – Source compiler – Runtime library

CellSs Compilation Environment

CellSs Compiler Source to source compiler – Function(task) to be executed in the SPEs – Function parameter directions – Parameters that are arrays and their lengths No pointers!

Parallelism on CellSs Annotated code Generated code for the PPE Generated code for the SPE

CellSs Syntax Three types of pragmas – initialization and finalization css start and css finish – task css task [input inout output] – synchronization css wait

Example CellSs Source Code start/finish task wait for task

CellSs Runtime Execute function – Add a node in task graph – Data dependency analysis (RaW, WaR, Waw) – Parameters renaming – Task submission

CellSs Runtime Behavior

Middleware for the Cell Task scheduling – task control buffer – task grouping – dynamic scheduling

Locality Aware Task Scheduling

Tracing Generates Paraver trace files by a tracing component embedded in the CellSs runtime – when the main program enters or exits – when an annotated function is called in the main program – when a task is started or finished

Performance Analysis Matmul – Block matrix multiplication TSP – Recursive implementation of Traveling Salesman Problem Cholesky – Block matrix Cholesky factorization

Performance Analysis TSP – No data dependency Cholesky – Highly connected data dependency graph

Performance Analysis x-axis : timeline y-axis : a thread of the application green : events yellow : communications

Performance Analysis yellow : SPE thread DMA transfer brown : SPE executing the task

Pros and Cons annotations – simple – but limited data transfer transparently to the user code task dependency analysis rely on other compilers for – code vectorization (SPE performance) – lower level code optimization

Related Work OpenMP Accelerated Library Framework (ALF) Thread level synchronization Sequoia Rapidmind Ohara Graphics Processor Units (GPUs)

References J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy, “Introduction to the Cell multiprocessor”, IBM J. Res. & Dev. Vol. 49 No. 4/5 July/ September Pieter Bellens, Josep M. Perez, Rosa M. Badia and Jesus Labarta, “CellSs: a Programming Model for the Cell BE Architecture”, Supercomputing Conference, M. W. Riley, J. D. Warnock, D. F. Wendel, “Cell Broadband Engine processor:Design and implementation”, IBM J. Res. & Dev. Vol. 51 No. 5 September J. M. Perez, P. Bellens, R. M. Badia, J. Labarta, “CellSs: Making it easier to program the Cell Broadband Engine processor”, IBM J. Res. & Dev. Vol. 51 No. 5 September