Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.

Slides:



Advertisements
Similar presentations
DSPs Vs General Purpose Microprocessors
Advertisements

Lecture 4 Introduction to Digital Signal Processors (DSPs) Dr. Konstantinos Tatas.
Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.
Computer Organization and Architecture
A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.
INSTRUCTION SET ARCHITECTURES
Click to add text © IBM Corporation Optimization Issues in SSE/AVX-compatible functions on PowerPC Ian McIntosh November 5, 2014.
Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.
Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.
Instruction Set Architecture & Design
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Computer Science: An Overview Tenth Edition by J. Glenn Brookshear Chapter.
Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.
Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.
What is an instruction set?
CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.
Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.
Unit-1 PREPARED BY: PROF. HARISH I RATHOD COMPUTER ENGINEERING DEPARTMENT GUJARAT POWER ENGINEERING & RESEARCH INSTITUTE Advance Processor.
Emotion Engine A look at the microprocessor at the center of the PlayStation2 gaming console Charles Aldrich.
J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.
Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.
Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.
Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD.
Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.
Agenda Performance highlights of Cell Target applications
Kenichi Kourai (Kyushu Institute of Technology) Takuya Nagata (Kyushu Institute of Technology) A Secure Framework for Monitoring Operating Systems Using.
IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.
Dr Mohamed Menacer College of Computer Science and Engineering Taibah University CS-334: Computer.
High Performance Computing on the Cell Broadband Engine
1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.
March 12, 2007 Introduction to PS3 Cell BE Programming Narate Taerat.
1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.
Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.
1 Instruction Set Architecture (ISA) Alexander Titov 10/20/2012.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Computer Architecture and Organization
Computer Architecture EKT 422
1)Leverage raw computational power of GPU  Magnitude performance gains possible.
LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.
Copyright © 2008 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Data Manipulation Brookshear, J.G. (2012) Computer Science: an Overview.
Chapter 10 Instruction Sets: Characteristics and Functions Felipe Navarro Luis Gomez Collin Brown.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Chapter 2 Data Manipulation © 2007 Pearson Addison-Wesley. All rights reserved.
Optimizing Ray Tracing on the Cell Microprocessor David Oguns.
Chapter 2: Data Manipulation
What is a program? A sequence of steps
Group # 3 Jorge Chavez Henry Diaz Janty Ghazi German Montenegro.
Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.
Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.
Instruction Sets. Instruction set It is a list of all instructions that a processor can execute. It is a list of all instructions that a processor can.
Systems and Technology Group Cell Programming Tips & Techniques 1 Cell Programming Workshop Cell Ecosystem Solutions Enablement.
Amdahl’s Law & I/O Control Method 1. Amdahl’s Law The overall performance of a system is a result of the interaction of all of its components. System.
1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)
Microprocessor Systems Design I
Cell Architecture.
Chapter 2: Data Manipulation
ECEG-3202 Computer Architecture and Organization
Chapter 9 Instruction Sets: Characteristics and Functions
ECEG-3202 Computer Architecture and Organization
Cell Programming Tips & Techniques
Chapter 2: Data Manipulation
Large data arrays processing on Cell Broadband Engine
ADSP 21065L.
Chapter 2: Data Manipulation
Chapter 10 Instruction Sets: Characteristics and Functions
Presentation transcript:

Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007

Goals of Presentation 1) Latest Technology 2) Promote parallel programming Vector vs Scalar programming 3) Incite you to program & design in parallel 4) Meant to be informative 5) Technical details & inner works 6) Not to critique the design of the Cell Processor

Presentation Layout 1) IBM Cell Processor Design 2) IBM Cell Processor on Playstation 3 3) IBM Cell Processor SDK 4) From Scalar to Vector Programming 5) Levels of Parallelism 6) SPE Program Modules 7) Data Transfers & Communication 8) Programming Techniques 9) Program Example

Cell Processor Design

Cell Processor Architecture 1)PPE register file: 32 x 128-byte vectors 2)SPE register file: 128 x 128-byte vectors 3)PPE: dual-issue in-order processor In-order & out-of-order computation (load instructs.)‏ 4)SPE: dual-issue in-order processor In-order computation & out-of-order data transfers

Cell Processor Architecture

1)PPE design goals Maximize performance/power Maximize performance/area ratio 2)PPE main tasks Run OS (Linux)‏ Coordinate with SPE's 3)SPE dedicated DMA engines 4)PPE & 3.2Ghz 5)External RAMBUS XDR Memory Two 3.2Ghz (400Mhz, Octal data rate)‏ 6)IO 5Ghz 7)SPE's parallel nature Even pipeline Odd pipeline

Cell Processor Design

Cell Processor on Playstation 3

1) Only 6 / 8 SPE's accessible 2) Only 256MB XDR memory 3) GigaBit Ethernet Controller High latency ~250us - why? 4) Wi-Fi Controller 5) 4 USB ports 6) 20GB – 40GB – 60GB and 80GB hard drives 7) Hypervisor - Virtualization Layer 8) Maximum power consumption / usual consumption

Cell Processor on Playstation 3 1) Linux Distributions available Fedora Core 5,6,7 Yellow Dog 5.0+ Gentoo PowerPC 64 Debian IBM'S choice: Fedora 2) Easy installation Format PS3 Hard drive USB key required for otherOS Cell Addon CD Fedora PPC DVD 3) Linux Kernel full support for PS3 4) Gcc compiler for C/C++/Fortan 95 for PPE 5) Access to SPE requires IBM Cell SDK

IBM Cell Processor SDK

Cell Processor SDK 1) SDK 2.1 Fedora Core 6 GNU tool chain by Sony Computer Entertainment IBM XL C/C++ Compiler IBM Full System Simulator Sysroot Image for System Simulator SIMD math library MASS (Mathematical Acceleration SubSystem)‏ Samples code IBM Eclipse IDE for Cell BE 2) SDK 3.0 Fedora Core 7 BLAS library (single & double precision linear algebra functions)‏ GNU Ada compiler for PPE

Cell Processor SDK GNU Fortan compiler for PPE & SPE Numactl library (for non-uniform memory access machines)‏ FFT Library – 1D & 2D Fast Fourier Transforms Random Number Generation (good for simulations)‏ SPU Isolation runtime environment – signing & encrypting SPE apps.

From Scalar to Vector Programming

1) Cell designed for vector computations Vector arithmetic faster than scalar arithmetic 2) Designed for fast SIMD processing 3) Vector Big endian order

From Scalar VS Vector Programming

From Scalar to Vector Programming 1) Sizeof() on a vector always returns 16 2) Default vector alignment to 16-byte boundary 'result' addition faster than 'c' addition

From Scalar to Vector Programming Cryptography performance up to 2.3x at the same frequency than a leading brand processor with SIMD Cryptography performance up to 2.3x at the same frequency than a leading brand processor with SIMD

From Scalar to Vector Programming High bandwidth Best area efficiency processor on the market*

Levels of Parallelism

1) Breaking a problem into modules Same or different modules Modularity of SPE's 2) SIMD operations on vector data types Arithmetic intrinsics spu_add – vector add spu_madd – vector multiply and add spu_msub – vector multiply and subtract spu_mul – vector multiply spu_sub – vector subtract spu_nmadd – negative vector multiply and add spu_nmsub – negative vector multiply and subtract spu_re – vector float reciprocal estimate spu_rsqrte – vector float reciprocal square-root estimate Byte Operation intrinsics spu_absd – vector absolute difference spu_avg – average of 2 vectors

Levels of Parallelism Compare intrinsics spu_cmpabseq – element-wise absolute equal spu_cmpabsgt – element-wise absolute greater than spu_cmpeq – element-wise equal spu_cmpgt – element-wise greater than Bits and Mask intrinsics spu_sel – select bits spu_shuffle – shuffle 2 vectors of bytes Logical intrinsics spu_and – vector bit-wise AND spu_nand – vector bit-wise complement AND spu_nor – vector bit-wise complement OR spu_or – vector bit-wise OR spu_xor – vector bit-wise XOR

Levels of Parallelism 1) SIMD Math Library Too many to list 2) SPE: Even pipeline: Float, double and integer multiplies unit Fixed-point arithmetic, logical ops., word shifts unit Odd pipeline: Fixed-point permutes, shuffles, quadword rotates unit Instruction sequencing, branching execution control unit Local store load/save/supply instructions to control unit DMA channel for input/output through MFC 3) Channel interface independent of SPE 4) SPE issue & complete 2 instructions / cycle

SPE Program Modules

1) Separate compiler for SPE Embed SPE executable into library 'extern spe_program_handle_t ' Compile main PPU program with library 2) SPE Context How to appropriate yourself SPEs for computation...

SPE Program Modules How to load a SPE program into SPEs... How to release SPEs...

SPE Program Modules How run pthreads with the SPEs example...

Data Transfers & Communication

1) Data transfers initiated with spu_mfcdma32() or spu_mfcdma64()‏ 2) Tell the SPE's MFC which channel (0) to use  spu_writech(MFC_WrTagMask,-1); 3) Wait for data to be completely transfered  spu_mfcstat(MFC_TAG_UPDATE_ALL); 4) Different modes of data transfers: MFC_PUT_CMDMFC_PUTB_CMDMFC_PUTF_CMDMFC_GET_CMDMFC_GETB_CMDMFC_GETF_CMD

Data Transfers & Communication 1) MFC_PUTF_CMD & MFC_PUTB_CMD: 'F' for Fence: command is locally ordered w.r.t. all previously issued commands within the same tag group and command queue 'B' for Barrier: command and all subsequent commands with the same tag ID as this command are locally ordered w.r.t. all previously issued commands within the same tag group and command queue 2) PPU & SPE MailBox 3) SPE Events

Programming Techniques

1) XLC C/C++ Compiler vs GCC Which to choose? __align_hint(); (SPE only)‏ Improves data access through pointers Provides information to compiler for auto-vectorization __builtin_expect(); Programmer directed branch-prediction 2) Double Buffering

Programming Techniques 1) Program flow: limit branching if statements... Pointer arithmetic

Programming Techniques 1) Loop unrolling... especially inner-most loops 2) Code's width

Program Example

Simple Hello World!