High Performance Computing on the Cell Broadband Engine

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.

4. Shared Memory Parallel Architectures 4.4. Multicore Architectures

Multiprocessors— Large vs. Small Scale Multiprocessors— Large vs. Small Scale.

Lecture 6: Multicore Systems

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Structure of Computer Systems

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

Ido Tov & Matan Raveh Parallel Processing ( ) January 2014 Electrical and Computer Engineering DPT. Ben-Gurion University.

Computational Astrophysics: Methodology 1.Identify astrophysical problem 2.Write down corresponding equations 3.Identify numerical algorithm 4.Find a computer.

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

CS 7810 Lecture 24 The Cell Processor H. Peter Hofstee Proceedings of HPCA-11 February 2005.

1 Computer Science, University of Warwick Architecture Classifications A taxonomy of parallel architectures: in 1972, Flynn categorised HPC architectures.

Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.

Cell Broadband Processor Daniel Bagley Meng Tan. Agenda  General Intro  History of development  Technical overview of architecture  Detailed technical.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Joram Benham April 2,  Introduction  Motivation  Multicore Processors  Overview, CELL  Advantages of CMPs  Throughput, Latency  Challenges.

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

Introduction to the Cell multiprocessor J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, D. Shippy (IBM Systems and Technology Group)

Cell/B.E. Jiří Dokulil. Introduction Cell Broadband Engine developed Sony, Toshiba and IBM 64bit PowerPC PowerPC Processor Element (PPE) runs OS SIMD.

Cell Systems and Technology Group. Introduction to the Cell Broadband Engine Architecture  A new class of multicore processors being brought to the consumer.

Evaluation of Multi-core Architectures for Image Processing Algorithms Masters Thesis Presentation by Trupti Patil July 22, 2009.

Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.

Agenda Performance highlights of Cell Target applications

Simultaneous Multithreading: Maximizing On-Chip Parallelism Presented By: Daron Shrode Shey Liggett.

Implementation of Parallel Processing Techniques on Graphical Processing Units Brad Baker, Wayne Haney, Dr. Charles Choi.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

High Performance Linear Transform Program Generation for the Cell BE

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

1/21 Cell Processor (Cell Broadband Engine Architecture) Mark Budensiek.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Chapter 2 Parallel Architecture. Moore’s Law The number of transistors on a chip doubles every years. – Has been valid for over 40 years – Can’t.

High Performance Computing Processors Felix Noble Mirayma V. Rodriguez Agnes Velez Electric and Computer Engineer Department August 25, 2004.

Multiprocessing. Going Multi-core Helps Energy Efficiency William Holt, HOT Chips 2005 Adapted from UC Berkeley "The Beauty and Joy of Computing"

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.

1 The IBM Cell Processor – Architecture and On-Chip Communication Interconnect.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

1)Leverage raw computational power of GPU  Magnitude performance gains possible.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Outline Why this subject? What is High Performance Computing?

Programming on IBM Cell Triblade Jagan Jayaraj,Pei-Hung Lin, Mike Knox and Paul Woodward University of Minnesota April 1, 2009.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

High performance computing architecture examples Unit 2.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Constructing a system with multiple computers or processors 1 ITCS 4/5145 Parallel Programming, UNC-Charlotte, B. Wilkinson. Jan 13, 2016.

● Cell Broadband Engine Architecture Processor ● Ryan Layer ● Ben Kreuter ● Michelle McDaniel ● Carrie Ruppar.

9/18/2018 Accelerating IMA: A Processor Performance Comparison of the Internal Multiple Attenuation Algorithm Michael Perrone Mgr, Cell Solution Dept.,

Large data arrays processing on Cell Broadband Engine

Multicore and GPU Programming

Types of Parallel Computers

Multicore and GPU Programming

Presentation transcript:

High Performance Computing on the Cell Broadband Engine Vas Chellappa Electrical & Computer Engineering Carnegie Mellon University Dec 3 2008

Designing “faster” processors Need for speed Parallelism: forms Superscalar Pipelining Vector Multi-core Multi-node

Designing “faster” processors Need for speed Parallelism: forms (limitations) Superscalar (power density) Pipelining (latch overhead: frequency scaling, branching) Vector (programming, only numeric) Multi-core (memory wall, programming) Multi-node (interconnects, reliability)

Multi-core Parallelism Future is definitely multi-core parallelism But what problems/limitations do multi-cores have? Increased programming burden Scaling issues: power, interconnects etc.

The Cell BE Approach Frequency wall: many simple, in-order cores Power wall: vectorized, in-order, arithmetic cores Memory wall: Memory Flow Controller handles programmer driven DMA in background Cell BE Chip Main Mem EIB SPE LS PPE

Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up

Cell Broadband Engine EIB Designed for high-density floating-point computation (PlayStation 3, IBM Roadrunner) Compute: Heterogeneous multi-core (1 PPE + 8 SPEs) 204 Gflop/s (only SPEs) High-speed on-chip interconnect Memory system: Explicit scratchpad-type “local store” DMA based programming Challenges: Parallelization, vectorization, explicit memory New design: new programming paradigm SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS SPE LS Main Mem Writing by hand is just really hard. Automated tools exist, but do not deliver performance.

Cell BE Processor: A Closer Look Power Processing Element (PPE) Synergistic Processing Element (SPE) x8 Local Stores (LS) Cell BE Chip Main Mem EIB SPE LS PPE

Power Processing Element (PPE) Purpose: Operating System, program control Uses POWER Instruction Set Architecture 2-way multithreaded Cache: 32KB L1-I, 32KB L1-D, 512KB L2 AltiVec SIMD System functions Virtualization, address translation/protection, exception handling http://www.lanl.gov/orgs/hpc/roadrunner/pdfs/Roadrunner-tutorial-session-1-web1.pdf

Synergistic Processing Element (SPE) SPU = Processor + LS; SPE = + MFC Synergistic Processing Unit (SPU) Local Store (LS) Memory Flow Controller (MFC)

Synergistic Processing Unit (SPU) Number cruncher Vectorization (4-way/2-way) Peak performance (each SPE) 25.6 Gflop/s (single precision): 3.2 GHz x 4-way (vector) x 2 (FMA) <2 Gflop/s (double precision): Not pipelined EDP version: full speed double precision (12.8 Gflop/s) Comparison: Intel 128 vector registers, each 128B Even, odd pipelines In-order, shallow pipelines No branch prediction (hinting) Completely deterministic

Local Stores (LS) and Memory Flow Cont. (MFC) Each SPU contains a 256KB LS (instead of cache) Explicit read/write (programmer issues DMA) Extremely fast (6-cycle load latency to SPU) Memory Flow Controller Co-processor to handle DMAs (in background) 8/16 command-queue entries Handles DMA-lists (scatter/gather) Barriers, fences, tag groups etc. Mailboxes, signals

Element Interconnect Bus (EIB) 4 data rings (16B wide each) 2 clockwise, 2 counter-clockwise Supports multiple data transfers Data ports: 25.6 Gb/s per direction 204.8 Gb/s sustained peak

Direct Memory Access (DMA) Programmer driven Packet sizes 1B – 16KB Several alignment constraints (bus errors!) Packet size vs. performance DMA lists Get, put: SPE-centric view Mailboxes/signals are also DMAs

Systems using the Cell Sony PlayStation 3 6 available SPEs 7th: hypervisor 8th: defective (yield issues) Can run Linux (Fedora / Yellow Dog Linux) Various PS3-cluster projects IBM BladeCenter QS20/QS22 Two Cell processors Infiniband/Ethernet

IBM Roadrunner Supercomputer at Los Alamos National Lab (NM) Main purpose: model decay of the US nuclear arsenal Performance World’s fastest [TOP500.org] Peak: 1.7 petaflop/s. First to top 1.0 petaflop/s on Linpack Design: hybrid dual-core 64-bit AMD Opterons at 1.8GHz (6,480 Opterons) Cell attached to each Opteron core at 3.2GHz (12,960 Cells) Design hierarchy QS22 Blade = 2 PowerXCell 8i TriBlade = LS21 Opteron Blade + 2x QS22 Cell Blades (PCIe x8) Connected Unit = 180 TriBlades (Infiniband) Cluster = 18 CUs (Infiniband) 90% of peak performance from SPEs Porting programs over Endianness

Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up 17

Programming on the Cell: Philosophy Major differences to traditional processors Not designed for scalar performance Explicit memory access Heterogeneous multi-core Using the SPEs SPMD model (Single Program Multiple Data) Streaming model

Programming Tips What kind of code good/bad for SPEs? No branching (no prediction) Use branch hinting No scalar (no support) Use intrinsics for vectorization, DMA Context switches are expensive Program + data reside in LS. These have to be swapped in/out DMA code: alignment, alignment, alignment! Libraries available to emulate software-managed cache

DMA Programming Main idea: hide memory accesses with multibuffering Compute on one buffer in LS Write back / read in other batches of data Like a completely controlled cache Inter-chip communication Message boxes Signals DMA

Tools for Cell Programming IBM’s Cell SDK 3.0 spu-gcc, ppu-gcc, xlc compilers Simulator libspe: SPE runtime management library Other tools: Assembly visualizer Because SPEs are in-order Single source compiler No OpenMP right now Other tools (from RapidMind, Mercury etc.)

Program Design Use knowledge of architecture to model Back of the envelope calculations Cost of processing? Cost of communication? Trends? Limits? How close is the model? What programming improvements can be made to fit the architecture better?

Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up 23

Creating PPE Program, SPE Threads Each program consists of PPE and SPE sections Program is started up on PPE PPE creates SPE threads pthreads implementation Not full PPE data structure to keep track of SPE threads PPE/SPE shared data structure for argument passing X, Y, Z addresses Thread id Returned cycle count

DMA Access spu_writech(MFC_WrTagMask, -1); spu_mfcdma64(source_address, dest_high_address, dest_low_address, size_in_byes, tag_id, MFC_GET_CMD); spu_mfcstat(MFC_TAG_UPDATE_ALL); Use my DMA_BL_GET, DMA_BL_PUT macros

Compiling Compile ppe, spe programs separately Details: specify SPE program name, call from PPE 32/64 bit (watch out for pointer sizes etc.) Cell SDK has sample Makefiles We will use a simple Makefile

Performance Evaluation: Timing Performance measure: runtime, Gflop/s Timing Each SPE has its own decrementer Decrements at an independent, lower frequency (80GHz on PS3) cat /proc/cpuinfo Reset counter to highest value Measure on each SPE? Average? Min? Max? Which one fits the real-word scenario the best?

Exercise 1: Add/Mul Two Arrays Goal: X[] += Y[] * Z[] Part 1: Infrastructure, understand skeleton code Part 2: Parallelization and vectorization (easy) Part 3: Hiding memory access costs

Part 1 Goal: Evaluate: PPU’s tasks: Use only single SPU. SPU’s task: Understand skeleton code Get infrastructure up and running (compiler, basic code) Evaluate: scalar, sequential code performance PPU’s tasks: Initialize vectors in main memory Start up threads for each SPU, and let them run Verify/print results, performance Use only single SPU. SPU’s task: Get (DMA) all 3 arrays from main memory Perform computation Put (DMA) back result to main memory Write back time to PPU Your tasks: Compile Transform code Timer code

Part 2 Goal Evaluate: PPU: (vector float) d = spu_madd(a,b,c); SPU: Parallelize across 4 SPEs (easy with skeleton code) Vectorize X[] += Y[] * Z[] (easy) Evaluate: Parallel code performance Vectorized parallel code performance PPU: Start up 4 SPU threads Performance evaluation: how? SPU: DMA-get, compute, DMA-put only its own chunk 4-way single precision vectorization (vector float) d = spu_madd(a,b,c); Your tasks: Parallelize Vectorize Performance?

Part 3 Goal: hide memory accesses How?

Presentation Overview Cell Broadband Engine: Design Programming on the Cell Exercise: implement addition of vectors Wrap-up 32

Exercise Debriefing How effectively did we use the architecture? Parallelization, vectorization mandatory! Memory overlapping: big difference Do our optimizations work for a large size range? Smaller sizes: lower packet sizes? Real world problems (Fourier transform, WHT) Real-world problems are rarely embarrassingly parallel Additional complexities?

WHT on the Cell Vectorization: as before Parallelization: locality-aware! Explicit memory access Provide code Multibuffering? How? Inter-SPE data exchange Algorithms that generate large packet sizes? Overlap? Fast barrier

WHT: Data Exchange

WHT: Data Exchange

WHT: Data Exchange

WHT: Data Exchange

DMA Issues External multibuffering (streaming) Strategies for problem sizes Small/medium: data exchange on-chip, streaming Large: trickier. Break down into parts Using all memory banks

Cell Philosophy Cell philosophies: do they extend to other systems? Yes: Fundamental problems are the same Distributed memory computing Clusters, supercomputers Processing faster than interconnects Higher interconnect bandwidth with larger packets Multicore processors Trend: NUMA, even on-chip Locality-aware parallelism

Wrap-Up Programming Cell BE for high-performance computing Cell: chip multiprocessor designed for HPC Applications from video gaming to supercomputers Programming burden is factor for performance Parallelization, vectorization, memory handling Automated tools yield limited performance Programmers must understand μ-arch., tradeoffs For performance (esp. on Cell)