Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Slides:

Advertisements

Similar presentations

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Advertisements

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

Intel Pentium 4 ENCM Jonathan Bienert Tyson Marchuk.

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Implementation of 2-D FFT on the Cell Broadband Engine Architecture William Lundgren Gedae), Kerry Barnes (Gedae), James Steed (Gedae)

Cell Broadband Engine. INF5062, Carsten Griwodz & Pål Halvorsen University of Oslo Cell Broadband Engine Structure SPE PPE MIC EIB.

 Understanding the Sources of Inefficiency in General-Purpose Chips.

Using Cell Processors for Intrusion Detection through Regular Expression Matching with Speculation Author: C˘at˘alin Radu, C˘at˘alin Leordeanu, Valentin.

CML Vector Class on Limited Local Memory (LLM) Multi-core Processors Ke Bai Di Lu and Aviral Shrivastava Compiler Microarchitecture Lab Arizona State University,

11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

Michael A. Baker, Pravin Dalale, Karam S. Chatha, Sarma B. K. Vrudhula

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

State Machines Timing Computer Bus Computer Performance Instruction Set Architectures RISC / CISC Machines.

Synergistic Processing In Cell’s Multicore Architecture Michael Gschwind, et al. Presented by: Jia Zou CS258 3/5/08.

Embedded Computer Architecture 5KK73 MPSoC Platforms Part2: Cell Bart Mesman and Henk Corporaal.

0 HPEC 2010 Automated Software Cache Management.

Kathy Grimes. Signals Electrical Mechanical Acoustic Most real-world signals are Analog – they vary continuously over time Many Limitations with Analog.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Cell Architecture. Introduction The Cell concept was originally thought up by Sony Computer Entertainment inc. of Japan, for the PlayStation 3 The architecture.

© 2005 Mercury Computer Systems, Inc. Yael Steinsaltz, Scott Geaghan, Myra Jean Prelle, Brian Bouzas,

Gedae, Inc. Implementing Modal Software in Data Flow for Heterogeneous Architectures James Steed, Kerry Barnes, William Lundgren Gedae, Inc.

High Performance Linear Transform Program Generation for the Cell BE

Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc

Software Pipelining for Stream Programs on Resource Constrained Multi-core Architectures IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEM 2012 Authors:

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

CIS250 OPERATING SYSTEMS Memory Management Since we share memory, we need to manage it Memory manager only sees the address A program counter value indicates.

© 2007 SET Associates Corporation SAR Processing Performance on Cell Processor and Xeon Mark Backues, SET Corporation Uttam Majumder, AFRL/RYAS.

Seunghwa Kang David A. Bader Optimizing Discrete Wavelet Transform on the Cell Broadband Engine.

Simple, Efficient, Portable Decomposition of Large Data Sets William Lundgren Gedae), David Erb (IBM), Max Aguilar (IBM), Kerry Barnes.

1/30/2003 BARC1 Profile-Guided I/O Partitioning Yijian Wang David Kaeli Electrical and Computer Engineering Department Northeastern University {yiwang,

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

HPEC SMHS 9/24/2008 MIT Lincoln Laboratory Large Multicore FFTs: Approaches to Optimization Sharon Sacco and James Geraci 24 September 2008 This.

Parallelization and Characterization of Pattern Matching using GPUs Author: Giorgos Vasiliadis 、 Michalis Polychronakis 、 Sotiris Ioannidis Publisher:

Group May Bryan McCoy Kinit Patel Tyson Williams.

IBM Research © 2008 Feeding the Multicore Beast: It’s All About the Data! Michael Perrone IBM Master Inventor Mgr, Cell Solutions Dept.

Summary Background –Why do we need parallel processing? Moore’s law. Applications. Introduction in algorithms and applications –Methodology to develop.

Optimization of Collective Communication in Intra- Cell MPI Optimization of Collective Communication in Intra- Cell MPI Ashok Srinivasan Florida State.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

LLMGuard: Compiler and Runtime Support for Memory Management on Limited Local Memory (LLM) Multi-Core Architectures Ke Bai and Aviral Shrivastava Compiler.

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

High Performance Computing Group Feasibility Study of MPI Implementation on the Heterogeneous Multi-Core Cell BE TM Architecture Feasibility Study of MPI.

Hardware Benchmark Results for An Ultra-High Performance Architecture for Embedded Defense Signal and Image Processing Applications September 29, 2004.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Gedae, Inc. Gedae: Auto Coding to a Virtual Machine Authors: William I. Lundgren, Kerry B. Barnes, James W. Steed HPEC 2004.

Pipelined and Parallel Computing Partition for 1 Hongtao Du AICIP Research Dec 1, 2005 Part 2.

Flexible Filters for High Performance Embedded Computing Rebecca Collins and Luca Carloni Department of Computer Science Columbia University.

Optimizing Ray Tracing on the Cell Microprocessor David Oguns.

Comparison of Cell and POWER5 Architectures for a Flocking Algorithm A Performance and Usability Study CS267 Final Project Jonathan Ellithorpe Mark Howison.

1 VSIPL++: Parallel Performance HPEC 2004 CodeSourcery, LLC September 30, 2004.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

KERRY BARNES WILLIAM LUNDGREN JAMES STEED

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 20/11/2015 Compilers for Embedded Systems.

Gedae Portability: From Simulation to DSPs to the Cell Broadband Engine James Steed, William Lundgren, Kerry Barnes Gedae, Inc

XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.

Institute of Software,Chinese Academy of Sciences An Insightful and Quantitative Performance Optimization Chain for GPUs Jia Haipeng.

IBM Cell Processor Ryan Carlson, Yannick Lanner-Cusin, & Cyrus Stoller CS87: Parallel and Distributed Computing.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Parallel Programming By J. H. Wang May 2, 2017.

High Performance Computing on an IBM Cell Processor --- Bioinformatics

Embedded Systems Design

Accelerating PFA FFT: Performance Comparison

Chapter 8: Memory management

Outline Module 1 and 2 dealt with processes, scheduling and synchronization Next two modules will deal with memory and storage Processes require data to.

VSIPL++: Parallel Performance HPEC 2004

Mapping the FFT Algorithm to the IBM Cell Processor

Presentation transcript:

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast (Lockheed Martin), David Erb (IBM), Kerry Barnes (Gedae), James Steed (Gedae) HPEC 2007

Introduction Cell Broadband Engine (Cell/B.E.) Processor Programming Challenges –Distributed control –Distributed memory –Dependence on alignment for performance Synthetic Aperture Radar (SAR) benchmark Gedae is used to perform the benchmark If programming challenges can be addressed, great performance is possible –116X improvement over quad 500MHz PowerPC board 2

Cell/B.E. Architecture Power Processing Element (PPE) Eight Synergistic Processing Elements (SPE) –4 SIMD ALUs –DMA Engines –256 kB Local Storage (LS) System Memory –25 GB/s Element Interconnect Bus (EIB) –Over 200 GB/s 3

Gedae Addresses the Software Architecture Software architecture defines how software is distributed across processors like the SPEs Optimizing the software architecture requires a global view of the application This global view cannot be obstructed by libraries Software Architecture Libraries Source Code Compiler 4

Vector & IPC Libraries Gedae’s Approach is Automation The functional specification is specified by the programming language The implementation specification defines how the functionality is mapped to the HW (i.e., the software architecture) Automation, via the compiler, forms the multithreaded application 5 Implementation Specification Functional Specification Multicore Hardware Thread Manager Threaded Application Compiler

Synthetic Aperture Radar Algorithm

Stages of SAR Algorithm Partition –Distribute the matrix to multiple PEs Range –Compute intense operation on the rows of the matrix Corner Turn –Distributed matrix transpose Azimuth –Compute intense operation on the rows of [ M(i-1) M(i) ] Concatenation –Combine results from the PEs for display 7

Stages Execute Sequentially 8 Distributed Transpose Range processing Azimuth processing

SAR Performance Platforms used –Quad 500 MHz PowerPC AltiVec Board –IBM QS20 Cell/B.E. Blade Server (using 8 SPEs at 3.2 GHz) Comparison of large SAR throughput –Quad PowerPC Board3 images/second –IBM QS images/second Maximum achieved performance on IBM QS20 9 Algorithm8 SPEs Range112.9 GFLOPS Corner Turn12.88 GB/s Azimuth128.4 GFLOPS TOTAL SAR97.94 GFLOPS

Synthetic Aperture Radar Algorithm Tailoring to the Cell/B.E.

Processing Large Images Large images do not fit in LS –Size of each SPE’s LS: 256 kB –Size of example image: 2048 x 512 x 4 B/w = 4 MB Store large data sets in system memory Coordinate movement of data between system memory and LS 11 System Memory 256 MB on Sony Playstation 3 1 GB on IBM QS20 LS 25GB/s

Strip Mining Strip mine data from system memory –DMA pieces of image (rows or tiles) to LS of SPE –Process pieces of image –DMA result back to system memory 12 Data for Processor 0 Data for Processor 1 Data for Processor 2 Data for Processor 7 … System Memory SPU SPE 0 LS

Gedae Automated Implementation of Strip Mining Unmapped memory type –Platform independent of specifying memory outside of the PE’s address space, such as system memory Gedae can adjust the granularity –Up to increase vectorization –Down to reduce memory use Specify rowwise processing of a matrix as vector operations –Use matrix-to-vector and vector-to-matrix boxes to convert –Gedae can adjust the implementation to accommodate the processor 13

Synthetic Aperture Radar Algorithm Range Processing

Break matrix into sets of rows Triple buffering used so new data is always available for processing 15 TimeDMA to LSProcessDMA from LS 0Subarray 0 –> Buf 0 1Subarray 1 –> Buf 1Buf 0 2Subarray 2 –> Buf 2Buf 1Buf 0 –> Subarray 0 3Subarray 3 –> Buf 0Buf 2Buf 1 –> Subarray 1 4Subarray 4 –> Buf 1Buf 0Buf 2 –> Subarray 2 Repeat pattern NBuf 1Buf 0 –> Subarray N-2 N+1Buf 1 –> Subarray N-1 System Memory SPE LS 3 Buffers Last Current Next SPU

16 Implementation of Range Get rows from system memory Range processing Put rows into system memory

Trace Table for Range Processing Vector routines –FFT (2048) 5.62us –Real/complex vector multiply (2048) 1.14us Communication –Insert 0.91us –Extract 0.60us Total –8.27us per strip –529us per frame –832us measured Scheduling overhead –303us per frame –256 primitive firings 17

Scheduling Overhead Gaps between black boxes are –Static scheduling overhead: determine next primitive in current thread –Dynamic scheduling overhead: determine next thread Static scheduling overhead will be removed by automation 18

Synthetic Aperture Radar Algorithm Corner Turn

Distributed Corner Turn Break matrix into tiles Assign each SPU a set of tiles to transpose Four buffers in LS, two inputs and two outputs 20 0,20,3 TimeDMA to LSProcessDMA from LS 00,0 –> Buf0 10,1 –> Buf1Buf0 –> Buf2 20,2 –> Buf0Buf1 –> Buf3Buf2 –> 0,0 30,3 –> Buf1Buf0 –> Buf2Buf3 –> 1,0 40,4 –> Buf0Buf1 –> Buf3Buf2 –> 2,0 Repeat pattern R*C-1 Buf2 –> R-1,C-1 2,0 0,1 1,0 0,3 0,2 SPU Input array in system memory Output array in system memory

21 Implementation of Corner Turn Get tiles from system memory Transpose tiles on SPUs Put tiles into system memory

Trace Table for Corner Turn Vector routine –Matrix transpose (32x32) 0.991us Transfer –Vary greatly due to contention Total –1661us measured Scheduling overhead –384 primitive firings 22

Synthetic Aperture Radar Algorithm Azimuth Processing

Double buffering of data in system memory provides M(i-1) and M(i) for azimuth processing Triple buffering in LS allows continuous processing Output DMA’ed to separate buffer in system memory 24 Input arrays in system memory SPE LS – 3 Buffers Output array in system memory SPU

25 Implementation of Azimuth Get tiles from system memory Azimuth processing Put tiles into system memory

Trace Table for Azimuth Processing Vector routines –FFT/IFFT (1024) 2.71us –Complex vector multiply (1024) 0.618us Communication –Insert 0.229us –Get 0.491us Total –6.76us per strip –1731us per frame –2058us measured Scheduling overhead –327us per frame –1280 primitive firings 26

Synthetic Aperture Radar Algorithm Implementation Settings

Distribution to Processors Partition Table is used to group primitives Map Partition Table is used to assign partitions to processors 28 Map to processor numbers Group primitives by partition name

Implementation of Strip Mining Subscheduling is Gedae’s method of applying strip mining Gedae applies maximum amount of strip mining –Low granularity –Low memory use User can adjust the amount of strip mining to increase the vectorization 29 Result of automated strip mining

Synthetic Aperture Radar Algorithm Efficiency Considerations

Distributed Control SPEs are very fast compared to the PPE –SPEs can perform 25.6 GFLOPS at 3.2 GHz –PPE can perform 6.4 GFLOPS at 3.2 GHz PPE can be a bottleneck Minimize use of PPE –Do not use the PPE to control the SPEs –Distribute control amongst the SPEs Gedae automatically implements distributed control 31

Alignment Issues Misalignment can make a large impact in performance Input and output of DMA transfers must have same alignment Gedae automatically enforces proper alignment to the extent possible 32 Good Bad

Alignment in DMA List Destination of DMA List transfers are –Contiguous –On 16 byte boundaries 33 Not Possible with DMA List Possible but not Always Useful SysMem LS

Implications to Image Partitioning Rowwise partitioning –Rows should be 16 byte multiples Tile partitioning –Tile dimensions should be 16 byte multiples Inefficient – source not 16 B aligned Inefficient – source and destination for each row have different alignments SysMem LS SysMem LS

Evidence of Contention Performance of 8 SPE implementation is only 50% faster than 4 SPE implementation Sending tiles between system memory and LS is acting like a bottleneck Histogram of tile get/insert shows more variation in 8 SPE execution 35 Corner Turn - 4 Processors Corner Turn - 8 Processors

Summary Great performance and speedup can be achieved by moving algorithms to the Cell/B.E. processor That performance cannot be achieved without knowledge and a plan of attack on how to handle –Streaming processing through the SPE’s LS without involving the PPE –Using the system memory and the SPEs’ LS in concert –Use of all the SIMD ALU on the SPEs –Compensating for alignment in both vector processing and transfers Gedae can help mitigate the risk of moving to the Cell/B.E. processor by automating the plan of attack for these tough issues 36