Intro This talk will focus on Cell processor –Cell Broadband Engine Architecture (CBEA) Power Processing Element (PPE) Synergistic Processing Element.

Slides:

Advertisements

Similar presentations

Network II.5 simulator ..

Advertisements

MPI Message Passing Interface

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

An OpenCL Framework for Heterogeneous Multicores with Local Memory PACT 2010 Jaejin Lee, Jungwon Kim, Sangmin Seo, Seungkyun Kim, Jungho Park, Honggyu.

A Seamless Communication Solution for Hybrid Cell Clusters Natalie Girard Bill Gardner, John Carter, Gary Grewal University of Guelph, Canada.

SE-292 High Performance Computing

Instructor Notes We describe motivation for talking about underlying device architecture because device architecture is often avoided in conventional.

Computer Abstractions and Technology

A Dynamic World, what can Grids do for Multi-Core computing? Daniel Goodman, Anne Trefethen and Douglas Creager

ACCELERATING MATRIX LANGUAGES WITH THE CELL BROADBAND ENGINE Raymes Khoury The University of Sydney.

1 Coven a Framework for High Performance Problem Solving Environments Nathan A. DeBardeleben Walter B. Ligon III Sourabh Pandit Dan C. Stanzione Jr. Parallel.

Prof. Srinidhi Varadarajan Director Center for High-End Computing Systems.

MotoHawk Training Model-Based Design of Embedded Systems.

Development of a Ray Casting Application for the Cell Broadband Engine Architecture Shuo Wang University of Minnesota Twin Cities Matthew Broten Institute.

ELEC 6200, Fall 07, Oct 29 McPherson: Vector Processors1 Vector Processors Ryan McPherson ELEC 6200 Fall 2007.

Course Instructor: Aisha Azeem

J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maeurer, and D. Shippy IBM Systems and Technology Group IBM Journal of Research and Development.

Programming the Cell Multiprocessor Işıl ÖZ. Outline Cell processor – Objectives – Design and architecture Programming the cell – Programming models CellSs.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

1 Charm++ on the Cell Processor David Kunzman, Gengbin Zheng, Eric Bohm, Laxmikant V. Kale.

OpenMP in a Heterogeneous World Ayodunni Aribuki Advisor: Dr. Barbara Chapman HPCTools Group University of Houston.

Cell Broadband Engine Architecture Bardia Mahjour ENCM 515 March 2007 Bardia Mahjour ENCM 515 March 2007.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Adaptive MPI Milind A. Bhandarkar

IT253: Computer Organization Lecture 4: Instruction Set Architecture Tonga Institute of Higher Education.

1 Chapter 04 Authors: John Hennessy & David Patterson.

GPUs and Accelerators Jonathan Coens Lawrence Tan Yanlin Li.

National Center for Supercomputing Applications University of Illinois at Urbana-Champaign Cell processor implementation of a MILC lattice QCD application.

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

Support for Debugging Automatically Parallelized Programs Robert Hood Gabriele Jost CSC/MRJ Technology Solutions NASA.

Molecular Dynamics Collection of [charged] atoms, with bonds – Newtonian mechanics – Relatively small #of atoms (100K – 10M) At each time-step – Calculate.

Scheduling Many-Body Short Range MD Simulations on a Cluster of Workstations and Custom VLSI Hardware Sumanth J.V, David R. Swanson and Hong Jiang University.

4 November 2008NGS Innovation Forum '08 11 NGS Clearspeed Resources Clearspeed and other accelerator hardware on the NGS Steven Young Oxford NGS Manager.

Programming Examples that Expose Efficiency Issues for the Cell Broadband Engine Architecture William Lundgren Gedae), Rick Pancoast.

Advanced / Other Programming Models Sathish Vadhiyar.

Kevin Eady Ben Plunkett Prateeksha Satyamoorthy.

Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.

Vector/Array ProcessorsCSCI 4717 – Computer Architecture CSCI 4717/5717 Computer Architecture Topic: Vector/Array Processors Reading: Stallings, Section.

Processes Introduction to Operating Systems: Module 3.

Group May Bryan McCoy Kinit Patel Tyson Williams.

Toolkits version 1.0 Special Cource on Computer Architectures

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

A few issues on the design of future multicores André Seznec IRISA/INRIA.

OpenCL Programming James Perry EPCC The University of Edinburgh.

Transactional Coherence and Consistency Presenters: Muhammad Mohsin Butt. (g ) Coe-502 paper presentation 2.

1. 2 Pipelining vs. Parallel processing  In both cases, multiple “things” processed by multiple “functional units” Pipelining: each thing is broken into.

Sam Sandbote CSE 8383 Advanced Computer Architecture The IBM Cell Architecture Sam Sandbote CSE 8383 Advanced Computer Architecture April 18, 2006.

Core Java Introduction Byju Veedu Ness Technologies httpdownload.oracle.com/javase/tutorial/getStarted/intro/definition.html.

Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.

LYU0703 Parallel Distributed Programming on PS3 1 Huang Hiu Fung Wong Chung Hoi Supervised by Prof. Michael R. Lyu Department of Computer.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

The Octoplier: A New Software Device Affecting Hardware Group 4 Austin Beam Brittany Dearien Brittany Dearien Warren Irwin Amanda Medlin Amanda Medlin.

Slide-1 Multicore Theory MIT Lincoln Laboratory Theory of Multicore Algorithms Jeremy Kepner and Nadya Bliss MIT Lincoln Laboratory HPEC 2008 This work.

Charm++ Data-driven Objects L. V. Kale. Parallel Programming Decomposition – what to do in parallel Mapping: –Which processor does each task Scheduling.

Weekly Report- Reduction Ph.D. Student: Leo Lee date: Oct. 30, 2009.

Introduction to CUDA CAP 4730 Spring 2012 Tushar Athawale.

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-1.

AUTO-GC: Automatic Translation of Data Mining Applications to GPU Clusters Wenjing Ma Gagan Agrawal The Ohio State University.

Computer Programming A simple example /* HelloWorld: A simple C program */ #include int main (void) { printf (“Hello world!\n”); return.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

EECS 583 – Class 22 Research Topic 4: Automatic SIMDization - Superword Level Parallelism University of Michigan December 10, 2012.

University of Washington Today Quick review? Parallelism Wrap-up 

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

XRD data analysis software development. Outline  Background  Reasons for change  Conversion challenges  Status 2.

Embedded Systems MPSoC Architectures OpenMP: Exercises Alberto Bosio

User-Written Functions

High Performance Computing on an IBM Cell Processor --- Bioinformatics

User-facing Improvements to Charm++ Process Launching

Multicore and GPU Programming

Multicore and GPU Programming

Presentation transcript:

Intro This talk will focus on Cell processor –Cell Broadband Engine Architecture (CBEA) Power Processing Element (PPE) Synergistic Processing Element (SPE) –Current implementations Sony Playstation 3 (1 chip with 6 SPEs) IBM Blades (2 chips with 8 SPEs each) Toshiba SpursEngine (1 chip with 4 SPES) Future work will try to include GPUs & Larrabee

Two Topics in One Accelerators (Accel) …this is going to hurt… Heterogeneous systems (Hetero) …kill me now… Goal of work… take away the pain and make code portable Code examples

Why Use Accelerators? Performance

Why Not Use Accelerators? Hard to program –Many architecturally specific details Different ISAs between core types Explicit DMA transactions to transfer data to/from the SPEs’ local stores Scheduling of work and communication –Code is not trivially portable Structure of code on an accelerator often does not match that of a commodity architecture Simple re-compile not sufficient

Extensions Charm++ Added extensions –Accelerated entry methods –Accelerated blocks –SIMD instruction abstraction Extensions should be portable between architectures

Accelerated Entry Methods Executed on accelerator if present Targets computationally intensive code Structure based on standard entry methods –Data dependencies expressed via messages –Code is self-contained Managed by the runtime system –DMAs automatically overlapped with work on the SPEs –Scheduled (based on data dependencies: messages, objects) –Multiple independently written portions of code share the same SPE (link to multiple accelerated libraries)

Accel Entry Method Structure entry [accel] void entryName ( …passed parameters… ) [ …local parameters… ] { … function body … } callback_member_funcion; objProxy.entryName( … passed parameters …)

Accelerated Blocks Additional code that is accessible to accelerated entry methods –#include directives –Functions called by accelerated entry methods

SIMD Abstraction Abstract SIMD instructions supported by multiple architectures –Currently adding support for: SSE (x86), AltiVec (PowerPC; PPE), SIMD instructions on SPEs –Generic C implementation when no direct architectural support is present –Types: vec4f, vec2lf, vec4i, etc. –Operations: vadd4f, vmul4f, vsqrt4f, etc.

“HelloWorld” Code hello.ci mainmodule hello { … accelblock { void sayMessage(char* msg, int thisIndex, int fromIndex) { printf("%d told %d to say \"%s\"\n", fromIndex, thisIndex, msg); } }; array [1D] Hello { entry Hello(void); entry [accel] void saySomething( int msgLen, char msg[msgLen], int fromIndex )[ readonly : int thisIndex thisIndex> ] { sayMessage(msg, thisIndex, fromIndex); } saySomething_callback; }; Hello.C class Main : public CBase_Main { Main(CkArgMsg* m) { CkPrintf("Running Hello on %d processors for %d elements\n", CkNumPes(), nElements); char *msg = "Hello from Main"; arr[0].saySomething(strlen(msg) + 1, msg, -1); }; void done(void) { CkPrintf("All done\n"); CkExit(); }; }; class Hello : public CBase_Hello { void saySomething_callback() { if (thisIndex < nElements - 1) { char msgBuf[128]; int msgLen = sprintf(msgBuf, "Hello from %d", thisIndex) + 1; thisProxy[thisIndex+1].saySomething(msgLen, msgBuf, thisIndex); } else { mainProxy.done(); } };

“HelloWorld” Output X Running Hello on 1 processors for 5 elements -1 told 0 to say "Hello from Main" 0 told 1 to say "Hello from 0" 1 told 2 to say "Hello from 1" 2 told 3 to say "Hello from 2" 3 told 4 to say "Hello from 3" All done Blade SPE reported _end = 0x Running Hello on 1 processors for 5 elements -1 told 0 to say "Hello from Main" 0 told 1 to say "Hello from 0" 1 told 2 to say "Hello from 1" 2 told 3 to say "Hello from 2" 3 told 4 to say "Hello from 3" All done

MD Example Code List of particles evenly divided into equal sized patches –Compute objects calculate forces Coulomb’s Law Single precision floating-point –Patches sum forces and update particle data –All particles interact with all other particles each timestep ~92K particles (similar to ApoA1 benchmark) Uses SIMD abstraction for all versions

MD Example Code Speedups (vs. 1 x86 core using SSE) –6 x86 cores: 5.89 –1 QS20 chip (8 SPEs): 5.74 GFlops/sec for 1 QS20 chip –50.1 GFlops/sec observed (24.4% peak) –Nature of code (single inner-loop iteration) Inner-loop: 124 Flops using 54 instructions in 56 cycles Sequential code executing continuously can achieve, at most, 56.7 GFlops/sec (27.7% peak) We observe 88.4% of the ideal GFlops/sec for this code –178.2 GFlops/sec using 4 QS20s (net-linux layer)

Projections

Why Heterogeneous? Trend towards specialized accelerator cores mixed with general cores –#1 supercomputer on Top500 list, Roadrunner at LANL (Cell & x86) –Lincoln Cluster at NCSA (x86 & GPUs) Aging workstations that are loosely clustered

Hetero System View

Messages Across Architectures Makes use of Pack- UnPack (PUP) routines –Object migration and parameter marshaled entry method are the same as before –Custom pack/unpack routines for messages can use PUP framework Supported machine-layers: –net-linux –net-linux-cell

Making Hetero Runs Launch using charmrun –Compile separate binary for each architecture –Modified nodelist files to specify correct binary based on architecture

Hetero “Hello World” Example Nodelist group main ++shell "ssh -X" host kaleblade ++pathfix __arch_dir__ net-linux host blade_1 ++pathfix __arch_dir__ net-linux-cell host ps3_1 ++pathfix __arch_dir__ net-linux-cell Accelblock change in hello.ci (just for demonstration) accelblock { void sayMessage(char* msg, int thisIndex, int fromIndex) { #if CMK_CELL_SPE != 0 char *coreType = "SPE"; #elif CMK_CELL != 0 char *coreType = "PPE"; #else char *coreType = "GEN"; #endif printf("[%s] :: %d told %d to say \"%s\"\n", coreType, fromIndex, thisIndex, msg); } }; Launch Command: /charmrun ++nodelist./nodelist_hetero +p3 ~/charm/__arch_dir__/examples/charm++/cell/hello/hello 10 Output Running Hello on 3 processors for 10 elements [GEN] :: -1 told 0 to say "Hello from Main" [SPE] :: 0 told 1 to say "Hello from 0" [SPE] :: 1 told 2 to say "Hello from 1" [GEN] :: 2 told 3 to say "Hello from 2" [SPE] :: 3 told 4 to say "Hello from 3" [SPE] :: 4 told 5 to say "Hello from 4" [GEN] :: 5 told 6 to say "Hello from 5" [SPE] :: 6 told 7 to say "Hello from 6" [SPE] :: 7 told 8 to say "Hello from 7" [GEN] :: 8 told 9 to say "Hello from 8" All done

Summary Development still in progress (both) Addition of accelerator extensions –Example codes in Charm++ distribution (the nightly build) –Achieve good performance Heterogeneous system support –Simple example codes running –Not in public Charm++ distribution yet

Credits Work partially supported by NIH grant PHS 5 P41 RR : Biophysics / Molecular Dynamics Cell hardware supplied by IBM SUR grant awarded to University of Illinois Background Playstation controller image originally taken by “wlodi” on Flickr and modified by David Kunzman