Programming on IBM Cell Triblade Jagan Jayaraj,Pei-Hung Lin, Mike Knox and Paul Woodward University of Minnesota April 1, 2009.

Slides:

Advertisements

Similar presentations

Computer Architecture

Advertisements

Systems and Technology Group © 2006 IBM Corporation Cell Programming Tutorial - JHD24 May 2006 Cell Programming Tutorial Jeff Derby, Senior Technical Staff.

Taxanomy of parallel machines. Taxonomy of parallel machines Memory – Shared mem. – Distributed mem. Control – SIMD – MIMD.

Types of Parallel Computers

Convey Computer Status Steve Wallach swallach”at”conveycomputer.com.

Heterogeneous Computing: New Directions for Efficient and Scalable High-Performance Computing Dr. Jason D. Bakos.

Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.

COMP3221: Microprocessors and Embedded Systems Lecture 2: Instruction Set Architecture (ISA) Lecturer: Hui Wu Session.

OPERATING SYSTEM OVERVIEW

Seminar on parallel computing Goal: provide environment for exploration of parallel computing Driven by participants Weekly hour for discussion, show &

Hitachi SR8000 Supercomputer LAPPEENRANTA UNIVERSITY OF TECHNOLOGY Department of Information Technology Introduction to Parallel Computing Group.

Chapter 2: Impact of Machine Architectures What is the Relationship Between Programs, Programming Languages, and Computers.

IBM RS/6000 SP POWER3 SMP Jari Jokinen Pekka Laurila.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Real Parallel Computers. Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra, Meuer, Simon Parallel.

High Performance Computing (HPC) at Center for Information Communication and Technology in UTM.

Heterogeneous Computing Dr. Jason D. Bakos. Heterogeneous Computing 2 “Traditional” Parallel/Multi-Processing Large-scale parallel platforms: –Individual.

Contemporary Languages in Parallel Computing Raymond Hummel.

Scheduling of Tiled Nested Loops onto a Cluster with a Fixed Number of SMP Nodes Maria Athanasaki, Evangelos Koukis, Nectarios Koziris National Technical.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

Real Parallel Computers. Modular data centers Background Information Recent trends in the marketplace of high performance computing Strohmaier, Dongarra,

ECE 265 – LECTURE 12 The Hardware Interface 8/22/ ECE265.

NERCS Users’ Group, Oct. 3, 2005 NUG Training 10/3/2005 Logistics –Morning only coffee and snacks –Additional drinks $0.50 in refrigerator in small kitchen.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

CC02 – Parallel Programming Using OpenMP 1 of 25 PhUSE 2011 Aniruddha Deshmukh Cytel Inc.

Performance Evaluation of Hybrid MPI/OpenMP Implementation of a Lattice Boltzmann Application on Multicore Systems Department of Computer Science and Engineering,

 What hardware accelerators are you using/evaluating?  Cells in a Roadrunner configuration ◦ 8-way SPE threads w/ local memory, DMA & vector unit programming.

HPC Technology Track: Foundations of Computational Science Lecture 2 Dr. Greg Wettstein, Ph.D. Research Support Group Leader Division of Information Technology.

Levels of Architecture & Language CHAPTER 1 © copyright Bobby Hoggard / material may not be redistributed without permission.

ISG We build general capability Introduction to Olympus Shawn T. Brown, PhD ISG MISSION 2.0 Lead Director of Public Health Applications Pittsburgh Supercomputing.

JIT in webkit. What’s JIT See time_compilation for more info. time_compilation.

Compiler BE Panel IDC HPC User Forum April 2009 Don Kretsch Director, Sun Developer Tools Sun Microsystems.

High Performance Computing on the Cell Broadband Engine

CS/ECE 3330 Computer Architecture Kim Hazelwood Fall 2009.

17-April-2007 High Performance Computing Basics April 17, 2007 Dr. David J. Haglin.

March 12, 2007 Introduction to PS3 Cell BE Programming Narate Taerat.

4 November 2008NGS Innovation Forum '08 11 NGS Clearspeed Resources Clearspeed and other accelerator hardware on the NGS Steven Young Oxford NGS Manager.

InCoB August 30, HKUST “Speedup Bioinformatics Applications on Multicore- based Processor using Vectorizing & Multithreading Strategies” King.

Multi-core.  What is parallel programming ?  Classification of parallel architectures  Dimension of instruction  Dimension of data  Memory models.

G-JavaMPI: A Grid Middleware for Distributed Java Computing with MPI Binding and Process Migration Supports Lin Chen, Cho-Li Wang, Francis C. M. Lau and.

Interrupts By Ryan Morris. Overview ● I/O Paradigm ● Synchronization ● Polling ● Control and Status Registers ● Interrupt Driven I/O ● Importance of Interrupts.

HPC for Statistics Grad Students. A Cluster Not just a bunch of computers Linked CPUs managed by queuing software – Cluster – Node – CPU.

1CPSD Software Infrastructure for Application Development Laxmikant Kale David Padua Computer Science Department.

Seven Minute Madness: Heterogeneous Computing Dr. Jason D. Bakos.

Modes of transfer in computer

Cell Processor Programming: An introduction Pascal Comte Brock University, Fall 2007.

Sep 08, 2009 SPEEDUP – Optimization and Porting of Path Integral MC Code to New Computing Architectures V. Slavnić, A. Balaž, D. Stojiljković, A. Belić,

M. Mateen Yaqoob The University of Lahore Spring 2014.

CS- 492 : Distributed system & Parallel Processing Lecture 7: Sun: 15/5/1435 Foundations of designing parallel algorithms and shared memory models Lecturer/

LCSE – NCSA Partnership Accomplishments, FY01 Paul R. Woodward Laboratory for Computational Science & Engineering University of Minnesota October 17, 2001.

DR. SIMING LIU SPRING 2016 COMPUTER SCIENCE AND ENGINEERING UNIVERSITY OF NEVADA, RENO CS 219 Computer Organization.

Chapter 1 Basic Concepts of Operating Systems Introduction Software A program is a sequence of instructions that enables the computer to carry.

PACI Program : One Partner’s View Paul R. Woodward LCSE, Univ. of Minnesota NSF Blue Ribbon Committee Meeting Pasadena, CA, 1/22/02.

بسم الله الرحمن الرحيم MEMORY AND I/O.

Presented by Jeremy S. Meredith Sadaf R. Alam Jeffrey S. Vetter Future Technologies Group Computer Science and Mathematics Division Research supported.

Aarul Jain CSE520, Advanced Computer Architecture Fall 2007.

FFTC: Fastest Fourier Transform on the IBM Cell Broadband Engine David A. Bader, Virat Agarwal.

4/27/2000 A Framework for Evaluating Programming Models for Embedded CMP Systems Niraj Shah Mel Tsai CS252 Final Project.

CISC. What is it?  CISC - Complex Instruction Set Computer  CISC is a design philosophy that:  1) uses microcode instruction sets  2) uses larger.

1/21 Cell Processor Systems Seminar Diana Palsetia (11/21/2006)

Martin Kruliš by Martin Kruliš (v1.1)1.

Intel Many Integrated Cores Architecture

ECE 3056: Architecture, Concurrency and Energy in Computation

Why to use the assembly and why we need this course at all?

Computer Science I CSC 135.

Benjamin Goldberg Compiler Verification and Optimization

Multivector and SIMD Computers

Dynamic Code Mapping Techniques for Limited Local Memory Systems

Implementation of neural gas on Cell Broadband Engine

Presentation transcript:

Programming on IBM Cell Triblade Jagan Jayaraj,Pei-Hung Lin, Mike Knox and Paul Woodward University of Minnesota April 1, 2009

An instability of an interface between two fluids of different densities, which occurs when the lighter fluid is pushing the heavier fluid. Using multi-fluids Piecewise-Parabolic Method(PPM) to implement R-T instability simulation Program is written in Fortran Rayleigh–Taylor instability

TriBlade ▫Two QS22 blades, each with 2 PowerXCell 8i CPUs ▫LS21 blade with two dual-core AMD Opterons ▫16GB memory for LS21 and 8GB memory for QS22

LCSE Cell Cluster 6 Triblades 4 QS22 Cell blades 2 QS20 Cell blades 4 AMD Quadcore Systems

Login instructions Account credentials should be in your . Guest account: lcse / lcse$ncsa! Login steps: ▫SSH to frodo.lcse.umn.edu ▫Once logged in to frodo SSH to an assigned Cell Processor host  AMD – rra001a ~ rra006a  Cell – rra001b / rra001c ~ rra006b/rra006c

Software available Cell SDK 3.1 OpenMPI 1.3 DaCS Fortran bindings Compilers ▫AMD: gfortran, gcc ▫PPU: ppuxlf, ppu-gcc ▫SPU: spuxlf, spu-gcc Example code is available on /mnt/scratch/NCSA_Example

Compilation and Execution On AMD node: ▫make ppm4f-x86 On Cell node: ▫make ppm4f-ppu On AMD node: ▫./ppm4f-x86

 Three levels of parallelism:  within-Cell  within-node  node-to-node  Compute-communication overlap  DMA  DaCS  MPI Triblade programming paradigm

 Single code for Roadrunner and non-RR systems ◦ Using lots #ifdef, #if, #endif… ◦ Using preprocessor to generate three codes  Minimize the manual translation for SPU code ◦ Using Fortran to Cell C translator,  Tedious portions of the SPU code can be translated.  Fortran codes for PPU and AMD ◦ Fortran binding programs for C intrinsic libraries  Keep memory footprint small Programming for IBM Cell Tri-blade

Single Source Code Preprocessor PPU Fortran codeSPU Fortran codeAMD Fortran code Translation SPU C codeFortran Binding Programs SPU C Compiler PPU Fortran Compiler GNU Fortran Compiler AMD ExecutablePPU ExecutableSPU Executable Embedded

Division of labor ▫Define jobs for AMD, PPU and SPU clearly  AMD: I/O, MPI, relay data to Cell…  PPU: Transfer data, manage SPUs  SPU: Just compute

▫Three codes for three different ISAs ▫Different endian-ness between PPU and AMD  Need to do byte-swapping ▫64bit/32bit conversion  SPU supports 32bit address only, but DaCS requires 64bit address mode Items to care

Translator Fortran to C with Cell extensions Needs directives Built with ANTLR Handles: ▫Vector and scalar loops ▫DMAs (Including List DMAs) ▫Variable declarations ▫Conditional vector moves

References Woodward, P. R., J. Jayaraj, P.-H. Lin, and P.-C. Yew, “Moving Scientific Codes to Multicore Microprocessor CPUs,” Computing in Science & Engineering, special issue on novel architectures, Nov., 2008, p Also available at Woodward, P. R., J. Jayaraj, P.-H. Lin, and D. Porter, “Programming Techniques for Moving Scientific Simulation Codes to Roadrunner,” tutorial given 3/12/08 at Los Alamos, link available at Woodward, P. R., J. Jayaraj, P.-H. Lin, and W. Dai, “First Experience of Compressible Gas Dynamics Simulation on the Los Alamos Roadrunner Machine,” submitted to Concurrency and Computation Practice and Experience, preprint available at