UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Slides:

Advertisements

Similar presentations

1 Implementing PGAS on InfiniBandPaul H. Hargrove Experiences Implementing Partitioned Global Address Space (PGAS) Languages on InfiniBand Paul H. Hargrove.

Advertisements

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Thoughts on Shared Caches Jeff Odom University of Maryland.

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

GASNet: A Portable High-Performance Communication Layer for Global Address-Space Languages Dan Bonachea Jaein Jeong In conjunction with the joint UCB and.

MPI in uClinux on Microblaze Neelima Balakrishnan Khang Tran 05/01/2006.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL

Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

GASNet: A Portable High-Performance Communication Layer for Global Address-Space Languages Dan Bonachea Jaein Jeong In conjunction with the joint UCB and.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.

Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.

Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

1 Developing Native Device for MPJ Express Advisor: Dr. Aamir Shafi Co-advisor: Ms Samin Khaliq.

Bulk Synchronous Parallel Processing Model Jamie Perkins.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

The MPC Parallel Computer Hardware, Low-level Protocols and Performances University P. & M. Curie (PARIS) LIP6 laboratory Olivier Glück.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,

3/12/2013Computer Engg, IIT(BHU)1 INTRODUCTION-2.

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Current Generation Hypervisor Type 1 Type 2.

An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.

Overview of Berkeley UPC

CS5102 High Performance Computer Systems Thread-Level Parallelism

Unified Parallel C at NERSC

Programming Models for SimMillennium

Indranil Roy High Performance Computing (HPC) group

Department of Computer Science University of California, Santa Barbara

UPC and Titanium Kathy Yelick University of California, Berkeley and

Immersed Boundary Method Simulation in Titanium Objectives

MPJ: A Java-based Parallel Computing System

Prof. Leonardo Mostarda University of Camerino

Support for Adaptivity in ARMCI Using Migratable Objects

Presentation transcript:

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome NERSC, U.C. Berkeley, and Concordia U.

Overview of NERSC Effort Three components: Compilers IBM SP platform and PC clusters are main targets Portable compiler infrasturucture (UPC->C) Optimization of communication and global pointers Runtime systems for multiple compilers Allow use by other languages (Titanium and CAF) And in other UPC compilers Performance evaluation Applications and benchmarks Currently looking at NAS PB Evaluating language and compilers Plan to do a larger application next year

NERSC UPC Compiler Compiler being developed by Costin Iancu Status Based on Open64 compiler for C Originally developed at SGI Has IA64 backend with some ongoing development Software available on SourceForge Can use as C to C translator Can either generate before most optimizations Or after, but this is known to be buggy right now Status Parses and type-checks UPC Finishing code generation for UPC->C translator Code generation for SMPs underway

Compiler Optimizations Based on lessons learned from Titanium: UPC in Java Split-C: one of the UPC predecessors Optimizations Pointer optimizations: Optimization of phase-less pointers Turn global pointers into local ones Overlap Split-phase Merge “synchs” at barrier Aggregation Split-C data on CM-5

Portable Runtime Support Developing a runtime layer that can be easily ported and tuned to multiple architectures. Direct implementations of parts of full GASNet UPCNet: Global pointers (opaque type with rich set of pointer operations), memory management, job startup, etc. Generic support for UPC, CAF, Titanium GASNet Extended API: Supports put, get, locks, barrier, bulk, scatter/gather GASNet Core API: Small interface based on “Active Messages” Core sufficient for functional implementation

Portable Runtime Support Full runtime designed to be used by multiple compilers NERSC compiler based on Open64 Intrepid compiler based on gcc Communication layer designed to run on multiple machines Hardware shared memory (direct load/store) IBM SP (LAPI) Myrinet 2K (GM) Quadrics (Elan3) Dolphin VIA and Infiniband in anticipation of future networks MPI for portability Use communication micro-benchmarks to choose optimizations

Possible Optimizations Use of lightweight communication Converting reads to writes (or reverse) Overlapping communication with communication Overlapping communication with computation Aggregating small messages into larger ones

MPI vs. LAPI on the IBM SP LAPI generally faster than MPI Non-Blocking (relaxed) faster than blocking

Overlapping Computation: IBM SP Nearly all software overhead – no computation overlap Recall: 36 usec blocking, 12 usec nonblocking

Conclusions for IBM SP LAPI is better the MPI Reads/Writes roughly the same cost Overlapping communication with communication (pipelining) is important Overlapping communication with computation Important if no communication overlap Minimal value if >= 2 messages overlapped Large messages are still much more efficient Generally noisy data: hard to control

Other Machines Observations: Low latency reveals programming advantage T3E is still much better than the other networks usec

Applications Status Short term goal: Evaluate language and compilers using small applications Longer term, identify large application Conjugate Gradient Show advantage of t3e network model and UPC Performance on Compaq machine worse: Serial code Communication performance Simple n2 particle simulation Currently working on NAS MG Need for shared array arithmetic optimizations

Future Plans This month This year Next year Draft of runtime spec Draft of GASNet spec This year Initial runtime implementation on shared memory Runtime implementation on distributed memory (M2K, SP) NERSC compiler release 1.0b for IBM SP Next year Compiler release for PC cluster Development of CLUMP compiler Begin large application effort More GASNet implementations Advanced analysis and optimizations

Runtime Breakout How many runtime systems? Language issues Compaq MTU LBNL/Intrepid Language issues Locks Richard Stallman’s ? upc_phaseof for pointers with indef. block size Misc Runtime extensions Strided and scatter/gather memcopy

Read/Write Behavior Negligible difference between blocking read and write performance

Overlapping Communication Effects of pipelined communication are significant 8 overlapped messages are sufficient to saturate NI Queue depth

Overlapping Computation Same experiment, but fix total amount of computation

SPMV on Compaq/Quadrics Seeing 15 usec latency for small msgs Data for 1 thread per node Note: this data was hand-typed a picture from Parry – couldn’t cut and paste

Optimization Strategy Optimizations of communication is key to making UPC more usable Two problems: Analysis of code to determine which optimizations are legal Use of performance models to select transformations to improve performance Focus on the second problem here

Runtime Status Characterizing network performance Low latency (low overhead) -> programmability Specification of portable runtime Communication layer (UPC, Titanium, Co-Array Fortran) Built on small “core” layer; interoperability a major concern Full runtime has memory management, job startup, etc. usec

What is UPC? UPC is an explicitly parallel language Global address space; can read/write remote memory Programmer control over layout and scheduling From Split-C, AC, PCP Why a new language? Easier to use than MPI, especially for program with complicated data structures Possibly faster on some machines, but current goal is comparable performance p0 p1 p2

Background UPC efforts elsewhere UPC Book: IDA: t3e implementation based on old gcc GMU (documentation) and UMC (benchmarking) Compaq (Alpha cluster and C+MPI compiler (with MTU)) Cray, Sun, and HP (implementations) Intrepid (SGI compiler and t3e compiler) UPC Book: T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick Three components of NERSC effort Compilers (SP and PC clusters) + optimization (DOE/UPC) Runtime systems for multiple compilers (DOE/Pmodels + NSA) Applications and benchmarks (DOE/UPC)

Overlapping Computation on Quadrics 8-Byte non-blocking put on Compaq/Quadrics