Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,

Slides:



Advertisements
Similar presentations
MPI Message Passing Interface
Advertisements

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
2003 Michigan Technological University March 19, Steven Seidel Department of Computer Science Michigan Technological University
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
AMLAPI: Active Messages over Low-level Application Programming Interface Simon Yau, Tyson Condie,
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
GASNet: A Portable High-Performance Communication Layer for Global Address-Space Languages Dan Bonachea Jaein Jeong In conjunction with the joint UCB and.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
GASNet: A Portable High-Performance Communication Layer for Global Address-Space Languages Dan Bonachea In conjunction with the joint UC Berkeley and LBL.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
GASNet: A Portable High-Performance Communication Layer for Global Address-Space Languages Dan Bonachea Jaein Jeong In conjunction with the joint UCB and.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Unified Parallel C at LBNL/UCB Message Strip-Mining Heuristics for High Speed Networks Costin Iancu, Parry Husbans, Wei Chen.
Unified Parallel C at LBNL/UCB Empirical (so far) Understanding of Communication Optimizations for GAS Languages Costin Iancu LBNL.
Evaluation of High-Performance Networks as Compilation Targets for Global Address Space Languages Mike Welcome In conjunction with the joint UCB and NERSC/LBL.
GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,
Evaluation of Memory Consistency Models in Titanium.
Synchronization and Communication in the T3E Multiprocessor.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Programming in UPC Burt Gordon HPN Group, HCS lab Taken in part from a Presentation by Tarek El-Ghazawi at SC2003.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Parallel Programming in Split-C David E. Culler et al. (UC-Berkeley) Presented by Dan Sorin 1/20/06.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.
High-Level, One-Sided Models on MPI: A Case Study with Global Arrays and NWChem James Dinan, Pavan Balaji, Jeff R. Hammond (ANL); Sriram Krishnamoorthy.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Department of Computer Science MapReduce for the Cell B. E. Architecture Marc de Kruijf University of Wisconsin−Madison Advised by Professor Sankaralingam.
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Programmability Hiroshi Nakashima Thomas Sterling.
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Unified Parallel C at NERSC
CMSC 611: Advanced Computer Architecture
Chapter 4: Threads.
Presentation transcript:

Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome NERSC/LBNL, U.C. Berkeley, and Concordia U.

Outline What is a Global Address Space Language? –Programming advantages –Potential performance advantage Application example Possible optimizations LogP Model Cost on current networks

Two Programming Models Shared memory +Programming is easier Can build large shared data structures –Machines don’t scale Typically, SMPs < 16 processors, DSM < 128 processors –Performance is hard to predict and control Message passing +Machines easier to build and scale from commodity parts +Programmer has control over performance –Programming is harder Distributed data structures only in the programmers mind Tedious packing/unpacking of irregular data structures Losing programmers with each machine generation

Global Address-Space Languages Unified Parallel C (UPC) –Extension of C with distributed arrays –UPC efforts IDA: t3e implementation based on old gcc NERSC: Open64 implementation + generic runtime GMU (documentation) and UMD (benchmarking) Compaq (Alpha cluster and C+MPI compiler (with MTU)) Cray, Sun, and HP (implementations) Intrepid (SGI compiler and t3e compiler) Titanium (Berkeley) –Extension of Java without the JVM –Compiler available from –Runs on most machines (shared, distributed, and hybrid) –Some experience calling libraries in other languages CAF (Rice and U. Minnesota)

Global Address Space Programming Intermediate point between message passing and shared memory Program consists of a collection of processes. –Fixed at program startup time, like MPI Local and shared data, as in shared memory model –But, shared data is partitioned over local processes –Remote data stays remote on distributed memory machines –Processes communicate by reads/writes to shared variables Examples are UPC, Titanium, CAF, Split-C Note: These are not data-parallel languages –Compiler does not have to map the n-way loop to p processors

UPC Pointers Pointers may point to shared or private variables –Same syntax for use, just add qualifier shared int *sp; int *lp; –sp is a pointer to an integer residing in the shared memory space. –sp is called a shared pointer (somewhat sloppy). Private pointers are faster -- aliasing common Shared Global address space x: 3 Private sp: lp:

Shared Arrays in UPC Shared array elements are spread across the threads shared int x[THREADS] /*One element per thread */ shared int y[3][THREADS] /* 3 elements per thread */ shared int z[3*THREADS] /* 3 elements per thread, cyclic */ In the pictures below –Assume THREADS = 4 –Elements with affinity to processor 0 are marked x y blocked z cyclic This is really a 2D array

Example Problem Relaxation on a mesh (structured or not) –Also known as Sparse matrix-vector multiply v Color indicates the owner processor Implementation strategies –Read values of across edges, either local or remote –Prefetch remote –Remote processor writes values (into a ghost) –Remote processor packs values, and ship as a block

Communication Requirements One-sided communication –origin can read or write the memory of a target node, with no explicit interaction by the target Low latency for small messages Hide latency with non-blocking accesses (UPC “relaxed”); low software overhead –Overlap communication with communication –Overlap communication with computation Support for bulk, scatter/gather, and collective operations (as in MPI) Portability to a number of architectures

Performance Advantage of Global Address Space Languages Sparse matrix-vector multiplication on a T3E UPC model with remote reads is fastest Small message (1 word) Hand-coded prefetching Thanks to Bob Lucas Explanations MPI on the T3E isn’t very good Remote read/write is fundamentally faster than two-sided message passing

Optimization Opportunities Introducing non-blocking communication –Currently hand optimized in Titanium code gen –Small message versions of algorithms on IBM SP

How Hard is the Compiler Problem? Split-C, UPC, and Titanium experience –Small effort –Relied on lightweight communication Distinguish between –Single thread/process analysis –Global, cross-thread analysis Two-sided communication, gets-to-puts, strong consistency semantics with non-blocking implementation Support for application level optimization key –Bulk communication, scatter-gather, etc.

UPCNet: Global pointers (opaque type with rich set of pointer operations), memory management, job startup, etc. GASNet Extended API: Supports put, get, locks, barrier, bulk, scatter/gather Portable Runtime Support Developing a runtime layer that can be easily ported and tuned to multiple architectures. GASNet Core API: Small interface based on “Active Messages” Generic support for UPC, CAF, Titanium Core sufficient for functional implementation Direct implementations of parts of full GASNet

Portable Runtime Support Full runtime designed to be used by multiple compilers –NERSC compiler based on Open64 –Intrepid compiler based on gcc Communication layer designed to run on multiple machines –Hardware shared memory (direct load/store) –IBM SP (LAPI) –Myrinet 2K (GM) –Quadrics (Elan3) –Dolphin –VIA and Infiniband in anticipation of future networks –MPI for portability Use communication micro-benchmarks to choose optimizations

Core API – Active Messages Super-Lightweight RPC –Unordered, reliable delivery with "user"-provided handlers Request/reply messages –3 sizes: small (<=32 bytes),medium (<=512 bytes), large (DMA) Very general - provides extensibility –Available for implementing compiler-specific operations –scatter-gather or strided memory access, remote allocation, … Already implemented on a number of interconnects –MPI, LAPI, UDP/Ethernet, Via, Myrinet, and others Allow a number of message servicing paradigms –Interrupts, main-thread polling, NIC-thread polling or some combination

Extended API – Remote memory operations Want an orthogonal, expressive, high-performance interface –Scalars and Bulk contiguous data –Blocking and non-blocking (returns a handle) –Also have a non-blocking form where the handle is implicit Non-blocking synchronization –Sync on a particular operation (using a handle) –Sync on a list of handles (some or all) –Sync on all pending reads, writes or both (for implicit handles) –Allow polling (trysync) or blocking (waitsync) Misc. characteristics –gets specify a destination memory address (also have register-mem ops) –Remote addresses expressed as (node id, virtual address) –Loopback is supported –Handles need not be explicitly freed –Knows nothing about local UPC threads, but is thread-safe on platforms with POSIX threads

Extended API – Remote Memory API for remote gets/puts: void get (void *dest, int node, void *src, int numbytes) handle get_nb (void *dest, int node, void *src, int numbytes) void get_nbi(void *dest, int node, void *src, int numbytes) void put (int node, void *src, void *src, int numbytes) handle put_nb (int node, void *src, void *src, int numbytes) void put_nbi(int node, void *src, void *src, int numbytes) "nb" = non-blocking with explicit handle "nbi" = non-blocking with implicit handle Also have "value" forms for register transfers Recognize and optimize common sizes with macros Extensibility of core API allows easily adding other more complicated access patterns (scatter/gather, strided, etc)

Extended API – Remote Memory API for get/put synchronization: Non-blocking ops with explicit handles: int try_syncnb(handle) void wait_syncnb(handle) int try_syncnb_some(handle *, int numhandles) void wait_syncnb_some(handle *, int numhandles) int try_syncnb_all(handle *, int numhandles) void wait_syncnb_all(handle *, int numhandles) Non-blocking ops with implicit handles: int try_syncnbi_gets() void wait_syncnbi_gets() int try_syncnbi_puts() void wait_syncnbi_puts() int try_syncnbi_all() // gets & puts void wait_syncnbi_all()

Extended API – Other operations Basic job control –Init, exit –Job layout queries – get node rank & node count –Common user interface for job startup Synchronization –Named split-phase barrier (wait & notify) –Locking support Core API provides "handler-safe" locks for implementing upc_locks May also provide atomic compare&swap or fetch&increment Collective communication –Broadcast, exchange, reductions, scans? Other –Performance monitoring (counters) –Debugging support?

Software Overhead Overhead: cost cannot be hidden with overlap –Shown here for 8-byte messages (put or send) –Compare to 1.5 usec for CM5 using Active Messages

Small Message Bandwidth If overhead fills all time, there is no potential for overlapping computation 95

Latency (Including Overhead)

Large Message Bandwidth

What to Take Away Opportunity to influence vendors to expose lighter weight communication –Overhead is most important –Then gap (inverse bandwidth) –Then latency Global address space languages –Easier first implementation –Incremental performance tuning Proposal for a GASNet –Two layers: full interface + core

End of Slides

Performance Characteristics LogP model is useful for understanding small message performance and overlap L: latency across the network o: overhead (sending and receiving busy time) g: gap between messages (1/rate) P: number of processors P M OsOs oror L (latency) g

Questions Why Active Messages at the bottom? –Changing the PC is the minimum work What about machines with sophisticated NICs? –Handled by direct implementation of full API Why not MPI-2 one-sided? –Designed for application level –Too much synchronization required for runtime Why not ARMCI? –Similar goals, but not designed for small (non- blocking) messages

Implications for Communication Fast small message read/write simplifies programming Non-blocking read/write may be introduced by the programmer or compiler –UPC has “relaxed” to indicate that an access need not happen immediately Bulk and scatter/gather support will be useful (as in MPI) Non-blocking versions may also be useful

Overview of NERSC Effort Three components: 1)Compilers –IBM SP platform and PC clusters are main targets –Portable compiler infrasturucture (UPC->C) –Optimization of communication and global pointers 2)Runtime systems for multiple compilers –Allow use by other languages (Titanium and CAF) –And in other UPC compilers –Performance evaluation 3)Applications and benchmarks –Currently looking at NAS PB –Evaluating language and compilers –Plan to do a larger application next year

NERSC UPC Compiler Compiler being developed by Costin Iancu –Based on Open64 compiler for C Originally developed at SGI Has IA64 backend with some ongoing development Software available on SourceForge –Can use as C to C translator Can either generate before most optimizations Or after, but this is known to be buggy right now Status –Parses and type-checks UPC –Finishing code generation for UPC->C translator Code generation for SMPs underway

Compiler Optimizations Based on lessons learned from –Titanium: UPC in Java –Split-C: one of the UPC predecessors Optimizations –Pointer optimizations: Optimization of phase-less pointers Turn global pointers into local ones –Overlap Split-phase Merge “synchs” at barrier –Aggregation Split-C data on CM-5

Possible Optimizations Use of lightweight communication Converting reads to writes (or reverse) Overlapping communication with communication Overlapping communication with computation Aggregating small messages into larger ones

MPI vs. LAPI on the IBM SP LAPI generally faster than MPI Non-Blocking (relaxed) faster than blocking

Overlapping Computation: IBM SP Nearly all software overhead – no computation overlap –Recall: 36 usec blocking, 12 usec nonblocking

Conclusions for IBM SP LAPI is better the MPI Reads/Writes roughly the same cost Overlapping communication with communication (pipelining) is important Overlapping communication with computation –Important if no communication overlap –Minimal value if >= 2 messages overlapped Large messages are still much more efficient Generally noisy data: hard to control

Other Machines Observations: –Low latency reveals programming advantage –T3E is still much better than the other networks usec

Future Plans This month –Draft of runtime spec –Draft of GASNet spec This year –Initial runtime implementation on shared memory –Runtime implementation on distributed memory (M2K, SP) –NERSC compiler release 1.0b for IBM SP Next year –Compiler release for PC cluster –Development of CLUMP compiler –Begin large application effort –More GASNet implementations –Advanced analysis and optimizations

Read/Write Behavior Negligible difference between blocking read and write performance

Overlapping Communication Effects of pipelined communication are significant –8 overlapped messages are sufficient to saturate NI Queue depth

Overlapping Computation Same experiment, but fix total amount of computation

SPMV on Compaq/Quadrics Seeing 15 usec latency for small msgs Data for 1 thread per node

Optimization Strategy Optimizations of communication is key to making UPC more usable Two problems: –Analysis of code to determine which optimizations are legal –Use of performance models to select transformations to improve performance Focus on the second problem here

Runtime Status Characterizing network performance –Low latency (low overhead) -> programmability Specification of portable runtime –Communication layer (UPC, Titanium, Co-Array Fortran) Built on small “core” layer; interoperability a major concern –Full runtime has memory management, job startup, etc. usec

What is UPC? UPC is an explicitly parallel language –Global address space; can read/write remote memory –Programmer control over layout and scheduling –From Split-C, AC, PCP Why a new language? –Easier to use than MPI, especially for program with complicated data structures –Possibly faster on some machines, but current goal is comparable performance p0p1p2

Background UPC efforts elsewhere –IDA: t3e implementation based on old gcc –GMU (documentation) and UMC (benchmarking) –Compaq (Alpha cluster and C+MPI compiler (with MTU)) –Cray, Sun, and HP (implementations) –Intrepid (SGI compiler and t3e compiler) UPC Book: –T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick Three components of NERSC effort 1)Compilers (SP and PC clusters) + optimization (DOE/UPC) 2)Runtime systems for multiple compilers (DOE/Pmodels + NSA) 3)Applications and benchmarks (DOE/UPC)

Overlapping Computation on Quadrics 8-Byte non-blocking put on Compaq/Quadrics