UPC and Titanium Kathy Yelick University of California, Berkeley and

UPC and Titanium Kathy Yelick University of California, Berkeley and
Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory

Outline Global Address Languages in General UPC Titanium
Language overview Berkeley UPC compiler status and microbenchmarks Application benchmarks and plans Titanium Berkeley Titanium compiler status

Global Address Space Languages
Explicitly-parallel programming model with SPMD parallelism Fixed at program start-up, typically 1 thread per processor Global address space model of memory Allows programmer to directly represent distributed data structures Address space is logically partitioned Local vs. remote memory (two-level hierarchy) Programmer control over performance critical decisions Data layout and communication Performance transparency and tunability are goals Initial implementation can use fine-grained shared memory Suitable for current and future architectures Either shared memory or lightweight messaging is key Base languages differ: UPC (C), CAF (Fortran), Titanium (Java)

Global Address Space X[0] X[1] X[P] Shared Global address space ptr: ptr: ptr: Private The languages share the global address space abstraction Shared memory is partitioned by processors Remote memory may stay remote: no automatic caching implied One-sided communication through reads/writes of shared variables Both individual and bulk memory copies Differ on details Some models have a separate private memory area Distributed arrays generality and how they are constructed

UPC Programming Model Features
SPMD parallelism fixed number of images during execution images operate asynchronously Several kinds of array distributions double a[n] a private n-element array on each processor shared double a[n] a n-element shared array, with cyclic mapping shared [4] double a[n] a block cyclic array with 4-element blocks shared [0] double *a = (shared [0] double *) upc_alloc(n); a shared array with all elements local Pointers for irregular data structures shared double *sp a pointer to shared data double *lp a pointers to private data

UPC Programming Model Features
Global synchronization upc_barrier traditional barrier upc_notify/upc_wait split-phase global synchronization Pair-wise synchronization upc_lock/upc_unlock traditional locks Memory consistence has two types of accesses Strict: must be performed immediately and atomically: typically a blocking round-trip message if remote Relaxed: still must preserve dependencies, but other processors may view these as happening out of order Parallel I/O Based on ideas in MPI I/O Specification for UPC by Thakur, El Ghazawi et al

Assembly: IA64, MIPS,… + Runtime
Berkeley UPC Compiler Compiler based on Open64 Recently merged Rice sources Multiple front-ends, including gcc Intermediate form called WHIRL Current focus on C backend IA64 possible in future UPC Runtime Pointer representation Shared/distribute memory Communication in GASNet Portable Language-independent UPC Higher WHIRL Optimizing transformations C + Runtime Lower WHIRL Assembly: IA64, MIPS,… + Runtime

Design for Portability & Performance
UPC to C translator: Translates UPC to C; insert runtime calls for parallel features UPC runtime: Allocate shared data; implement pointers-to-shared GASNet: A uniform interface for low-level communication primitives Portability: C is our intermediate language GASNet is itself layered with a small core as the essential part High-Performance: Native C compiler optimizes serial code Translator can perform communication optimizations GASNet can access network directly

Berkeley UPC Compiler Status
UPC Extensions added to front-end Code-generation complete Some issues related to code quality (hints to backend compilers) GASNet communication layer Running on Quadrics/Elan, IBM/LAPI, Myrinet/GM, and MPI Optimized for small non-blocking messages and compiled code Next step: strided and indexed put/get leveraging ARMCI work UPC Runtime layer Developed and tested on all GASNet implementations Supports multiple pointer representations Next step: direct shared memory support Release scheduled for later this month Glitch related to include files and usability to iron out

Pointer-to-Shared Representation
UPC has three difference kinds of pointers: Block-cyclic, cyclic, and indefinite (always local) A pointer needs a “phase” to keep track of where it is in a block Source of overhead for updating and de-referencing Consumes space in the pointer Our runtime has special cases for: Phaseless (cyclic and indefinite) – skip phase update Indefinite – skip thread id update Pointer size/representation easily reconfigured 64 bits on small machines, 128 on large, word or struct Address Thread Phase

Preliminary Performance
Testbed Compaq AlphaServer, with Quadrics GASNet conduit Compaq C compiler for the translated C code Microbenchmarks Measures the cost of UPC language features and construct Shared pointer arithmetic, barrier, allocation, etc Vector addition: no remote communication NAS Parallel Benchmarks EP: no communication IS: large bulk memory operations MG: bulk memput CG: fine-grained vs. bulk memput

Performance of Shared Pointer Arithmetic
Phaseless pointers are an important optimization Indefinite pointers almost as fast as regular C pointers General blocked cyclic pointer 7x slower for addition Competitive with HP compiler, which generates native code Both compiler have known opportunities for improvement

Cost of Shared Memory Access
Explain there’s a performance bug in hp upc 1.7 Local shared accesses somewhat slower than private ones HP has improved local performance in newer version Remote accesses worse than local, as expected Runtime/GASNet layering for portability is not a problem

NAS PB: EP EP = Embarrassingly Parallel has no communication
Serial performance via C code generation is not a problem

NAS PB: IS IS = Integer Sort is dominated by Bulk Communication
GASNet bulk communication adds no measurable overhead

NAS PB: MG MG = Multigrid involves medium bulk copies
“Berkeley” reveals a slight serial performance degradation due to casts Berkeley-C uses the original C code for the inner loops

Scaling MG on the T3E Scalability of the language shown here for the T3E compiler Directly shared memory support is probably needed to be competitive on most current machines

Mesh Generation in UPC Parallel Mesh Generation in UPC
2D Delaunay triangulation Based on Triangle software by Shewchuk (UCB) Parallel version from NERSC uses dynamic load balancing, software caching, and parallel sorting

UPC Interactions UPC consortium Other Implementations
Tarek El-Ghazawi is coordinator: semi-annual meetings, ~daily Revised UPC Language Specification (IDA,GWU,…) UPC Collectives (MTU) UPC I/O Specifications (GWU, ANL-PModels) Other Implementations HP (Alpha cluster and C+MPI compiler (with MTU)) MTU (C+MPI Compiler based on HP compiler, memory model) Cray (X1 implementation) Intrepid (SGI implementation based on gcc) Etnus (debugging) UPC Book: T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick Goal is proofs by SC03 HPC HPCS Effort Recent interest from Sandia

Titanium Based on Java, a cleaner C++
classes, automatic memory management, etc. compiled to C and then native binary (no JVM) Same parallelism model as UPC and CAF SPMD with a global address space Dynamic Java threads are not supported Optimizing compiler static (compile-time) optimizer, not a JIT communication and memory optimizations synchronization analysis (e.g. static barrier analysis) cache and other uniprocessor optimizations Java is already being used for scientific computing. Personal experience with development (research vehicle) The Titanium project seeks to improve Java, making a better language for scientific computing on parallel machines.

Summary of Features Added to Java
Scalable parallelism (Java threads replaced) Immutable (“value”) classes Multidimensional arrays with iterators Checked Synchronization Operator overloading Templates Zone-based memory management (regions) Libraries for collective communication, distributed arrays, bulk I/O

Immutable Classes in Titanium
For small objects, would sometimes prefer to avoid level of indirection pass by value (copy entire object) especially when immutable -- fields never modified Example: immutable class Complex { Complex () {real=0; imag=0; } ... } Complex c1 = new Complex(7.1, 4.3); c1 = c1.add(c1); Addresses performance and programmability Similar to structs in C (not C++ classes) in terms of performance Adds support for complex types No inheritance => no polymorphism => static dispatch No aliasing problems – more aggressive optimizations

Multidimensional Arrays
Arrays in Java are objects Array bounds are checked Multidimensional arrays are arrays-of-arrays Safe and general, but potentially slow New kind of multidimensional array added to Titanium Sub-arrays are supported (interior, boundary, etc.) Indexed by Points (tuple of ints) Combined with unordered iteration to enable optimizations foreach (p within A.domain()) { A[p]... } “A” could be multidimensional, an interior region, etc. Multi-D slow because of: indexing requires several lookups potential row aliasing kills optimizations – hard to analyze effectively

Communication Titanium has explicit global communication:
Broadcast, reduction, etc. Primarily used to set up distributed data structures Most communication is implicit through the shared address space Dereferencing a global reference, g.x, can generate communication Arrays have copy operations, which generate bulk communication: A1.copy(A2) Automatically computes the intersection of A1 and A2’s index set or domain

Distributed Data Structures
Building distributed arrays: Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d]; Particle [1d] myParticle = new Particle [0:myParticleCount-1]; allParticle.exchange(myParticle); Now each processor has array of pointers, one to each processor’s chunk of particles All to all broadcast P0 P1 P2

Titanium Compiler Status
Titanium compiler runs on almost any machine Requires a C compiler (and decent C++ to compile translator) Pthreads for shared memory Communication layer for distributed memory (or hybrid) Recently moved to live on GASNet: obtained GM, Elan, and improved LAPI implementation Leverages other PModels work for maintenance Recent language extensions Indexed array copy (scatter/gather style) Non-blocking array copy under development Compiler optimizations Cache optimizations, for loop optimizations Communication optimizations for overlap, pipelining, and scatter/gather under development

Applications in Titanium
Several benchmarks Fluid solvers with Adaptive Mesh Refinement (AMR) Conjugate Gradient 3D Multigrid Unstructured mesh kernel: EM3D Dense linear algebra: LU, MatMul Tree-structured n-body code Finite element benchmark Genetics: micro-array selection SciMark serial benchmarks Larger applications Heart simulation Ocean modeling with AMR (in progress)

Serial Performance (Pure Java)
Several optimizations in Titanium compiler (tc) over the past year These codes are all written in pure Java without performance extensions

AMR for Ocean Modeling Ocean Modeling [Wen, Colella]
Require embedded boundaries to model the ocean floor and coastline Results in irregular data structures and array accesses Starting with AMR solver this year for ocean flow Compiler and language support for irregular problem under design Bottom of picture is the wall Shockwave entering at 45 deg like / and reflecting back from wall Color is areal density Graphics from Titanium AMR Gas Dynamics [McCorquodale,Colella

Heart Simulation Immersed Boundary Method [Peskin/MacQueen]
Fibers (e.g., heart muscles) modeled by list of fiber points Fluid space modeled by a regular lattice Irregular fiber lists need to interact with regular fluid lattice Trade-off between load balancing of fibers and minimizing communication Memory and communication intensive Current model can be used to design heart valves Ultimately: real-time clinical work Random array access is key problem in the performance Developed compiler optimizations to improve their performance Application effort funded by NSF/NPACI

Parallel Performance and Scalability
Poisson solver using “Method of Local Corrections” [Balls, Colella] Communication < 5%; Scaled speedup nearly ideal (flat) IBM SP Cray T3E Explain scaled speedup – scale problem size with num procs Problem (work) itself doesn't always scale exactly linearly

Titanium Interactions
GASNet interactions In addition to the Application collaborators Charles Peskin and Dave McQueen and Courant Institute Phil Colella and Tong Wen and LBNL Scott Baden and Greg Balls and UCSD Involved in Sun HPCS Effort The GASNet work is common to UPC and Titanium Joint effort between U.C. Berkeley and LBNL (UPC project is primarily at LBNL; Titanium is U.C. Berkeley) Collaboration with Nieplocha on communication runtime Participation in Global Address Space tutorials

The End

NAS PB: CG CG = Conjugate Gradient can be written naturally with fine-grained communication in the sparse matrix-vector product Worked well on the T3E (and hopefully will on the X1) For other machines, a bulk version is required

UPC and Titanium Kathy Yelick University of California, Berkeley and

Similar presentations

Presentation on theme: "UPC and Titanium Kathy Yelick University of California, Berkeley and"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

UPC and Titanium Kathy Yelick University of California, Berkeley and

Similar presentations

Presentation on theme: "UPC and Titanium Kathy Yelick University of California, Berkeley and"— Presentation transcript:

Similar presentations

About project

Feedback