UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.

Slides:



Advertisements
Similar presentations
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Advertisements

Scalable Multi-Cache Simulation Using GPUs Michael Moeng Sangyeun Cho Rami Melhem University of Pittsburgh.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Thoughts on Shared Caches Jeff Odom University of Maryland.
Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
A High-Performance Java Dialect Kathy Yelick, Luigi Semenzato, Geoff Pike, Carleton Miyamoto, Ben Liblit, Arvind Krishnamurthy, Paul Hilfinger, Susan Graham,
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
1 New Architectures Need New Languages A triumph of optimism over experience! Ian Watson 3 rd July 2009.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
UPC Runtime Layer Jason Duell. The Big Picture The Runtime layer handles everything that is both: 1) Platform/Environment specific —So compiler can output.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.
Presented by High Productivity Language Systems: Next-Generation Petascale Programming Aniruddha G. Shet, Wael R. Elwasif, David E. Bernholdt, and Robert.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Compiling Several Classes of Communication Patterns on a Multithreaded Architecture Gagan Agrawal Department of Computer and Information Sciences Ohio.
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,
NERSC/LBNL UPC Compiler Status Report Costin Iancu and the UCB/LBL UPC group.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,
C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.
Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.
Parallel Computing Presented by Justin Reschke
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Programming Models for SimMillennium
Department of Computer Science University of California, Santa Barbara
UPC and Titanium Kathy Yelick University of California, Berkeley and
Immersed Boundary Method Simulation in Titanium Objectives
Department of Computer Science University of California, Santa Barbara
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory

Center for Programming Models for Scalable Parallel Computing2Review, March 13, 2003 Outline Global Address Languages in General Distinction between languages and libraries UPC Language overview Berkeley UPC compiler status and microbenchmarks Application benchmarks and plans Titanium Language overview Berkeley Titanium compiler status Application benchmarks and plans

Center for Programming Models for Scalable Parallel Computing3Review, March 13, 2003 Global Address Space Languages Explicitly-parallel programming model with SPMD parallelism Fixed at program start-up, typically 1 thread per processor Global address space model of memory Allows programmer to directly represent distributed data structures Address space is logically partitioned Local vs. remote memory (two-level hierarchy) Programmer control over performance critical decisions Data layout and communication Performance transparency and tunability are goals Initial implementation can use fine-grained shared memory Suitable for current and future architectures Either shared memory or lightweight messaging is key Base languages differ: UPC (C), CAF (Fortran), Titanium (Java)

Center for Programming Models for Scalable Parallel Computing4Review, March 13, 2003 Global Address Space The languages share the global address space abstraction Shared memory is partitioned by processors Remote memory may stay remote: no automatic caching implied One-sided communication through reads/writes of shared variables Both individual and bulk memory copies Differ on details Some models have a separate private memory area Distributed arrays generality and how they are constructed Shared Global address space X[0] Private ptr: X[1]X[P]

Center for Programming Models for Scalable Parallel Computing5Review, March 13, 2003 UPC Programming Model Features SPMD parallelism fixed number of images during execution images operate asynchronously Several kinds of array distributions double a[n] a private n-element array on each processor shared double a[n] a n-element shared array, with cyclic mapping shared [4] double a[n] a block cyclic array with 4-element blocks shared [0] double *a = (shared [0] double *) upc_alloc(n); a shared array with all elements local Pointers for irregular data structures shared double *sp a pointer to shared data double *lp a pointers to private data

Center for Programming Models for Scalable Parallel Computing6Review, March 13, 2003 UPC Programming Model Features Global synchronization upc_barrier traditional barrier upc_notify/upc_wait split-phase global synchronization Pair-wise synchronization upc_lock/upc_unlock traditional locks Memory consistence has two types of accesses Strict: must be performed immediately and atomically: typically a blocking round-trip message if remote Relaxed: still must preserve dependencies, but other processors may view these as happening out of order Parallel I/O Based on ideas in MPI I/O Specification for UPC by Thakur, El Ghazawi et al

Center for Programming Models for Scalable Parallel Computing7Review, March 13, 2003 Berkeley UPC Compiler UPC Higher WHIRL Lower WHIRL Compiler based on Open64 Recently merged Rice sources Multiple front-ends, including gcc Intermediate form called WHIRL Current focus on C backend IA64 possible in future UPC Runtime Pointer representation Shared/distribute memory Communication in GASNet Portable Language-independent Optimizing transformations C + Runtime Assembly: IA64, MIPS,… + Runtime

Center for Programming Models for Scalable Parallel Computing8Review, March 13, 2003 Design for Portability & Performance UPC to C translator: Translates UPC to C; insert runtime calls for parallel features UPC runtime: Allocate shared data; implement pointers-to-shared GASNet: A uniform interface for low-level communication primitives Portability: C is our intermediate language GASNet is itself layered with a small core as the essential part High-Performance: Native C compiler optimizes serial code Translator can perform communication optimizations GASNet can access network directly

Center for Programming Models for Scalable Parallel Computing9Review, March 13, 2003 Berkeley UPC Compiler Status UPC Extensions added to front-end Code-generation complete Some issues related to code quality (hints to backend compilers) GASNet communication layer Running on Quadrics/Elan, IBM/LAPI, Myrinet/GM, and MPI Optimized for small non-blocking messages and compiled code Next step: strided and indexed put/get leveraging ARMCI work UPC Runtime layer Developed and tested on all GASNet implementations Supports multiple pointer representations Next step: direct shared memory support Release scheduled for later this month Glitch related to include files and usability to iron out

Center for Programming Models for Scalable Parallel Computing10Review, March 13, 2003 Pointer-to-Shared Representation UPC has three difference kinds of pointers: Block-cyclic, cyclic, and indefinite (always local) A pointer needs a “phase” to keep track of where it is in a block Source of overhead for updating and de-referencing Consumes space in the pointer Our runtime has special cases for: Phaseless (cyclic and indefinite) – skip phase update Indefinite – skip thread id update Pointer size/representation easily reconfigured 64 bits on small machines, 128 on large, word or struct Address ThreadPhase

Center for Programming Models for Scalable Parallel Computing11Review, March 13, 2003 Preliminary Performance Testbed Compaq AlphaServer, with Quadrics GASNet conduit Compaq C compiler for the translated C code Microbenchmarks Measures the cost of UPC language features and construct Shared pointer arithmetic, barrier, allocation, etc Vector addition: no remote communication NAS Parallel Benchmarks EP: no communication IS: large bulk memory operations MG: bulk memput CG: fine-grained vs. bulk memput

Center for Programming Models for Scalable Parallel Computing12Review, March 13, 2003 Performance of Shared Pointer Arithmetic Phaseless pointers are an important optimization ­Indefinite pointers almost as fast as regular C pointers ­General blocked cyclic pointer 7x slower for addition Competitive with HP compiler, which generates native code ­Both compiler have known opportunities for improvement

Center for Programming Models for Scalable Parallel Computing13Review, March 13, 2003 Cost of Shared Memory Access Local shared accesses somewhat slower than private ones ­HP has improved local performance in newer version Remote accesses worse than local, as expected ­Runtime/GASNet layering for portability is not a problem

Center for Programming Models for Scalable Parallel Computing14Review, March 13, 2003 NAS PB: EP EP = Embarrassingly Parallel has no communication Serial performance via C code generation is not a problem

Center for Programming Models for Scalable Parallel Computing15Review, March 13, 2003 NAS PB: IS IS = Integer Sort is dominated by Bulk Communication GASNet bulk communication adds no measurable overhead

Center for Programming Models for Scalable Parallel Computing16Review, March 13, 2003 NAS PB: MG MG = Multigrid involves medium bulk copies “Berkeley” reveals a slight serial performance degradation due to casts Berkeley-C uses the original C code for the inner loops

Center for Programming Models for Scalable Parallel Computing17Review, March 13, 2003 Scaling MG on the T3E Scalability of the language shown here for the T3E compiler Directly shared memory support is probably needed to be competitive on most current machines

Center for Programming Models for Scalable Parallel Computing18Review, March 13, 2003 Mesh Generation in UPC Parallel Mesh Generation in UPC 2D Delaunay triangulation Based on Triangle software by Shewchuk (UCB) Parallel version from NERSC uses dynamic load balancing, software caching, and parallel sorting

Center for Programming Models for Scalable Parallel Computing19Review, March 13, 2003 Research in Optimizations Privatizing accesses for local memory In conjunction with elimination of forall loop affinity tests Communication optimizations Separate get/put from sync & exploit split-phase barrier Message aggregation (fine-grained to bulk) Software caching Research problems: Optimization selection based on performance model Language research in the UPC memory consistency model

Center for Programming Models for Scalable Parallel Computing20Review, March 13, 2003 Preliminary Performance Results UPC communication optimizations Performed by hand Remote fetch-and-increment (not random data)

Center for Programming Models for Scalable Parallel Computing21Review, March 13, 2003 UPC Interactions UPC consortium Tarek El-Ghazawi is coordinator: semi-annual meetings, ~daily Revised UPC Language Specification (IDA,GWU,…) UPC Collectives (MTU) UPC I/O Specifications (GWU, ANL-PModels) Other Implementations HP (Alpha cluster and C+MPI compiler (with MTU)) MTU (C+MPI Compiler based on HP compiler, memory model) Cray (X1 implementation) Intrepid (SGI implementation based on gcc) Etnus (debugging) UPC Book: T. El-Ghazawi, B. Carlson, T. Sterling, K. Yelick Goal is proofs by SC03 HPC HPCS Effort Recent interest from Sandia

Center for Programming Models for Scalable Parallel Computing22Review, March 13, 2003 Titanium Based on Java, a cleaner C++ classes, automatic memory management, etc. compiled to C and then native binary (no JVM) Same parallelism model as UPC and CAF SPMD with a global address space Dynamic Java threads are not supported Optimizing compiler static (compile-time) optimizer, not a JIT communication and memory optimizations synchronization analysis (e.g. static barrier analysis) cache and other uniprocessor optimizations

Center for Programming Models for Scalable Parallel Computing23Review, March 13, 2003 Summary of Features Added to Java 1.Scalable parallelism (Java threads replaced) 2.Immutable (“value”) classes 3.Multidimensional arrays with unordered iteration 4.Checked Synchronization 5.Operator overloading 6.Templates 7.Zone-based memory management (regions) 8.Libraries for collective communication, distributed arrays, bulk I/O

Center for Programming Models for Scalable Parallel Computing24Review, March 13, 2003 Immutable Classes in Titanium For small objects, would sometimes prefer to avoid level of indirection pass by value (copy entire object) especially when immutable -- fields never modified Example: immutable class Complex { Complex () {real=0; imag=0; } Complex operator+ (Complex c) {... } } Complex c1 = new Complex(7.1, 4.3); c1 = c1 + c1; Addresses performance and programmability Similar to structs in C (not C++ classes) in terms of performance Adds support for complex types

Center for Programming Models for Scalable Parallel Computing25Review, March 13, 2003 Multidimensional Arrays Arrays in Java are objects Array bounds are checked Multidimensional arrays are arrays-of-arrays  Safe and general, but potentially slow New kind of multidimensional array added to Titanium Sub-arrays are supported (interior, boundary, etc.) Indexed by Points (tuple of ints) Combined with unordered iteration to enable optimizations foreach (p within A.domain()) { A[p]... } “A” could be multidimensional, an interior region, etc.

Center for Programming Models for Scalable Parallel Computing26Review, March 13, 2003 Communication Titanium has explicit global communication: Broadcast, reduction, etc. Primarily used to set up distributed data structures Most communication is implicit through the shared address space Dereferencing a global reference, g.x, can generate communication Arrays have copy operations, which generate bulk communication: A1.copy(A2) Automatically computes the intersection of A1 and A2’s index set or domain

Center for Programming Models for Scalable Parallel Computing27Review, March 13, 2003 Distributed Data Structures Building distributed arrays: Particle [1d] single [1d] allParticle = new Particle [0:Ti.numProcs-1][1d]; Particle [1d] myParticle = new Particle [0:myParticleCount-1]; allParticle.exchange(myParticle); Now each processor has array of pointers, one to each processor’s chunk of particles P0 P1P2 All to all broadcast

Center for Programming Models for Scalable Parallel Computing28Review, March 13, 2003 Titanium Compiler Status Titanium compiler runs on almost any machine Requires a C compiler (and decent C++ to compile translator) Pthreads for shared memory Communication layer for distributed memory (or hybrid) Recently moved to live on GASNet: obtained GM, Elan, and improved LAPI implementation Leverages other PModels work for maintenance Recent language extensions Indexed array copy (scatter/gather style) Non-blocking array copy under development Compiler optimizations Cache optimizations, for loop optimizations Communication optimizations for overlap, pipelining, and scatter/gather under development

Center for Programming Models for Scalable Parallel Computing29Review, March 13, 2003 Applications in Titanium Several benchmarks Fluid solvers with Adaptive Mesh Refinement (AMR) Conjugate Gradient 3D Multigrid Unstructured mesh kernel: EM3D Dense linear algebra: LU, MatMul Tree-structured n-body code Finite element benchmark Genetics: micro-array selection SciMark serial benchmarks Larger applications Heart simulation Ocean modeling with AMR (in progress)

Center for Programming Models for Scalable Parallel Computing30Review, March 13, 2003 Serial Performance (Pure Java) Several optimizations in Titanium compiler (tc) over the past year These codes are all written in pure Java without performance extensions

Center for Programming Models for Scalable Parallel Computing31Review, March 13, 2003 AMR for Ocean Modeling Ocean Modeling [Wen, Colella] Require embedded boundaries to model ocean floor/coastline Line vs. point relaxation to handle aspect ratio: 1000km x 10km Result in irregular data structures and array accesses Goal for this year: Basin scale AMR circulation model Currently a non-adaptive implementation Compiler and language support design Graphics from Titanium AMR Gas Dynamics [McCorquodale,Colella

Center for Programming Models for Scalable Parallel Computing32Review, March 13, 2003 Immersed Boundary Method [Peskin/MacQueen] Fibers (e.g., heart muscles) modeled by list of fiber points Fluid space modeled by a regular lattice Irregular fiber lists need to interact with regular fluid lattice Trade-off between load balancing of fibers and minimizing communication Memory and communication intensive Heart Simulation Random array access is key problem in the performance Developed compiler optimizations to improve their performance Application effort funded by NSF/NPACI

Center for Programming Models for Scalable Parallel Computing33Review, March 13, 2003 Parallel Performance and Scalability Poisson solver using “Method of Local Corrections” [Balls, Colella] Communication < 5%; Scaled speedup nearly ideal (flat) IBM SP Cray T3E

Center for Programming Models for Scalable Parallel Computing34Review, March 13, 2003 Titanium Interactions GASNet interactions In addition to the Application collaborators Charles Peskin and Dave McQueen and Courant Institute Phil Colella and Tong Wen and LBNL Scott Baden and Greg Balls and UCSD Involved in Sun HPCS Effort The GASNet work is common to UPC and Titanium Joint effort between U.C. Berkeley and LBNL (UPC project is primarily at LBNL; Titanium is U.C. Berkeley) Collaboration with Nieplocha on communication runtime Participation in Global Address Space tutorials

Center for Programming Models for Scalable Parallel Computing35Review, March 13, 2003 The End

Center for Programming Models for Scalable Parallel Computing36Review, March 13, 2003 NAS PB: CG CG = Conjugate Gradient can be written naturally with fine- grained communication in the sparse matrix-vector product Worked well on the T3E (and hopefully will on the X1) For other machines, a bulk version is required

Center for Programming Models for Scalable Parallel Computing37Review, March 13, 2003 NAS MG in Titanium Preliminary Performance for MG code on IBM SP