Unified Parallel C at LBNL/UCB UPC: A Portable High Performance Dialect of C Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove,

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

Parallel Processing with OpenMP

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Types of Parallel Computers

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Reference: Message Passing Fundamentals.

1 Tuesday, November 07, 2006 “If anything can go wrong, it will.” -Murphy’s Law.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

Introduction to Symmetric Multiprocessors Süha TUNA Bilişim Enstitüsü UHeM Yaz Çalıştayı

A Behavioral Memory Model for the UPC Language Kathy Yelick Joint work with: Dan Bonachea, Jason Duell, Chuck Wallace.

A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Evaluation of Memory Consistency Models in Titanium.

Synchronization and Communication in the T3E Multiprocessor.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.

Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Spring 2003CSE P5481 Issues in Multiprocessors Which programming model for interprocessor communication shared memory regular loads & stores message passing.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

Unified Parallel C at LBNL/UCB Compiler Optimizations in the Berkeley UPC Translator Wei Chen the Berkeley UPC Group.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Introduction to OpenMP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB Fredericton, New Brunswick.

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

Unified Parallel C at LBNL/UCB The Berkeley UPC Project Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

3/12/2013Computer Engg, IIT(BHU)1 OpenMP-1. OpenMP is a portable, multiprocessing API for shared memory computers OpenMP is not a “language” Instead,

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

Lecture 13 Parallel Processing. 2 What is Parallel Computing? Traditionally software has been written for serial computation. Parallel computing is the.

Department of Computer Science, Johns Hopkins University Lecture 7 Finding Concurrency EN /420 Instructor: Randal Burns 26 February 2014.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.

Group Members Hamza Zahid (131391) Fahad Nadeem khan Abdual Hannan AIR UNIVERSITY MULTAN CAMPUS.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Overview of Berkeley UPC

Parallel Programming By J. H. Wang May 2, 2017.

Computer Engg, IIT(BHU)

CMSC 611: Advanced Computer Architecture

Chapter 4: Threads.

Presentation transcript:

Unified Parallel C at LBNL/UCB UPC: A Portable High Performance Dialect of C Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Wei Tu, Mike Welcome

Unified Parallel C at LBNL/UCB Parallelism on the Rise 1.8x annual performance increase -1.4x from improved technology and on-chip parallelism -1.3x in processor count for larger machines Year Single CPU Performance CPU Frequencies Aggregate Systems Performance FLOPS 100M 1G 10G 100G 10T 100T 1P 10M 1T 100MHz 1GHz 10GHz 10MHz X-MPVP-200 S-810/20 S-820/80 VP-400 SX-2 CRAY-2Y-MP8 VP2600/1 0 SX-3 C90 SX-3R S-3800 SR2201 SR2201/2K CM-5 Paragon T3D T90 T3E NWT/166 VPP500 SX-4 ASCI RedVPP700 SX-5 SR8000 SR8000G1 SX-6 ASCI Blue ASCI Blue Mountain VPP800 VPP5000 ASCI White ASCI Q Earth Simulator 1M Increasing Parallelism

Unified Parallel C at LBNL/UCB Parallel Programming Models Parallel software is still an unsolved problem ! Most parallel programs are written using either: -Message passing with a SPMD model -for scientific applications; scales easily -Shared memory with threads in OpenMP, Threads, or Java - non-scientific applications; easier to program Partitioned Global Address Space (PGAS) Languages -global address space like threads (programmability) -SPMD parallelism like MPI (performance) -local/global distinction, i.e., layout matters (performance)

Unified Parallel C at LBNL/UCB Partitioned Global Address Space Languages Explicitly-parallel programming model with SPMD parallelism -Fixed at program start-up, typically 1 thread per processor Global address space model of memory -Allows programmer to directly represent distributed data structures Address space is logically partitioned -Local vs. remote memory (two-level hierarchy) Programmer control over performance critical decisions -Data layout and communication Performance transparency and tunability are goals -Initial implementation can use fine-grained shared memory Base languages differ: UPC (C), CAF (Fortran), Titanium (Java)

Unified Parallel C at LBNL/UCB UPC Design Philosophy Unified Parallel C (UPC) is: -An explicit parallel extension of ISO C -A partitioned global address space language -Sometimes called a GAS language Similar to the C language philosophy -Concise and familiar syntax -Orthogonal extensions of semantics -Assume programmers are clever and careful -Given them control; possibly close to hardware -Even though they may get intro trouble Based on ideas in Split-C, AC, and PCP

Unified Parallel C at LBNL/UCB A Quick UPC Tutorial

Unified Parallel C at LBNL/UCB Virtual Machine Model Global address space abstraction -Shared memory is partitioned over threads -Shared vs. private memory partition within each thread -Remote memory may stay remote: no automatic caching implied -One-sided communication through reads/writes of shared variables Build data structures using -Distributed arrays -Two kinds of pointers: Local vs. global pointers (“pointers to shared”) Shared Global address space X[0] Private ptr: X[1]X[P] Thread 0 Thread 1 Thread n

Unified Parallel C at LBNL/UCB UPC Execution Model Threads work independently in a SPMD fashion -Number of threads given by THREADS set as compile time or runtime flag -MYTHREAD specifies thread index ( 0..THREADS-1 ) -upc_barrier is a global synchronization: all wait Any legal C program is also a legal UPC program #include /* needed for UPC extensions */ #include main() { printf("Thread %d of %d: hello UPC world\n", MYTHREAD, THREADS); }

Unified Parallel C at LBNL/UCB Private vs. Shared Variables in UPC C variables and objects are allocated in the private memory space Shared variables are allocated only once, in thread 0’s space shared int ours; int mine; Shared arrays are spread across the threads shared int x[2*THREADS] /* cyclic, 1 element each, wrapped */ shared int [2] y [2*THREADS] /* blocked, with block size 2 */ Shared variables may not occur in a function definition unless static Shared Global address space Private mine: Thread 0 Thread 1 Thread n ours: x[0,n+1] y[0,1] x[1,n+2] y[2,3] x[n,2n] y[2n-1,2n]

Unified Parallel C at LBNL/UCB This owner computes idiom is common, so UPC has upc_forall(init; test; loop; affinity) statement; Programmer indicates the iterations are independent -Undefined if there are dependencies across threads -Affinity expression indicates which iterations to run -Integer: affinity%THREADS is MYTHREAD -Pointer: upc_threadof(affinity) is MYTHREAD Work Sharing with upc_forall() shared int v1[N], v2[N], sum[N]; void main() { int i; for(i=0; i<N; i++) if (MYTHREAD = = i%THREADS) sum[i]=v1[i]+v2[i]; } cyclic layout owner computes shared int v1[N], v2[N], sum[N]; void main() { int i; upc_forall(i=0; i<N; i++; &v1[i]) sum[i]=v1[i]+v2[i]; } i would also work

Unified Parallel C at LBNL/UCB Memory Consistency in UPC Shared accesses are strict or relaxed, designed by: -A pragma affects all otherwise unqualified accesses -#pragma upc relaxed -#pragma upc strict -Usually done by including standard.h files with these -A type qualifier in a declaration affects all accesses -int strict shared flag; -A strict or relaxed cast can be used to override the current pragma or declared qualifier. Informal semantics -Relaxed accesses must obey dependencies, but non-dependent access may appear reordered by other threads -Strict accesses appear in order: sequentially consistent

Unified Parallel C at LBNL/UCB Other Features of UPC Synchronization constructs -Global barriers -Variant with labels to document matching of barriers -Split-phase variant ( upc_notify and upc_wait ) -Locks -upc_lock, upc_lock_attempt, upc_unlock Collective communication library -Allows for asynchronous entry/exit shared [] int A[10]; shared [10] int B[10*THREADS]; // Initialize A. upc_all_broadcast(B, A, sizeof(int)*NELEMS, UPC_IN_MYSYNC | UPC_OUT_ALLSYNC ); Parallel I/O library

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler

Unified Parallel C at LBNL/UCB Goals of the Berkeley UPC Project Make UPC Ubiquitous on -Parallel machines -Workstations and PCs for development -A portable compiler: for future machines too Components of research agenda: 1.Runtime work for Partitioned Global Address Space (PGAS) languages in general 2.Compiler optimizations for parallel languages 3.Application demonstrations of UPC

Unified Parallel C at LBNL/UCB Berkeley UPC Compiler UPC Higher WHIRL Lower WHIRL Compiler based on Open64 Multiple front-ends, including gcc Intermediate form called WHIRL Current focus on C backend IA64 possible in future UPC Runtime Pointer representation Shared/distribute memory Communication in GASNet Portable Language-independent Optimizing transformations C + Runtime Assembly: IA64, MIPS,… + Runtime

Unified Parallel C at LBNL/UCB Optimizations In Berkeley UPC compiler -Pointer representation -Generating optimizable single processor code -Message coalescing (aka vectorization) Opportunities -forall loop optimizations (unnecessary iterations) -Irregular data set communication (Titanium) -Sharing inference -Automatic relaxation analysis and optimizations

Unified Parallel C at LBNL/UCB Pointer-to-Shared Representation UPC has three difference kinds of pointers: -Block-cyclic, cyclic, and indefinite (always local) A pointer needs a “phase” to keep track of where it is in a block -Source of overhead for updating and de-referencing -Consumes space in the pointer Our runtime has special cases for: -Phaseless (cyclic and indefinite) – skip phase update -Indefinite – skip thread id update -Some machine-specific special cases for some memory layouts Pointer size/representation easily reconfigured -64 bits on small machines, 128 on large, word or struct Address ThreadPhase

Unified Parallel C at LBNL/UCB Performance of Pointers to Shared Phaseless pointers are an important optimization Indefinite pointers almost as fast as regular C pointers General blocked cyclic pointer 7x slower for addition Competitive with HP compiler, which generates native code Both compiler have improved since these were measured

Unified Parallel C at LBNL/UCB Generating Optimizable (Vectorizable) Code Translator generated C code can be as efficient as original C code Source-to-source translation a good strategy for portable PGAS language implementations

Unified Parallel C at LBNL/UCB NAS CG: OpenMP style vs. MPI style GAS language outperforms MPI+Fortran (flat is good!) Fine-grained (OpenMP style) version still slower shared memory programming style leads to more overhead (redundant boundary computation) GAS languages can support both programming styles

Unified Parallel C at LBNL/UCB Message Coalescing Implemented in a number of parallel Fortran compilers (e.g., HPF) Idea: replace individual puts/gets with bulk calls Targets bulk calls and index/strided calls in UPC runtime (new) Goal: ease programming by speeding up shared memory style shared [0] int * r; … for (i = L; i < U; i++) exp1 = exp2 + r[i]; Unoptimized loop int lr[U-L]; … upcr_memget(lr, &r[L], U-L); for (i = L; i < U; i++) exp1 = exp2 + lr[i-L]; Optimized Loop

Unified Parallel C at LBNL/UCB Message Coalescing vs. Fine- grained One thread per node Vector is 100K elements, number of rows is 100*threads Message coalesced code more than 100X faster Fine-grained code also does not scale well -Network overhead

Unified Parallel C at LBNL/UCB Message Coalescing vs. Bulk Message coalescing and bulk style code have comparable performance - For indefinite array the generated code is identical - For cyclic array, coalescing is faster than manual bulk code on elan -memgets to each thread are overlapped -Points to need for language extension

Unified Parallel C at LBNL/UCB Automatic Relaxtion Goal: simplify programming by giving programmers the illusion that the compiler and hardware are not reordering When compiling sequential programs: Valid if y not in expr1 and x not in expr2 (roughly) When compiling parallel code, not sufficient test. y = expr2; x = expr1; y = expr2; Initially flag = data = 0 Proc A Proc B data = 1; while (flag!=1); flag = 1;... =...data...;

Unified Parallel C at LBNL/UCB Cycle Detection: Dependence Analog Processors define a “program order” on accesses from the same thread P is the union of these total orders Memory system define an “access order” on accesses to the same variable A is access order (read/write & write/write pairs) A violation of sequential consistency is cycle in P U A. Intuition: time cannot flow backwards. write data read flag write flag read data

Unified Parallel C at LBNL/UCB Cycle Detection Generalizes to arbitrary numbers of variables and processors Cycles may be arbitrarily long, but it is sufficient to consider only cycles with 1 or 2 consecutive stops per processor write x write y read y read y write x

Unified Parallel C at LBNL/UCB Static Analysis for Cycle Detection Approximate P by the control flow graph Approximate A by undirected “dependence” edges Let the “delay set” D be all edges from P that are part of a minimal cycle The execution order of D edge must be preserved; other P edges may be reordered (modulo usual rules about serial code) Conclusions: -Cycle detection is possible for small language -Synchronization analysis is critical -Open: is pointer/array analysis accurate enough for this to be practical? write z read x read y write z write y read x

Unified Parallel C at LBNL/UCB GASNet: Communication Layer for PGAS Languages

Unified Parallel C at LBNL/UCB GASNet Design Overview - Goals Language-independence: support multiple PGAS languages/compilers -UPC, Titanium, Co-array Fortran, possibly others.. -Hide UPC- or compiler-specific details such as pointer-to-shared representation Hardware-independence: variety of parallel arch., OS's & networks -SMP's, clusters of uniprocessors or SMPs -Current networks: -Native network conduits: Myrinet GM, Quadrics Elan, Infiniband VAPI, IBM LAPI -Portable network conduits: MPI 1.1, Ethernet UDP -Under development: Cray X-1, SGI/Cray Shmem, Dolphin SCI -Current platforms: -CPU: x86, Itanium, Opteron, Alpha, Power3/4, SPARC, PA-RISC, MIPS -OS: Linux, Solaris, AIX, Tru64, Unicos, FreeBSD, IRIX, HPUX, Cygwin, MacOS Ease of implementation on new hardware -Allow quick implementations -Allow implementations to leverage performance characteristics of hardware -Allow flexibility in message servicing paradigm (polling, interrupts, hybrids, etc) Want both portability & performance

Unified Parallel C at LBNL/UCB GASNet Design Overview - System Architecture 2-Level architecture to ease implementation: Core API -Most basic required primitives, as narrow and general as possible -Implemented directly on each network -Based heavily on active messages paradigm Compiler-generated code Compiler-specific runtime system GASNet Extended API GASNet Core API Network Hardware Extended API –Wider interface that includes more complicated operations –We provide a reference implementation of the extended API in terms of the core API –Implementors can choose to directly implement any subset for performance - leverage hardware support for higher-level operations –Currently includes: –blocking and non-blocking puts/gets (all contiguous), flexible synchronization mechanisms, barriers –Just recently added non-contiguous extensions (coming up later)

Unified Parallel C at LBNL/UCB GASNet Performance Summary

Unified Parallel C at LBNL/UCB GASNet Performance Summary

Unified Parallel C at LBNL/UCB GASNet vs. MPI on Infiniband OSU MVAPICH widely regarded as the "best" MPI implementation on Infiniband MVAPICH code based on the FTG project MVICH (MPI over VIA) GASNet wins because fully one-sided, no tag matching or two-sided sync.overheads MPI semantics provide two-sided synchronization, whether you want it or not

Unified Parallel C at LBNL/UCB GASNet vs. MPI on Infiniband GASNet significantly outperforms MPI at mid-range sizes - the cost of MPI tag matching Yellow line shows the cost of naïve bounce-buffer pipelining when local side not prepinned - memory registration is an important issue

Unified Parallel C at LBNL/UCB Applications in PGAS Languages

Unified Parallel C at LBNL/UCB PGAS Languages Scale Use of the memory model (relaxed/strict) for synchronization Medium sized messages done through array copies

Unified Parallel C at LBNL/UCB Performance Results Berkeley UPC FT vs MPI Fortran FT 80 Dual PIII-866MHz Nodes running Berkeley UPC (gm-conduit /Myrinet 2K, 33Mhz-64Bit bus)

Unified Parallel C at LBNL/UCB Challenging Applications Focus on the problems that are hard for MPI -Naturally fine-grained -Patterns of sharing/communication unknown until runtime Two examples -Adaptive Mesh Refinement (AMR) -Poisson problem in Titanium (low flops to memory/comm) -Hyperbolic problems in UPC (higher ratio, not adaptive so far) -Task parallel view (first) -Immersed boundary method simulation -Used for simulating the heart, cochlea, bacteria, insect flight,.. -Titanium version is a general framework –Specializations for the heart and cochlea -Particle methods with two structures: regular fluid mesh + list of materials

Unified Parallel C at LBNL/UCB Ghost Region Exchange in AMR Ghost regions exist even in the serial code -Algorithm decomposed as operations on grid patches -Nearest neighbors (7, 9, 27-point stencils, etc.) Adaptive mesh organized by levels -Nasty meta-data problem to find neighbors -May exists only at a different level Grid 1Grid 2 Grid 3Grid 4 C BA

Unified Parallel C at LBNL/UCB Distributed Data Structures for AMR This shows just one level of the grid hierarchy Not an distributed array in any of the languages that support them Note: Titanium uses this structure even for regular arrays P1P2 G1 G2 G3 P1P2 G4 PROCESSOR 1PROCESSOR 2

Unified Parallel C at LBNL/UCB Programmability Comparison in Titanium Ghost region exchange in AMR -37 lines of Titanium -327 lines of C++/MPI, of which 318 are MPI-related Speed (single processor, full solve) -The same algorithm, Poisson AMR on same mesh -C++/Fortran Chombo: 366 seconds -Titanium by Chombo programmer: 208 seconds -Titanium after expert optimizations: 155 seconds -The biggest optimization was avoiding copies of single element arrays, which required domain/performance expert to find/fix -Titanium is faster on this platform!

Unified Parallel C at LBNL/UCB Heart Simulation in Titanium Programming experience -Code existed in Fortran for vector machines -Complete rewrite in Titanium -Except fast FFTs -3 GSR years postdoc year -# of numerical errors found along the way -About 1 every 2 weeks -Mostly due to missing code -# of race conditions: 1

Unified Parallel C at LBNL/UCB Scalability in < 1 second per timestep not possible 10x increase in bisection bandwidth would fix this

Unified Parallel C at LBNL/UCB Those Finicky Users How do we get people to use new languages? -Needs to be incremental -Start at the bottom, not at the top of the software stack Need to demonstrate advantages -Performance is the easiest: comes from ability to use great hardware -Productivity is harder: -Managers may be convinced by data -Programmers will vote by experience -Wait for programmer turnover Key: language must run well everywhere -As well as the hardware allows

Unified Parallel C at LBNL/UCB PGAS Languages are Not the End of the Story Flat parallelism model -Machines are not flat: vectors,streams,SIMD, VLIW, FPGAs, PIMs, SMP nodes,… No support for dynamic load balancing -Virtualize memory structure  moving load is easier -No virtualization of processor space  taskqueue library No fault tolerance -SPMD model is not a good fit if nodes fail frequently Little understanding of scientific problems -CAF and Titanium have multiD arrays -A matrix and grid are both arrays, but they’re different -Next level example: Immersed boundary method language

Unified Parallel C at LBNL/UCB To Virtualize or Not to Virtualize Why to virtualize -Portability -Fault tolerance -Load imbalance Why to not virtualize -Deep memory hierarchies -Expensive system overhead -Some problems match the hardware; don’t want to pay overhead If we design for these problems, need to beat MPI and HPC will always be a niche PGAS languages virtualize memory structure but not processor number Can we provide virtualized machine, but still allow for control in mapping (separate code)?