1 PGAS LanguagesKathy Yelick Compilation Technology for Computational Science Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint.

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

OpenMP Optimization National Supercomputing Service Swiss National Supercomputing Center.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Thoughts on Shared Caches Jeff Odom University of Maryland.

Enforcing Sequential Consistency in SPMD Programs with Arrays Wei Chen Arvind Krishnamurthy Katherine Yelick.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

Silberschatz, Galvin and Gagne ©2013 Operating System Concepts Essentials – 2 nd Edition Chapter 4: Threads.

Introduction CS 524 – High-Performance Computing.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.

Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Memory Management 2010.

Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.

Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

Computer Organization and Architecture

Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.

Making Sequential Consistency Practical in Titanium Amir Kamil and Jimmy Su.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Computer System Architectures Computer System Software

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

Evaluation of Memory Consistency Models in Titanium.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1b.1 Types of Parallel Computers Two principal approaches: Shared memory multiprocessor Distributed memory multicomputer ITCS 4/5145 Parallel Programming,

1 Charm Kathy Yelick Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

HPC User Forum Back End Compiler Panel SiCortex Perspective Kevin Harris Compiler Manager April 2009.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Full and Para Virtualization

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.

Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.

1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,

Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

Chapter 4: Threads Modified by Dr. Neerja Mhaskar for CS 3SH3.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Overview of Berkeley UPC

Programming Models for SimMillennium

Amir Kamil and Katherine Yelick

Making Sequential Consistency Practical in Titanium

Chapter 4: Threads.

Chapter 4: Threads.

Chapter 4: Threads & Concurrency

Amir Kamil and Katherine Yelick

COMP755 Advanced Operating Systems

Presentation transcript:

1 PGAS LanguagesKathy Yelick Compilation Technology for Computational Science Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome

Kathy Yelick, 2 Outline Computer architecture trends Software trends Scientific computing: expertise in parallelism Performance is as important as parallelism Resource management is key to performance Open question: how much to virtualize machine? Parallel language problems & PGAS solutions Virtualize global address space Not shared virtual memory, not virtual processor space Parallel compiler problems/solutions

Kathy Yelick, 3 Parallelism Everywhere Single processor Moore’s Law effect is ending Power density limitations; device physics below 90nm Multicore is becoming the norm AMD, IBM, Intel, Sun all offering multicore Number of cores per chip likely to increase with density Fundamental software change Parallelism is exposed to software Performance is no longer solely a hardware problem What has the HPC community learned? Caveat: Scale and applications differ

Kathy Yelick, 4 High-end simulation in the physical sciences = 7 methods : 1. Structured Grids (including Adaptive Mesh Refinement) 2. Unstructured Grids 3. Spectral Methods (FFTs, etc.) 4. Dense Linear Algebra 5. Sparse Linear Algebra 6. Particles 7. Monte Carlo Simulation Note: Data sizes (8 bit to 32 bit) and types (integer, character) differ, but algorithms the same Games/Entertainment close to scientific computing Phillip Colella’s “Seven dwarfs” Add 4 for embedded; covers all 41 EEMBC benchmarks 8. Search/Sort 9. Filter 10. Comb. logic 11. Finite State Machine Slide source: Phillip Colella, 2004 and Dave Patterson, 2006

Kathy Yelick, 5 Parallel Programming Models Parallel software is still an unsolved problem ! Most parallel programs are written using either: Message passing with a SPMD model for scientific applications; scales easily Shared memory with threads in OpenMP, Threads, or Java non-scientific applications; easier to program Partitioned Global Address Space (PGAS) Languages off 3 features Productivity: easy to understand and use Performance: primary requirement in HPC Portability: must run everywhere

Kathy Yelick, 6 Partitioned Global Address Space Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout Global address space x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn By default: Object heaps are shared Program stacks are private 3 Current languages: UPC, CAF, and Titanium Emphasis in this talk on UPC & Titanium (based on Java)

Kathy Yelick, 7 PGAS Language Overview Many common concepts, although specifics differ Consistent with base language Both private and shared data int x[10]; and shared int y[10]; Support for distributed data structures Distributed arrays; local and global pointers/references One-sided shared-memory communication Simple assignment statements: x[i] = y[i]; or t = *p; Bulk operations: memcpy in UPC, array ops in Titanium and CAF Synchronization Global barriers, locks, memory fences Collective Communication, IO libraries, etc.

Kathy Yelick, 8 Example: Titanium Arrays Ti Arrays created using Domains; indexed using Points: double [3d] gridA = new double [[0,0,0]:[10,10,10]]; Eliminates some loop bound errors using foreach foreach (p in gridA.domain()) gridA[p] = gridA[p]*c + gridB[p]; Rich domain calculus allow for slicing, subarray, transpose and other operations without data copies Array copy operations automatically work on intersection data[neighborPos].copy(mydata); mydatadata[neighorPos] “restrict”-ed (non-ghost) cells ghost cells intersection (copied area)

Kathy Yelick, 9 Productivity: Line Count Comparison Comparison of NAS Parallel Benchmarks UPC version has modest programming effort relative to C Titanium even more compact, especially for MG, which uses multi-d arrays Caveat: Titanium FT has user-defined Complex type and cross-language support used to call FFTW for serial 1D FFTs UPC results from Tarek El-Gazhawi et al; CAF from Chamberlain et al; Titanium joint with Kaushik Datta & Dan Bonachea

Kathy Yelick, 10 Case Study 1: Block-Structured AMR Adaptive Mesh Refinement (AMR) is challenging Irregular data accesses and control from boundaries Mixed global/local view is useful AMR Titanium work by Tong Wen and Philip Colella Titanium AMR benchmarks available

Kathy Yelick, 11 AMR in Titanium C++/Fortran/MPI AMR Chombo package from LBNL Bulk-synchronous comm: Pack boundary data between procs Titanium AMR Entirely in Titanium Finer-grained communication No explicit pack/unpack code Automated in runtime system Code Size in Lines C++/Fortran/MPITitanium AMR data Structures AMR operations Elliptic PDE solver4200* X reduction in lines of code! * Somewhat more functionality in PDE part of Chombo code Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

Kathy Yelick, 12 Performance of Titanium AMR Serial: Titanium is within a few % of C++/F; sometimes faster! Parallel: Titanium scaling is comparable with generic optimizations - additional optimizations (namely overlap) not yet implemented Comparable performance

Kathy Yelick, 13 Immersed Boundary Simulation in Titanium Modeling elastic structures in an incompressible fluid. Blood flow in the heart, blood clotting, inner ear, embryo growth, and many more Complicated parallelization Particle/Mesh method “Particles” connected into materials Joint work with Ed Givelberg, Armando Solar-Lezama Code Size in Lines FortranTitanium

Kathy Yelick, 14 High Performance Strategy for acceptance of a new language Within HPC: Make it run faster than anything else Approaches to high performance Language support for performance: Allow programmers sufficient control over resources for tuning Non-blocking data transfers, cross-language calls, etc. Control over layout, load balancing, and synchronization Compiler optimizations reduce need for hand tuning Automate non-blocking memory operations, relaxed memory,… Productivity gains though parallel analysis and optimizations Runtime support exposes best possible performance Berkeley UPC and Titanium use GASNet communication layer Dynamic optimizations based on runtime information

Kathy Yelick, 15 One-Sided vs Two-Sided A one-sided put/get message can be handled directly by a network interface with RDMA support Avoid interrupting the CPU or storing data from CPU (preposts) A two-sided messages needs to be matched with a receive to identify memory address to put data Offloaded to Network Interface in networks like Quadrics Need to download match tables to interface (from host) address message id data payload one-sided put message two-sided message network interface memory host CPU

Kathy Yelick, 16 Performance Advantage of One-Sided Communication: GASNet vs MPI Opteron/InfiniBand (Jacquard at NERSC): GASNet’s vapi-conduit and OSU MPI MVAPICH Half power point (N ½ ) differs by one order of magnitude Joint work with Paul Hargrove and Dan Bonachea (up is good)

Kathy Yelick, 17 GASNet: Portability and High-Performance (down is good) GASNet better for latency across machines Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 18 (up is good) GASNet at least as high (comparable) for large messages GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 19 (up is good) GASNet excels at mid-range sizes: important for overlap GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 20 Case Study 2: NAS FT Performance of Exchange (Alltoall) is critical 1D FFTs in each dimension, 3 phases Transpose after first 2 for locality Bisection bandwidth-limited Problem as #procs grows Three approaches: Exchange: wait for 2 nd dim FFTs to finish, send 1 message per processor pair Slab: wait for chunk of rows destined for 1 proc, send when ready Pencil: send each row as it completes Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

Kathy Yelick, 21 Overlapping Communication Goal: make use of “all the wires all the time” Schedule communication to avoid network backup Trade-off: overhead vs. overlap Exchange has fewest messages, less message overhead Slabs and pencils have more overlap; pencils the most Example: Class D problem on 256 Processors Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Exchange (all data at once)512 Kbytes Slabs (contiguous rows that go to 1 processor) 64 Kbytes Pencils (single row)16 Kbytes

Kathy Yelick, 22 NAS FT Variants Performance Summary Slab is always best for MPI; small message cost too high Pencil is always best for UPC; more overlap Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea.5 Tflops

Kathy Yelick, 23 Case Study 3: LU Factorization Direct methods have complicated dependencies Especially with pivoting (unpredictable communication) Especially for sparse matrices (dependence graph with holes) LU Factorization in UPC Use overlap ideas and multithreading to mask latency Multithreaded: UPC threads + user threads + threaded BLAS Panel factorization: Including pivoting Update to a block of U Trailing submatrix updates Status: Dense LU done: HPL-compliant Sparse version underway Joint work with Parry Husbands

Kathy Yelick, 24 UPC HPL Performance Comparison to ScaLAPACK on an Altix, a 2 x 4 process grid ScaLAPACK (block size 64) GFlop/s (tried several block sizes) UPC LU (block size 256) GFlop/s, (block size 64) GFlop/s n = on a 4x4 process grid ScaLAPACK GFlop/s (block size = 64) UPC Gflop/s (block size = 200) MPI HPL numbers from HPCC database Large scaling: 2.2 TFlops on 512p, 4.4 TFlops on 1024p (Thunder) Joint work with Parry Husbands

Kathy Yelick, 25 Automating Support for Optimizations The previous examples are hand-optimized Non-blocking put/get on distributed memory Relaxed memory consistency on shared memory What analyses are needed to optimize parallel codes? Concurrency analysis: determine which blocks of code could run in parallel Alias analysis: determine which variables could access the same location Synchronization analysis: align matching barriers, locks… Locality analysis: when is a general (global pointer) used only locally (can convert to cheaper local pointer) Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 26 In parallel programs, a reordering can change the semantics even if no local dependencies exist. Reordering in Parallel Programs data = 1 flag = 1 data = 1 {f == 1, d == 0} is possible after reordering; not in original T1 f = flag d = data T2 f = flag d = data T2 Initially, flag = data = 0 Compiler, runtime, and hardware can produce such reorderings Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 27 Sequential consistency: a reordering is illegal if it can be observed by another thread Relaxed consistency: reordering may be observed, but local dependencies and synchronization preserved (roughly) Titanium, Java, & UPC are not sequentially consistent Perceived cost of enforcing it is too high For Titanium and UPC, network latency is the cost For Java shared memory fences and code transformations are the cost Memory Models Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 28 Compiler can reorder accesses as part of an optimization Example: copy propagation Logical fences inserted where reordering is illegal – optimizations respect these fences Hardware can reorder accesses Examples: out of order execution, remote accesses Fence instructions inserted into generated code – waits until all prior memory operations have completed Can cost a complete round trip time due to remote accesses Software and Hardware Reordering Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 29 Reordering of an access is observable only if it conflicts with some other access: The accesses can be to the same memory location At least one access is a write The accesses can run concurrently Fences (compiler and hardware) need to be inserted around accesses that conflict Conflicts data = 1 flag = 1 T1 f = flag d = data T2 Conflicts Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 30 Sequential Consistency in Titanium Minimize number of fences – allow same optimizations as relaxed model Concurrency analysis identifies concurrent accesses Relies on Titanium’s textual barriers and single- valued expressions Alias analysis identifies accesses to the same location Relies on SPMD nature of Titanium Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 31 Many parallel languages make no attempt to ensure that barriers line up Example code that is legal but will deadlock: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else ; // odd ID threads Barrier Alignment Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 32 Aiken and Gay introduced structural correctness (POPL’98) Ensures that every thread executes the same number of barriers Example of structurally correct code: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads Structural Correctness Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 33 Titanium has textual barriers: all threads must execute the same textual sequence of barriers Stronger guarantee than structural correctness – this example is illegal: if (Ti.thisProc() % 2 == 0) Ti.barrier(); // even ID threads else Ti.barrier(); // odd ID threads Single-valued expressions used to enforce textual barriers Textual Barrier Alignment Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 34 A single-valued expression has the same value on all threads when evaluated Example: Ti.numProcs() > 1 All threads guaranteed to take the same branch of a conditional guarded by a single-valued expression Only single-valued conditionals may have barriers Example of legal barrier use: if (Ti.numProcs() > 1) Ti.barrier(); // multiple threads else ; // only one thread total Single-Valued Expressions Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 35 Concurrency Analysis Graph generated from program as follows: Node added for each code segment between barriers and single-valued conditionals Edges added to represent control flow between segments // code segment 1 if ([single]) // code segment 2 else // code segment 3 // code segment 4 Ti.barrier() // code segment barrier Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 36 Concurrency Analysis (II) Two accesses can run concurrently if: They are in the same node, or One access’s node is reachable from the other access’s node without hitting a barrier Algorithm: remove barrier edges, do DFS barrier Concurrent Segments XXXX 2XXX 3XXX 4XXXX 5X Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 37 Alias Analysis Allocation sites correspond to abstract locations (a-locs) All explicit and implict program variables have points-to sets A-locs are typed and have points-to sets for each field of the corresponding type Arrays have a single points-to set for all indices Analysis is flow,context-insensitive Experimental call-site sensitive version – doesn’t seem to help much Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 38 Thread-Aware Alias Analysis Two types of abstract locations: local and remote Local locations reside in local thread’s memory Remote locations reside on another thread Exploits SPMD property Results are a summary over all threads Independent of the number of threads at runtime Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 39 Alias Analysis: Allocation Creates new local abstract location Result of allocation must reside in local memory class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); } Points-to Sets a b c A-locs1, 2 Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 40 Alias Analysis: Assignment Copies source abstract locations into points-to set of target class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); } Points-to Sets a 1 b c 1 1. z 2 A-locs1, 2 Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 41 Alias Analysis: Broadcast Produces both local and remote versions of source abstract location Remote a-loc points to remote analog of what local a-loc points to class Foo { Object z; } static void bar() { L1: Foo a = new Foo(); Foo b = broadcast a from 0; Foo c = a; L2: a.z = new Object(); } Points-to Sets a 1 b 1, 1 r c 1 1. z 2 1r.z1r.z 2r2r A-locs1, 2, 1 r Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 42 Aliasing Results Two variables A and B may alias if:  x  pointsTo( A ). x  pointsTo( B ) Two variables A and B may alias across threads if:  x  pointsTo( A ). R( x )  pointsTo( B ), (where R( x ) is the remote counterpart of x ) Points-to Sets a 1 b 1, 1 r c 1 Alias [Across Threads]: a b, c [ b ] b a, c [ a, c ] ca, b [ b ] Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 43 Benchmarks BenchmarkLines 1 Description pi 56Monte Carlo integration demv 122Dense matrix-vector multiply sample-sort 321Parallel sort lu-fact 420Dense linear algebra 3d-fft 614Fourier transform gsrb 1090Computational fluid dynamics kernel gsrb* 1099 Slightly modified version of gsrb spmv 1493Sparse matrix-vector multiply gas 8841Hyperbolic solver for gas dynamics 1 Line counts do not include the reachable portion of the 1 37,000 line Titanium/Java 1.0 libraries Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 44 We tested analyses of varying levels of precision Analysis Levels AnalysisDescription naïve All heap accesses sharing All shared accesses concur Concurrency analysis + type-based AA concur/saa Concurrency analysis + sequential AA concur/taa Concurrency analysis + thread-aware AA concur/taa/cycle Concurrency analysis + thread-aware AA + cycle detection Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 45 Percentages are for number of static fences reduced over naive Static (Logical) Fences GOOD Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 46 Percentages are for number of dynamic fences reduced over naive Dynamic (Executed) Fences GOOD Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 47 Dynamic Fences: gsrb GOOD gsrb relies on dynamic locality checks slight modification to remove checks ( gsrb* ) greatly increases precision of analysis Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 48 Consider two optimizations for GAS languages 1.Overlap bulk memory copies 2.Communication aggregation for irregular array accesses (i.e. a[b[i]] ) Both optimizations reorder accesses, so sequential consistency can inhibit them Both are addressing network performance, so potential payoff is high Two Example Optimizations Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 49 Array Copies in Titanium Array copy operations are commonly used dst.copy(src); Content in the domain intersection of the two arrays is copied from dst to src Communication (possibly with packing) required if arrays reside on different threads Processor blocks until the operation is complete. src dst Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 50 Non-Blocking Array Copy Optimization Automatically convert blocking array copies into non-blocking array copies Push sync as far down the instruction stream as possible to allow overlap with computation Interprocedural: syncs can be moved across method boundaries Optimization reorders memory accesses – may be illegal under sequential consistency Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 51 Communication Aggregation on Irregular Array Accesses (Inspector/Executor) A loop containing indirect array accesses is split into phases Inspector examines loop and computes reference targets Required remote data gathered in a bulk operation Executor uses data to perform actual computation Can be illegal under sequential consistency for (...) { a[i] = remote[b[i]]; // other accesses } schd = inspect(remote, b); tmp = get(remote, schd); for (...) { a[i] = tmp[i]; // other accesses } Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 52 Relaxed + SC with 3 Analyses NameDescription relaxed Uses Titanium’s relaxed memory model naïve Uses sequential consistency, puts fences around every heap access sharing Uses sequential consistency, puts fences around every shared heap access concur/taa/cycle Uses sequential consistency, uses our most aggressive analysis We tested performance using analyses of varying levels of precision Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 53 Dense Matrix Vector Multiply Non-blocking array copy optimization applied Strongest analysis is necessary: other SC implementations suffer relative to relaxed Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 54 Sparse Matrix Vector Multiply Inspector/executor optimization applied Strongest analysis is again necessary and sufficient Joint work with Amir Kamil and Jimmy Su

Kathy Yelick, 55 Portability of Titanium and UPC Titanium and the Berkeley UPC translator use a similar model Source-to-source translator (generate ISO C) Runtime layer implements global pointers, etc Common communication layer (GASNet) Both run on most PCs, SMPs, clusters & supercomputers Support Operating Systems: Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris, Cygwin, MacOSX, Unicos, SuperUX UPC translator somewhat less portable: we provide a http-based compile server Supported CPUs: x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC, Opteron GASNet communication: Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X1, SGI Altix, Cray/SGI SHMEM, and (for portability) MPI and UDP Specific supercomputer platforms: HP AlphaServer, Cray X1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000 Underway: Cray XT3, BG/L (both run over MPI) Can be mixed with MPI, C/C++, Fortran Also used by gcc/upc Joint work with Titanium and UPC groups

Kathy Yelick, 56 Portability of PGAS Languages Other compilers also exist for PGAS Languages UPC Gcc/UPC by Intrepid: runs on GASNet HP UPC for AlphaServers, clusters, … MTU UPC uses HP compiler on MPI (source to source) Cray UPC Co-Array Fortran: Cray CAF Compiler: X1, X1E Rice CAF Compiler (on ARMCI or GASNet), John Mellor-Crummey Source to source Processors: Pentium, Itanium2, Alpha, MIPS Networks: Myrinet, Quadrics, Altix, Origin, Ethernet OS: Linux32 RedHat, IRIS, Tru64 NB: source-to-source requires cooperation by backend compilers

Kathy Yelick, 57 Summary PGAS languages offer productivity advantage Order of magnitude in line counts for grid-based code in Titanium Push decisions about packing/not into runtime for portability (advantage of language with translator vs. library approach) Significant work in compiler can make programming easier PGAS languages offer performance advantages Good match to RDMA support in networks Smaller messages may be faster: make better use of network: postpone bisection bandwidth pain can also prevent cache thrashing for packing Have locality advantages that may help even SMPs Source-to-source translation The way to ubiquity Complement highly tuned machine-specific compilers

58 PGAS LanguagesKathy Yelick End of Slides

Kathy Yelick, 59 Productizing BUPC Recent Berkeley UPC release Support full 1.2 language spec Supports collectives (tuning ongoing); memory model compliance Supports UPC I/O (naïve reference implementation) Large effort in quality assurance and robustness Test suite: 600+ tests run nightly on 20+ platform configs Tests correct compilation & execution of UPC and GASNet >30,000 UPC compilations and >20,000 UPC test runs per night Online reporting of results & hookup with bug database Test suite infrastructure extended to support any UPC compiler now running nightly with GCC/UPC + UPCR also support HP-UPC, Cray UPC, … Online bug reporting database Over >1100 reports since Jan 03 > 90% fixed (excl. enhancement requests)

Kathy Yelick, 60 NAS FT: UPC Non-blocking MFlops Berkeley UPC compiler support non-blocking UPC extensions Produce 15-45% speedup over best UPC Blocking version Non-blocking version requires about 30 extra lines of UPC code

Kathy Yelick, 61 Benchmarking Next few UPC and MPI application benchmarks use the following systems Myrinet: Myrinet 2000 PCI64B, P4-Xeon 2.2GHz InfiniBand: IB Mellanox Cougar 4X HCA, Opteron 2.2GHz Elan3: Quadrics QsNet1, Alpha 1GHz Elan4: Quadrics QsNet2, Itanium2 1.4GHz

Kathy Yelick, 62 PGAS Languages: Key to High Performance One way to gain acceptance of a new language Make it run faster than anything else Keys to high performance Parallelism: Scaling the number of processors Maximize single node performance Generate friendly code or use tuned libraries (BLAS, FFTW, etc.) Avoid (unnecessary) communication cost Latency, bandwidth, overhead Avoid unnecessary delays due to dependencies Load balance Pipeline algorithmic dependencies

Kathy Yelick, 63 Hardware Latency Network latency is not expected to improve significantly Overlapping communication automatically (Chen) Overlapping manually in the UPC applications (Husbands, Welcome, Bell, Nishtala) Language support for overlap (Bonachea)

Kathy Yelick, 64 Effective Latency Communication wait time from other factors Algorithmic dependencies Use finer-grained parallelism, pipeline tasks (Husbands) Communication bandwidth bottleneck Message time is: Latency + 1/Bandwidth * Size Too much aggregation hurts: wait for bandwidth term De-aggregation optimization: automatic (Iancu); Bisection bandwidth bottlenecks Spread communication throughout the computation (Bell)

Kathy Yelick, 65 Fine-grained UPC vs. Bulk-Synch MPI How to waste money on supercomputers Pack all communication into single message (spend memory bandwidth) Save all communication until the last one is ready (add effective latency) Send all at once (spend bisection bandwidth) Or, to use what you have efficiently: Avoid long wait times: send early and often Use “all the wires, all the time” This requires having low overhead!

Kathy Yelick, 66 What You Won’t Hear Much About Compiler/runtime/gasnet bug fixes, performance tuning, testing, … >13,000 messages regarding cvs checkins Nightly regression testing 25 platforms, 3 compilers (head, opt-branch, gcc-upc), Bug reporting 1177 bug reports, 1027 fixed Release scheduled for later this summer Beta is available Process significantly streamlined

Kathy Yelick, 67 Take-Home Messages Titanium offers tremendous gains in productivity High level domain-specific array abstractions Titanium is being used for real applications Not just toy problems Titanium and UPC are both highly portable Run on essentially any machine Rigorously tested and supported PGAS Languages are Faster than two-sided MPI Better match to most HPC networks Berkeley UPC and Titanium benchmarks Designed from scratch with one-side PGAS model Focus on 2 scalability challenges: AMR and Sparse LU

Kathy Yelick, 68 Titanium Background Based on Java, a cleaner C++ Classes, automatic memory management, etc. Compiled to C and then machine code, no JVM Same parallelism model at UPC and CAF SPMD parallelism Dynamic Java threads are not supported Optimizing compiler Analyzes global synchronization Optimizes pointers, communication, memory

Kathy Yelick, 69 Do these Features Yield Productivity? Joint work with Kaushik Datta, Dan Bonachea

Kathy Yelick, 70 GASNet/X1 Performance GASNet/X1 improves small message performance over shmem and MPI Leverages global pointers on X1 Highlights advantage of languages vs. library approach single word put single word get Joint work with Christian Bell, Wei Chen and Dan Bonachea

Kathy Yelick, 71 High Level Optimizations in Titanium Average and maximum speedup of the Titanium version relative to the Aztec version on 1 to 16 processors Irregular communication can be expensive “Best” strategy differs by data size/distribution and machine parameters E.g., packing, sending bounding boxes, fine-grained are Use of runtime optimizations Inspector-executor Performance on Sparse MatVec Mult Results: best strategy differs within the machine on a single matrix (~ 50% better) Speedup relative to MPI code (Aztec library) Joint work with Jimmy Su

Kathy Yelick, 72 Source to Source Strategy Source-to-source translation strategy Tremendous portability advantage Still can perform significant optimizations Relies on high quality back-end compilers and some coaxing in code generation 48x Use of “restrict” pointers in C Understand Multi-D array indexing (Intel/Itanium issue) Support for pragmas like IVDEP Robust vectorizators (X1, SSE, NEC,…) On machines with integrated shared memory hardware need access to shared memory operations Joint work with Jimmy Su