1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome

Kathy Yelick, 2 The 3 P’s of Parallel Computing Productivity Global address space supports construction of complex shared data structures High level constructs (e.g., multidimensional arrays) simplify programming Performance PGAS Languages are Faster than two-sided MPI Some surprising hints on performance tuning Portability These languages are nearly ubiquitous

Kathy Yelick, 3 Partitioned Global Address Space Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout Global address space x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn By default: Object heaps are shared Program stacks are private 3 Current languages: UPC, CAF, and Titanium Emphasis in this talk on UPC & Titanium (based on Java)

Kathy Yelick, 4 PGAS Language Overview Many common concepts, although specifics differ Consistent with base language Both private and shared data int x[10]; and shared int y[10]; Support for distributed data structures Distributed arrays; local and global pointers/references One-sided shared-memory communication Simple assignment statements: x[i] = y[i]; or t = *p; Bulk operations: memcpy in UPC, array ops in Titanium and CAF Synchronization Global barriers, locks, memory fences Collective Communication, IO libraries, etc.

Kathy Yelick, 5 Example: Titanium Arrays Ti Arrays created using Domains; indexed using Points: double [3d] gridA = new double [[0,0,0]:[10,10,10]]; Eliminates some loop bound errors using foreach foreach (p in gridA.domain()) gridA[p] = gridA[p]*c + gridB[p]; Rich domain calculus allow for slicing, subarray, transpose and other operations without data copies Array copy operations automatically work on intersection data[neighborPos].copy(mydata); mydatadata[neighorPos] “restrict”-ed (non-ghost) cells ghost cells intersection (copied area)

Kathy Yelick, 6 Productivity: Line Count Comparison Comparison of NAS Parallel Benchmarks UPC version has modest programming effort relative to C Titanium even more compact, especially for MG, which uses multi-d arrays Caveat: Titanium FT has user-defined Complex type and cross-language support used to call FFTW for serial 1D FFTs UPC results from Tarek El-Gazhawi et al; CAF from Chamberlain et al; Titanium joint with Kaushik Datta & Dan Bonachea

Kathy Yelick, 7 Case Study 1: Block-Structured AMR Adaptive Mesh Refinement (AMR) is challenging Irregular data accesses and control from boundaries Mixed global/local view is useful AMR Titanium work by Tong Wen and Philip Colella Titanium AMR benchmarks available

Kathy Yelick, 8 AMR in Titanium C++/Fortran/MPI AMR Chombo package from LBNL Bulk-synchronous comm: Pack boundary data between procs Titanium AMR Entirely in Titanium Finer-grained communication No explicit pack/unpack code Automated in runtime system Code Size in Lines C++/Fortran/MPITitanium AMR data Structures350002000 AMR operations65001200 Elliptic PDE solver4200*1500 10X reduction in lines of code! * Somewhat more functionality in PDE part of Chombo code Elliptic PDE solver running time (secs) PDE Solver Time (secs)C++/Fortran/MPITitanium Serial5753 Parallel (28 procs)113126 Comparable running time Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

Kathy Yelick, 9 Immersed Boundary Simulation in Titanium Modeling elastic structures in an incompressible fluid. Blood flow in the heart, blood clotting, inner ear, embryo growth, and many more Complicated parallelization Particle/Mesh method “Particles” connected into materials Joint work with Ed Givelberg, Armando Solar-Lezama Code Size in Lines FortranTitanium 80004000

Kathy Yelick, 10 The 3 P’s of Parallel Computing Productivity Global address space supports complex shared structures High level constructs simplify programming Performance PGAS Languages are Faster than two-sided MPI Better match to most HPC networks Some surprising hints on performance tuning Send early and often is sometimes best Portability These languages are nearly ubiquitous

Kathy Yelick, 11 PGAS Languages: High Performance Strategy for acceptance of a new language Make it run faster than anything else Keys to high performance Parallelism: Scaling the number of processors Maximize single node performance Generate friendly code or use tuned libraries (BLAS, FFTW, etc.) Avoid (unnecessary) communication cost Latency, bandwidth, overhead Berkeley UPC and Titanium use GASNet communication layer Avoid unnecessary delays due to dependencies Load balance; Pipeline algorithmic dependencies

Kathy Yelick, 12 One-Sided vs Two-Sided A one-sided put/get message can be handled directly by a network interface with RDMA support Avoid interrupting the CPU or storing data from CPU (preposts) A two-sided messages needs to be matched with a receive to identify memory address to put data Offloaded to Network Interface in networks like Quadrics Need to download match tables to interface (from host) address message id data payload one-sided put message two-sided message network interface memory host CPU

Kathy Yelick, 13 Performance Advantage of One-Sided Communication: GASNet vs MPI Opteron/InfiniBand (Jacquard at NERSC): GASNet’s vapi-conduit and OSU MPI 0.9.5 MVAPICH Half power point (N ½ ) differs by one order of magnitude Joint work with Paul Hargrove and Dan Bonachea (up is good)

Kathy Yelick, 14 GASNet: Portability and High-Performance (down is good) GASNet better for latency across machines Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 15 (up is good) GASNet at least as high (comparable) for large messages GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 16 (up is good) GASNet excels at mid-range sizes: important for overlap GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 17 Case Study 2: NAS FT Performance of Exchange (Alltoall) is critical 1D FFTs in each dimension, 3 phases Transpose after first 2 for locality Bisection bandwidth-limited Problem as #procs grows Three approaches: Exchange: wait for 2 nd dim FFTs to finish, send 1 message per processor pair Slab: wait for chunk of rows destined for 1 proc, send when ready Pencil: send each row as it completes Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

Kathy Yelick, 18 Overlapping Communication Goal: make use of “all the wires all the time” Schedule communication to avoid network backup Trade-off: overhead vs. overlap Exchange has fewest messages, less message overhead Slabs and pencils have more overlap; pencils the most Example: Class D problem on 256 Processors Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Exchange (all data at once)512 Kbytes Slabs (contiguous rows that go to 1 processor) 64 Kbytes Pencils (single row)16 Kbytes

Kathy Yelick, 19 NAS FT Variants Performance Summary Slab is always best for MPI; small message cost too high Pencil is always best for UPC; more overlap Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea.5 Tflops

Kathy Yelick, 20 Case Study 2: LU Factorization Direct methods have complicated dependencies Especially with pivoting (unpredictable communication) Especially for sparse matrices (dependence graph with holes) LU Factorization in UPC Use overlap ideas and multithreading to mask latency Multithreaded: UPC threads + user threads + threaded BLAS Panel factorization: Including pivoting Update to a block of U Trailing submatrix updates Status: Dense LU done: HPL-compliant Sparse version underway Joint work with Parry Husbands

Kathy Yelick, 21 UPC HPL Performance Comparison to ScaLAPACK on an Altix, a 2 x 4 process grid ScaLAPACK (block size 64) 25.25 GFlop/s (tried several block sizes) UPC LU (block size 256) - 33.60 GFlop/s, (block size 64) - 26.47 GFlop/s n = 32000 on a 4x4 process grid ScaLAPACK - 43.34 GFlop/s (block size = 64) UPC - 70.26 Gflop/s (block size = 200) MPI HPL numbers from HPCC database Large scaling: 2.2 TFlops on 512p, 4.4 TFlops on 1024p (Thunder) Joint work with Parry Husbands

Kathy Yelick, 22 The 3 P’s of Parallel Computing Productivity Global address space supports complex shared structures High level constructs simplify programming Performance PGAS Languages are Faster than two-sided MPI Some surprising hints on performance tuning Portability These languages are nearly ubiquitous Source-to-source translators are key Combined with portable communication layer Specialized compilers are useful in some cases

Kathy Yelick, 23 Portability of Titanium and UPC Titanium and the Berkeley UPC translator use a similar model Source-to-source translator (generate ISO C) Runtime layer implements global pointers, etc Common communication layer (GASNet) Both run on most PCs, SMPs, clusters & supercomputers Support Operating Systems: Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris, Cygwin, MacOSX, Unicos, SuperUX UPC translator somewhat less portable: we provide a http-based compile server Supported CPUs: x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC, Opteron GASNet communication: Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X1, SGI Altix, Cray/SGI SHMEM, and (for portability) MPI and UDP Specific supercomputer platforms: HP AlphaServer, Cray X1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000 Underway: Cray XT3, BG/L (both run over MPI) Can be mixed with MPI, C/C++, Fortran Also used by gcc/upc Joint work with Titanium and UPC groups

Kathy Yelick, 24 Portability of PGAS Languages Other compilers also exist for PGAS Languages UPC Gcc/UPC by Intrepid: runs on GASNet HP UPC for AlphaServers, clusters, … MTU UPC uses HP compiler on MPI (source to source) Cray UPC Co-Array Fortran: Cray CAF Compiler: X1, X1E Rice CAF Compiler (on ARMCI or GASNet), John Mellor-Crummey Source to source Processors: Pentium, Itanium2, Alpha, MIPS Networks: Myrinet, Quadrics, Altix, Origin, Ethernet OS: Linux32 RedHat, IRIS, Tru64 NB: source-to-source requires cooperation by backend compilers

Kathy Yelick, 25 Summary PGAS languages offer performance advantages Good match to RDMA support in networks Smaller messages may be faster: make better use of network: postpone bisection bandwidth pain can also prevent cache thrashing for packing PGAS languages offer productivity advantage Order of magnitude in line counts for grid-based code in Titanium Push decisions about packing/not into runtime for portability (advantage of language with translator vs. library approach) Source-to-source translation The way to ubiquity Complement highly tuned machine-specific compilers

26 PGAS LanguagesKathy Yelick End of Slides

Kathy Yelick, 27 Productizing BUPC Recent Berkeley UPC release Support full 1.2 language spec Supports collectives (tuning ongoing); memory model compliance Supports UPC I/O (naïve reference implementation) Large effort in quality assurance and robustness Test suite: 600+ tests run nightly on 20+ platform configs Tests correct compilation & execution of UPC and GASNet >30,000 UPC compilations and >20,000 UPC test runs per night Online reporting of results & hookup with bug database Test suite infrastructure extended to support any UPC compiler now running nightly with GCC/UPC + UPCR also support HP-UPC, Cray UPC, … Online bug reporting database Over >1100 reports since Jan 03 > 90% fixed (excl. enhancement requests)

Kathy Yelick, 28 Case Study in NAS CG Problems in NAS CG are different than FT Reductions, including vector reductions Highlights need for processor team reductions Using one-sided low latency model Performance: Comparable (slightly better) performance in UPC than MPI/Fortran Current focus on more realistic CG Real matrices 1D layout Optimize across iterations (BeBOP Project)

Kathy Yelick, 29 NAS FT: UPC Non-blocking MFlops Berkeley UPC compiler support non-blocking UPC extensions Produce 15-45% speedup over best UPC Blocking version Non-blocking version requires about 30 extra lines of UPC code

Kathy Yelick, 30 Benchmarking Next few UPC and MPI application benchmarks use the following systems Myrinet: Myrinet 2000 PCI64B, P4-Xeon 2.2GHz InfiniBand: IB Mellanox Cougar 4X HCA, Opteron 2.2GHz Elan3: Quadrics QsNet1, Alpha 1GHz Elan4: Quadrics QsNet2, Itanium2 1.4GHz

Kathy Yelick, 31 PGAS Languages: Key to High Performance One way to gain acceptance of a new language Make it run faster than anything else Keys to high performance Parallelism: Scaling the number of processors Maximize single node performance Generate friendly code or use tuned libraries (BLAS, FFTW, etc.) Avoid (unnecessary) communication cost Latency, bandwidth, overhead Avoid unnecessary delays due to dependencies Load balance Pipeline algorithmic dependencies

Kathy Yelick, 32 Hardware Latency Network latency is not expected to improve significantly Overlapping communication automatically (Chen) Overlapping manually in the UPC applications (Husbands, Welcome, Bell, Nishtala) Language support for overlap (Bonachea)

Kathy Yelick, 33 Effective Latency Communication wait time from other factors Algorithmic dependencies Use finer-grained parallelism, pipeline tasks (Husbands) Communication bandwidth bottleneck Message time is: Latency + 1/Bandwidth * Size Too much aggregation hurts: wait for bandwidth term De-aggregation optimization: automatic (Iancu); Bisection bandwidth bottlenecks Spread communication throughout the computation (Bell)

Kathy Yelick, 34 Fine-grained UPC vs. Bulk-Synch MPI How to waste money on supercomputers Pack all communication into single message (spend memory bandwidth) Save all communication until the last one is ready (add effective latency) Send all at once (spend bisection bandwidth) Or, to use what you have efficiently: Avoid long wait times: send early and often Use “all the wires, all the time” This requires having low overhead!

Kathy Yelick, 35 What You Won’t Hear Much About Compiler/runtime/gasnet bug fixes, performance tuning, testing, … >13,000 e-mail messages regarding cvs checkins Nightly regression testing 25 platforms, 3 compilers (head, opt-branch, gcc-upc), Bug reporting 1177 bug reports, 1027 fixed Release scheduled for later this summer Beta is available Process significantly streamlined

Kathy Yelick, 36 Take-Home Messages Titanium offers tremendous gains in productivity High level domain-specific array abstractions Titanium is being used for real applications Not just toy problems Titanium and UPC are both highly portable Run on essentially any machine Rigorously tested and supported PGAS Languages are Faster than two-sided MPI Better match to most HPC networks Berkeley UPC and Titanium benchmarks Designed from scratch with one-side PGAS model Focus on 2 scalability challenges: AMR and Sparse LU

Kathy Yelick, 37 Titanium Background Based on Java, a cleaner C++ Classes, automatic memory management, etc. Compiled to C and then machine code, no JVM Same parallelism model at UPC and CAF SPMD parallelism Dynamic Java threads are not supported Optimizing compiler Analyzes global synchronization Optimizes pointers, communication, memory

Kathy Yelick, 38 Do these Features Yield Productivity? Joint work with Kaushik Datta, Dan Bonachea

Kathy Yelick, 39 GASNet/X1 Performance GASNet/X1 improves small message performance over shmem and MPI Leverages global pointers on X1 Highlights advantage of languages vs. library approach single word put single word get Joint work with Christian Bell, Wei Chen and Dan Bonachea

Kathy Yelick, 40 High Level Optimizations in Titanium Average and maximum speedup of the Titanium version relative to the Aztec version on 1 to 16 processors Irregular communication can be expensive “Best” strategy differs by data size/distribution and machine parameters E.g., packing, sending bounding boxes, fine-grained are Use of runtime optimizations Inspector-executor Performance on Sparse MatVec Mult Results: best strategy differs within the machine on a single matrix (~ 50% better) Speedup relative to MPI code (Aztec library) Joint work with Jimmy Su

Kathy Yelick, 41 Source to Source Strategy Source-to-source translation strategy Tremendous portability advantage Still can perform significant optimizations Relies on high quality back-end compilers and some coaxing in code generation 48x Use of “restrict” pointers in C Understand Multi-D array indexing (Intel/Itanium issue) Support for pragmas like IVDEP Robust vectorizators (X1, SSE, NEC,…) On machines with integrated shared memory hardware need access to shared memory operations Joint work with Jimmy Su

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

Similar presentations

Presentation on theme: "1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

Similar presentations

Presentation on theme: "1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work."— Presentation transcript:

Similar presentations

About project

Feedback