1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

Slides:



Advertisements
Similar presentations
C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.
Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Thoughts on Shared Caches Jeff Odom University of Maryland.
Dynamic Load Balancing for VORPAL Viktor Przebinda Center for Integrated Plasma Studies.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,
1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.
Introduction CS 524 – High-Performance Computing.
Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Evaluation and Optimization of a Titanium Adaptive Mesh Refinement Amir Kamil Ben Schwarz Jimmy Su.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.
1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.
Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL
Performance Implications of Communication Mechanisms in All-Software Global Address Space Systems Chi-Chao Chang Dept. of Computer Science Cornell University.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Kathy Yelick U.C. Berkeley.
Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.
UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
1 PGAS LanguagesKathy Yelick Compilation Technology for Computational Science Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint.
Yelick 1 ILP98, Titanium Titanium: A High Performance Java- Based Language Katherine Yelick Alex Aiken, Phillip Colella, David Gay, Susan Graham, Paul.
Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,
Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.
A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.
1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.
Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,
Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.
Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.
1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,
Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.
1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
1 Charm Kathy Yelick Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National.
UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,
Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
HPC Components for CCA Manoj Krishnan and Jarek Nieplocha Computational Sciences and Mathematics Division Pacific Northwest National Laboratory.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
1 HPJAVA I.K.UJJWAL 07M11A1217 Dept. of Information Technology B.S.I.T.
Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.
Unified Parallel C at LBNL/UCB Berkeley UPC Runtime Report Jason Duell LBNL September 9, 2004.
1 Titanium Review: Immersed Boundary Armando Solar-Lezama Biological Simulations Using the Immersed Boundary Method in Titanium Ed Givelberg, Armando Solar-Lezama,
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.
Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.
UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,
Overview of Berkeley UPC
Programming Models for SimMillennium
Chapter 4: Threads.
Immersed Boundary Method Simulation in Titanium Objectives
Chapter 4: Threads & Concurrency
Presentation transcript:

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work with The Titanium Group: S. Graham, P. Hilfinger, P. Colella, D. Bonachea, K. Datta, E. Givelberg, A. Kamil, N. Mai, A. Solar, J. Su, T. Wen The Berkeley UPC Group: C. Bell, D. Bonachea, W. Chen, J. Duell, P. Hargrove, P. Husbands, C. Iancu, R. Nishtala, M. Welcome

Kathy Yelick, 2 The 3 P’s of Parallel Computing Productivity Global address space supports construction of complex shared data structures High level constructs (e.g., multidimensional arrays) simplify programming Performance PGAS Languages are Faster than two-sided MPI Some surprising hints on performance tuning Portability These languages are nearly ubiquitous

Kathy Yelick, 3 Partitioned Global Address Space Global address space: any thread/process may directly read/write data allocated by another Partitioned: data is designated as local (near) or global (possibly far); programmer controls layout Global address space x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn By default: Object heaps are shared Program stacks are private 3 Current languages: UPC, CAF, and Titanium Emphasis in this talk on UPC & Titanium (based on Java)

Kathy Yelick, 4 PGAS Language Overview Many common concepts, although specifics differ Consistent with base language Both private and shared data int x[10]; and shared int y[10]; Support for distributed data structures Distributed arrays; local and global pointers/references One-sided shared-memory communication Simple assignment statements: x[i] = y[i]; or t = *p; Bulk operations: memcpy in UPC, array ops in Titanium and CAF Synchronization Global barriers, locks, memory fences Collective Communication, IO libraries, etc.

Kathy Yelick, 5 Example: Titanium Arrays Ti Arrays created using Domains; indexed using Points: double [3d] gridA = new double [[0,0,0]:[10,10,10]]; Eliminates some loop bound errors using foreach foreach (p in gridA.domain()) gridA[p] = gridA[p]*c + gridB[p]; Rich domain calculus allow for slicing, subarray, transpose and other operations without data copies Array copy operations automatically work on intersection data[neighborPos].copy(mydata); mydatadata[neighorPos] “restrict”-ed (non-ghost) cells ghost cells intersection (copied area)

Kathy Yelick, 6 Productivity: Line Count Comparison Comparison of NAS Parallel Benchmarks UPC version has modest programming effort relative to C Titanium even more compact, especially for MG, which uses multi-d arrays Caveat: Titanium FT has user-defined Complex type and cross-language support used to call FFTW for serial 1D FFTs UPC results from Tarek El-Gazhawi et al; CAF from Chamberlain et al; Titanium joint with Kaushik Datta & Dan Bonachea

Kathy Yelick, 7 Case Study 1: Block-Structured AMR Adaptive Mesh Refinement (AMR) is challenging Irregular data accesses and control from boundaries Mixed global/local view is useful AMR Titanium work by Tong Wen and Philip Colella Titanium AMR benchmarks available

Kathy Yelick, 8 AMR in Titanium C++/Fortran/MPI AMR Chombo package from LBNL Bulk-synchronous comm: Pack boundary data between procs Titanium AMR Entirely in Titanium Finer-grained communication No explicit pack/unpack code Automated in runtime system Code Size in Lines C++/Fortran/MPITitanium AMR data Structures AMR operations Elliptic PDE solver4200* X reduction in lines of code! * Somewhat more functionality in PDE part of Chombo code Elliptic PDE solver running time (secs) PDE Solver Time (secs)C++/Fortran/MPITitanium Serial5753 Parallel (28 procs) Comparable running time Work by Tong Wen and Philip Colella; Communication optimizations joint with Jimmy Su

Kathy Yelick, 9 Immersed Boundary Simulation in Titanium Modeling elastic structures in an incompressible fluid. Blood flow in the heart, blood clotting, inner ear, embryo growth, and many more Complicated parallelization Particle/Mesh method “Particles” connected into materials Joint work with Ed Givelberg, Armando Solar-Lezama Code Size in Lines FortranTitanium

Kathy Yelick, 10 The 3 P’s of Parallel Computing Productivity Global address space supports complex shared structures High level constructs simplify programming Performance PGAS Languages are Faster than two-sided MPI Better match to most HPC networks Some surprising hints on performance tuning Send early and often is sometimes best Portability These languages are nearly ubiquitous

Kathy Yelick, 11 PGAS Languages: High Performance Strategy for acceptance of a new language Make it run faster than anything else Keys to high performance Parallelism: Scaling the number of processors Maximize single node performance Generate friendly code or use tuned libraries (BLAS, FFTW, etc.) Avoid (unnecessary) communication cost Latency, bandwidth, overhead Berkeley UPC and Titanium use GASNet communication layer Avoid unnecessary delays due to dependencies Load balance; Pipeline algorithmic dependencies

Kathy Yelick, 12 One-Sided vs Two-Sided A one-sided put/get message can be handled directly by a network interface with RDMA support Avoid interrupting the CPU or storing data from CPU (preposts) A two-sided messages needs to be matched with a receive to identify memory address to put data Offloaded to Network Interface in networks like Quadrics Need to download match tables to interface (from host) address message id data payload one-sided put message two-sided message network interface memory host CPU

Kathy Yelick, 13 Performance Advantage of One-Sided Communication: GASNet vs MPI Opteron/InfiniBand (Jacquard at NERSC): GASNet’s vapi-conduit and OSU MPI MVAPICH Half power point (N ½ ) differs by one order of magnitude Joint work with Paul Hargrove and Dan Bonachea (up is good)

Kathy Yelick, 14 GASNet: Portability and High-Performance (down is good) GASNet better for latency across machines Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 15 (up is good) GASNet at least as high (comparable) for large messages GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 16 (up is good) GASNet excels at mid-range sizes: important for overlap GASNet: Portability and High-Performance Joint work with UPC Group; GASNet design by Dan Bonachea

Kathy Yelick, 17 Case Study 2: NAS FT Performance of Exchange (Alltoall) is critical 1D FFTs in each dimension, 3 phases Transpose after first 2 for locality Bisection bandwidth-limited Problem as #procs grows Three approaches: Exchange: wait for 2 nd dim FFTs to finish, send 1 message per processor pair Slab: wait for chunk of rows destined for 1 proc, send when ready Pencil: send each row as it completes Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea

Kathy Yelick, 18 Overlapping Communication Goal: make use of “all the wires all the time” Schedule communication to avoid network backup Trade-off: overhead vs. overlap Exchange has fewest messages, less message overhead Slabs and pencils have more overlap; pencils the most Example: Class D problem on 256 Processors Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea Exchange (all data at once)512 Kbytes Slabs (contiguous rows that go to 1 processor) 64 Kbytes Pencils (single row)16 Kbytes

Kathy Yelick, 19 NAS FT Variants Performance Summary Slab is always best for MPI; small message cost too high Pencil is always best for UPC; more overlap Joint work with Chris Bell, Rajesh Nishtala, Dan Bonachea.5 Tflops

Kathy Yelick, 20 Case Study 2: LU Factorization Direct methods have complicated dependencies Especially with pivoting (unpredictable communication) Especially for sparse matrices (dependence graph with holes) LU Factorization in UPC Use overlap ideas and multithreading to mask latency Multithreaded: UPC threads + user threads + threaded BLAS Panel factorization: Including pivoting Update to a block of U Trailing submatrix updates Status: Dense LU done: HPL-compliant Sparse version underway Joint work with Parry Husbands

Kathy Yelick, 21 UPC HPL Performance Comparison to ScaLAPACK on an Altix, a 2 x 4 process grid ScaLAPACK (block size 64) GFlop/s (tried several block sizes) UPC LU (block size 256) GFlop/s, (block size 64) GFlop/s n = on a 4x4 process grid ScaLAPACK GFlop/s (block size = 64) UPC Gflop/s (block size = 200) MPI HPL numbers from HPCC database Large scaling: 2.2 TFlops on 512p, 4.4 TFlops on 1024p (Thunder) Joint work with Parry Husbands

Kathy Yelick, 22 The 3 P’s of Parallel Computing Productivity Global address space supports complex shared structures High level constructs simplify programming Performance PGAS Languages are Faster than two-sided MPI Some surprising hints on performance tuning Portability These languages are nearly ubiquitous Source-to-source translators are key Combined with portable communication layer Specialized compilers are useful in some cases

Kathy Yelick, 23 Portability of Titanium and UPC Titanium and the Berkeley UPC translator use a similar model Source-to-source translator (generate ISO C) Runtime layer implements global pointers, etc Common communication layer (GASNet) Both run on most PCs, SMPs, clusters & supercomputers Support Operating Systems: Linux, FreeBSD, Tru64, AIX, IRIX, HPUX, Solaris, Cygwin, MacOSX, Unicos, SuperUX UPC translator somewhat less portable: we provide a http-based compile server Supported CPUs: x86, Itanium, Alpha, Sparc, PowerPC, PA-RISC, Opteron GASNet communication: Myrinet GM, Quadrics Elan, Mellanox Infiniband VAPI, IBM LAPI, Cray X1, SGI Altix, Cray/SGI SHMEM, and (for portability) MPI and UDP Specific supercomputer platforms: HP AlphaServer, Cray X1, IBM SP, NEC SX-6, Cluster X (Big Mac), SGI Altix 3000 Underway: Cray XT3, BG/L (both run over MPI) Can be mixed with MPI, C/C++, Fortran Also used by gcc/upc Joint work with Titanium and UPC groups

Kathy Yelick, 24 Portability of PGAS Languages Other compilers also exist for PGAS Languages UPC Gcc/UPC by Intrepid: runs on GASNet HP UPC for AlphaServers, clusters, … MTU UPC uses HP compiler on MPI (source to source) Cray UPC Co-Array Fortran: Cray CAF Compiler: X1, X1E Rice CAF Compiler (on ARMCI or GASNet), John Mellor-Crummey Source to source Processors: Pentium, Itanium2, Alpha, MIPS Networks: Myrinet, Quadrics, Altix, Origin, Ethernet OS: Linux32 RedHat, IRIS, Tru64 NB: source-to-source requires cooperation by backend compilers

Kathy Yelick, 25 Summary PGAS languages offer performance advantages Good match to RDMA support in networks Smaller messages may be faster: make better use of network: postpone bisection bandwidth pain can also prevent cache thrashing for packing PGAS languages offer productivity advantage Order of magnitude in line counts for grid-based code in Titanium Push decisions about packing/not into runtime for portability (advantage of language with translator vs. library approach) Source-to-source translation The way to ubiquity Complement highly tuned machine-specific compilers

26 PGAS LanguagesKathy Yelick End of Slides

Kathy Yelick, 27 Productizing BUPC Recent Berkeley UPC release Support full 1.2 language spec Supports collectives (tuning ongoing); memory model compliance Supports UPC I/O (naïve reference implementation) Large effort in quality assurance and robustness Test suite: 600+ tests run nightly on 20+ platform configs Tests correct compilation & execution of UPC and GASNet >30,000 UPC compilations and >20,000 UPC test runs per night Online reporting of results & hookup with bug database Test suite infrastructure extended to support any UPC compiler now running nightly with GCC/UPC + UPCR also support HP-UPC, Cray UPC, … Online bug reporting database Over >1100 reports since Jan 03 > 90% fixed (excl. enhancement requests)

Kathy Yelick, 28 Case Study in NAS CG Problems in NAS CG are different than FT Reductions, including vector reductions Highlights need for processor team reductions Using one-sided low latency model Performance: Comparable (slightly better) performance in UPC than MPI/Fortran Current focus on more realistic CG Real matrices 1D layout Optimize across iterations (BeBOP Project)

Kathy Yelick, 29 NAS FT: UPC Non-blocking MFlops Berkeley UPC compiler support non-blocking UPC extensions Produce 15-45% speedup over best UPC Blocking version Non-blocking version requires about 30 extra lines of UPC code

Kathy Yelick, 30 Benchmarking Next few UPC and MPI application benchmarks use the following systems Myrinet: Myrinet 2000 PCI64B, P4-Xeon 2.2GHz InfiniBand: IB Mellanox Cougar 4X HCA, Opteron 2.2GHz Elan3: Quadrics QsNet1, Alpha 1GHz Elan4: Quadrics QsNet2, Itanium2 1.4GHz

Kathy Yelick, 31 PGAS Languages: Key to High Performance One way to gain acceptance of a new language Make it run faster than anything else Keys to high performance Parallelism: Scaling the number of processors Maximize single node performance Generate friendly code or use tuned libraries (BLAS, FFTW, etc.) Avoid (unnecessary) communication cost Latency, bandwidth, overhead Avoid unnecessary delays due to dependencies Load balance Pipeline algorithmic dependencies

Kathy Yelick, 32 Hardware Latency Network latency is not expected to improve significantly Overlapping communication automatically (Chen) Overlapping manually in the UPC applications (Husbands, Welcome, Bell, Nishtala) Language support for overlap (Bonachea)

Kathy Yelick, 33 Effective Latency Communication wait time from other factors Algorithmic dependencies Use finer-grained parallelism, pipeline tasks (Husbands) Communication bandwidth bottleneck Message time is: Latency + 1/Bandwidth * Size Too much aggregation hurts: wait for bandwidth term De-aggregation optimization: automatic (Iancu); Bisection bandwidth bottlenecks Spread communication throughout the computation (Bell)

Kathy Yelick, 34 Fine-grained UPC vs. Bulk-Synch MPI How to waste money on supercomputers Pack all communication into single message (spend memory bandwidth) Save all communication until the last one is ready (add effective latency) Send all at once (spend bisection bandwidth) Or, to use what you have efficiently: Avoid long wait times: send early and often Use “all the wires, all the time” This requires having low overhead!

Kathy Yelick, 35 What You Won’t Hear Much About Compiler/runtime/gasnet bug fixes, performance tuning, testing, … >13,000 messages regarding cvs checkins Nightly regression testing 25 platforms, 3 compilers (head, opt-branch, gcc-upc), Bug reporting 1177 bug reports, 1027 fixed Release scheduled for later this summer Beta is available Process significantly streamlined

Kathy Yelick, 36 Take-Home Messages Titanium offers tremendous gains in productivity High level domain-specific array abstractions Titanium is being used for real applications Not just toy problems Titanium and UPC are both highly portable Run on essentially any machine Rigorously tested and supported PGAS Languages are Faster than two-sided MPI Better match to most HPC networks Berkeley UPC and Titanium benchmarks Designed from scratch with one-side PGAS model Focus on 2 scalability challenges: AMR and Sparse LU

Kathy Yelick, 37 Titanium Background Based on Java, a cleaner C++ Classes, automatic memory management, etc. Compiled to C and then machine code, no JVM Same parallelism model at UPC and CAF SPMD parallelism Dynamic Java threads are not supported Optimizing compiler Analyzes global synchronization Optimizes pointers, communication, memory

Kathy Yelick, 38 Do these Features Yield Productivity? Joint work with Kaushik Datta, Dan Bonachea

Kathy Yelick, 39 GASNet/X1 Performance GASNet/X1 improves small message performance over shmem and MPI Leverages global pointers on X1 Highlights advantage of languages vs. library approach single word put single word get Joint work with Christian Bell, Wei Chen and Dan Bonachea

Kathy Yelick, 40 High Level Optimizations in Titanium Average and maximum speedup of the Titanium version relative to the Aztec version on 1 to 16 processors Irregular communication can be expensive “Best” strategy differs by data size/distribution and machine parameters E.g., packing, sending bounding boxes, fine-grained are Use of runtime optimizations Inspector-executor Performance on Sparse MatVec Mult Results: best strategy differs within the machine on a single matrix (~ 50% better) Speedup relative to MPI code (Aztec library) Joint work with Jimmy Su

Kathy Yelick, 41 Source to Source Strategy Source-to-source translation strategy Tremendous portability advantage Still can perform significant optimizations Relies on high quality back-end compilers and some coaxing in code generation 48x Use of “restrict” pointers in C Understand Multi-D array indexing (Intel/Itanium issue) Support for pragmas like IVDEP Robust vectorizators (X1, SSE, NEC,…) On machines with integrated shared memory hardware need access to shared memory operations Joint work with Jimmy Su