Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity.

Slides:

Advertisements

Similar presentations

C. Bell, D. Bonachea, R. Nishtala, and K. Yelick, 1Berkeley UPC: Optimizing Bandwidth Limited Problems Using One-Sided Communication.

Advertisements

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.

1 PGAS Languages and Halo Updates Will Sawyer, CSCS.

Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering

Konstantin Berlin 1, Jun Huan 2, Mary Jacob 3, Garima Kochhar 3, Jan Prins 2, Bill Pugh 1, P. Sadayappan 3, Jaime Spacco 1, Chau-Wen Tseng 1 1 University.

1 An Evaluation of Global Address Space Languages: Co-Array Fortran and Unified Parallel C Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey Rice University.

PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.

Parallelizing stencil computations Based on slides from David Culler, Jim Demmel, Bob Lucas, Horst Simon, Kathy Yelick, et al., UCB CS267.

Sparse Triangular Solve in UPC By Christian Bell and Rajesh Nishtala.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen, Dan Bonachea, Jason Duell, Parry Husbands, Costin Iancu,

1 Presentation at the 4 th PMEO-PDS Workshop Benchmark Measurements of Current UPC Platforms Zhang Zhang and Steve Seidel Michigan Technological University.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

 2006 Michigan Technological University CS /15/6 1 Shared Memory Programming for Large Scale Machines C. Barton 1, C. Cascaval 2, G. Almasi 2,

1 Synthesis of Distributed ArraysAmir Kamil Synthesis of Distributed Arrays in Titanium Amir Kamil U.C. Berkeley May 9, 2006.

DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick U.C. Berkeley, EECS LBNL, Future Technologies Group.

Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1: the Berkeley UPC Experience Christian Bell and Wei Chen CS252.

1 Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Rajesh Nishtala, Mike Welcome.

CS 240A: Models of parallel programming: Machines, languages, and complexity measures.

Applications for K42 Initial Brainstorming Paul Hargrove and Kathy Yelick with input from Lenny Oliker, Parry Husbands and Mike Welcome.

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

Problems with using MPI 1.1 and 2.0 as compilation targets for parallel language implementations Dan Bonachea & Jason Duell U. C. Berkeley / LBNL

1 Titanium and UPCKathy Yelick UPC Benchmarks Kathy Yelick LBNL and UC Berkeley Joint work with The Berkeley UPC Group: Christian Bell, Dan Bonachea, Wei.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium Siu Man Yau, Katherine.

UPC and Titanium Open-source compilers and tools for scalable global address space computing Kathy Yelick University of California, Berkeley and Lawrence.

Use of a High Level Language in High Performance Biomechanics Simulations Katherine Yelick, Armando Solar-Lezama, Jimmy Su, Dan Bonachea, Amir Kamil U.C.

UPC at CRD/LBNL Kathy Yelick Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu, Mike Welcome, Christian Bell.

Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.

1 PGAS LanguagesKathy Yelick Compilation Technology for Computational Science Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint.

Kathy Yelick, 1 Advanced Software for Biological Simulations Elastic structures in an incompressible fluid. Blood flow, clotting, inner ear, embryo growth,

Unified Parallel C at LBNL/UCB FT Benchmark in UPC Christian Bell and Rajesh Nishtala.

GASP: A Performance Tool Interface for Global Address Space Languages & Libraries Adam Leko 1, Dan Bonachea 2, Hung-Hsun Su 1, Bryan Golden 1, Hans Sherburne.

A Behavioral Memory Model for the UPC Language Kathy Yelick University of California, Berkeley and Lawrence Berkeley National Laboratory.

1 Titanium Review: Ti Parallel Benchmarks Kaushik Datta Titanium NAS Parallel Benchmarks Kathy Yelick U.C. Berkeley September.

Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.

Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.

Unified Parallel C at LBNL/UCB Overview of Berkeley UPC Kathy Yelick Christian Bell, Dan Bonachea, Wei Chen, Jason Duell, Paul Hargrove, Parry Husbands,

Global Address Space Applications Kathy Yelick NERSC/LBNL and U.C. Berkeley.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Adaptive MPI Milind A. Bhandarkar

1 A Multi-platform Co-Array Fortran Compiler Yuri Dotsenko Cristian Coarfa John Mellor-Crummey Department of Computer Science Rice University Houston,

Center for Programming Models for Scalable Parallel Computing: Project Meeting Report Libraries, Languages, and Execution Models for Terascale Applications.

Unified Parallel C at LBNL/UCB The Berkeley UPC Compiler: Implementation and Performance Wei Chen the LBNL/Berkeley UPC Group.

1 John Mellor-Crummey Cristian Coarfa, Yuri Dotsenko Department of Computer Science Rice University Experiences Building a Multi-platform Compiler for.

1 Charm Kathy Yelick Compilation Techniques for Partitioned Global Address Space Languages Katherine Yelick U.C. Berkeley and Lawrence Berkeley National.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.

Unified Parallel C at LBNL/UCB An Evaluation of Current High-Performance Networks Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove,

Supercomputing ‘99 Parallelization of a Dynamic Unstructured Application using Three Leading Paradigms Leonid Oliker NERSC Lawrence Berkeley National Laboratory.

Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.

October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,

© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.

A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,

1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.

Gtb 1 Titanium Titanium: Language and Compiler Support for Scientific Computing Gregory T. Balls University of California - Berkeley Alex Aiken, Dan Bonachea,

Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,

A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,

Communication Support for Global Address Space Languages Kathy Yelick, Christian Bell, Dan Bonachea, Yannick Cote, Jason Duell, Paul Hargrove, Parry Husbands,

C OMPUTATIONAL R ESEARCH D IVISION 1 Defining Software Requirements for Scientific Computing Phillip Colella Applied Numerical Algorithms Group Lawrence.

Kathy Yelick, Computer Science Division, EECS, University of California, Berkeley Titanium Titanium: A High Performance Language Based on Java Kathy Yelick.

Benchmarking and Applications. Purpose of Our Benchmarking Effort Reveal compiler (and run-time systems) weak points and lack of adequate automatic optimizations.

1 PGAS LanguagesKathy Yelick Partitioned Global Address Space Languages Kathy Yelick Lawrence Berkeley National Laboratory and UC Berkeley Joint work.

Unified Parallel C at LBNL/UCB UPC at LBNL/U.C. Berkeley Overview Kathy Yelick LBNL and U.C. Berkeley.

UPC at NERSC/LBNL Kathy Yelick, Christian Bell, Dan Bonachea,

Overview of Berkeley UPC

Programming Models for SimMillennium

UPC and Titanium Kathy Yelick University of California, Berkeley and

Immersed Boundary Method Simulation in Titanium Objectives

Presentation transcript:

Christian Bell, Dan Bonachea, Kaushik Datta, Rajesh Nishtala, Paul Hargrove, Parry Husbands, Kathy Yelick The Performance and Productivity Benefits of Global Address Space Languages Global Address Space languages (GAS) support: Global pointers and distributed arrays User controlled layout of data across nodes Communicate using implicit reads & writes of remote memory Languages: UPC, Titanium, Co-Array Fortran Productivity benefits of shared-memory programming Competitive performance on distributed memory Use Single Program Multiple Data (SPMD) control Fixed number of compute threads Global synchronization, barriers, collectives Exploit fast one-sided communication Individual accesses and bulk copies Berkeley implementations use GASNet Global Address Space languages (GAS) support: Global pointers and distributed arrays User controlled layout of data across nodes Communicate using implicit reads & writes of remote memory Languages: UPC, Titanium, Co-Array Fortran Productivity benefits of shared-memory programming Competitive performance on distributed memory Use Single Program Multiple Data (SPMD) control Fixed number of compute threads Global synchronization, barriers, collectives Exploit fast one-sided communication Individual accesses and bulk copies Berkeley implementations use GASNet Global Address Space Languages LU Fourier Transform Conjugate Gradient Performance Productivity GASNet Portability Native network hardware support: Quadrics QsNet I (Elan3) and QsNet II (Elan4) Cray X1 - Cray shmem SGI Altix - SGI shmem Dolphin - SCI InfiniBand - Mellanox VAPI Myricom Myrinet - GM-1 and GM-2 IBM Colony and Federation - LAPI Portable network support: Ethernet - UDP: works with any TCP/IP MPI 1: portable impl. for other HPC systems Berkeley UPC, Titanium & GASNet highly portable: CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, … OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, … Native network hardware support: Quadrics QsNet I (Elan3) and QsNet II (Elan4) Cray X1 - Cray shmem SGI Altix - SGI shmem Dolphin - SCI InfiniBand - Mellanox VAPI Myricom Myrinet - GM-1 and GM-2 IBM Colony and Federation - LAPI Portable network support: Ethernet - UDP: works with any TCP/IP MPI 1: portable impl. for other HPC systems Berkeley UPC, Titanium & GASNet highly portable: CPU's: x86, Itanium, Opteron, Athlon, Alpha, PowerPC, MIPS, PA-RISC, SPARC, T3E, X-1, SX-6, … OS's: Linux, FreeBSD, NetBSD, Tru64, AIX, IRIX, HPUX, Solaris, MS-Windows/Cygwin, Mac OSX, Unicos, SuperUX, Catamount, BlueGene, … Applications Mach-10 shock hitting solid surface at oblique angle Immersed Boundary Method Simulation Human Heart Cochlea Adaptive Mesh Refinement (AMR) AMR Poisson Multigrid Solvers 3D AMR Gas Dynamics Hyperbolic Solver AMR with Line Relaxation (high aspect ratio) Bioinformatics: Microarray oligonucleotide selection Finite Element Benchmarks Tree-structured n-body kernels Dense Linear Algebra: LU, MatMul Immersed Boundary Method Simulation Human Heart Cochlea Adaptive Mesh Refinement (AMR) AMR Poisson Multigrid Solvers 3D AMR Gas Dynamics Hyperbolic Solver AMR with Line Relaxation (high aspect ratio) Bioinformatics: Microarray oligonucleotide selection Finite Element Benchmarks Tree-structured n-body kernels Dense Linear Algebra: LU, MatMul final Point NORTH = [0,1], SOUTH = [0,-1], EAST = [1,0], WEST = [-1,0]; foreach (p in gridA.domain()) { gridB[p] = S0 * gridA[p] + S1 * ( gridA[p+NORTH] + gridA[p+SOUTH] + gridA[p+EAST] + gridA[p+WEST] ); Unified Parallel C UPC: a GAS extension to ISO C99 shared types & pointers Libraries for parallel ops Synchronization Collective data movement File I/O Many open source & vendor impl: Berkeley UPC, MuPC HP UPC, Cray UPC See UPC Booth #137 NAS Benchmarks written from scratch FT, CG, MG, LU Use some Berkeley UPC extensions All competitive with best MPI versions UPC: a GAS extension to ISO C99 shared types & pointers Libraries for parallel ops Synchronization Collective data movement File I/O Many open source & vendor impl: Berkeley UPC, MuPC HP UPC, Cray UPC See UPC Booth #137 NAS Benchmarks written from scratch FT, CG, MG, LU Use some Berkeley UPC extensions All competitive with best MPI versions Titanium: GAS dialect of Java Compiled to native executable: No JVM or JIT Highly portable & high performance High productivity language All the productivity features of Java Automatic memory management Object oriented programming, etc Built-in types for scientific computing: N-d points, rectangles, domains Flexible & efficient multi-D arrays High-performance templates User-defined immutable (value) classes Explicitly unordered N-d loop iteration Operator overloading Efficient cross-language support Allows for elegant and concise programs Titanium: GAS dialect of Java Compiled to native executable: No JVM or JIT Highly portable & high performance High productivity language All the productivity features of Java Automatic memory management Object oriented programming, etc Built-in types for scientific computing: N-d points, rectangles, domains Flexible & efficient multi-D arrays High-performance templates User-defined immutable (value) classes Explicitly unordered N-d loop iteration Operator overloading Efficient cross-language support Allows for elegant and concise programs High Performance Global Address Space logically partitioned Shared data access from any thread Private data local to each thread Pointers are either local or global (possibly remote) Global Address Space logically partitioned Shared data access from any thread Private data local to each thread Pointers are either local or global (possibly remote) 2-d Multigrid stencil code in Titanium * C++ code has a more general grid generator AMR SolverTitanium AMRChombo (C++/F/MPI ) Serial (small)53.2 sec57.5 sec Parallel (large)134.5 sec113.3 sec High ratio of computation to communication scales very well to large machines: Top500 Non-trivial dependence patterns HPL code uses two-sided MPI messaging, and is very difficult to tune UPC implementation of LU uses: one-sided communication in GAS model lightweight multithreading (atop SPMD) memory-constrained lookahead to manage amount of concurrency at each processor highly adaptable to problem/machine sizes natural latency tolerance Performance is comparable to HPL / MPI UPC code less than half the size High ratio of computation to communication scales very well to large machines: Top500 Non-trivial dependence patterns HPL code uses two-sided MPI messaging, and is very difficult to tune UPC implementation of LU uses: one-sided communication in GAS model lightweight multithreading (atop SPMD) memory-constrained lookahead to manage amount of concurrency at each processor highly adaptable to problem/machine sizes natural latency tolerance Performance is comparable to HPL / MPI UPC code less than half the size Network-independent GAS language program code Translator Generated C Code Language Runtime System GASNet Communication System Network Hardware Platform- independent Compiler-independent Language- independent Source-to-source Translator Solves Ax = b for x, where A is sparse 2-D array Computation dominated by Sparse Matrix-Vector Multiplications (SPMV) Key communication ops for 2D decomposition: team-sum-reduce-to-all (vector&scalar) point-to-point exchange Need explicit synchronization in one-sided model Uses tree-based team reduction library Tunable based on network characteristics Overlap the vector reductions on the rows of the matrix with the computation of the SPMV Outperform MPI implementation by up to 10% Solves Ax = b for x, where A is sparse 2-D array Computation dominated by Sparse Matrix-Vector Multiplications (SPMV) Key communication ops for 2D decomposition: team-sum-reduce-to-all (vector&scalar) point-to-point exchange Need explicit synchronization in one-sided model Uses tree-based team reduction library Tunable based on network characteristics Overlap the vector reductions on the rows of the matrix with the computation of the SPMV Outperform MPI implementation by up to 10% Communication microbenchmarks show GASNet consistently matches or outperforms MPI One-sided, lightweight communication semantics eliminates matching & rendezvous overheads better match to hardware capabilities leverages widely-available support for remote direct memory access Communication microbenchmarks show GASNet consistently matches or outperforms MPI One-sided, lightweight communication semantics eliminates matching & rendezvous overheads better match to hardware capabilities leverages widely-available support for remote direct memory access Datta, Bonachea and Yelick, LCPC 2005 Wen and Collella, IPDPS 2005 (up is good) (down is good) (up is good) (down is good) Titanium reduces code size and development time Communication more concise than MPI, less error-prone Language features and libraries capture useful paradigms Titanium reduces code size and development time Communication more concise than MPI, less error-prone Language features and libraries capture useful paradigms 3D-FFT with 1-D partitioning across processors Computation is 1-D FFTs using FFTW library Communication is all-to-all transpose Traditionally a bandwidth-limited problem Optimization approach in GAS languages Aggressively overlap transpose with 2nd FFT Send more, smaller msgs to maximize overlap Per-target slabs or individual 1-D pencils Use low-overhead one-sided communication Consistently outperform MPI-based implementations Improvement of up to 2x 3D-FFT with 1-D partitioning across processors Computation is 1-D FFTs using FFTW library Communication is all-to-all transpose Traditionally a bandwidth-limited problem Optimization approach in GAS languages Aggressively overlap transpose with 2nd FFT Send more, smaller msgs to maximize overlap Per-target slabs or individual 1-D pencils Use low-overhead one-sided communication Consistently outperform MPI-based implementations Improvement of up to 2x (up is good) Communication overlap less effective in MPI due to increased messaging overheads and higher per-message costs