1 PGAS Languages and Halo Updates Will Sawyer, CSCS.

Slides:



Advertisements
Similar presentations
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
Advertisements

MPI Message Passing Interface
Parallel Jacobi Algorithm Steven Dong Applied Mathematics.
Introduction to Openmp & openACC
1 ISCM-10 Taub Computing Center High Performance Computing for Computational Mechanics Moshe Goldberg March 29, 2001.
Unified Parallel C at LBNL/UCB Implementing a Global Address Space Language on the Cray X1 Christian Bell and Wei Chen.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
© Cray Inc. CSC, Finland September 21-24, XT3XT4XT5XT6 Number of cores/socket Number of cores/node Clock Cycle (CC) ??
IBM’s X10 Presentation by Isaac Dooley CS498LVK Spring 2006.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Potential Languages of the Future Chapel,
Introduction to the Partitioned Global Address Space (PGAS) Programming Model David E. Hudak, Ph.D. Program Director for HPC Engineering
Thoughts on Shared Caches Jeff Odom University of Maryland.
PGAS Language Update Kathy Yelick. PGAS Languages: Why use 2 Programming Models when 1 will do? Global address space: thread may directly read/write remote.
DISTRIBUTED AND HIGH-PERFORMANCE COMPUTING CHAPTER 7: SHARED MEMORY PARALLEL PROGRAMMING.
Overview *Unified Parallel C is an extension to ANSI C. *UPC is a global address space language for parallel programming. * UPC extends C by providing.
ECE669 L5: Grid Computations February 12, 2004 ECE 669 Parallel Computer Architecture Lecture 5 Grid Computations.
CS 240A: Models of parallel programming: Machines, languages, and complexity measures.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
Arquitectura de Sistemas Paralelos e Distribuídos Paulo Marques Dep. Eng. Informática – Universidade de Coimbra Ago/ Programming.
Active Messages: a Mechanism for Integrated Communication and Computation von Eicken et. al. Brian Kazian CS258 Spring 2008.
3.5 Interprocess Communication
Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)
Tile Reduction: the first step towards tile aware parallelization in OpenMP Ge Gan Department of Electrical and Computer Engineering Univ. of Delaware.
Support for Adaptive Computations Applied to Simulation of Fluids in Biological Systems Immersed Boundary Method Simulation in Titanium.
Chapter 51 Threads Chapter 5. 2 Process Characteristics  Concept of Process has two facets.  A Process is: A Unit of resource ownership:  a virtual.
Programming for High Performance Computers John M. Levesque Director Cray’s Supercomputing Center Of Excellence.
Exercise problems for students taking the Programming Parallel Computers course. Janusz Kowalik Piotr Arlukowicz Tadeusz Puzniakowski Informatics Institute.
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview.
High Performance Computation --- A Practical Introduction Chunlin Tian NAOC Beijing 2011.
Lecture 4: Parallel Programming Models. Parallel Programming Models Parallel Programming Models: Data parallelism / Task parallelism Explicit parallelism.
MPI3 Hybrid Proposal Description
Venkatram Ramanathan 1. Motivation Evolution of Multi-Core Machines and the challenges Summary of Contributions Background: MapReduce and FREERIDE Wavelet.
ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.
1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.
CSE 260 – Parallel Processing UCSD Fall 2006 A Performance Characterization of UPC Presented by – Anup Tapadia Fallon Chen.
Compilation Technology SCINET compiler workshop | February 17-18, 2009 © 2009 IBM Corporation Software Group Coarray: a parallel extension to Fortran Jim.
AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author : Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source : Proceedings of the 2nd IASTED.
Co-Array Fortran Open-source compilers and tools for scalable global address space computing John Mellor-Crummey Rice University.
Case Study in Computational Science & Engineering - Lecture 2 1 Parallel Architecture Models Shared Memory –Dual/Quad Pentium, Cray T90, IBM Power3 Node.
1 Parallel Programming Aaron Bloomfield CS 415 Fall 2005.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Copyright © 2009 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Principles of Parallel Programming First Edition by Calvin Lin Lawrence Snyder.
Software Caching for UPC Wei Chen Jason Duell Jimmy Su Spring 2003.
October 11, 2007 © 2007 IBM Corporation Multidimensional Blocking in UPC Christopher Barton, Călin Caşcaval, George Almási, Rahul Garg, José Nelson Amaral,
© 2009 IBM Corporation Parallel Programming with X10/APGAS IBM UPC and X10 teams  Through languages –Asynchronous Co-Array Fortran –extension of CAF with.
A Multi-platform Co-array Fortran Compiler for High-Performance Computing John Mellor-Crummey, Yuri Dotsenko, Cristian Coarfa {johnmc, dotsenko,
1 Qualifying ExamWei Chen Unified Parallel C (UPC) and the Berkeley UPC Compiler Wei Chen the Berkeley UPC Group 3/11/07.
Ronny Krashinsky Erik Machnicki Software Cache Coherent Shared Memory under Split-C.
Cray Inc. Hot Interconnects 1 Bob Alverson, Duncan Roweth, Larry Kaplan Cray Inc.
Fortress Aaron Becker Abhinav Bhatele Hassan Jafri 2 May 2006.
Unified Parallel C Kathy Yelick EECS, U.C. Berkeley and NERSC/LBNL NERSC Team: Dan Bonachea, Jason Duell, Paul Hargrove, Parry Husbands, Costin Iancu,
Experiences with Co-array Fortran on Hardware Shared Memory Platforms Yuri DotsenkoCristian Coarfa John Mellor-CrummeyDaniel Chavarria-Miranda Rice University,
A Multi-platform Co-Array Fortran Compiler for High-Performance Computing Cristian Coarfa, Yuri Dotsenko, John Mellor-Crummey {dotsenko, ccristi,
MSc in High Performance Computing Computational Chemistry Module Parallel Molecular Dynamics (i) Bill Smith CCLRC Daresbury Laboratory
Threaded Programming Lecture 1: Concepts. 2 Overview Shared memory systems Basic Concepts in Threaded Programming.
CALTECH cs184c Spring DeHon CS184c: Computer Architecture [Parallel and Multithreaded] Day 9: May 3, 2001 Distributed Shared Memory.
Nov 14, 08ACES III and SIAL1 ACES III and SIAL: technologies for petascale computing in chemistry and materials physics Erik Deumens, Victor Lotrich, Mark.
Implementing Remote Procedure Call Landon Cox February 12, 2016.
Parallel Computing Presented by Justin Reschke
Programming Parallel Hardware using MPJ Express By A. Shafi.
Parallel Performance Wizard: A Generalized Performance Analysis Tool Hung-Hsun Su, Max Billingsley III, Seth Koehler, John Curreri, Alan D. George PPW.
Application of Design Patterns to Geometric Decompositions V. Balaji, Thomas L. Clune, Robert W. Numrich and Brice T. Womack.
An Emerging, Portable Co-Array Fortran Compiler for High-Performance Computing Daniel Chavarría-Miranda, Cristian Coarfa, Yuri.
Implementing Simplified Molecular Dynamics Simulation in Different Parallel Paradigms Chao Mei April 27th, 2006 CS498LVK.
More on MPI Nonblocking point-to-point routines Deadlock
CHAPTER 4:THreads Bashair Al-harthi OPERATING SYSTEM
More on MPI Nonblocking point-to-point routines Deadlock
Immersed Boundary Method Simulation in Titanium Objectives
Support for Adaptivity in ARMCI Using Migratable Objects
Presentation transcript:

1 PGAS Languages and Halo Updates Will Sawyer, CSCS

POMPA Kickoff Meeting, May 3-4, 2011 Important concepts and acronyms 2  PGAS: Partitioned Global Address Space  UPC: Unified Parallel C  CAF: Co-Array Fortran  Titanium: PGAS Java dialect  MPI: Message-Passing Interface  SHMEM: Shared Memory API (SGI)

3 POMPA Kickoff Meeting, May 3-4, 2011 Partitioned Global Address Space  Global address space: any thread/process may directly read/write data allocated by any other  Partitioned: data is designated as local (with ‘affinity’) or global (possibly far); programmer controls layout Global address space By default: Object heaps are shared Program stacks are private x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn 3 Current languages: UPC, CAF, and Titanium

4 POMPA Kickoff Meeting, May 3-4, 2011 Potential strengths of a PGAS language  Interprocess communication intrinsic to language  Explicit support for distributed data structures (private and shared data)  Conceptually the parallel formulation can be more elegant  One-sided shared-memory communication  Values are either ‘put’ or ‘got’ from remote images  Support for bulk messages, synchronization  Could be implemented with message-passing library or through RDMA (remote direct memory access)  PGAS hardware support available  Cray Gemini (XE6) interconnect supports RDMA  Potential interoperability with existing C/Fortran/Java code

POMPA Kickoff Meeting, May 3-4, 2011 POP Halo Exchange with Co-Array Fortran 5 Worley, Levesque, The Performance Evolution of the Parallel Ocean Program on the Cray X1, Cray User Group Meeting, 2004  Cray X1 had a single vector processor per node, internode comm. hardware support  Co-Array Fortran (CAF) driven by Numrich, et al., also the authors of SHMEM  Halo exchange programmed in MPI, CAF, SHMEM

POMPA Kickoff Meeting, May 3-4, 2011 Halo Exchange “Stencil 2D” Benchmark 6 Halo exchange and stencil operation over a square domain distributed over a 2-D virtual process topology  Arbitrary halo ‘radius’ (number of halo cells in a given dimension, e.g. 3)  MPI implementations: Trivial: post all 8 MPI_Isend and Irecv Sendrecv: MPI_Sendrecv between PE pairs Halo: MPI_Isend/Irecv between PE pairs  CAF implementations: Trivial: simple copies to remote images Put: reciprocal puts between image pairs Get: reciprocal gets between image pairs GetA: all images do inner region first, then all do block region (fine grain, no sync.) GetH: half of images do inner region first, half do block region first (fine grain, no sync.)

POMPA Kickoff Meeting, May 3-4, 2011 Example code: Trivial CAF 7 real, allocatable, save :: V(:,:)[:,:] : allocate( V(1-halo:m+halo,1-halo:n+halo)[p,*] ) : WW = myP-1 ; if (WW<1) WW = p EE = myP+1 ; if (EE>p) EE = 1 SS = myQ-1 ; if (SS<1) SS = q NN = myQ+1 ; if (NN>q) NN = 1 : V(1:m,1:n) = dom(1:m,1:n) ! internal region V(1-halo:0, 1:n)[EE,myQ] = dom(m-halo+1:m,1:n) ! to East V(m+1:m+halo, 1:n)[WW,myQ] = dom(1:halo,1:n) ! to West V(1:m,1-halo:0)[myP,NN] = dom(1:m,n-halo+1:n) ! to North V(1:m,n+1:n+halo)[myP,SS] = dom(1:m,1:halo) ! to South V(1-halo:0,1-halo:0)[EE,NN] = dom(m-halo+1:m,n-halo+1:n) ! to North-East V(m+1:m+halo,1-halo:0)[WW,NN] = dom(1:halo,n-halo+1:n) ! to North-West V(1-halo:0,n+1:n+halo)[EE,SS] = dom(m-halo+1:m,1:halo) ! to South-East V(m+1:m+halo,n+1:n+halo)[WW,SS] = dom(1:halo,1:halo) ! to South-West sync all ! ! Now run a stencil filter over the internal region (the region unaffected by halo values) ! do j=1,n do i=1,m sum = 0. do l=-halo,halo do k=-halo,halo sum = sum + stencil(k,l)*V(i+k,j+l) enddo dom(i,j) = sum enddo

POMPA Kickoff Meeting, May 3-4, 2011 Stencil 2D Results on XT5, XE6, X2; Halo = 1 8 Using a fixed size virtual PE topology, vary the size of the local square  XT5: CAF puts/gets implemented through message-passing library  XE6, X2: RMA-enabled hardware support for PGAS, but still must pass through software stack

POMPA Kickoff Meeting, May 3-4, 2011 Stencil 2D Weak Scaling on XE6 9 Fixed local dimension, vary the PE virtual topology (take the optimal configuration)

POMPA Kickoff Meeting, May 3-4, Sergei Isakov SPIN: Transverse field Ising model  No symmetries  Any lattice with n sites — 2 n states  Need n bits to encode the state  split this in two parts of m and n-m bits  First part is a core index — 2 m cores  Second part is a state index within the core — 2 n-m states  Sparse matrix times dense vector  Each process communicates (large vectors) only with m ‘neighbors’  Similar to a halo update, but with higher dimensional state space  Implementation in C with MPI_Irecv/Isend, MPI_Allreduce 10

11 POMPA Kickoff Workshop, May 3-4, 2011 UPC Version “Elegant” shared double *dotprod; /* on thread 0 */ shared double shared_a[THREADS]; shared double shared_b[THREADS]; struct ed_s {... shared double *v0, *v1, *v2; /* vectors */ shared double *swap; /* for swapping vectors */ }; : for (iter = 0; iter max_iter; ++iter) { shared_b[MYTHREAD] = b; /* calculate beta */ upc_all_reduceD( dotprod, shared_b, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC ); ed->beta[iter] = sqrt(fabs(dotprod[0])); ib = 1.0 / ed->beta[iter]; /* normalize v1 */ upc_forall (i = 0; i nlstates; ++i; &(ed->v1[i]) ) ed->v1[i] *= ib; upc_barrier(0); /* matrix vector multiplication */ upc_forall (s = 0; s nlstates; ++s; &(ed->v1[s]) ) { /* v2 = A * v1, over all threads */ ed->v2[s] = diag(s, ed->n, ed->j) * ed->v1[s]; /* diagonal part */ for (k = 0; k n; ++k) { /* offdiagonal part */ s1 = flip_state(s, k); ed->v2[s] += ed->gamma * ed->v1[s1]; } a = 0.0; /* Calculate local conjugate term */ upc_forall (i = 0; i nlstates; ++i; &(ed->v1[i]) ) { a += ed->v1[i] * ed->v2[i]; } shared_a[MYTHREAD] = a; upc_all_reduceD( dotprod, shared_a, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC ); ed->alpha[iter] = dotprod[0]; b = 0.0; /* v2 = v2 - v0 * beta1 - v1 * alpha1 */ upc_forall (i = 0; i nlstates; ++i; &(ed->v2[i]) ) { ed->v2[i] -= ed->v0[i] * ed->beta[iter] + ed->v1[i] * ed->alpha[iter]; b += ed->v2[i] * ed->v2[i]; } swap01(ed); swap12(ed); /* "shift" vectors */ }

12 sPOMPA Kickoff Workshop, May 3-4, 2011 UPC “Inelegant1”: reproduce existing messaging  MPI MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[0], k, MPI_COMM_WORLD, &req_send2); MPI_Irecv(ed->vv1, ed->nlstates, MPI_DOUBLE, ed->from_nbs[0], ed->nm-1, MPI_COMM_WORLD, &req_recv); : MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[neighb], k, MPI_COMM_WORLD, &req_send2); MPI_Irecv(ed->vv2, ed->nlstates, MPI_DOUBLE, ed->from_nbs[neighb], k, MPI_COMM_WORLD, &req_recv2); :  UPC shared[NBLOCK] double vtmp[THREADS*NBLOCK]; : for (i = 0; i v1[i]; upc_barrier(1); for (i = 0; i vv1[i] = vtmp[i+(ed->from_nbs[0]*NBLOCK)]; : for (i = 0; i vv2[i] = vtmp[i+(ed->from_nbs[neighb]*NBLOCK)]; upc_barrier(2); :

13 POMPA Kickoff Workshop, May 3-4, 2011 UPC “Inelegant3”: use only PUT operations shared[NBLOCK] double vtmp1[THREADS*NBLOCK]; shared[NBLOCK] double vtmp2[THREADS*NBLOCK]; : upc_memput( &vtmp1[ed->to_nbs[0]*NBLOCK], ed->v1, NBLOCK*sizeof(double) ); upc_barrier(1); : if ( mode == 0 ) { upc_memput( &vtmp2[ed->to_nbs[neighb]*NBLOCK], ed->v1, NBLOCK*sizeof(double) ); } else { upc_memput( &vtmp1[ed->to_nbs[neighb]*NBLOCK], ed->v1, NBLOCK*sizeof(double) ); } : if ( mode == 0 ) { for (i = 0; i nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp1[i+MYTHREAD*NBLOCK]; } mode = 1; } else { for (i = 0; i nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp2[i+MYTHREAD*NBLOCK]; } mode = 0; } upc_barrier(2);

14 Thursday, February 3, 2011SCR discussion of HP2C projects But then: why not use light weight SHMEM protocol? #include : double *vtmp1,*vtmp2; : vtmp1 = (double *) shmalloc(ed->nlstates*sizeof(double)); vtmp2 = (double *) shmalloc(ed->nlstates*sizeof(double)); : shmem_double_put(vtmp1,ed->v1,ed->nlstates,ed->from_nbs[0]); /* Do local work */ shmem_barrier_all(); : shmem_double_put(vtmp2,ed->v1,ed->nlstates,ed->from_nbs[0]); : for (i = 0; i nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp1[i]; } shmem_barrier_all(); swap(&vtmp1, &vtmp2); :

15 POMPA Kickoff Workshop. May 3-4, 2011 Strong scaling: Cray XE6/Gemini, n=22,24; 10 iter.

16 POMPA Kickoff Workshop, May 3-4, 2011 Weak scaling: Cray XE6/Gemini,10 iterations

17 POMPA Kickoff Workshop, May 3-4, 2011 Conclusions  One-way communication has conceptual and can have real benefits (e.g., Cray T3E, X1, perhaps X2)  On XE6, CAF/UPC formulation can achieve SHMEM performance, but only by using puts and gets, but ‘elegant’ implementations have poor performance  If the domain decomposition is already properly formulated… why not use a simple, light-weight protocol like SHMEM??  For XE6 Gemini interconnect: study of one-sided communication primitives (Tineo, et al.) indicates 2-sided MPI communication is still most effective. To do: test MPI-2 one-sided primitives  Still: PGAS path should be kept open; possible task: PGAS (CAF or SHMEM) implementation of COSMO halo update?