1 PGAS Languages and Halo Updates Will Sawyer, CSCS
POMPA Kickoff Meeting, May 3-4, 2011 Important concepts and acronyms 2 PGAS: Partitioned Global Address Space UPC: Unified Parallel C CAF: Co-Array Fortran Titanium: PGAS Java dialect MPI: Message-Passing Interface SHMEM: Shared Memory API (SGI)
3 POMPA Kickoff Meeting, May 3-4, 2011 Partitioned Global Address Space Global address space: any thread/process may directly read/write data allocated by any other Partitioned: data is designated as local (with ‘affinity’) or global (possibly far); programmer controls layout Global address space By default: Object heaps are shared Program stacks are private x: 1 y: l: g: x: 5 y: x: 7 y: 0 p0p1pn 3 Current languages: UPC, CAF, and Titanium
4 POMPA Kickoff Meeting, May 3-4, 2011 Potential strengths of a PGAS language Interprocess communication intrinsic to language Explicit support for distributed data structures (private and shared data) Conceptually the parallel formulation can be more elegant One-sided shared-memory communication Values are either ‘put’ or ‘got’ from remote images Support for bulk messages, synchronization Could be implemented with message-passing library or through RDMA (remote direct memory access) PGAS hardware support available Cray Gemini (XE6) interconnect supports RDMA Potential interoperability with existing C/Fortran/Java code
POMPA Kickoff Meeting, May 3-4, 2011 POP Halo Exchange with Co-Array Fortran 5 Worley, Levesque, The Performance Evolution of the Parallel Ocean Program on the Cray X1, Cray User Group Meeting, 2004 Cray X1 had a single vector processor per node, internode comm. hardware support Co-Array Fortran (CAF) driven by Numrich, et al., also the authors of SHMEM Halo exchange programmed in MPI, CAF, SHMEM
POMPA Kickoff Meeting, May 3-4, 2011 Halo Exchange “Stencil 2D” Benchmark 6 Halo exchange and stencil operation over a square domain distributed over a 2-D virtual process topology Arbitrary halo ‘radius’ (number of halo cells in a given dimension, e.g. 3) MPI implementations: Trivial: post all 8 MPI_Isend and Irecv Sendrecv: MPI_Sendrecv between PE pairs Halo: MPI_Isend/Irecv between PE pairs CAF implementations: Trivial: simple copies to remote images Put: reciprocal puts between image pairs Get: reciprocal gets between image pairs GetA: all images do inner region first, then all do block region (fine grain, no sync.) GetH: half of images do inner region first, half do block region first (fine grain, no sync.)
POMPA Kickoff Meeting, May 3-4, 2011 Example code: Trivial CAF 7 real, allocatable, save :: V(:,:)[:,:] : allocate( V(1-halo:m+halo,1-halo:n+halo)[p,*] ) : WW = myP-1 ; if (WW<1) WW = p EE = myP+1 ; if (EE>p) EE = 1 SS = myQ-1 ; if (SS<1) SS = q NN = myQ+1 ; if (NN>q) NN = 1 : V(1:m,1:n) = dom(1:m,1:n) ! internal region V(1-halo:0, 1:n)[EE,myQ] = dom(m-halo+1:m,1:n) ! to East V(m+1:m+halo, 1:n)[WW,myQ] = dom(1:halo,1:n) ! to West V(1:m,1-halo:0)[myP,NN] = dom(1:m,n-halo+1:n) ! to North V(1:m,n+1:n+halo)[myP,SS] = dom(1:m,1:halo) ! to South V(1-halo:0,1-halo:0)[EE,NN] = dom(m-halo+1:m,n-halo+1:n) ! to North-East V(m+1:m+halo,1-halo:0)[WW,NN] = dom(1:halo,n-halo+1:n) ! to North-West V(1-halo:0,n+1:n+halo)[EE,SS] = dom(m-halo+1:m,1:halo) ! to South-East V(m+1:m+halo,n+1:n+halo)[WW,SS] = dom(1:halo,1:halo) ! to South-West sync all ! ! Now run a stencil filter over the internal region (the region unaffected by halo values) ! do j=1,n do i=1,m sum = 0. do l=-halo,halo do k=-halo,halo sum = sum + stencil(k,l)*V(i+k,j+l) enddo dom(i,j) = sum enddo
POMPA Kickoff Meeting, May 3-4, 2011 Stencil 2D Results on XT5, XE6, X2; Halo = 1 8 Using a fixed size virtual PE topology, vary the size of the local square XT5: CAF puts/gets implemented through message-passing library XE6, X2: RMA-enabled hardware support for PGAS, but still must pass through software stack
POMPA Kickoff Meeting, May 3-4, 2011 Stencil 2D Weak Scaling on XE6 9 Fixed local dimension, vary the PE virtual topology (take the optimal configuration)
POMPA Kickoff Meeting, May 3-4, Sergei Isakov SPIN: Transverse field Ising model No symmetries Any lattice with n sites — 2 n states Need n bits to encode the state split this in two parts of m and n-m bits First part is a core index — 2 m cores Second part is a state index within the core — 2 n-m states Sparse matrix times dense vector Each process communicates (large vectors) only with m ‘neighbors’ Similar to a halo update, but with higher dimensional state space Implementation in C with MPI_Irecv/Isend, MPI_Allreduce 10
11 POMPA Kickoff Workshop, May 3-4, 2011 UPC Version “Elegant” shared double *dotprod; /* on thread 0 */ shared double shared_a[THREADS]; shared double shared_b[THREADS]; struct ed_s {... shared double *v0, *v1, *v2; /* vectors */ shared double *swap; /* for swapping vectors */ }; : for (iter = 0; iter max_iter; ++iter) { shared_b[MYTHREAD] = b; /* calculate beta */ upc_all_reduceD( dotprod, shared_b, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC ); ed->beta[iter] = sqrt(fabs(dotprod[0])); ib = 1.0 / ed->beta[iter]; /* normalize v1 */ upc_forall (i = 0; i nlstates; ++i; &(ed->v1[i]) ) ed->v1[i] *= ib; upc_barrier(0); /* matrix vector multiplication */ upc_forall (s = 0; s nlstates; ++s; &(ed->v1[s]) ) { /* v2 = A * v1, over all threads */ ed->v2[s] = diag(s, ed->n, ed->j) * ed->v1[s]; /* diagonal part */ for (k = 0; k n; ++k) { /* offdiagonal part */ s1 = flip_state(s, k); ed->v2[s] += ed->gamma * ed->v1[s1]; } a = 0.0; /* Calculate local conjugate term */ upc_forall (i = 0; i nlstates; ++i; &(ed->v1[i]) ) { a += ed->v1[i] * ed->v2[i]; } shared_a[MYTHREAD] = a; upc_all_reduceD( dotprod, shared_a, UPC_ADD, THREADS, 1, NULL, UPC_IN_ALLSYNC | UPC_OUT_ALLSYNC ); ed->alpha[iter] = dotprod[0]; b = 0.0; /* v2 = v2 - v0 * beta1 - v1 * alpha1 */ upc_forall (i = 0; i nlstates; ++i; &(ed->v2[i]) ) { ed->v2[i] -= ed->v0[i] * ed->beta[iter] + ed->v1[i] * ed->alpha[iter]; b += ed->v2[i] * ed->v2[i]; } swap01(ed); swap12(ed); /* "shift" vectors */ }
12 sPOMPA Kickoff Workshop, May 3-4, 2011 UPC “Inelegant1”: reproduce existing messaging MPI MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[0], k, MPI_COMM_WORLD, &req_send2); MPI_Irecv(ed->vv1, ed->nlstates, MPI_DOUBLE, ed->from_nbs[0], ed->nm-1, MPI_COMM_WORLD, &req_recv); : MPI_Isend(ed->v1, ed->nlstates, MPI_DOUBLE, ed->to_nbs[neighb], k, MPI_COMM_WORLD, &req_send2); MPI_Irecv(ed->vv2, ed->nlstates, MPI_DOUBLE, ed->from_nbs[neighb], k, MPI_COMM_WORLD, &req_recv2); : UPC shared[NBLOCK] double vtmp[THREADS*NBLOCK]; : for (i = 0; i v1[i]; upc_barrier(1); for (i = 0; i vv1[i] = vtmp[i+(ed->from_nbs[0]*NBLOCK)]; : for (i = 0; i vv2[i] = vtmp[i+(ed->from_nbs[neighb]*NBLOCK)]; upc_barrier(2); :
13 POMPA Kickoff Workshop, May 3-4, 2011 UPC “Inelegant3”: use only PUT operations shared[NBLOCK] double vtmp1[THREADS*NBLOCK]; shared[NBLOCK] double vtmp2[THREADS*NBLOCK]; : upc_memput( &vtmp1[ed->to_nbs[0]*NBLOCK], ed->v1, NBLOCK*sizeof(double) ); upc_barrier(1); : if ( mode == 0 ) { upc_memput( &vtmp2[ed->to_nbs[neighb]*NBLOCK], ed->v1, NBLOCK*sizeof(double) ); } else { upc_memput( &vtmp1[ed->to_nbs[neighb]*NBLOCK], ed->v1, NBLOCK*sizeof(double) ); } : if ( mode == 0 ) { for (i = 0; i nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp1[i+MYTHREAD*NBLOCK]; } mode = 1; } else { for (i = 0; i nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp2[i+MYTHREAD*NBLOCK]; } mode = 0; } upc_barrier(2);
14 Thursday, February 3, 2011SCR discussion of HP2C projects But then: why not use light weight SHMEM protocol? #include : double *vtmp1,*vtmp2; : vtmp1 = (double *) shmalloc(ed->nlstates*sizeof(double)); vtmp2 = (double *) shmalloc(ed->nlstates*sizeof(double)); : shmem_double_put(vtmp1,ed->v1,ed->nlstates,ed->from_nbs[0]); /* Do local work */ shmem_barrier_all(); : shmem_double_put(vtmp2,ed->v1,ed->nlstates,ed->from_nbs[0]); : for (i = 0; i nlstates; ++i) { ed->v2[i] += ed->gamma * vtmp1[i]; } shmem_barrier_all(); swap(&vtmp1, &vtmp2); :
15 POMPA Kickoff Workshop. May 3-4, 2011 Strong scaling: Cray XE6/Gemini, n=22,24; 10 iter.
16 POMPA Kickoff Workshop, May 3-4, 2011 Weak scaling: Cray XE6/Gemini,10 iterations
17 POMPA Kickoff Workshop, May 3-4, 2011 Conclusions One-way communication has conceptual and can have real benefits (e.g., Cray T3E, X1, perhaps X2) On XE6, CAF/UPC formulation can achieve SHMEM performance, but only by using puts and gets, but ‘elegant’ implementations have poor performance If the domain decomposition is already properly formulated… why not use a simple, light-weight protocol like SHMEM?? For XE6 Gemini interconnect: study of one-sided communication primitives (Tineo, et al.) indicates 2-sided MPI communication is still most effective. To do: test MPI-2 one-sided primitives Still: PGAS path should be kept open; possible task: PGAS (CAF or SHMEM) implementation of COSMO halo update?