its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
its.unc.edu 2 Collective Communication Communications involving a group of processes. Called by all processes in a communicator. Examples: Barrier synchronization Broadcast, scatter, gather. Global sum, global maximum, etc.
its.unc.edu 3 Characteristics of Collective Communication Collective action over a communicator All processes must communicate Synchronization may or may not occur All collective operations are blocking. No tags. Receive buffers must be exactly the right size This restriction was made to simplify the standard (e.g. no receive status required)
its.unc.edu 4 Why use collective operations? Efficiency Clarity Convenience Robustness Flexibility Programming Style – all collective ops
its.unc.edu 5 Barrier Synchronization int MPI_Barrier (MPI_Comm comm) all processes within the communicator, comm, are synchronized.
its.unc.edu 6 Broadcast A0 data Processors
its.unc.edu 7 Broadcast int MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm) buffer - address of data to broadcast count - number of items to send datatype - MPI datatype root - process from which data is sent comm - MPI communicator
its.unc.edu 8 Gather/Scatter A0 A1 A2 A3 data A1A2A3 scatter gather Processors
its.unc.edu 9 Gather int MPI_Gather (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) sendbuf - each process (including root) sends this buffer to root sendcount - number of elements in the send buffer recvbuf - address of receive buffer, this is ignored for all non-root processes. Gather buffer should be large enough to hold results from all processors. recvcount - number of elements for any single receive root - rank of receiving process
its.unc.edu 10 Scatter int MPI_Scatter (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) sendbuf - address of send buffer, this is ignored for all non-root processes sendcount - number of elements sent to each process recvbuf - address of receive buffer recvcount - number of elements in the receive buffer root - rank of sending process
its.unc.edu 11 Allgather A0 data A0 B0 C0 D0 B0 C0 D0 Processors
its.unc.edu 12 Allgather int MPI_Allgather (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) sendbuf - starting address of send buffer sendcount - number of elements in the send buffer recvbuf - address of receive buffer recvcount - number of elements for any single receive Like gather except all processes receive result
its.unc.edu 13 Alltoall A0 A1 A2 A3 data A0 B0 C0 D0 B0 B1 B2 B3 C0 C1 C2 C3 D0 D1 D2 D3 A1 B1 C1 D1 A2 B2 C2 D2 A3 B3 C3 D3 Processors
its.unc.edu 14 All to All Scatter/Gather int MPI_Alltoall (void* sendbuf, int sendcount, MPI_Datatype sendtype, void* recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm) sendbuf - starting address of send buffer sendcount - number of elements in the send buffer recvbuf - address of receive buffer recvcount - number of elements for any single receive Extension of Allgather to the case where each process sends distinct data to each receiver
its.unc.edu 15 Additional Routines Vector versions exist for all of these which allow a varying count of data from each process MPI_Alltoallw provides the most general form of communication and indeed generalizes all of the vector variants allows separate specification of count, byte displacement and datatype for each block
its.unc.edu 16 Identical send/receive buffers The send and receive data buffers for all collective operations are normally two distinct (disjoint) buffers. To reuse the same buffer, specify the MPI constant MPI_IN_PLACE instead of the send or receive buffer argument.
its.unc.edu 17 Global Reduction Operations Used to compute a result involving data distributed over a group of processes. Examples: global sum or product global maximum or minimum global user-defined operation
its.unc.edu 18 Predefined Reduction Operations MPI_MAX – maximum value MPI_MIN – minimum value MPI_SUM - sum MPI_PROD - product MPI_LAND, MPI_BAND – logical, bitwise AND MPI_LOR, MPI_BOR – logical, bitwise OR MPI_LXOR, MPI_BXOR - logical, bitwise exclusive OR MPI_MAXLOC –max value and it’s location MPI_MINLOC - min value and it’s location
its.unc.edu 19 User Defined Operators In addition to the predefined reduction operations, the user may define their own Use MPI_Op_Create to register the operation with MPI it must be of type: C: MPI_User_function (…) Fortran: Function User_Function(…)
its.unc.edu 20 Reduce data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0+A1+A2B0+B1+B2C0+C1+C2 Processors
its.unc.edu 21 Reduce int MPI_Reduce (void* sendbuf, void* recvbuf, int count, MPI_Datatype type, MPI_Op op, int root, MPI_Comm comm) sendbuf - address of send buffer recvbuf - address of receive buffer count - number of elements in send buffer op - reduce operation root - rank of receiving process
its.unc.edu 22 Variants of MPI_REDUCE MPI_ALLREDUCE no root process MPI_REDUCE_SCATTER result is scattered MPI_SCAN “parallel prefix”
its.unc.edu 23 Allreduce data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0+A1+A2B0+B1+B2C0+C1+C2 A0+A1+A2B0+B1+B2C0+C1+C2 A0+A1+A2B0+B1+B2C0+C1+C2 Processors
its.unc.edu 24 Allreduce int MPI_Allreduce (void* sendbuf, void* recvbuf, int count, MPI_Datatype type, MPI_Op op, MPI_Comm comm) sendbuf - address of send buffer recvbuf - address of receive buffer count - number of elements in send buffer op - reduce operation
its.unc.edu 25 Reduce - Scatter data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0+A1+A2 B0+B1+B2 C0+C1+C2 Processors
its.unc.edu 26 Scan data A0 A1 A2 B0 B1 B2 C0 C1 C2 A0B0C0 A0+A1B0+B1C0+C1 A0+A1+A2B0+B1+B2C0+C1+C2 Processors
its.unc.edu 27 Example: Find the maximum value of a function over some range Let’s pick a simple function y = cos(2 x+ /4) Our range will be 0-> 2 so x will vary from 0-> 1. Break x range into evenly sized blocks on each processor
its.unc.edu 28 Example: Find the maximum value of a function /* program findmax */ /* find the maximum of the function y=cos(2*pi*x + pi/4) across processors */ #include #define numpts 100 main(int argc, char* argv[]) {
its.unc.edu 29 int numprocs,i,myrank; float pi, twopi, phase, blksze,x,delx; struct { float data; int idx; } y[100],ymax, ymymax; /* Compute number of processes and myrank */ MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); findmax cont.
its.unc.edu 30 findmax cont. /* compute the block size and spacing of x points*/ blksze = 1.0/numprocs; /* 1.0 is max value of x */ delx = blksze/numpts; /* define some constants */ pi=acos(-1.0); phase = pi/4; twopi = 2*pi; /* note: x from 0 to 1-delx s.t. y over 2pi */ /* initialize across all processors */ for (i=0;i<numpts;i++) { x = blksze*myrank + i*delx; y[i].idx = numpts*myrank + i; y[i].data = cos(twopi*x+phase); }
its.unc.edu 31 findmax cont. /* Now find the max on each local processor */ ymymax.data = -FLT_MAX; ymymax.idx = 0; for (i=0;i<numpts;i++) { if (y[i].data > ymymax.data) { ymymax.data = y[i].data; ymymax.idx = y[i].idx; } /* Now find the max across the processors */ MPI_Reduce (&ymymax,ymax,1,MPI_FLOAT_INT, MPI_MAXLOC, 0, MPI_COMM_WORLD);
its.unc.edu 32 findmax cont. /* now print out answer */ if (myrank == 0) { printf("The maximum value of the function is %f which occurs at x = %f \n", ymax.data, ymax.idx*delx); } /* call barrier and exit */ MPI_Barrier(MPI_COMM_WORLD); MPI_Finalize (); }
its.unc.edu 33 Results Results for numpts=10,000 : The maximum value of the function is which occurs at x = i.e at x=7/8 or 2 x= 1 ¾ In the previous example we found the minimum in data space and then reduced across processor space. We could have reversed this and reduced across processor space first and then found the minimum. What are the disadvantages of this second method?