Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.

Similar presentations


Presentation on theme: "1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments."— Presentation transcript:

1 1 Collective Communications

2 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.  Types of collective operations:  Synchronization: MPI_Barrier  Data movement: MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Alltoall  Collective computation: MPI_Reduce, MPI_Allreduce, MPI_Scan  Collective routines are blocking:  Completion of call means the communication buffer can be accessed  No indication on other processes’ status of completion  May or may not have effect of synchronization among processes.

3 3 Overview  Can use same communicators as PtP communications  MPI guarantees messages from collective communications will not be confused with PtP communications.  Key is a group of processes participating in communication  If you want only a sub-group of processes involved in collective communication, need to create a sub- group/sub-communicator from MPI_COMM_WORLD

4 4 Barrier  Blocks the calling process until all group members have called it.  Affect performance. Refrain from using it. int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM,IERROR) integer COMM, IERROR … MPI_Barrier(MPI_COMM_WORLD); // synchronization point …

5 5 Broadcast  Broadcasts a message from process with rank root to all processes in group, including itself.  comm, root must be the same in all processes  The amount of data sent must be equal to amount of data received, pairwise between each process and the root  For now, means count and datatype must be the same for all processes; may be different when generalized datatypes are involved. int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM int num=-1; If(my_rank==0) num=100; … MPI_Bcast(&num, 1, MPI_INT, 0, MPI_COMM_WORLD); …

6 6 Gather  Gathers message to root; concatenated based on rank order at root process  Recvbuf, recvcount, recvtype are only important at root; ignored in other processes.  root and comm must be identical on all processes.  recvbuf and sendbuf cannot be the same on root process.  Amount of data sent from a process must be equal to amount of data received at root  For now, recvcount=sendcount, recvtype=sendtype.  recvcount is the number of items received from each process, not the total number of items received, not the size of receive buffer! Int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, int root, MPI_Comm comm) MPI_Gather(SENDBUF,SENDCOUNT,SENDTYPE,RECVBUF,RECVCOUNT,RECVTYPE,ROOT,COMM, IERROR) SENDBUF(*), RECVBUF(*) integer SENDCOUNT,SENDTYPE,RECVCOUNT,RECVTYPE,ROOT,COMM

7 7 Gather Example int rank, ncous; int root = 0; int *data_received=NULL, data_send[100]; // assume running with 10 cpus MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(rank==root) data_received = new int[100*ncpus]; // 100*10 MPI_Gather(data_send, 100, MPI_INT, data_received, 100, MPI_INT, root, MPI_COMM_WORLD); // ok // MPI_Gather(data_send,100,MPI_INT,data_received, 100*ncpus, MPI_INT, root, MPI_COMM_WORLD);  wrong

8 8 Gather to All  Concatenated messages according to rank order received by all processes  recvcount is the number of items from each process, not the total number of items received.  For now, sendcount=recvcount,sendtype=recvtype Int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype *sendtype, void *recvbuf, int recvcount, MPI_Datatype *recvtype, MPI_Comm comm) int A[100], B[1000]; // assume 10 processors MPI_Allgather(A, 100, MPI_INT, B, 100, MPI_INT, MPI_COMM_WORLD); // ok?... MPI_Allgather(A, 100, MPI_INT, B, 1000, MPI_INT, MPI_COMM_WORLD); // ok?

9 9 Scatter  Inverse to MPI_Gather  Split message into ncpus equal segments; n -th segment to n -th process.  sendbuf, sendcount, sendtype important only at root, ignored in other processes.  sendcount is the number of items sent to each process, not the total number of items in sendbuf. Int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype *sendtype, void *recvbuf, int recvcount, MPI_Datatype *recvtype, int root, MPI_Comm comm)

10 10 Scatter Example int A[1000], B[100];... // initializa A etc // assume 10 processors MPI_Scatter(A, 100, MPI_INT, B, 100, MPI_INT, 0, MPI_COMM_WORLD); // ok?... MPI_Scatter(A, 1000, MPI_INT, B, 100, MPI_INT, 0, MPI_COMM_WORLD); // ok?

11 11 All-to-All  Important for distributed matrix transposition; critical to FFT-based algorithms  Most stressful communication.  sendcount is the number of items sent to each process, not the total number of items in sendbuf.  recvcount is the number of items received from each process, not the total number of items received.  For now, sendcount=recvcount, sendtype=recvtype Int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype sendtype, void *recvbuf, int recvcount, MPI_Datatype recvtype, MPI_Comm comm)

12 12 All-to-All Example 0123 4567 891011 12131415 double A[4], B[4];... // assume 4 cpus for(i=0;i<4;i++) A[i] = my_rank + i; MPI_Alltoall(A, 4, MPI_DOUBLE, B, 4, MPI_DOUBLE, MPI_COMM_WORLD); // ok? MPI_Alltoall(A, 1, MPI_DOUBLE, B, 1, MPI_DOUBLE, MPI_COMM_WORLD); // ok? 04812 15913 261014 371115 Cpu 0 Cpu 1 Cpu 2 Cpu 3

13 13 Reduction  Perform global reduction operations (sum, max, min, and, etc) across processors.  MPI_Reduce – return result to one processor  MPI_Allreduce – return result to all processors  MPI_Reduce_scatter – scatter reduction result across processors  MPI_Scan – parallel prefix operation

14 14 Reduction  Element-wise combine data from input buffers across processors using operation op ; store results in output buffer on processor root.  All processes must provide input/output buffers of the same length and data type.  Operation op must be associative:  Pre-defined operations  User can define own operations Int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) int rank, res; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Reduce(&rank,&res,1,MPI_INT,MPI_MAX,0,MPI_COMM_WORLD);

15 15 Pre-Defined Operations MPI_MAX MPI_MIN MPI_SUM MPI_PROD MPI_LANDLogical AND MPI_LORLogical OR MPI_BANDBitwise AND MPI_BORBitwise OR MPI_LXOR MPI_BXOR MPI_MAXLOCmax + location MPI_MINLOCmin + location

16 16 All Reduce  Reduction result stored on all processors. int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) int rank, res; MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Allreduce(&rank, &res, 1, MPI_INT, MPI_MAX, MPI_COMM_WORLD);

17 17 Scan  Prefix reduction  To process j, return results of reduction on input buffers of processes 0, 1, …, j. Int MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

18 18 Example: Matrix Transpose A 11 A 12 A 13 A 21 A 22 A 23 A 31 A 32 A 33 A – NxN matrix Distributed on P cpus Row-wise decomposition B = A T B also distributed on P cpus Rwo-wise decomposition A ij – (N/P)x(N/P) matrices B ij =A ji T Input: A[i]][j] = 2*i+j A 11 T A 21 T A 31 T A 12 T A 22 T A 32 T A 13 T A 23 T A 33 T AB A 11 T A 12 T A 13 T A 21 T A 22 T A 23 T A 31 T A 32 T A 33 T Local transpose All-to-all

19 19 Example: Matrix Transpose 0123 4567 0123 4567 0404 1515 2626 3737 On each cpu, A is (N/P)xN matrix; First need to first re-write to P blocks of (N/P)x(N/P) matrices, then can do local transpose 01234567 A: 2x4 01452367 Two 2x2 blocks After all-to-all comm, have P blocks of (N/P)x(N/P) matrices; Need to merge into a (N/P)xN matrix Three steps: 1.Divide A into blocks; 2.Transpose each block locally; 3.All-to-all comm; 4.Merge blocks locally;

20 20 #include #include "dmath.h" #define DIM 1000 // global A[DIM], B[DIM] int main(int argc, char **argv) { int ncpus, my_rank, i, j, iblock; int Nx, Ny; // Nx=DIM/ncpus, Ny=DIM, local array: A[Nx][Ny], B[Nx][Ny] double **A, **B, *Ctmp, *Dtmp; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(DIM%ncpus != 0) { // make sure DIM can be divided by ncpus if(my_rank==0) printf("ERROR: DIM cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; Ny = DIM; A = DMath::newD(Nx, Ny); // allocate memory B = DMath::newD(Nx, Ny); Ctmp = DMath::newD(Nx*Ny); // work space Dtmp = DMath::newD(Nx*Ny); // work space for(i=0;i<Nx;i++) for(j=0;j<Ny;j++) A[i][j] = 2*(my_rank*Nx+i) + j; memset(&B[0][0], '\0', sizeof(double)*Nx*Ny); // zero out B Matrix Transposition

21 21 // divide A into blocks --> Ctmp; A[i][iblock*Nx+j]  Ctmp[iblock][i][j] for(i=0;i<Nx;i++) for(iblock=0;iblock<ncpus;iblock++) for(j=0;j<Nx;j++) Ctmp[iblock*Nx*Nx+i*Nx+j] = A[i][iblock*Nx+j]; // local transpose of A --> Dtmp; Ctmp[iblock][i][j]  Dtmp[iblock][j][i] for(iblock=0;iblock<ncpus;iblock++) for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Dtmp[iblock*Nx*Nx+i*Nx+j] = Ctmp[iblock*Nx*Nx+j*Nx+i]; // All-to-all comm --> Ctmp MPI_Alltoall(Dtmp, Nx*Nx, MPI_DOUBLE, Ctmp, Nx*Nx, MPI_DOUBLE, MPI_COMM_WORLD); // merge blocks --> B; Ctmp[iblock][i][j]  B[i][iblock*Nx+j] for(i=0;i<Nx;i++) for(iblock=0;iblock<ncpus;iblock++) for(j=0;j<Nx;j++) B[i][iblock*Nx+j] = Ctmp[iblock*Nx*Nx+i*Nx+j]; // clean up DMath::del(A); DMath::del(B); DMath::del(Ctmp); DMath::del(Dtmp); MPI_Finalize(); return 0; }

22 22 Project #1: FFT of 3D Matrix  A: 3D Matrix of real numbers, NxNxN  Distributed over P CPUs:  1D decomposition: x direction in C, z direction in FORTRAN;  (bonus) 2D decomposition: x and y directions in C, or y and z directions in FORTRAN;  Compute the 3D FFT of this matrix using fftw library (www.fftw.org)www.fftw.org N/P N N 1D decomposition N/P N N x y z y x z

23 23 Project #1  FFTW library will be available on ITAP machines  Fftw user’s manual available at www.fftw.orgwww.fftw.org  Refer to manual on how to use fftw functions.  FFTW is serial  It has an MPI parallel version (fftw 2.1.5), suitable for 1D decomposition.  You cannot use the fftw routines for MPI for this project.  3D fft can be done in several steps, e.g.  First real-to-complex fft in z direction  Then complex fft in y direction  Then complex fft in x direction  When doing fft in a direction, e.g. x direction, if matrix is distributed/decomposed in that direction,  need to first do a matrix transposition to get all data along that direction  Then call fftw function to perform fft along that direction  Then you may/will need to transpose matrix back.

24 24 Project #1  Write a parallel C, C++, or FORTRAN program to first compute the fft of matrix A, store result in matrix B; then compute the inverse fft of B, store result in C. Check the correctness of your code by comparing data in A and C. Make sure your program is correct by testing with some small matrices, e.g. using a 4x4x4 matrix.  If you want to get the bonus points, you can also implement only the 2D data decomposition; then let the number of cpus in one direction be 1, and your code will be able to handle 1D data decompositions  Let A be a matrix of size 256x256x256, A[i][j][k]=3*i+2*j+k  Run your code on 1, 2, 4, 8, 16 processors, and record the wall clock time of main code section for the work (transpositions, ffts, inverse ffts etc) using MPI_Wtime().  Compute the speedup factors, Sp = T1/Tp  Turn in:  Your source codes + a compiled binary code on hamlet or radon  Plot of speedup vs. number of CPUs for each data decomposition  Write-up of what you have learned from this project.  Due: 10/30

25 25 N N/P N


Download ppt "1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments."

Similar presentations


Ads by Google