1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

Slides:

Advertisements

Similar presentations

Generalized Requests. The current definition They are defined in MPI 2 under the hood of the chapter 8 (External Interfaces) Page 166 line 16 The objective.

Advertisements

1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)

Nonblocking I/O Blocking vs. non-blocking I/O

1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.

MPI Message Passing Interface Portable Parallel Programs.

MPI Message Passing Interface

1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.

VSCSE Summer School Programming Heterogeneous Parallel Computing Systems Lecture 6: Basic CUDA and MPI.

Parallel Jacobi Algorithm Steven Dong Applied Mathematics.

Practical techniques & Examples

Asynchronous I/O with MPI Anthony Danalis. Basic Non-Blocking API  MPI_Isend()  MPI_Irecv()  MPI_Wait()  MPI_Waitall()  MPI_Waitany()  MPI_Test()

Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.

MPI Collective Communications

Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.

1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.

Reference: / Point-to-Point Communication.

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.

SOME BASIC MPI ROUTINES With formal datatypes specified.

Message-Passing Programming and MPI CS 524 – High-Performance Computing.

Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.

Lesson2 Point-to-point semantics Embarrassingly Parallel Examples.

High Performance Parallel Programming Dirk van der Knijff Advanced Research Computing Information Division.

Comp 422: Parallel Programming Lecture 8: Message Passing (MPI)

MPI Point-to-Point Communication CS 524 – High-Performance Computing.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.

1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.

1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.

MA471Fall 2003 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.

Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard

1 Why Derived Data Types  Message data contains different data types  Can use several separate messages  performance may not be good  Message data.

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.

MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.

MPI Send/Receive Blocked/Unblocked Tom Murphy Director of Contra Costa College High Performance Computing Center Message Passing Interface BWUPEP2011,

An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams

1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.

Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.

MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.

MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.

CSCI-455/522 Introduction to High Performance Computing Lecture 4.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.

An Introduction to MPI (message passing interface)

Introduction to Parallel Programming at MCSR Message Passing Computing –Processes coordinate and communicate results via calls to message passing library.

Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.

MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.

Chapter 5. Nonblocking Communication MPI_Send, MPI_Recv are blocking operations Will not return until the arguments to the functions can be safely modified.

Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.

1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.

Message Passing Interface Using resources from

Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.

An Introduction to Parallel Programming with MPI February 17, 19, 24, David Adams

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:

Computer Science Department

Introduction to parallel computing concepts and technics

MPI Point to Point Communication

Computer Science Department

More on MPI Nonblocking point-to-point routines Deadlock

Lecture 14: Inter-process Communication

A Message Passing Standard for MPP and Workstations

More on MPI Nonblocking point-to-point routines Deadlock

Computer Science Department

5- Message-Passing Programming

Presentation transcript:

1 Non-Blocking Communications

2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1; int tag = 101; MPI_Status statSend, statRecv; MPI_Request reqSend, reqRecv; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); left_neighbor = (my_rank-1 + ncpus)%ncpus; right_neighbor = (my_rank+1)%ncpus; MPI_Isend(&my_rank,1,MPI_INT,left_neighbor,tag,MPI_COMM_WORLD,&reqSend); // comm start MPI_Irecv(&data_received,1,MPI_INT,right_neighbor,tag,MPI_COMM_WORLD,&reqRecv); // maybe do something useful here MPI_Wait(&reqSend, &statSend); // complete comm MPI_Wait(&reqRecv, &statRecv); printf("Among %d processes, process %d received from right neighbor: %d\n", ncpus, my_rank, data_received); // clean up MPI_Finalize(); return 0; } Example mpirun –np 4 test_shift Among 4 processes, process 3 received from right neighbor: 0 Among 4 processes, process 2 received from right neighbor: 3 Among 4 processes, process 0 received from right neighbor: 1 Among 4 processes, process 1 received from right neighbor: 2

3 Semantics etc  Purpose:  Mechanism for overlapping communication and useful computations. Communication and computation may proceed concurrently. Latency hiding.  Deadlock avoidance  May avoid system buffering and memory-to-memory copying, and improve performance  Structure of non-blocking calls Post communication requests  non-blocking call, MPI_Isend … … // do some useful work Complete communication call  MPI_Wait, MPI_Test, …

4 Semantics etc  Non-blocking calls: MPI_Isend, MPI_Irecv etc  Will return immediately. Merely post a request to system to initiate communication.  However, communication is not completed yet.  Cannot tamper with the memory provided in these calls until the communication is completed by calling MPI_Wait or MPI_Test etc Non-blocking sendNon-blocking receive

5 Non-blocking Send/Recv int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) MPI_ISEND(BUF,COUNT,DATATYPE,DEST,TAG,COMM,REQUEST,IERROR) BUF(*) INTEGER COUNT,DATATYPE,DEST,TAG,COMM,REQUEST, IERROR int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request) MPI_IRECV(BUF,COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR) BUF(*) INTEGER COUNT,DATATYPE,SOURCE,TAG,COMM,REQUEST,IERROR Post send/recv requests to MPI system. Calls return immediately, but don’t access the memory pointed to by *buf MPI_Request request is a handle to an internal MPI object. Everything about that non-blocking communication is through that handle. MPI_REQUEST_NULL is a NULL request. MPI_Request req1, req2; double A[10], B[5]; … MPI_Isend(A, 10, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req1); MPI_Irecv(B, 5, MPI_DOUBLE, rank, tag, MPI_COMM_WORLD, &req2);

6 Other Non-blocking Sends  4 communication modes, same semantics as blocking sends.  MPI_ISEND – standard mode  MPI_IBSEND – buffered mode  MPI_ISSEND – synchronous mode  MPI_IRSEND – ready mode Identical arguments as MPI_Isend int MPI_Ibsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Issend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irsend(void *buf,int count,MPI_Datatype datatype,int dest, int tag, MPI_Comm comm, MPI_Request *request)

7 Completion  Use MPI_Wait or MPI_Test to complete non-blocking communication  Semantics: after MPI_Wait returns  For standard send, message data has been safely stored away, safe to access buffer.  For receive, data is received.

8 MPI_Wait  Will block until the communication completes (or fails)  If request is from MPI_Isend, MPI_Irecv etc  Will deallocate request object, set request to MPI_REQUEST_NULL.  Will return in status the status information.  for MPI_Irecv, hold additional information.  For MPI_Isend, not much to be used int MPI_Wait(MPI_Request *request, MPI_Status *status) MPI_WAIT(REQUEST,STATUS,IERROR) INTEGER REQUEST, STATUS(MPI_STATUS_SIZE), IERROR *request is a handle returned from MPI_Isend, MPI_Irecv etc MPI_Request req; MPI_Status stat; … MPI_Irecv(…, &req); MPI_Wait(&req, &stat);

9 MPI_Test  request – MPI_Request object from MPI_Isend, etc  flag – true if communication complete; false if not yet  If true, request object will be de-allocated, and set to MPI_REQUEST_NULL  status – contain status information if complete  Does not block, return immediately.  Provide a mechanism for overlapping communication and computation  Do useful computation; periodically check communication status; if not complete, go back to computation. int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) MPI_TEST(REQUEST,FLAG,STATUS,IERROR) LOGICAL FLAG INTEGER REQUEST, STATUS, IERROR

10 Properties  Order: non-overtaking, order preserved  according to the execution order of non-blocking calls that initiate the communications  Progress: guarantees progress  Receive call completed by MPI_Wait will eventually return if there is a matching send.  Send call completed by MPI_Wait will eventually return if there is a matching receive. MPI_Comm_rank(comm,&rank); If(rank==0) { MPI_Isend(A,1,MPI_DOUBLE,1,99,comm,&req1); MPI_Isend(B,1,MPI_DOUBLE,1,99,comm,&req2); } Else if(rank==1) { MPI_Irecv(A,1,MPI_DOUBLE,0,MPI_ANY_TAG,comm,&req1); MPI_Irecv(B,1,MPI_DOUBLE,0,99,comm,&req2); } MPI_Wait(&req1,&stat1); MPI_Wait(&req2,&stat2);

11 MPI_Wait Variants  Deal with arrays of MPI_Requests: MPI_Request req[4];  MPI_Waitall:  MPI_Waitall(int count, MPI_Request *request, MPI_Status *status)  Blocks until all active requests in array complete; return status of all communications  Deallocate request objects, set to MPI_REQUEST_NULL  MPI_Waitany:  MPI_Waitany(int count,MPI_Request *req, int *index, MPI_Status *stat)  Blocks until one of the active requests in array completes; return its index in array and the status of completing request; deallocate that request object. If none completes, return index=MPI_UNDEFINED.  MPI_Waitsome:  MPI_Waitsome(int incount, MPI_Request *req, int *outcount, int *array_indices, MPI_Status *array_status)  Blocks until at least one of the active communications completes; return associated indices and status of completed communications; deallocate objects. If none, outcount=MPI_UNDEFINED. MPI_Request req[2]; MPI_Status stat[2]; … MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitall(2, req, stat); MPI_Request req[2]; MPI_Status stat; Int index; MPI_Isend(…, &req[0]); MPI_Isend(…, &req[1]); MPI_Waitany(2, req, &index, &stat); …

12 MPI_Test Variants  MPI_Testall:  MPI_Testall(int count, MPI_Request *array_req, int *flag, MPI_Status *array_stat)  Return flag=true if all active requests complete; return flag=false otherwise.  If true, will de-allocate request objects, set to MPI_REQUEST_NULL.  MPI_Testany:  MPI_Testany(int count, MPI_Request *array_req, int *index, int *flag, MPI_Status *stat)  If one of active comm completes, return flag=true the index and status of completing comm; deallocate that object.  Return flag=false, index=MPI_UNDEFINED if none completes  Return flag=true, index=MPI_UNDEFINED if none active requests.  MPI_Testsome:  MPI_Testsome(int incount, MPI_Request *array_req, int *outcount, int *array_indices, MPI_Status *array_stat)  Return in outcount the number of completed active comm and associated indices and status of completing comm.  If none completes, return outcount=0  if none active comm, outcount=MPI_UNDEFINED.

13 Persistent Communication  Structure for nonblocking calls:  MPI_Ixxxx allocates MPI_Request  MPI_Wait or MPI_Test completes and de-allocates request objects  Often a communication with same arguments is executed repeatedly  e.g. every time step or every iteration.  Can create a persistent request that will not be de- allocated by MPI_Wait. Reduce overhead Create persistent request  MPI_Send_init, MPI_Recv_init Repeat: Start communication  MPI_Start … Complete communication  MPI_Wait, MPI_Test Free persistent request  MPI_Request_free

14 Creation  Creates a persistent request object for standard send mode.  Bind to the arguments: buf, count, datatype, dest, tag, comm. These arguments will not change in following communications  On creation, request inactive – not associated with any active communication. Communication initiated by MPI_Start int MPI_Send_init(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *req) int MPI_Recv_init(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *req) MPI_Request req_send, req_recv; double A[100], B[100]; int left_neighbor, right_neighbor, tag=999; MPI_Status stat_send, stat_recv; … MPI_Send_init(A,100,MPI_DOUBLE,left_neighbor,tag,MPI_COMM_WORLD,&req_send); MPI_Recv_init(B,100,MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_recv); MPI_Start(&req_send); MPI_Start(&req_recv); … // do something else useful MPI_Wait(&req_send, &stat_send); MPI_Wait(&req_recv, &stat_recv); MPI_Request_free(&req_send); MPI_Request_free(&req_recv);

15 Start Communication, Free Request  request is a persistent request created by MPI_Send_init etc.  Start the communication on request object.  The call returns immediately. It starts a non-blocking communication. Should not access the buffer after this call until completion.  Complete communication by MPI_Wait, MPI_Test etc.  MPI_Wait, MPI_Test will not de-allocate the request upon completion of communication  De-allocate persistent request using MPI_Request_free in the end. int MPI_Start(MPI_Request *request) MPI_START(REQUEST) integer REQUEST int MPI_Request_free(MPI_Request *request) MPI_REQUEST_FREE(request) integer REQUEST

16 Example: Matrix-Vector Multiplication AX=Y A – NxN matrix X,Y – vectors, dimension N = AXY A11A12A13 A21A22A23 A31A32A33 X1 X2 X3 Y1 Y2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 = A11A12A13 A21A22A23 A31A32A33 X2 X3 X1 Y1 Y2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 = A11A12A13 A21A22A23 A31A32A33 X3 X1 X2 Y1 Y2 Y3 Y1 = A11*X1 + A12*X2 + A13*X3 Y2 = A21*X1 + A22*X2 + A23*X3 Y3 = A31*X1 + A32*X2 + A33*X3 cpu 0 cpu 1 cpu 2 cpu 0 cpu 1 cpu 2 cpu 0 cpu 1 cpu 2

17 Example: Matrix-Vector Data on cpu 0: [A11 A12 A13]  N/3 x N matrix X1  vector, length N/3 Y1  vector, length N/3 Data on cpu 1: [A21 A22 A23]  N/3 x N matrix X2  vector, length N/3 Y2  vector, length N/3 Data on cpu 2: [A31 A32 A33]  N/3 x N matrix X3  vector, length N/3 Y3  vector, length N/3 Need to communicate: X1, X2, X3 Upward shift. Number of shifts = ncpus-1 Assume: A[i][j] = i+j X[i] = i

18 #include #include "dmath.h“ //  ignore this for now #define DIM 1000 // logical A[DIM][DIM], X[DIM], Y[DIM] int main(int argc, char **argv) { int ncpus, my_rank, left_neighbor, right_neighbor, tag=1001; int Nx, Ny; // Ny=DIM, Nx=DIM/ncpus, on each cpu: A[Nx][Ny], X[Nx], Y[Nx] MPI_Request req_sr[2]; MPI_Status stat_sr[2]; double **A, *X, *Y, *Xt; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); MPI_Comm_size(MPI_COMM_WORLD, &ncpus); if(DIM%ncpus != 0) { // assume DIM dividable by ncpus if(my_rank==0) printf("ERROR: grid size cannot be divided by ncpus!\n"); MPI_Finalize(); return -1; } Nx = DIM/ncpus; // again on each cpu: A[Nx][Ny] etc Ny = DIM; left_neighbor = (my_rank-1 + ncpus)%ncpus; // top neighbor right_neighbor = (my_rank+1)%ncpus; // bottom neighbor A = DMath::newD(Nx, Ny); // allocate memory, ignore DMath – my own routine X = DMath::newD(Nx); Xt = DMath::newD(Nx); // Xt – temporary space for receiving from neighbor Y = DMath::newD(Nx); Example (non-blocking comm)

19 int i,j; for(i=0;i<Nx;i++) { // initialize A, X for(j=0;j<Ny;j++) A[i][j] = (my_rank*Nx+i) + j; //  *** important *** X[i] = my_rank*Nx+i; } int count; // loop counter int sindex, curr_block; memset(Y, '\0', sizeof(double)*Nx); // zero out result vector Y first for(count=0;count<ncpus;count++){ if(count < ncpus-1) { MPI_Irecv(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); // receive from bottom neighbor MPI_Isend(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); // send to top neighbor } // compute on current data curr_block = (my_rank+count)%ncpus; //  *** important *** sindex = curr_block*Nx; // starting index of A[i][sindex+0:sindex+Nx-1] for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; //  *** important *** // complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data from Xt to X *** important ** } Example

20 Example // clean up, free memory DMath::del(A); // Ignore DMath for now DMath::del(X); DMath::del(Xt); DMath::del(Y); MPI_Finalize(); return 0; }

21... MPI_Recv_init(Xt, Nx, MPI_DOUBLE,right_neighbor,tag,MPI_COMM_WORLD,&req_sr[0]); MPI_Send_init(X, Nx, MPI_DOUBLE, left_neighbor, tag, MPI_COMM_WORLD, &req_sr[1]); for(count=0;count<ncpus;count++){ if(count < ncpus-1) MPI_Startall(2, req_sr); // compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // complete comm if(count<ncpus-1) { MPI_Waitall(2, req_sr, stat_sr); // data now in Xt memcpy(X, Xt, sizeof(double)*Nx); // copy data to X } MPI_Request_free(&req_sr[0]); MPI_Request_free(&req_sr[1]);... Example: Persistent Communication

22... for(count=0;count<ncpus;count++){ // compute on current data curr_block = (my_rank+count)%ncpus; sindex = curr_block*Nx; for(i=0;i<Nx;i++) for(j=0;j<Nx;j++) Y[i] += A[i][sindex+j]*X[j]; // send-recv if(count<ncpus-1) MPI_Sendrecv_replace(X,Nx,MPI_DOUBLE,left_neighbor,tag, right_neighbor, tag, MPI_COMM_WORLD, &stat_sr); }... Example: Send-Recv

23 HWK#2: Matrix Multiplication A1A2A3 B11B12B13 B21B22B23 B31B32B33 C1C2C3 = AB C C1 = A1*B11 + A2*B21 + A3*B31 cpu 0 C2 = A1*B12 + A2*B22 + A3*B32 cpu 1 C3 = A1*B13 + A2*B23 + A3*B33 cpu 2 A, B, C – NxN matrices P – number of processors A1, A2, A3 – Nx(N/P) matrices C1, C2, C3 - … Bij – (N/P)x(N/P) matrices Input: A[i][j] = 2*i + j B[i][j] = 2*i – j Column-wise decomposition

24 HWK #2  Implement the above parallel matrix multiplication (column-wise data decomposition) in either C, C++ or Fortran  Use non-blocking communication or persistent communication in MPI  Test your parallel implementation and make sure the result is correct  Result for matrix C on p CPUs must be identical to that on 1 CPU  Use a matrix size 2048x2048 (double)  Time the “multiplication section” of your code using MPI_Wtime() routine for wall-clock time.  Run your code on 1, 2, 4, 8, 16 CPUs and obtain the wall-clock time it takes: T1, T2, …, T16  Compute parallel speedup factors: Sp = T1/Tp, e.g. Sp=T1/T8 for 8 CPUs.  Plot Sp vs. number of CPUs.  Turn in:  Source code + compiled binary code on either hamlet or radon.  Table of wall-clock time vs. number of CPUs.  Plot of parallel speedup factors.  Write-up of what you have learned from the implementation and timing results  Due date: Oct. 11

25 Collective Communications

26 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.  Types of collective operations:  Synchronization: MPI_Barrier  Data movement: MPI_Bcast, MPI_Scatter, MPI_Gather, MPI_Allgather, MPI_Alltoall  Collective computation: MPI_Reduce, MPI_Allreduce, MPI_Scan  Collective routines are blocking:  Completion of call means the communication buffer can be accessed  No indication on other processes’ status of completion  May or may not have effect of synchronization among processes.

27 Overview  Can use same communicators as PtP communications  MPI guarantees messages from collective communications will not be confused with PtP communications.  Key is a group of processes partaking communication  If you want only a sub-group of processes involved in collective communication, need to create a sub- group/sub-communicator from MPI_COMM_WORLD

28 Barrier  Blocks the calling process until all group members have called it.  Decreases performance. Refrain from using it explicitly. int MPI_Barrier(MPI_Comm comm) MPI_BARRIER(COMM,IERROR) integer COMM, IERROR … MPI_Barrier(MPI_COMM_WORLD); // synchronization point …

29 Broadcast  Broadcasts a message from process with rank root to all processes in group, including itself.  comm, root must be the same in all processes  The amount of data sent must be equal to amount of data received, pairwise between each process and the root  For now, means count and datatype must be the same for all processes; may be different when generalized datatypes are involved. int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype,int root, MPI_Comm comm) MPI_BCAST(BUFFER, COUNT, DATATYPE, ROOT, COMM) integer BUFFER, COUNT, DATATYPE, ROOT, COMM