Presentation is loading. Please wait.

Presentation is loading. Please wait.

Collective Communication in MPI and Advanced Features

Similar presentations


Presentation on theme: "Collective Communication in MPI and Advanced Features"— Presentation transcript:

1 Collective Communication in MPI and Advanced Features
Message passing patterns and Collective MPI routines

2 What MPI Functions are commonly used
For simple applications, these are common: Startup MPI_Init() MPI_Finalize() Information on the processes MPI_Comm_rank() MPI_Comm_size() MPI_Get_processor_name() Point-to-Point communication MPI_Send() & MPI_Recv() MPI_Isend() & MPI_Irecv, MPI_Wait() Collective communication MPI_Allreduce() , MPI_Bcast(), MPI_Allgather() CS267 Lecture 2

3 Blocking Message Passing
The call waits until the data transfer is done MPI_Send() The sending process waits until all data are transferred to the system buffer MPI_Recv() The receiving process waits until all data are transferred from the system buffer to the receive buffer Buffers can be freely reused

4 Blocking Message Send MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm); buf Specifies the starting address of the buffer. count Indicates the number of buffer elements dtype Denotes the datatype of the buffer elements dest Specifies the rank of the destination process in the group associated with the communicator comm tag Denotes the message label comm Designates the communication context that identifies a group of processes

5 Blocking Message Send Standard:
MPI_Send() The sending process returns when the system can buffer the message or when the message is received and the buffer is ready for reuse. Buffered: MPI_Bsend() The sending process returns when the message is buffered in an application-supplied buffer. Synchronous: MPI_Ssend() The sending process returns only if a matching receive is posted and the receiving process has started to receive the message. Ready: MPI_Rsend() The message is sent as soon as possible (ASAP).

6 Blocking Message Receive
MPI_Recv(void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm, MPI_Status *status); buf Specifies the starting address of the buffer count Indicates the number of buffer elements dtype Denotes the datatype of the buffer elements source Specifies the rank of the source process in the group associated with the communicator comm tag Denotes the message label comm Designates the communication context that identifies a group of processes status Returns information about the received message . Status information is useful when wildcards are used or the received message is smaller than expected. Status may also contain error codes.

7 Example More Examples at http://mpi.deino.net/mpi_functions/index.htm
… if (rank == 0) { for (i=0; i<10; i++) buffer[i] = i; MPI_Send(buffer, 10, MPI_INT, 1, 123, MPI_COMM_WORLD); } else if (rank == 1) { for (i=0; i<10; i++) buffer[i] = -1; MPI_Recv(buffer, 10, MPI_INT, 0, 123, MPI_COMM_WORLD, &status); for (i=0; i<10; i++) if (buffer[i] != i) printf("Error: buffer[%d] = %d but is expected to be %d\n", i, buffer[i], i); } More Examples at

8 Non-blocking Message Passing
Returns immediately after the data transferred is initiated Allows to overlap computation with communication Need to be careful though When send and receive buffers are updated before the transfer is over, the result will be wrong MPI_Request() represents a handle on a non-blocking operation. it represents a handle on a non-blocking operation. It can be used by wait MPI_Wait() MPI_Waitall() MPI_Waitany() MPI_Waitsome() to know when the non-blocking operation handled completes.

9 Non-blocking Message Passing
MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req); MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm, MPI_Request *req); MPI_Wait(MPI_Request *req, MPI_Status *status); A call to MPI_Wait returns when the operation identified by request is complete. req Specifies the request used by a completion routine when called by the application to complete the send operation. Blocking MPI_Send MPI_Bsend MPI_Ssend MPI_Rsend MPI_Recv Non-blocking MPI_Isend MPI_Ibsend MPI_Issend MPI_Irsend MPI_Irecv

10 Non-blocking meaning MPI_Isend() and MPI_Irecv() are non-blocking meaning that the function call returns before the communication is completed. MPI_Isend() returns immediately, usually before the data is finished being sent. Thus, you must be extra careful not to change any of the data in the buffer that you passed as an argument into MPI_Isend MPI_Irecv() returns immediately, which allows the process to continue doing calculations of some sort while the data is still being received. However, before the receiving process actually uses the data it is receiving, it must call MPI_Wait to ensure that the data is valid  Deadlock then becomes impossible with non-blocking communication, but other precautions must be taken when using them. In particular you need to be sure at a certain point, that your data has effectively arrived! You may need to place an MPI_Wait() call for each send and/or receive in order to be sure it is completed before advancing in the program. In MPI: blocking communications are synchronous communication but non-blocking communications are asynchronous communication Asynchronous is often the key to achieving high performance computing with MPI biggest advantage is that the functions are non-blocking, which allows processes to continue doing computations while communication with another process is still pending.

11 Non-blocking Message Passing
… right = (rank + 1) % nproc; left = rank - 1; if (left < 0) left = nproc – 1; MPI_Irecv(buffer, 10, MPI_INT, left, 123, MPI_COMM_WORLD, &request); MPI_Isend(buffer2, 10, MPI_INT, right, 123, MPI_COMM_WORLD, &request2); MPI_Wait(&request, &status); MPI_Wait(&request2, &status);

12 Message passing patterns
Source Destination Data Point-to-point Send-Receive Process 1 MPI_Send() MPI_Recv() Process 2 Message containing data Implementation of send-receive pattern with explicit MPI send and receive routines

13 Collective message-passing routines
A program might require many point to point send-receive patterns. Certain groups of send-receive patterns appear frequently and become slightly higher level patterns Collective patterns are those that involve multiple processes. One process (the root / master) is the source of data sent to other processes, or destination of data sent from other processes Collective message passing patterns implemented in MPI with higher efficiency than separate point-to-point routines although routines not absolutely necessary.

14 Collective Communications
A single call handles the communication between all the processes in a communicator There are 3 types of collective communications Data movement (e.g. MPI_Bcast) Reduction (e.g. MPI_Reduce) Synchronization (e.g. MPI_Barrier) You may find more examples at:

15 Broadcast pattern Sends same data to each of a group of processes
Destinations A common pattern to get same data to all processes, especially at the beginning of a computation Same data sent to all destinations Source Note: Patterns given do not mean the implementation does them as shown. Only the final result is the same in any parallel implementation. Patterns do not describe the implementation.

16 Broadcast P1 A B C D P1 A B C D P2 P2 A B C D MPI_Bcast P3 P3 A B C D
int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); One process (root) sends data to all the other processes in the same communicator Must be called by all the processes with the same arguments P1 A B C D P1 A B C D P2 P2 A B C D MPI_Bcast P3 P3 A B C D P4 P4 A B C D

17 MPI broadcast operation
Sending same message to all processes in communicator Notice same routine called by each process, with same parameters. MPI processes usually execute the same program so this is a handy construction.

18 MPI_Bcast parameters source Notice that there is no tag.
All processes in the Communicator must call the MPI_Bcast with the same parameters Notice that there is no tag.

19 Creating a broadcast with individual sends and receives
if(rank == 0) { for (i=1; i < P; i++) MPI_Send(buf, N, MPI_INT, i, tag, MPI_COMM_WORLD); }else MPI_Recv(buf, N, MPI_INT, 0, tag, MPI_COMM_WORLD, &status); Complexity of doing that is O(N * P), where the number of bytes in the message is N and there are P processors.

20 Likely MPI_Bcast implementation
The number of processes that have the data doubles with each iteration 1 2 3 4 5 6 7 log2P Already has the data Receives the data Complexity of broadcast is O( log2(N * P) ).

21 MPI Scatter MPI_Scatter() is a collective routine that is very similar to MPI_Bcast() A root process sending data to all processes in a communicator.  Difference between MPI_Bcast() and MPI_Scatter() is small but important MPI_Bcast() sends the same piece of data to all processes, while MPI_Scatter() sends chunks of an array to different processes.

22 Scatter Pattern Distributes a collection of data items to a group of processes A common pattern to get data to all processes Usually data sent are parts of an array Destinations Different data sent to each destinations Source

23 Scatter Pattern MPI_Bcast takes a single data element at the root process (the red box) and copies it to all other processes. MPI_Scatter takes an array of elements and distributes the elements in the order of process rank. The first element (in red) goes to process zero, the second element (in green) goes to process one, and so on.

24 Basic MPI scatter operation
Sending one of more contiguous elements of an array in root process to a separate process. Notice same routine called by each process, with same parameters. MPI processes usually execute the same program so this is a handy construction.

25 MPI scatter parameters
source Usually number of elements sent to each process and received by each process is the same. All processes in the Communicator must call the MPI_Scatter with the same parameters Notice that there is no tag.

26 Example In the following code, size of send buffer is given by 100 * <number of processes> and 100 contiguous elements are send to each process: int main (int argc, char *argv[]) { int size, *sendbuf, recvbuf[100]; /* for each process */ MPI_Init(&argc, &argv); /* initialize MPI */ MPI_Comm_size(MPI_COMM_WORLD, &size); sendbuf = (int *)malloc(size*100*sizeof(int)); . MPI_Scatter(sendbuf,100,MPI_INT,recvbuf,100,MPI_INT,0, MPI_COMM_WORLD); MPI_Finalize(); /* terminate MPI */ }

27 Scatter Example (source:

28 Scattering contiguous groups of elements to each process

29 Gather Pattern Essentially the reverse of a scatter. It receives data items from a group of processes Sources Data Destination Data A common pattern especially at the end of a computation to collect results Data collected at destination in an array Data

30 MPI Gather Having one process collect individual values from set of processes (includes itself). As usual same routine called by each process, with same parameters.

31 Gather Pattern P1 A P1 A B C D P2 B P2 MPI_Gather P3 C P3 P4 D P4
int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm) One process (root) collects data to all the other processes in the same communicator Must be called by all the processes with the same arguments P1 A P1 A B C D P2 B P2 MPI_Gather P3 C P3 P4 D P4

32 Gather parameters (from each process) (in any single receive) Usually number of elements sent to each process and received by each process is the same. Note: All processes in the Communicator must call the MPI_Gather() with the same parameters

33 Gather Example To gather 10 data elements from each process into process 0, using dynamically allocated memory in root process: int data[10]; /*data to be gathered from processes*/ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */ if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/ buf = (int *)malloc(grp_size*10*sizeof (int)); /*alloc. mem*/ } MPI_Gather(data, 10, MPI_INT, buf, 10, MPI_INT, 0, MPI_COMM_WORLD) ; Note: MPI_Gather() gathers from all processes, including root.

34 Reduce Pattern Sources Data Destination A common pattern to get data back to master from all processes and then aggregate it by combining collected data into one answer. Reduction operation must be a binary operation that is commutative (changing the order of the operands does not change the result) Data Data collected at destination and combined to get one answer with a commutative operation Data Needs to be commutative operation to allow the implementation to do the operations in any order.

35 MPI Reduce Gather operation combined with specified arithmetic/logical operation. Example: Values could be gathered and then added together by root: MPI_Reduce() MPI_Reduce() MPI_Reduce() As usual same routine called by each process, with same parameters.

36 Reduce parameters Note: All processes in the Communicator must call the MPI_Reduce() with the same parameters

37 Reduction P1 A … P1 A+B+C+D P2 B … P2 MPI_Reduce P3 C … P3 P4 D … P4
int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) One process (root) collects data to all the other processes in the same communicator, and performs an operation on the data MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and a few more MPI_Op_create(): User defined operator P1 A P1 A+B+C+D P2 B P2 MPI_Reduce P3 C P3 P4 D P4

38 Implementation of reduction using a tree construction
14 39 53 120 66 29 + 53 + 173 + 95 O(log2 P) with P processes + 226 + 321

39 MPI_Reduce Example for (i = 0; i < N; i++) { table[i] = ((float) random()) / RAND_MAX * 100; } printf ("<rank %d>:", rank); printf ("%4d", table[i]); printf ("\n"); MPI_Reduce (table, result, N, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (rank == 0) { printf ("\nAnswer:\n"); for (i = 0; i < N; i++) printf ("%4d", row[i]);

40 Complete sample MPI program with collective routines
#include “mpi.h” #include <stdio.h> #include <math.h> #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs, data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */ x = n/nproc; /* Add my portion Of data */ low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize();

41 Summary

42 Combined Patterns

43 Gather to All P1 A P1 A B C D P2 B P2 A B C D MPI_Allgather P3 C P3 A
int MPI_Allgather( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, MPI_Comm comm ) All the processes collects data to all the other processes in the same communicator Must be called by all the processes with the same arguments P1 A P1 A B C D P2 B P2 A B C D MPI_Allgather P3 C P3 A B C D P4 D P4 A B C D

44 Reduction to All P1 A … P1 A+B+C+D P2 B … P2 A+B+C+D MPI_Allreduce P3
int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) All the processes collect data to all the other processes in the same communicator, and perform an operation on the data MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and a few more MPI_Op_create(): User defined operator P1 A P1 A+B+C+D P2 B P2 A+B+C+D MPI_Allreduce P3 C P3 A+B+C+D P4 D P4 A+B+C+D

45 For more functions… http://www.mpi-forum.org
MPICH ( Open MPI ( MPI descriptions and examples are referred from Stéphane Ethier (PPPL)’s PICSciE/PICASso Mini-Course Slides

46 Synchronization

47 MPI Collective Communication
Collective routines provide a higher-level way to organize a parallel program Each process executes the same communication operations Communication and computation is coordinated among a group of processes in a communicator Tags are not used No non-blocking collective operations. Three classes of operations: Synchronization - processes wait until all members of the group have reached the synchronization point. MPI_Barrier blocks all MPI processes in the given communicator until they all call this routine. data movement - Data Movement - broadcast, scatter/gather, all to all. collective computation - (reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data. CS267 Lecture 2

48  Scope: Collective communication routines must involve all processes within the scope of a communicator. All processes are by default, members in the communicator MPI_COMM_WORLD. Additional communicators can be defined by the programmer. Unexpected behavior, including program failure, can occur if even one task in the communicator doesn't participate. It is the programmer's responsibility to ensure that all processes within a communicator participate in any collective operations.

49 Synchronization int MPI_Barrier(MPI_Comm comm)
Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed. #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, nprocs; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Barrier(MPI_COMM_WORLD); printf("Hello, world. I am %d of %d\n", rank, nprocs); MPI_Finalize(); return 0; }

50 Synchronization MPI_Barrier( comm )
Blocks until all processes in the group of the communicator comm call it. Not used often. Sometime used in measuring performance and load balancing CS267 Lecture 2

51 Collective Data Movement:
Operation Description MPI_Bcast Data movement operation. Broadcasts (sends) a message from the process with rank "root" to all other processes in the group. MPI_Scatter Data movement operation. Distributes distinct messages from a single source task to each task in the group. MPI_Gather Data movement operation. Gathers distinct messages from each task in the group to a single destination task. This routine is the reverse operation of MPI_Scatter. MPI_Allgather Data movement operation. Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group. MPI_Reduce Collective computation operation. Applies a reduction operation on all tasks in the group and places the result in one task. MPI_Allreduce Collective computation operation + data movement. Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.

52 Collective Data Movement: Broadcast, Scatter, and Gather
P0 P1 P2 P3 A Broadcast A B D C Scatter Gather CS267 Lecture 2

53 Broadcast Data belonging to a single process is sent to all of the processes in the communicator. Copyright © 2010, Elsevier Inc. All rights Reserved

54 Comments on Broadcast All collective operations must be called by all processes in the communicator MPI_Bcast is called by both the sender (called the root process) and the processes that are to receive the broadcast MPI_Bcast is not a “multi-send” “root” argument is the rank of the sender; this tells MPI which process originates the broadcast and which receive If one processor does not call collective, program hangs CS267 Lecture 2

55 MPI_Reduce

56 Predefined reduction operators in MPI
Copyright © 2010, Elsevier Inc. All rights Reserved

57 MPI_Allreduce Useful in a situation in which all of the processes need the result of a global sum in order to complete some larger computation. Copyright © 2010, Elsevier Inc. All rights Reserved

58 MPI Collective Routines: Summary
Many Routines: Allgather, Allgatherv, Allreduce, Alltoall, Alltoallv, Bcast, Gather, Gatherv, Reduce, Reduce_scatter, Scan, Scatter, Scatterv All versions deliver results to all participating processes. V versions allow the hunks to have variable sizes. Allreduce, Reduce, Reduce_scatter, and Scan take both built-in and user-defined combiner functions. MPI-2 adds Alltoallw, Exscan, intercommunicator versions of most routines Allgatherv –variable amounts of data on each processor Allreduce – everyone gets copy of sum CS267 Lecture 2

59 Example Extra: Self Study
MPI PI program

60 Example of MPI PI program using 6 Functions
Using basic MPI functions: MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK Using MPI collectives: MPI_BCAST MPI_REDUCE Slide source: Bill Gropp, ANL CS267 Lecture 2

61 Midpoint Rule for a b x f(x) xm

62 Example: PI in C - 1 Input and broadcast parameters #include "mpi.h"
#include <math.h> #include <stdio.h> int main(int argc, char *argv[]) { int done = 0, n, myid, numprocs, i, rc; double PI25DT = ; double mypi, pi, h, sum, x, a; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; Input and broadcast parameters Slide source: Bill Gropp, ANL CS267 Lecture 2

63 Example: PI in C - 2 Compute local pi values Compute summation
h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is .16f\n", pi, fabs(pi - PI25DT)); } MPI_Finalize(); return 0; } Compute local pi values Compute summation Slide source: Bill Gropp, ANL CS267 Lecture 2

64 Collective vs. Point-to-Point Communications
All the processes in the communicator must call the same collective function. Will this program work? if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Recv(&a, MPI_INT, MPI_SUM,0,0, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

65 Collective vs. Point-to-Point Communications
All the processes in the communicator must call the same collective function. For example, a program that attempts to match a call to MPI_Reduce on one process with a call to MPI_Recv on another process is erroneous, and, in all likelihood, the program will hang or crash. if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Recv(&a, MPI_INT, MPI_SUM,0,0, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

66 Collective vs. Point-to-Point Communications
The arguments passed by each process to an MPI collective communication must be “compatible.” Will this program work? if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 1, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

67 Collective vs. Point-to-Point Communications
The arguments passed by each process to an MPI collective communication must be “compatible.” For example, if one process passes in 0 as the dest_process and another passes in 1, then the outcome of a call to MPI_Reduce is erroneous, and, once again, the program is likely to hang or crash. if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 1, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

68 Example of MPI_Reduce execution
Multiple calls to MPI_Reduce with MPI_SUM and Proc 0 as destination (root) Is b=3 on Proc 0 after two MPI_Reduce() calls? Is d=6 on Proc 0? Copyright © 2010, Elsevier Inc. All rights Reserved

69 Example: Output results
However, the names of the memory locations are irrelevant to the matching of the calls to MPI_Reduce. The order of the calls will determine the matching so the value stored in b will be = 4, and the value stored in d will be = 5. Copyright © 2010, Elsevier Inc. All rights Reserved

70 Example Parallel Matrix Vector Multiplication
Collective Communication Application Textbook p

71 Matrix-vector multiplication: y= A * x

72 Partitioning and Task graph for matrix-vector multiplication
yi= Row Ai * x

73 Execution Schedule and Task Mapping
yi= Row Ai * x

74 Data Partitioning and Mapping for y= A*x

75 SPMD Code for y= A*x

76 Evaluation: Parallel Time
Ignore the cost of local address calculation. Each task performs n additions and n multiplications. Each addition/multiplication costs ω The parallel time is approximately

77 How is initial data distributed?
Assume initially matrix A and vector x are distributed evenly among processes Need to redistribute vector x to everybody in order to perform parallel computation! What MPI collective communication is needed?

78 Communication Pattern for Data Redistribution
Data requirement for Process 0 MPI_Gather Data requirement for all processes MPI_Allgather

79 MPI Code for Gathering Data
Data gather for Process 0 Repeat for all processes

80 Allgather A Allgather B C D
Concatenates the contents of each process’ send_buf_p and stores this in each process’ recv_buf_p. As usual, recv_count is the amount of data being received from each process. Copyright © 2010, Elsevier Inc. All rights Reserved

81 MPI SPMD Code for y=A*x

82 MPI SPMD Code for y=A*x

83 Performance Evaluation of Matrix Vector Multiplication
Copyright © 2010, Elsevier Inc. All rights Reserved

84 How to measure elapsed parallel time
Use MPI_Wtime() that returns the number of seconds that have elapsed since some time in the past. Copyright © 2010, Elsevier Inc. All rights Reserved

85 Measure elapsed sequential time in Linux
This code works for Linux without using MPI functions Use GET_TIME() which returns time in microseconds elapsed from some point in the past. Sample code for GET_TIME() #include <sys/time.h> /* The argument now should be a double (not a pointer to a double) */ #define GET_TIME(now) { struct timeval t; gettimeofday(&t, NULL); now = t.tv_sec + t.tv_usec/ ; }

86 Measure elapsed sequential time
Copyright © 2010, Elsevier Inc. All rights Reserved

87 Use MPI_Barrier() before time measurement
Start timing until every process in the communicator has reached the same time stamp

88 Run-times of serial and parallel matrix-vector multiplication
(Seconds) Copyright © 2010, Elsevier Inc. All rights Reserved

89 Speedup and Efficiency
Copyright © 2010, Elsevier Inc. All rights Reserved

90 Speedups of Parallel Matrix-Vector Multiplication
Copyright © 2010, Elsevier Inc. All rights Reserved

91 Efficiencies of Parallel Matrix-Vector Multiplication
Copyright © 2010, Elsevier Inc. All rights Reserved

92 Scalability A program is scalable if the problem size can be increased at a rate so that the efficiency doesn’t decrease as the number of processes increase. Programs that can maintain a constant efficiency without increasing the problem size are sometimes said to be strongly scalable. Programs that can maintain a constant efficiency if the problem size increases at the same rate as the number of processes are sometimes said to be weakly scalable. Copyright © 2010, Elsevier Inc. All rights Reserved

93 Safety Issues in MPI programs

94 Safety in MPI programs Is it a safe program? (Assume tag/process ID is assigned properly) Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) ) Copyright © 2010, Elsevier Inc. All rights Reserved

95 Safety in MPI programs Process 0 Send(1) Recv(1) Process 1 Send(0)
Is it a safe program? (Assume tag/process ID is assigned properly) May be unsafe because MPI standard allows MPI_Send to behave in two different ways: It can simply copy the message into an MPI managed buffer and return, or it can block until the matching call to MPI_Recv starts. Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) Copyright © 2010, Elsevier Inc. All rights Reserved

96 Buffer a message implicitly during MPI_Send()
When you send data, where does it go? One possibility is: Process 0 Process 1 User data Local buffer the network Doubles storage, but safe Slide source: Bill Gropp, ANL CS267 Lecture 2

97 Avoiding Buffering Avoiding copies uses less memory May use more time
Process 0 Process 1 User data the network Time-space tradeoff User data MPI_Send() waits until a matching receive is executed. Slide source: Bill Gropp, ANL CS267 Lecture 2

98 Safety in MPI programs Many implementations of MPI set a threshold at which the system switches from buffering to blocking. Relatively small messages will be buffered by MPI_Send. Larger messages, will cause it to block. If the MPI_Send() executed by each process blocks, no process will be able to start executing a call to MPI_Recv, and the program will hang or deadlock. Each process is blocked waiting for an event that will never happen. Copyright © 2010, Elsevier Inc. All rights Reserved

99 Example of unsafe MPI code with possible deadlocks
Send a large message from process 0 to process 1 If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive) Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) Doubles the memory to be able to continue computing past send() This may be “unsafe” because it depends on the availability of system buffers in which to store the data sent until it can be received Slide source: Bill Gropp, ANL CS267 Lecture 2

100 Safety in MPI programs A program that relies on MPI provided buffering is said to be unsafe. Such a program may run without problems for various sets of input, but it may hang or crash with other sets. Copyright © 2010, Elsevier Inc. All rights Reserved

101 How can we tell if a program is unsafe
Replace MPI_Send() with MPI_Ssend() The extra “s” stands for synchronous and MPI_Ssend is guaranteed to block until the matching receive starts. If the new program does not hang/crash, the original program is safe. MPI_Send() and MPI_Ssend() have the same arguments Copyright © 2010, Elsevier Inc. All rights Reserved

102 Some Solutions to the “unsafe” Problem
Order the operations more carefully: Process 0 Send(1) Recv(1) Process 1 Recv(0) Send(0) Simultaneous send and receive in one call Trouble with sendrecv: need to be paired up, called about the same time Process 0 Sendrecv(1) Process 1 Sendrecv(0) Slide source: Bill Gropp, ANL CS267 Lecture 2

103 Restructuring communication in odd-even sort
Copyright © 2010, Elsevier Inc. All rights Reserved

104 Use MPI_Sendrecv() to conduct a blocking send and a receive in a single call.
Copyright © 2010, Elsevier Inc. All rights Reserved

105 More Solutions to the “unsafe” Problem
Supply own space as buffer for send Process 0 Bsend(1) Recv(1) Process 1 Bsend(0) Recv(0) Use non-blocking operations: Process 0 Isend(1) Irecv(1) Waitall Process 1 Isend(0) Irecv(0) Buffered send: Bsend: supply buffer space yourself Immediate send: you promise not to touch it until done (need to call waitall later) Immediate recv: returns right away, you promise not to touch it until done (need to call waitall later) Riskier (race conditions possible) but allows more overlap CS267 Lecture 2

106 Concluding Remarks (1) MPI works in C, C++, or Python
A communicator is a collection of processes that can send messages to each other. Many parallel programs use the SPMD approach. Most serial programs are deterministic: if we run the same program with the same input we’ll get the same output. Parallel programs often don’t possess this property. Collective communications involve all the processes in a communicator. Copyright © 2010, Elsevier Inc. All rights Reserved

107 Concluding Remarks (2) Performance evaluation
Use elapsed time or “wall clock time”. Speedup = sequential/parallel time Efficiency = Speedup/ p If it’s possible to increase the problem size (n) so that the efficiency doesn’t decrease as p is increased, a parallel program is said to be scalable. An MPI program is unsafe if its correct behavior depends on the fact that MPI_Send is buffering its input. Copyright © 2010, Elsevier Inc. All rights Reserved


Download ppt "Collective Communication in MPI and Advanced Features"

Similar presentations


Ads by Google