Presentation is loading. Please wait.

Presentation is loading. Please wait.

Message Passing Programming Carl Tropper Department of Computer Science.

Similar presentations


Presentation on theme: "Message Passing Programming Carl Tropper Department of Computer Science."— Presentation transcript:

1 Message Passing Programming Carl Tropper Department of Computer Science

2 Generalities Structure of message passing programs Asynchronous SPMD (single program, multiple data) model First look at building blocks Send and receive operations Blocking and unblocking versions MPI (the standard) specifics

3 Send and receive operations send(void *sendbuf, int nelems, int dest) receive(void *recvbuf, int nelems, int source) Nelens=#elements to be sent/received P0 sends data to P1 P0 P1 a = 100; receive(&a, 1, 0) send(&a, 1, 1); printf("%d\n", a); a = 0; Good semantics- P1 to receives 100 Bad semantics-P1 receives 0 Could happen because dma and comm hardware could return before 100 is actually sent

4 Blocking message passing operations Handshake- Sender asks to send, receiver agrees to receive Sender sends, receiver receives Implemented without buffers

5 Deadlocks in Blocking, non buffered send/receive P0 P1 send(&b, 1, 1); send(&b, 1, 0); receive(&a, 1, 1); receive(&a, 1, 0); Both sends wait for both receives-DEADLOCK Can cure this deadlock by reversing the send and receive ops (e.g. in P1) Ugh

6 Send/Receive Blocking Buffered Buffers used at sender and receiver Dedicated comm hardware at both ends If sender has no buffer but receiver does, still can be made to work (rhs below)

7 The impact of non-infinite buffer space P0 P1 for (i = 0; i < 1000; i++) { for (i = 0; i < 1000; i++){ produce_data(&a); receive(&a, 1, 0); send(&a, 1, 1); consume_data(&a); } } Consumer consumes slower then producer produces……..

8 Deadlocks in Buffered Send/Receive P0 P1 receive(&a, 1, 1); receive(&a, 1, 0); send(&b, 1, 1); send(&b, 1, 0); Receive operation still blocks,so deadlock can happen Moral of the story-still have to be careful to avoid deadlocks!

9 Non blocking optimizations Blocking is safe but wastes time Alternative-use non-blocking with check-status operation Process is free to perform any operation which does not depend upon completion of send or receive Once transfer is complete, data can be used

10 Non blocking optimization

11 Possibilities

12 MPI Vendors all had their own message passing libraries Enter MPI-the standard for C and Fortran Defines syntax, semantics of core set of library routines (125 are defined)

13 Core set of routines for MPI MPI_Init Initializes MPI. MPI_Finalize Terminates MPI. MPI_Comm_size Determines the number of processes. MPI_Comm_rank Determines the label of calling process. MPI_Send Sends a message. MPI_Recv Receives a message.

14 Starting and Terminating MPI int MPI_Init(int *argc, char ***argv) int MPI_Finalize() MPI_Init is called prior to other MPI routines-it initializes the MPI environment MPI_Finalize is called at the end-it does clean-up Return code for both is MPI_success Mpi_h contains mpi constants and data structures

15 Communicators Communication domain- processes which communicate with one another Communicators are variables of type MPI_Comm. They store information about communication domains MPI_COMM_WORLD - default communicator, all processes in program

16 Communicators int MPI_Comm_size(MPI_Comm comm, int *size) int MPI_Comm_rank(MPI_Comm comm, int *rank) MPI_Comm_size - number of processes in communicator Rank id’s each process

17 Hello world #include main(int argc, char *argv[]) { int npes, myrank; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); printf("From process %d out of %d, Hello World!\n", myrank, npes); MPI_Finalize(); } Prints hello world from each processor

18 Sending/Receiving Messages int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Recv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Status *status) MPI_Send sends data in buf, count=#entries of type MPI_Datatype Length of message is specified as a number of entries, not as a number of bytes, for portability Dest=rank of destination process, tag=type of message MPI_ANY_SOURCE any process can be source MPI_ANY_TAG same for tag Buf is where received message is stored Count,datatype specify length of buffer

19 Datatypes MPI Datatype C Datatype MPI_CHAR signed char MPI_SHORT signed short int MPI_INT signed int MPI_LONG signed long int MPI_UNSIGNED_CHAR unsigned char MPI_UNSIGNED_SHORT unsigned short int MPI_UNSIGNED unsigned int MPI_UNSIGNED_LONG unsigned long int MPI_FLOAT float MPI_DOUBLE double MPI_LONG_DOUBLE long double MPI_BYTE MPI_PACKED

20 Sending/Receiving Status variable used to get info on Recv operation C status stored in MPI_Status typedef struct MPI_Status { int MPI_SOURCE; int MPI_TAG; int MPI_ERROR; }; int MPI_Get_count(MPI_Status *status, MPI_Datatype datatype, int *count) returns # entries in count variable

21 Sending/Receiving MPI_Recv is a blocking receive op- it returns after message is in buffer. MPI_Send has 2 implementations Returns after MPI_Recv issued and message is sent Returns after MPI_Send copied message into buffer- does not wait for MPI_Recv to be issued

22 Avoiding Deadlocks Process 0 sends 2 messages to process 1,which receives them in reverse order. int a[10], b[10], myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, MPI_COMM_WORLD); }... If MPI_Send is implemented by blocking until receive is issued, then process 0 waits for a receive for the tag 1 message, and process 1 waits for process 0 to issue MPI_Send. Deadlock “Solution”- Programmer has to match order in which sends and receives are issued- Ugh! No deadlock if MPI_Send is implemented with buffering.

23 Circular Deadlock Process i sends a message to process i + 1 and receives a message from process i - 1 int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD);... Deadlock if MPI_Send is blocking Works if it is implemented using buffering Deadlocks with two processes trying to send each other messages.

24 Break the circle Break circle into odd and even processes Odds first send and then receive Evens first receive and then send int a[10], b[10], npes, myrank; MPI_Status status;... MPI_Comm_size(MPI_COMM_WORLD, &npes); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank%2 == 1) { MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); } else { MPI_Recv(b, 10, MPI_INT, (myrank-1+npes)%npes, 1, MPI_COMM_WORLD); MPI_Send(a, 10, MPI_INT, (myrank+1)%npes, 1, MPI_COMM_WORLD); }...

25 Break the circle, part II A simultaneous send/receive operation int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype senddatatype, int dest, intsendtag, void *recvbuf, int recvcount,MPI_Datatype recvdatatype, int source, int recvtag,MPI_Comm comm, MPI_Status *status) Problem-need to use disjoint buffers Solution-MPI_Sendrec_replace function-received data replaces sent data in the same buffer int MPI_Sendrecv_replace(void *buf, int count, MPI_Datatype datatype, int dest, int sendtag, int source, int recvtag, MPI_Comm comm, MPI_Status *status)

26 Topologies and Embedding MPI-sees processes arranged linearly while parallel programs communicate naturally in higher dimensions Need to map linear ordering to these topologies Possible mappings are ……………………

27 Solution MPI helps programmer to arrange processes in topologies by supplying libraries Mapping to processors is done by libraries without programmer intervention

28 Cartesian topologies Can specify arbitrary topologies, but most topologies are grid- like (Cartesian) MPI_Cart_create takes processes in comm_old and builds a virtual process topology int MPI_Cart_create(MPI_Comm comm_old, int ndims, int *dims, int *periods, int reorder, MPI_Comm *comm_cart) New topology information is in comm_cart Processes belonging to comm_old need to call comm_cart Ndims=#dimensions,dims=size of each dimension array periods specifies if there are wraparound connections. Period[I]=true if a wrap in dimension I Reorder=T allows processes to be reordered by MPI

29 Process Naming Source, destination of processes are specified by ranks in MPI MPI_Cart_rank takes coordinates in array coords and returns rank (maxdims is dimension of coordinates address) MPI_Cart_coord takes the rank of the process and returns the its Cartesian coords in array coords int MPI_Cart_coord(MPI_Comm comm_cart, int rank, int maxdims, int *coords) int MPI_Cart_rank(MPI_Comm comm_cart, int *coords, int *rank)

30 Shifting Want to shift data along a dimension of the topology? int MPI_Cart_shift(MPI_Comm comm_cart, int dir, int s_step, int *rank_source, int *rank_dest) Dir=dimension of shift (which dimension it lives in) S_step=size of shift

31 Overlapping communication with computation Blocking sends/receives do not permit overlap. Need non-blocking functions MPI_Isend starts send, but returns before it is complete. MPI_Irecv starts receive, but returns before data is received MPI_Test tests if non-blocking operation has completed MPI_Wait waits until non-blocking operation finishes (don’t say it)

32 More non blocking int MPI_Isend(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm, MPI_Request *request) int MPI_Irecv(void *buf, int count, MPI_Datatype datatype, int source, int tag, MPI_Comm comm, MPI_Request *request ) Both allocate a request object and return a pointer to an object. The object is used as an argument by MPI_Test and MPI_Wait to identify the op whose status we want to query or we want to wait for int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) Flag=T if op is finished int MPI_Wait(MPI_Request *request, MPI_Status *status)

33 Avoiding deadlocks Using non-blocking operations remove most deadlocks. Following code is not safe. int a[10], b[10], myrank; MPI_Status status;... MPI_Comm_rank(MPI_COMM_WORLD, &myrank); if (myrank == 0) { MPI_Send(a, 10, MPI_INT, 1, 1, MPI_COMM_WORLD); MPI_Send(b, 10, MPI_INT, 1, 2, MPI_COMM_WORLD); } else if (myrank == 1) { MPI_Recv(b, 10, MPI_INT, 0, 2, &status, MPI_COMM_WORLD); MPI_Recv(a, 10, MPI_INT, 0, 1, &status, MPI_COMM_WORLD); } … Replace either the send or the receive operations with non- blocking counterparts fixes this deadlock.

34 Collective Ops-communication and computation Comm ops (MPI-broadcast, reduction,etc ops) are implemented by MPI All of the ops take a communicator argument which defines the group of processes involved in the op Ops don’t act like barriers-can go past call without waiting for other processes, but not a great idea to do so…..

35 The collective Barrier synchronization operation: int MPI_Barrier(MPI_Comm comm) Call returns after all processes have called the function The one-to-all broadcast operation is: int MPI_Bcast(void *buf, int count, MPI_Datatype datatype,int source, MPI_Comm comm) Source sends data in buf to all proceses in group. The all-to-one reduction operation is: int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int target, MPI_Comm comm) Combines elements in sendbuf of each process using op, returns combined values in recvbuf of process with rank target If count is more then one, then op is done on each element

36 Pre-defined Reduction Types MPI_MAX Maximum C integers and floating point MPI_MIN Minimum C integers and floating point MPI_SUM Sum C integers and floating point MPI_PROD Product C integers and floating point MPI_LAND Logical AND C integers MPI_BAND Bit-wise AND C integers and byte MPI_LOR Logical OR C integers MPI_BOR Bit-wise OR C integers and byte MPI_LXOR Logical XOR C integers MPI_BXOR Bit-wise XOR C integers and byte MPI_MAXLOC max-min value-location Data-pairs MPI_MINLOC min-min value-location Data-pairs

37 More Reduction Possible to define your own ops The operation MPI_MAXLOC combines pairs of values ( v i, l i ) and returns the pair ( v, l ) such that v is the maximum among all v i 's and l is the corresponding l i (if there are more than one, it is the smallest among all these l i 's). MPI_MINLOC does the same, except for minimum value of v i.

38 Reduction Need MPI datatypes for data pairs used with MPI_Maxloc and MPI_Minloc MPI_2INT corresponds to C datatype pair of ints MPI_Allreduce op returns result to all processes –Int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm)

39 Prefix Sum Prefix sum op is done via MPI_Scan-store partial sum up to node i on node i int MPI_Scan(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) In the end, the receive buffer of process with rank i stores reduction of send buffers of nodes 0 to i

40 Gather Ops The gather operation is performed in MPI using: int MPI_Gather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int target, MPI_Comm comm) Each process sends the data in sendbuf to target Data is stored in recvbuf in rank order-data from process I is stored at I*sendcount of recvbuf MPI also provides the MPI_Allgather function in which the data are gathered at all the processes. int MPI_Allgather(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm) These ops assume that the size of all of the array is the same-there are versions of the instructions which allow different size arrays

41 Scatter Op MPI_Scatter: int MPI_Scatter(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, int source, MPI_Comm comm) Source process sends a different part of sendbuf to each process. Received data is stored in recvbuf A version of MPI_Scatter allows different amounts of data to be sent to different processes

42 All to all Op The all-to-all personalized communication operation is performed by: int MPI_Alltoall(void *sendbuf, int sendcount, MPI_Datatype senddatatype, void *recvbuf, int recvcount, MPI_Datatype recvdatatype, MPI_Comm comm) Each process sends a different part of sendbuf to other processes (i*sendcount elements) Received data stored in recvbuf array Vector variant exists, which allows different amounts of data to be sent

43 Groups and communicators Might want to split a group of processes into subgroups int MPI_Comm_split(MPI_Comm comm, int color, int key, MPI_Comm *newcomm) Has to be called by all processes in a group Partitions processes in communicator comm into disjoint subgoups Color and key are input parameters Color defines the subgroups Key defines the rank within the subgroups New communicator is returned for each group in newcomm parameter

44 MPI_Comm_split

45 Splitting Cartesian Topologies MPI_Cart_sub splits a cartesian topology into smaller topologies int MPI_Cart_sub(MPI_Comm comm_cart, int *keep_dims, MPI_Comm *comm_subcart) Array keep_dims is tells us how to break up the topology Original topology is stored in comm_cart, comm_subcart stores new topologies

46 Splitting Cartesian Topologies Array keep_dims tells us how. If keep_dims[I]=T, then keep the ith dimension


Download ppt "Message Passing Programming Carl Tropper Department of Computer Science."

Similar presentations


Ads by Google