Introduction to MPI Programming

Introduction to MPI Programming
(Part II)‏ Michael Griffiths, Deniz Savas & Alan Real January 2006

Overview Review point to point communications Collective Communication
Data types Data packing Collective Communication Broadcast, Scatter & Gather of data Reduction Operations Barrier Synchronisation

Blocking operations Relate to when the operation has completed
Only return from the subroutine call when the operation has completed

Non-blocking operations
Return straight away and allow the sub-program to return to perform other work. At some time later the sub-program should test or wait for the completion of the non-blocking operation. A non-blocking operation immediately followed by a matching wait is equivalent to a blocking operation. Non-blocking operations are not the same as sequential subroutine calls as the operation continues after the call has returned.

Always completes (unless an error occurs), irrespective of receiver.
MPI_Isend Begins a nonblocking send Blocking Send and Receive Ready send Standard send Buffered send Synchronous send Sender mode Completes when a message arrives and received by pair of processors. MPI_Sendrecv Always completes (unless an error occurs), irrespective of whether the receive has completed. MPI_Rsend Can be synchronous or buffered (often implementation dependent). MPI_Send MPI_Bsend Only completes when the receive has completed. MPI_Ssend Completion status MPI Call (F/C)‏

Non-blocking communication
Separate communication into three phases: Initiate non-blocking communication Do some work: Perhaps involving other communications Wait for non-blocking communication to complete.

Non-blocking send 1 2 Receive MPI_COMM_WORLD 5 3 4 4 4 Send req Wait
2 Receive MPI_COMM_WORLD 5 3 4 4 4 Send req Wait Send is initiated and returns straight away. Sending process can do other things Can test later whether operation has completed.

Non-blocking receive 1 2 2 2 Rec req Wait MPI_COMM_WORLD 5 3 4 Send
Rec req Wait MPI_COMM_WORLD 5 3 4 Send Receive is initiated and returns straight away. Receiving process can do other things Can test later whether operation has completed.

The Request Handle Same arguments as non-blocking call
Additional request handle In C/C++ is of type MPI_Request/MPI::Request In Fortran is an INTEGER Request handle is allocated when a communication is initiated Can query to test whether non-blocking operation has completed

Non-blocking synchronous send
Fortran: CALL MPI_ISSEND(buf, count, datatype, dest, tag, comm, request, error)‏ CALL MPI_WAIT(request, status, error)‏ C: MPI_Issend(&buf, count, datatype, dest, tag, comm, &request); MPI_Wait(&request, &status); C++: request = comm.Issend(&buf, count, datatype, dest, tag); request.Wait();

Non-blocking synchronous receive
Fortran: CALL MPI_IRECV(buf, count, datatype, src, tag, comm, request, error)‏ CALL MPI_WAIT(request, status, error)‏ C: MPI_Irecv(&buf, count, datatype, src, tag, comm, &request); MPI_Wait(&request, &status); C++: request = comm.Irecv(&buf, count, datatype, src, tag); request.Wait(status);

Blocking v Non-blocking
Send and receive can be blocking or non-blocking. A blocking send can be used with a non-blocking receive, and vice versa. Non-blocking sends can use any mode: Comm.Recv(…)‏ MPI_Recv(…)‏ Receive Comm.Rsend(…)‏ MPI_Rsend(…)‏ Ready send Comm.Bsend(…)‏ MPI_Bsend(…)‏ Buffered send Comm.Ssend(…)‏ MPI_Ssend(…)‏ Synchronous send Comm.Send(…)‏ MPI_Send(…)‏ Standard send C++ MPI call Fortran/C Operation Synchronous mode affects completion, not initiation. A non-blocking call followed by an explicit wait is identical to the corresponding blocking communication.

Completion Can either wait or test for completion:
Fortran (LOGICAL flag): CALL MPI_WAIT(request, status, ierror)‏ CALL MPI_TEST(request, flag, status, ierror)‏ C (int flag): MPI_Wait(&request, &status)‏ MPI_Test(&request, &flag, &status)‏ C++ (bool flag): request.Wait()‏ flag = request.Test(); (for sends)‏ request.Wait(status); flag = request.Test(status); (for receives)‏

Other related wait and test routines
If multiple non-blocking calls are issued … MPI_TESTANY : Tests if any one of ‘a list of requests’ (they could be send or receive requests) have been completed. MPI_WAITANY : Waits until any one of the list of requests have been completed. MPI_TESTALL : Test if all the requests in a list are completed. MPI_WAITALL : Waits until all the requests in a list are completed. MPI_PROBE , MPI_IPROBE : Allows for the incoming messages to be checked for without actually receiving them. Note that MPI_PROBE is blocking. It waits until there is something to probe for. MPI_CANCEL : Cancels pending communication. Last resort, clean- up operation ! All routines take an array of requests and can return an array of statuses. ‘any’ routines return an index of the completed operation

Merging send and receive operations into a single unit
The following is the syntax of the MPI_Sendrecv command: IN C: int MPI_Sendrecv( void * sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void* recvbuf, int recvcount, MPI_Datatye recvtype ,int source , int recvtag, MPI_Comm comm, MPI_Status *status )‏ IN FORTRAN <sendtype> sendbuf(:) <recvtype> recvbuf(:)‏ INTEGER sendcount,sendtype, dest, sendtag, recvcount, recvtype, INTEGER source, recvtag, comm, status(MPI_STATUS_SIZE), ierror MPI_SENDRECV( sendbuf,sendcount,sendtype, dest, sendtag, recvbuf, recvcount , recvtype , source, recvtag , comm , status , ierror )‏

Important Notes about MPI_Sendrecv
Beware! A message sent by MPI_sendrecv is receivable by a regular receive operation if the destination and tag match. For the destination and source MPI_PROC_NULL can be specified to allow one directional working. (Useful in non-circular communication for the very end-nodes). Any communication with MPI_PROC_NULL returns immediately with no effect but as if the operation has been successful. This can make programming easier. The send and receive buffers must not overlap, they must be separate memory locations. This restriction can be avoided by using the MPI_Sendrecv_replace routine

Data Packing Up until now we have only seen contiguous data of pre-defined data-types being communicated by MPI calls. This can be rather restricting if what we are intending to transfer involves structures of data made up of mixtures of primitive data types, such as integer count followed by a sequence of real numbers. One solution to this problem is to use the MPI_PACK and MPI_UNPACK routines. The philosophy used is similar to the Fortran write/read to/from internal buffers and the scanf function in C. MPI_PACK routine can be called consecutively to compress the data into a send_buffer, the resulting buffer of data can then be sent by using MPI_SEND ‘or equivalent’ with the data_type set to MPI_PACKED. At the receiving-end it can be received by using MPI_RECV with the data type MPI_PACKED. The received data can then be unpacked by using MPI_UNPACK to recover the original packed data. This method of working can also improve communications efficiency by reducing the number of data transfer ‘send-receive’ calls. There are usually fixed overheads associated with setting up the communications that would cause inefficiencies if the sent/received messages are just too small.

MPI_Pack Fortran : <type> INBUF(:) , OUTBUF(:)‏ C :
INTEGER INCOUNT,DATATYPE,OUTSIZE,POSITION,COMM,IERROR MPI_PACK(INBUF,INCOUNT,DATATYPE,OUTBUF, OUTSIZE,POSITION, COMM,IERROR )‏ C : int MPI_Pack(void* inbuf, int incount, MPI_Datatype datatype, void *outbuf ,int outsize, int *position, MPI_Comm comm )‏ Packs the message in inbuf of type datatype and length=incount and stores it in outbuf . Outbuf size is specified in bytes. Outsize being the maximum length of outbuf ’in bytes’, rather than its actuaL size. On entry position indicates the starting location at the outbuf where data will be written. On exit position points to the first free position in outbuf following the location occupied by the packed message. This can then be readily used as the position parameter for the next mpi_pack call.

MPI_Unpack Fortran : <type> INBUF(:) , OUTBUF(:)‏ C :
INTEGER INSIZE, POSITION, OUTCOUNT,DATATYPE, COMM,IERROR MPI_UNPACK(INBUF,INSIZE,POSITION, OUTBUF,OUTCOUNT,DATATYPE, ,COMM,IERROR )‏ C : int MPI_Unpack(void* inbuf, int insize, int *position, void *outbuf ,int outcount, MPI_Datatype datatype, MPI_Comm comm )‏ Unpacks the message which is in inbuf as data of type datatype and length of outcounts and stores it in outbuf . On entry, position indicates the starting location of data in inbuf where data will be read from. On exit position points to the first position of the next set of data in inbuf. This can then be readily used as the position parameter for the next mpi_unpack call.

Collective Communication

Introduction & characteristics Barrier Synchronisation
Overview Introduction & characteristics Barrier Synchronisation Global reduction operations Predefined operations Broadcast Scatter Gather Partial sums Exercise:

Collective communications
Are higher-level routines involving several processes at a time. Can be built out of point-to-point communications. Examples are: Barriers Broadcast Reduction operations

Collective Communication
Communications involving a group of processes. Called by all processes in a communicator. Examples: Broadcast, scatter, gather (Data Distribution)‏ Global sum, global maximum, etc. (Reduction Operations)‏ Barrier synchronisation Characteristics Collective communication will not interfere with point-to-point communication and vice- versa. All processes must call the collective routine. Synchronization not guaranteed (except for barrier)‏ No non-blocking collective communication No tags Receive buffers must be exactly the right size

Collective Communications (one for all, all for one!!!)‏
Collective communication is defined as that which involves all the processes in a group. Collective communication routines can be divided into the following broad categories: Barrier synchronisation Broadcast from one to all. Scatter from one to all Gather from all to one. Scatter/Gather. From all to all. Global reduction (distribute elementary operations)‏ IMPORTANT NOTE: Collective Communication operations and point-to-point operations we have seen earlier are invisible to each other and hence do not interfere with each other. This is important to avoid dead-locks due to interference.

BARRIER SYNCHRONIZATION
T I M E B A R R I E R STATEMENT Here, there are seven processes running and three of them are waiting idle at the barrier statement for the other four to catch up.

Graphic Representations of Collective Communication Types
D C B A E D C B A P R O C E S ALLGATHER BROADCAST t s r q p o n m l k j i h g f e d c b a E D C B A E D C B A t o j e E s n i d D r m h c C q l g b B p k f a A E D C B A SCATTER ALLTOALL GATHER D A T A D A T A D A T A D A T A

Barrier Synchronisation
Each processes in communicator waits at barrier until all processes encounter the barrier. Fortran: INTEGER comm, error CALL MPI_BARRIER(comm, error)‏ C: MPI_Barrier(MPI_Comm comm); C++: Comm.Barrier(); E.g.: MPI::COMM_WORLD.Barrier();

Global reduction operations
Used to compute a result involving data distributed over a group of processes: Global sum or product Global maximum or minimum Global user-defined operation

Predefined operations
MPI_MINLOC MPI_MAXLOC MPI_BXOR MPI_LXOR MPI_BOR MPI_LOR MPI_BAND MPI_LAND MPI_PROD MPI_SUM MPI_MIN MPI_MAX MPI name (F/C)‏ Minimum and location MPI::MINLOC Maximum and location MPI::MAXLOC Bitwise exclusive OR MPI::BXOR Logical exclusive OR MPI::LXOR Bitwise OR MPI::BOR Logical OR MPI::LOR Bitwise AND MPI::BAND Logical AND MPI::LAND Product MPI::PROD Sum MPI::SUM Minimum MPI::MIN Maximum MPI::MAX Function MPI name (C++)‏

MPI_Reduce Performs count operations (o) on individual elements of sendbuf between processes Rank A B C A B C MPI_REDUCE 1 D E F D E F 2 G H I BoEoH G H I AoDoG

MPI_Reduce syntax Fortran C: C++:
INTEGER count, type, count, rtype, root, comm, error CALL MPI_REDUCE(sbuf, rbuf, count, rtype, op, root, comm, error)‏ C: MPI_Reduce(void *sbuf, void *rbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm); C++: Comm::Reduce(const void* sbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op, int root);

MPI_Reduce example Fortran C: C++: Integer global sum:
INTEGER x, result, error CALL MPI_REDUCE(x, result, 1, MPI_INTEGER, MPI_SUM, 0, MPI_COMM_WORLD, error)‏ C: int x, result; MPI_Reduce(&x, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); C++: MPI::COMM_WORLD.Reduce(&x, &result, 1, MPI::INT, MPI::SUM);

MPI_Allreduce Rank No root process
All processes get results of reduction operation Rank MPI_ALLREDUCE A B C A B C D E F D E F 1 2 G H I G H I AoDoG

MPI_Allreduce syntax Fortran C: C++:
INTEGER count, type, count, rtype, comm, error CALL MPI_ALLREDUCE(sbuf, rbuf, count, rtype, op, comm, error)‏ C: MPI_Allreduce(void *sbuf, void *rbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm); C++: Comm.Allreduce(const void* sbuf, void* recvbuf, int count, const MPI::Datatype& datatype, const MPI::Op& op);

Practice Session 3 Using reduction operations
This example shows the use of the continued fraction method of calculating pi and makes each processor calculate a different portion of the expansion series.

Broadcast Broadcast Rank 1 2 3
Duplicates data from root process to other processes in communicator A Broadcast A A A A A Rank 1 2 3

Broadcast syntax Fortran: C++:
INTEGER count, datatype, root, comm, error CALL MPI_BCAST(buffer, count, datatype, root, comm, error)‏ C: MPI_Bcast (void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); C++: Comm.Bcast(void* buffer, int count, const MPI::Datatype& datatype, int root); E.g broadcasting 10 integers from rank 0 int tenints[10]; MPI::COMM_WORLD.Bcast(&tenints, 10, MPI::INT, 0);

Scatter Distributes data from root process amongst processors within communicator. A B C D Scatter A B C D A C D Rank 1 2 3

Scatter syntax scount (and rcount) is number of elements each process is sent (i.e. = no received)‏ Fortran INTEGER scount, stype, rcount, rtype, root, comm, error CALL MPI_SCATTER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)‏ C: MPI_Scatter(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, root, comm); C++: Comm.Scatter(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

Gather Collects data distributed amongst processes in communicator onto root process ( Collection done in rank order ) . A B C D Gather A B C D A C D Rank 1 2 3

Gather syntax Takes same arguments as Scatter operation Fortran C:
INTEGER scount, stype, rcount, rtype, root, comm, error CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, root, comm, error)‏ C: MPI_Gather(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, root, comm); C++: Comm.Gather(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype, int root);

All Gather Collects all data on all processes in communicator A B C D Gather A B C D A B C D A B C D A B C D A B C D Rank 1 2 3

All Gather syntax As Gather but no root defined. Fortran C: C++:
INTEGER scount, stype, rcount, rtype, comm, error CALL MPI_GATHER(sbuf, scount, stype, rbuf, rcount, rtype, comm, error)‏ C: MPI_Gather(void *sbuf, int scount, MPI_Datatype stype, void *rbuf, int rcount, MPI_Datatype rtype, comm); C++: Comm.Gather(const void* sbuf, int scount, const MPI::Datatype& stype, void* rbuf, int rcount, const MPI::Datatype& rtype);

Introduction to MPI Programming

Similar presentations

Presentation on theme: "Introduction to MPI Programming"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Introduction to MPI Programming

Similar presentations

Presentation on theme: "Introduction to MPI Programming"— Presentation transcript:

Similar presentations

About project

Feedback