Download presentation
Presentation is loading. Please wait.
Published bySheila Montgomery Modified over 9 years ago
1
Plan: I. Introduction: Programming Model II. Basic MPI Command III. Examples IV. Collective Communications V. More on Communication modes VI. References on MPI Basic examples with MPI M. Garbey Reference: http://www.mcs.anl.gov/mpi/
2
The program is executed by one and only one processor All variables and constants of the program are allocated in central memory I Introduction Definition Model for a sequential program Memory PE Programme
3
The program is written in a classical language (Fortran, C, C ++, ….) The computer is an ensemble of processors with an arbitrary interconnection topology Each processor has its own medium-size local memory. Each processor executes its own program. Processors communicate by message passing. Any processor can send a message to any other processor. There are no shared resources (CPU, Memory…) I Introduction Message Passing Programming Model
4
I Introduction Message Passing Programming Model 0 1 2 3 Memory Processus Program Network
5
I Introduction Message Passing Programming Model 0 1 2 3 memory Processus Single Program Multiple Data Network
6
I Introduction Execution model: S P M D Single Program Multiple Data The same program is executed by all the processors Most of the computers can run this model. It is a particular case of MPMD, but SMPD can emulate MPMD. If processor is in set A then do piece of code A If processor is in set B then do piece of code B …….
7
I Introduction Process = Basic Unit of Computation A program written in a “standard” sequential language with library calls to implement message passing. A process executes on a node - other processes may execute simultaneously on other nodes A process communicates and synchronizes with other processes via messages. A process is uniquely identified by its label A process does not migrate …..
8
I Introduction Processes communicate and synchronize with each other by sending and receiving messages. (No global variables or shared memory) Processes execute independently and asynchronously (no global synchronizing clock) Processes my be unique and work on own data set Any process may communicate with any other process (A priori no limitation on message passing)
9
I Introduction Common Communication Patterns One processor to one processor One processor to many processors Input data Many processors to one processor Printing results Global operations Many processors to many processors Algorithm step (FFT …..)
10
integer code c -- start MPI call MPI_INIT(code) call MPI_Finalize(code) c -- end MPI MPI_InitInitialize MPI MPI_Comm_Sizegives the number of processes MPI_Comm_Rankgive the number of the process MPI_SendSend a message MPI_RecvReceive a message MPI_Finalizeend MPI environment II. 6 basic functions of MPI
11
integer nb_procs, rank, code c -- gives the number of processes running in the code: call MPI_COMM_SIZE(MPI_COMM_WORLD, nb_procs, code) c -- gives the rank of the process running this function: call MPI_COMM_RANK(MPI_COMM_WORLD, rank, code) NOTE: 0 =< rank =< nb_procs - 1 NOTE: MPI_COMM_WORLD is for the set of all processes running in the code MPI_Comm_Sizegives the number of processes MPI_Comm_Rankgive the rank of the process II. 6 basic functions of MPI
12
Program who_i_am implicit none include ‘mpif.h’ integer nb_procs, rang, code call MPI_INIT(code) call MPI_COMM_SIZE(MPI_COMM_WORLD,nb_procs,code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code) print *, ‘ I am the process ‘, rank, ‘among’, nb_procs call MPI_FINALIZE(code) end program who_i_am > mpirun -np 4 who_i_am I am the process 3 among 7 I am the process 0 among 4 I am the process 2 among 4 I am the process 1 among 4 II. 6 basic functions of MPI
13
MPI_SendSend a message MPI_RecvReceive a message II. 6 basic functions of MPI 0 1 2 5 3 1000
14
Program node_to_node implicit none include ‘mpif.h’ integer status(MPI_STATUS_SIZE) integer code, rank, value, tag parameter(tag=100) call MPI_INIT(code) call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code) if (rank.eq. 1) then value=1000 call MPI_SEND(value,1,MPI_INTEGER, 5, tag, MPI_COMM_WORLD, code) elseif (rank.eq. 5) then call MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,statut,code) end if call MPI_FINALIZE(code) end program node_to_node MPI_SendSend a message MPI_RecvReceive a message II. 6 basic functions of MPI
15
value is the number of type MPI_INTEGER that is sent each message should have a tag This protocol of communication is a Synchronous send and a Synchronous receive. MPI_SEND(value,1,MPI_INTEGER, 5, tag, MPI_COMM_WORLD, code) blocks the excecution of the code until the send is completed, value can be reused, but no guarantee that message has been received. MPI_RECEV(value,1, MPI_INTEGER, 1, tag, MPI_COMM_WORLD,status,code) blocks the execution of the code until the receive is completed NOTE: at the beginning, use print command to check that things are OK! MPI_SendSend a message MPI_RecvReceive a message II. 6 basic functions of MPI
16
sender must specify a valid destination rank receiver must specify a valid source rank may use wildcard: MPI_ANY_SOURCE the communicator must be the same Tags must match may use wildcard: MPI_ANY_TAG Message types must match Receiver’s buffer must be large enough For a communication to succeed: II. 6 basic functions of MPI
17
MPI DatatypesFortran Datatypes MPI_INTEGERINTEGER MPI_REALREAL MPI_DOUBLE_PRECISIONDOUBLE PRECISION MPI_COMPLEXECOMPLEXE MPI_LOGICALLOGICAL MPI_CHARACTERCHARACTER(1) MPI Basic Datatypes in Fortran II. 6 basic functions of MPI
18
in Fortran: double precision MPI_Wtime() Time is measured in seconds Time to perform a task is measured by consulting the timerbefore and after. Modify your program to measure its execution time and print out. Preliminary: TIMER III. The matrix multiply example: Example: tstart = mpi_wtime blabla blablaba….. tend = mpi_wtime print *, ‘ node ’, myid, ‘,time=‘, tend-tstart, ‘ seconds ’
19
Matrix A is copied to every processors j=1..np. Matrix B is divided into blocks of columns B and distributed to processors Performs matrix multiply simultaneously between A and B Output solutions. Simple matrix multiply algorithm III. The matrix multiply example: j=1..np j 1,2,3,4 *= 1234 1234 A B C
20
Master: distribute the work to workers, collect results, and output solution. Master sends a copy of A to every worker do dest=1, numworkers call MPI_SEND(a, nra*nca, mpi_double_precision, dest,mtype, mpi_comm_world, ierr) end do Worker: receive a copy of A from master call mpi_recv(a, nra*nca, mpi_double_precision, master, mtype, mpi_comm_world, status, ierr) III. The matrix multiply example:
21
Master: distribute block of columns of B to workers Master sends column length (cols) and column identifier (offset) do dest=1, numworkers call MPI_SEND(offset, 1, mpi_integer, dest,mtype, mpi_comm_world,ierr) call MPI_SEND(cols, 1, mpi_integer, dest,mtype, mpi_comm_world,ierr) end do Master sends corresponding values to workers: do dest=1, numworkers call MPI_SEND(b(1,offset), cols*nca, mpi_double_precision, dest, mtype, mpi_comm_world,ierr) end do III. The matrix multiply example:
22
Workers receive the data: call MPI_RECV(offset, 1, mpi_integer, master, mtype, mpi_comm_world, status, ierr) call MPI_RECV(cols, 1, mpi_integer, master, mtype, mpi_comm_world, status, ierr) call MPI_RECV(b, cols*nca, mpi_double_precision, master, mtype, mpi_comm_world, status, ierr) Workers do matrix multiply: do k=1, cols c(i,k)=0.0 d0 do j=1, nca c(i,k) = c(i,k) + a(i,j) * b(j,k) end do III. The matrix multiply example:
23
Workers send the results for their block back to the master: call MPI_SEND(c, cols*nca, mpi_double_precision, master, mtype, mpi_comm_world, ierr) Master receives results from workers: do i= 1, numworkers call MPI_RECV(c(1,offset), cols*nca, mpi_double_precision, master, mtype, mpi_comm_world, status, ierr) end do Remark: Fortran is not case sensitive III. The matrix multiply example:
24
Substitute for a more complex sequence of calls Involve all the processes in a process group Called by all processes in a communicator all routines block until they are locally complete Receive buffers must be exactly the right size No message tags are needed Collective calls are divided into three subsets: synchronization data movement global computation IV. Collective Communications:
25
To synchronize all processes within a communicator A communicator is a group of processes and a context of communication The base group is the group that contains all processes, which is associated with the MPI_COMM_WORLD communicator. A node calling it will be blocked until all nodes within the group have called it. Call MPI_BARRIER(comm,ierr) IV. Collective Communications: Barrier Synchronization Routines
26
One processor sends some data to all processors in a group call MPI_BCAST(buffer, count, datatype, root, comm, ierr) The MPI_BCAST must be called by each node in a group,specifying the same communicator and root. The message is sent from theroot process to all processes in the group, including the rootprocess. Scatter Data are distributed into n equal segments, where the ith segment is sent to the ith process in the group which has n processes. Call MPI_SCATTER(sbuff,scount, sdatatype, rbuf, rcount, rdatatype, root, comm, ierr) IV. Collective Communications: Broadcast
27
. Gather Data are collected into a specified process in the order of process rank, Gather is the reverse process of scatter. Call MPI_Gather(sbuff,scount, sdatatype, rbuf, rcount, rdatatype, root, comm, ierr) Example: datas in Proc. 0 are: {1,2}, in Proc. 1: {3,4}, in Proc.2: {5,6}, …. in Proc. 5 are {11,12}, then real rbuf(2), sbuf(2) call MPI_Gather (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, MPI_COMM_WORLD,ierr) will bring {1,2,3,4,5,6,….,11,12} into Proc. 3. Similarly, the inverse transfer is: call MPI_Scatter (sbuf,2,MPI_INIT,rbuf,2,MPI_INIT,3, MPI_COMM_WORLD,ierr) IV. Collective Communications:
28
. Two more MPI functions: MPI_Allgather and MPI_Alltoall: MPI_Alltoall(sbuf,scount,stype,rbuf,rcount,rtype,comm,ierr) sbuf: starting address of send buffer scount: number of elements sent to each process stype: data type to send buffer rbuff: address of receive buffer rcount: number of elements received from any process rtype: data type of receive buffer elements comm: communicator To summarize: IV. Collective Communications: p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 p0 p1 aa a aba b a b ab ab ab cdd c b a Broadcast Scatter Gather All Gather All to All
29
Global Reduction Routines The partial result in each process in the group is combined together using some desired function. The operation function passed to a global computation routine is either a predefined MPI function or a user supplied function. Examples: Global sum or product. Global maximum or minimum. Global user-defined operation. MPI_Reduce(sbuf,rbuf,count,stype,op,root, comm,ierr) MPI_Allreduce(sbuf,rbuf,count,stype,op, comm,ierr) IV. Collective Communications:
30
Global Reduction Routines sbuf : address of send buffer rbuf: address of receive buffer count: the number of elements in the send buffer stype: the data type of elements of send buffer op: the reduce operation function, predefined or user-defined root: the rank of the root process comm: communicator mpi_reduce returns results to single process mpi_allreduce returns results to all processes in the group. IV. Collective Communications:
31
Global Reduction Routines IV. Collective Communications: p0p1p2p0p1p2 MPI_Reduce( sendbuf,recvbuf, 4 MPI_INT, MPI_MAX,0,comm) 012 345 7 8 6 910 12 34 8 6 2 5 8 910 p0p1p2p0p1p2 MPI_Allreduce( sendbuf,recvbuf, 4 MPI_INT, MPI_SUM,comm) 0 012 345 7 8 6
32
Global Reduction Routines Examples c A subroutine that computes the dot product of two vectors that are distributed across c a group of processes and return the answer at node zero: subroutine PAR_BLAS1(N, a, b, scalar_product, comm) real a(N), b(N), sum, scalar_product sum=0.0 do I = 1, N sum = sum + a(I) * b(I) end do call MPI_Reduce(sum, scalar_product, 1, MPI_REAL, 0, MPI_SUM, comm, ierr) return IV. Collective Communications:
33
Global Reduction Routines Predefined Reduce Operations MPI NAME FUNCTIONMPI NAMEFUNCTION MPI_MAXMaximumMPI_LORLogical OR MPI_MINMinimumMPI_LAND Logical AND MPI_SUMSumMPI_PRODProduct IV. Collective Communications:
34
So far, we have seen standard standard SEND and RECEIVE functions, however we do need to know more in order to overlap communications by computations….and more generally optimized the code. Blocking Calls A blocking send or receive call suspends execution of user’s program until the message buffer being sent:received is safe to use. In case of a blocking send, this means the data to be sent have been copied out of the send buffer, but they have not necessarly been received in the receiving task. The contents of the send buffer can be modified without affecting the message that was sent. The blocking receive implies that the data in the receive buffer are valid. V. More on Communication Mode:
35
Blocking Communication Modes: Synchronous Send: MPI_SSEND: Return when the message buffer can be safely reused. The sending tasks tells the receiver that a message is ready for it and waits for the receiver to acknowledge. System overhead: buffer to network and vice versa. Synchronization overhead: handshake + waiting. Safe and Portable. Buffered Send: MPI_BSEND: Return when message is copied to the system buffer. Standard Send: MPI_SEND: Either synchronous or buffered, implemented by vendor to give good performance for most programs. In MPICH: we do have buffered send V. More on Communication Mode:
36
Non-Blocking Calls Non-blocking calls return immediately after initiating the communication. In order to reuse the send message buffer, the programmer must check for its status. The programmer can choose to block before the message buffer is used or test for the status of the message buffer. A blocking or non_blocking send can be paired to a blocking or non blocking receive. Syntax: call MPI_Isend(buf,count,datatype,dest,tag,comm,handle,ierr) call MPI_Irecv (buf,count,datatype,src,tag,comm,handle,ierr) V. More on Communication Mode:
37
Non-Blocking Calls The programmer can block or check for the status of the message buffer: MPI_Wait(request,status) this routine blocks until the communication has completed. They are useful when the data from the communication buffer is about to be re-used. MPI_Test(request,flag,status) This routine blocks until the communication specified by the handle request has completed. The request handle will have been returned by an earlier call to a non_blocking communication routine. The routine queries completion of the communication and the result (True or False) is returned in flag. V. More on Communication Mode:
38
Deadlock -All tasks are waiting for events that haven’t been initiated -Common to SPMD program with blocking communication, e.g every task sends, but none receives -Insufficient system buffer space is available -Remedies : -Arrange one task to receive -Use MPI_Ssendrecv -Use non-blocking communication
39
Examples : Deadlock c Improper use of blocking calls results in deadlock, run on two nodes c author : Roslyn Leibensperger, (CTC) program deadlock implicit none include ‘mpif.h’ integer MSGLEN, ITAG_A,ITAG_B parameter (MSGLEN = 2048,ITAG_A=100,ITAG_B=200) real rmsg1(MSGLEN), rmsg2(MSGLEN) integer, irank, idest, isrc, istag, iretag, istatus(MPI_STATUS_SIZE), ierr,I call MPI_Init (ierr) call MPI_Comm_rank(MPI_COMM_WORLD, irank, ierr) do I = 1, MSGLEN rmsg1(I)=100 rmsg2(I)= -100 end do V. More on Communication Mode:
40
Example : Deadline (Cont’d) if (irank.eq.0) then Idest = 1 Isrc = 1 Istag = ITAG_A Iretag = ITAG_B end if (irank.eq.1) then idest = 0 isrc = 0 istag = ITAG_B iretag = ITAG_A end if print*, ‘’ Task ‘’,irank, ‘’has sent the message ‘’ call MPI_Ssend (rmsg1,MSGLEN, MPI_REAL,isrc, iretag, MPI_COMM_WORLD, ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag, MPI_COMM_WORLD,istatus,ierr) print*, ‘’Task ‘’,irank, ‘’has received the message ‘’ call MPI_Finalize (ierr) end V. More on Communication Mode:
41
Examples : Deadlock (fixed) c Solution program showing the use of a non-blocking send to eliminate deadlock c author : Roslyn Leibensperger (CTC) program fixed implicit none include ‘mpif.h’ ----------------------- print*, ‘’Task ‘’, irank, ‘’has started the send ‘’ call MPI_isend(rmsg1,MSGLEN, MPI_REAL,idest, istag,MPI_COMM_WORLD,irequest,ierr) call MPI_Recev(rmsg2,MSGLEN,MPI_REAL,isrc, iretag,MPI_COMM_WORLD,irstatus,ierr) call MPI_Wait (irequest,isstatus,ierr) print*, ‘’Task ‘’,irank, ‘’ has completed the send ‘’ call MPI_Finalize(ierr) end V. More on Communication Mode:
42
Sendrecv -Useful for executing a shift operation across a chain of processes. -System take care of possible deadlock due to blocking call MPI_Sendrecv (sbuf,scount,stype,dest, stag,rbuf,rcount,rtype,rtag,comm,status) -sbuf (rbuf): initial address of send (receive) buffer. -scount (rcount): number of elements in send (receive) buffer. -stype (rtype) : type of elements in send (receive) buffer. -stag (rtag): send (receive) tag -dest: rank of destination. -source: rank of source. -comm: communicator -status: status object.
43
1: program sendrecv 2: implicit none 3: include ‘mpif.h’ 4: integer, dimension(MPI_STATUS_SIZE) :: status 5: integer, parameter :: tag 6: integer :: rank, value, num_proc, code 7: 8: call MPI_INIT(code) 9: call MPI_COMM_RANK(MPI_COMM_WORLD,rank,code) 10: 11: ! one suppose that we have only two processes. 12: num_proc=mod(rank+1,2) 13: 14: call MPI_SENDRECV(rank+1000,1,MPI_INTEGER,num_proc, tag,value,1,MPI_INTEGER,num_proc,tag, MPI_COMM_WORLD,status,code) 15: 16: print *,’me, process’,rank, ‘ i have received’, value,’from process’,num_proc 17: call MPI_FINALIZE(code) 18: end program sendrecv mpirun –np 2 send recv me, process 1, i have received 1000 from process 0 me, process 0, i have received 1001 from process 1 Remark: if Blocking MPI_SEND are implemented in this code, we will have a deadlock because each process will wait for an order of reception that will never come!
44
Optimizations Optimization must be a main concern when communications time become a significant part compare to computations time Optimization of communications may be accomplished at different levels, the main ones are : 1.Overlap communication by computation 2.Avoid, if possible, copy of the message in a temporary memory (buffering), 3.Minimize additional costs induced by calling subroutines of communication too often V. More on Communication Mode:
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.