Its.unc.edu 1 University of North Carolina - Chapel Hill ITS Research Computing Instructor: Mark Reed Point to Point Communication.

Slides:

Advertisements

Similar presentations

1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;

Advertisements

Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed

MPI Collective Communications

Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.

1 Implementing Master/Slave Algorithms l Many algorithms have one or more master processes that send tasks and receive results from slave processes l Because.

1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.

Reference: / Point-to-Point Communication.

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.

Point-to-Point Communication Self Test with solution.

SOME BASIC MPI ROUTINES With formal datatypes specified.

Lesson2 Point-to-point semantics Embarrassingly Parallel Examples.

MPI Point-to-Point Communication CS 524 – High-Performance Computing.

Using MPI - the Fundamentals University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed

Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.

1 Tuesday, October 10, 2006 To err is human, and to blame it on a computer is even more so. -Robert Orben.

A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.

Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.

CS 179: GPU Programming Lecture 20: Cross-system communication.

A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.

L15: Putting it together: N-body (Ch. 6) October 30, 2012.

MPI Short Course Instructor: Mark Reed Dept: Research Computing

1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.

2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,

Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard

Steve Lantz Computing and Information Science Distributed Memory Programming Using Advanced MPI (Message Passing Interface)

Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.

Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.

Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)

1 MPI Primer Lesson 10 2 What is MPI MPI is the standard for multi- computer and cluster message passing introduced by the Message-Passing Interface.

CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.

MPI Communications Point to Point Collective Communication Data Packaging.

Message Passing Programming Model AMANO, Hideharu Textbook pp. １４０－１４７.

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.

MPI Send/Receive Blocked/Unblocked Tom Murphy Director of Contra Costa College High Performance Computing Center Message Passing Interface BWUPEP2011,

An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams

1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.

Parallel Programming with MPI By, Santosh K Jena..

MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.

Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.

1 Lecture 4: Part 2: MPI Point-to-Point Communication.

MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.

Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.

MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.

Chapter 5. Nonblocking Communication MPI_Send, MPI_Recv are blocking operations Will not return until the arguments to the functions can be safely modified.

April 24, 2002 Parallel Port Example. April 24, 2002 Introduction The objective of this lecture is to go over a simple problem that illustrates the use.

Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.

Distributed Systems CS Programming Models- Part II Lecture 14, Oct 28, 2013 Mohammad Hammoud 1.

Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.

An Introduction to Parallel Programming with MPI February 17, 19, 24, David Adams

COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.

MPI_Alltoall By: Jason Michalske. What is MPI_Alltoall? Each process sends distinct data to each receiver. The Jth block of process I is received by process.

3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:

Introduction to parallel computing concepts and technics

Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Odd-Even Sort Implementation Dr. Xiao Qin.

CS4402 – Parallel Computing

Introduction to MPI Programming

MPI Point to Point Communication

Last Class: RPCs and RMI

Parallel Programming with MPI and OpenMP

Distributed Systems CS

Distributed Systems CS

Lecture 14: Inter-process Communication

A Message Passing Standard for MPP and Workstations

CSCE569 Parallel Computing

Barriers implementations

Synchronizing Computations

September 4, 1997 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.

5- Message-Passing Programming

September 4, 1997 Parallel Processing (CS 730) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.

Presentation transcript:

its.unc.edu 1 University of North Carolina - Chapel Hill ITS Research Computing Instructor: Mark Reed Point to Point Communication

its.unc.edu 2 Send Types  Synchronous Sends provide information about the completion of the message  Asynchronous Sends Only know when the message has left.

its.unc.edu 3 Blocking Operations  Relate to when the operation has completed.  Only return from the function call when the operation has completed.

its.unc.edu 4 Non-Blocking Operations  Return immediately and allow the sub- program to continue and perform other work.  At some later time the sub-program can test or wait for the completion of the non- blocking operation.  These are used to overlap communication and computation Communication Computation

its.unc.edu 5 Non-Blocking Operations (cont'd)  All non-blocking operations should have matching wait operations. Some systems cannot free resources until wait has been called.  A non-blocking operation immediately followed by a matching wait is equivalent to a blocking operation.  Non-blocking operations are not the same as sequential subroutine calls as the operation continues after the call has returned.

its.unc.edu 6 Point-to-Point Communication  Communication between exactly two processes.  Source process sends message to destination process.  Communication takes place within a communicator.  Destination process is identified by its rank in the communicator.

its.unc.edu 7 For a communication to succeed:  Sender must specify a valid destination rank.  Receiver must specify a valid source rank.  The communicator must be the same.  Tags must match.  Message types must match.  Receiver's buffer must be large enough.

its.unc.edu 8 Wildcarding  Receiver can wildcard.  To receive from any source MPI_ANY_SOURCE  To receive with any tag MPI_ANY_TAG  Actual source and tag are returned in the receiver's status parameter.

its.unc.edu 9 Rules of PTP communication  Messages are non-overtaking Order is preserved  Progress one of a pair of matching send/receives is guaranteed to complete, independently of other system actions  Fairness there is no guarantee of fairness an individual send or receive may never be satisfied due to the messages generated by a third process

its.unc.edu 10  Only completes when the receive has begun  Always completes (unless error occurs) irrespective of receiver  Either synchronous or buffered  Always completes (unless error occurs) irrespective of receiver  completes when a message has arrived Communication modes  Synchronous send  Buffered send  Standard send  Ready send  Receive

its.unc.edu 11 Synchronous Send  MPI_Ssend  Call does not complete until an acknowledgement is returned by the receiver (i.e. a “handshake”)  “safer” - system is not overloaded with undelivered messages, more predictable, synchronized  can be slower or faster sender idles - reduces comp/comm overlap non-buffered - can write directly out of memory

its.unc.edu 12 Buffered Send  MPI_Bsend  Copies message to a system buffer for later transmission - completes immediately  not synchronized, local semantics  User must explicitly allocate buffer space by calling MPI_Buffer_attach

its.unc.edu 13 MPI_BUFFER_ATTACH int MPI_Buffer_attach (void *buffer, int size) buffer - unused array allocated by user size - size of buffer in bytes int MPI_Buffer_detach (void *buffer, int size)  all communications already using the buffer are allowed to complete before detaching buffer  does not deallocate memory

its.unc.edu 14 Ready Send  MPI_Rsend  assumes receiver is ready - posted receive is user responsibility,  behavior is undefined if receive is not ready  removes handshake, allows for potentially faster sends for some systems  completes immediately

its.unc.edu 15 Combined send/receive  MPI_Sendrecv  blocking send and receive  avoids deadlock from lack of buffering  send and receive buffers differ  use MPI_Sendrecv_replace to reuse the same buffer

its.unc.edu 16 MPI_Sendrecv syntax int MPI_Sendrecv(void *sendbuf, int sendcount, MPI_Datatype sendtype, int dest, int sendtag, void *recvbuf, int recvcount, MPI_Datatype recvtype, int source, MPI_Datatype recvtag, MPI_Comm comm, MPI_Status *status)  sendbuf, sendcount, sendtype send triplet as in standard send  dest - rank of destination  sendtag - send tag  recvbuf, recvcount, recvtype receive triplet as in standard receive  source - rank of source  recvtag - receive tag  comm - communicator  status - status object

its.unc.edu 17 Non-Blocking Communications  Separate communication into three phases: 1) Initiate non-blocking communication. 2) Do some work 3) Wait for non-blocking communication to complete.

its.unc.edu 18 Initiate non-blocking communication  Immediate modes MPI_Isend- standard send MPI_Issend - synchronous send MPI_Ibsend - buffered send MPI_Irsend - ready send MPI_Irecv - receive  can mix and match blocking and non-blocking send receives  additional argument -request is returned by system to identify the communication

its.unc.edu 19 Testing for completion int MPI_Wait(MPI_Request *request, MPI_Status *status) blocks until request has completed other variants  MPI_Waitany, MPI_Waitsome, MPI_Waitall int MPI_Test(MPI_Request *request, int *flag, MPI_Status *status) returns immediately flag is true if request has completed, else false other variants  MPI_Testany, MPI_Testsome, MPI_Testall

its.unc.edu 20 Persistent Communication  used for communication requests repeatedly executed with the same arguments  non-blocking communication  rationale is to offer a (modest) optimization for calls with the same message signature  like a “half-channel” since it just binds on one side (local semantics)  for more see MPI_Send_init MPI_Recv_init MPI_Start MPI_Startall

its.unc.edu 21 Example: 1-D smoothing  Consider a simple smoothing operation used in image-processing  Replace each element of an image (here a 1-D array) with the average of itself and it’s two neighbors  domain decomposition  repeat for some number of iterations  use “ghost cells” for the two boundary cells (i.e. the ends of the subarray)

its.unc.edu 22 Domain Decomposition 0 1 processor 0 processor 1 processor “ghost cells” for boundaries

its.unc.edu 23 Initial Algorithm  for (iterations) update all cells send boundary values to neighbors receive ghost cells from neighbors

its.unc.edu 24 program smooth implicit none include "mpif.h" real, allocatable :: image(:),overimage(:) real tinv integer myblock,npts,npes,mype,msgtag,iprev,inext integer i,istat,irecvstat(MPI_STATUS_SIZE) integer iter,numiters c simple 1-D smoothing for image processing. Replace each c element with the average of itself and each adjacent element c initialize mpi, set the npes and mype variables call MPI_Init(istat) if (istat.ne.0) stop "ERROR: can't initialize mpi" call MPI_Comm_Size(MPI_COMM_WORLD,npes,istat) call MPI_Comm_Rank(MPI_COMM_WORLD,mype,istat) Smooth.f

its.unc.edu 25 Smooth.f c enter the size of the block on each processor c this ensures all are the same size c note msgtag is defined for all processors msgtag = 100 if (mype.eq.0) then print *, "enter the block size" read (*,*) myblock c now distribute it to all the processors do i=1,npes-1 call MPI_Send (myblock,1,mpi_integer,i,msgtag, $ mpi_comm_world,istat) enddo else call MPI_Recv (myblock,1,mpi_integer,0,msgtag, $ mpi_comm_world,irecvstat,istat) endif

its.unc.edu 26 Smooth.f c initialize variables on all processor's and allocate arrays npts=myblock*npes allocate (image(myblock), overimage(0:myblock+1)) numiters = 3 msgtag = 105 iprev = mod(mype+npes-1,npes) inext = mod(mype+npes+1,npes) c do initialization independent of num processors it runs on c but it does depend on the total number of points c initialize to 0 or 1 for even or odd do i =1,myblock image(i) = mod(mype*myblock+i,2) overimage(i) = image(i) enddo overimage(0) = mod(mype*myblock,2) overimage(myblock+1) = mod((mype+1)*myblock+1,2)

its.unc.edu 27 Smooth.f tinv = 1.0/3.0 do iter=1,numiters c smooth the data do i=1,myblock image(i) = (overimage(i-1) + overimage(i) + overimage(i+1))*tinv enddo c update overimage, excluding the ghost cells do i=1,myblock overimage(i) = image(i) enddo

its.unc.edu 28 Smooth.f c now send and receive the ghost cells c note: this can DEADLOCK depending on implementation c everyone but the end sends to the next pe c everyone but the first sends to the previous processor if (mype.ne.npes-1) then call mpi_send (image(myblock),1,mpi_real,inext,msgtag, & mpi_comm_world,istat) call mpi_recv (overimage(myblock+1),1,mpi_real,inext,msgtag, & mpi_comm_world,irecvstat,istat) endif if (mype.ne.0) then call mpi_send (image(1),1,mpi_real,iprev,msgtag, & mpi_comm_world,istat) call mpi_recv (overimage(0),1,mpi_real,iprev,msgtag, & mpi_comm_world,irecvstat,istat) endif enddo ! end do iter c call barrier and exit call mpi_barrier(MPI_COMM_WORLD,istat) call mpi_finalize (istat)

its.unc.edu 29 Sample Results of smoothing Input Output: 1000 Iterations

its.unc.edu 30 Drawbacks of this Algorithm  Deadlock can occur depending on the implentation of MPI (i.e. if MPI_SEND blocks)  There is no communication/computation overlap, that is, no latency hiding  Final processor must receive first and then propagate sends back down the line

its.unc.edu 31 Drawbacks Rank 0Rank 1Rank 2Rank 3 Send Recv Send Next Iter Recv SendNext Iter Time

its.unc.edu 32 Solution to this puzzle?  Use non-blocking communication, e.g. immediate sends and receives  avoids deadlock  allows overlap of computation and communication

its.unc.edu 33 Improved Algorithm  for (iterations) update boundary (ghost) cells initiate sending of boundary cells to neighbors initiate receipt of ghost cells from neighbors update non-boundary values wait for completion of sending of boundary values wait for completion of receipt of ghost values

its.unc.edu 34 Smooth.immed.f program smooth implicit none include "mpif.h" real, allocatable :: image(:),overimage(:) real tinv integer myblock,npts,npes,mype,msgtag,iprev,inext integer i,istat,irecvstat(MPI_STATUS_SIZE) integer iter,numiters integer nsendreq,psendreq,nrecvreq,precvreq c simple 1-D smoothing for image processing c replace each element with the average of itself and each c adjacent element c initialize mpi, set the npes and mype variables call MPI_Init(istat) if (istat.ne.0) stop "ERROR: can't initialize mpi" call MPI_Comm_Size(MPI_COMM_WORLD,npes,istat) call MPI_Comm_Rank(MPI_COMM_WORLD,mype,istat)

its.unc.edu 35 Smooth.immed.f c enter the size of the block on each processor c this ensures all are the same size c note msgtag is defined for all processor's msgtag = 100 if (mype.eq.0) then print *, "enter the block size" read (*,*) myblock c now distribute it to all the pe's do i=1,npes-1 call MPI_Send (myblock,1,mpi_integer,i,msgtag, $ mpi_comm_world,istat) enddo else call MPI_Recv (myblock,1,mpi_integer,0,msgtag, $ mpi_comm_world,irecvstat,istat) endif

its.unc.edu 36 Smooth.immed.f c initialize variables on all processors and allocate arrays npts=myblock*npes allocate (image(myblock), overimage(0:myblock+1)) numiters = 3 msgtag = 105 iprev = mod(mype+npes-1,npes) inext = mod(mype+npes+1,npes) c make initialization independent of num processors it c runs on - but it does depend on the total number of points c initialize to 0 or 1 for even or odd do i =1,myblock image(i) = mod(mype*myblock+i,2) overimage(i) = image(i) enddo overimage(0) = mod(mype*myblock,2) overimage(myblock+1) = mod((mype+1)*myblock+1,2)

its.unc.edu 37 Smooth.immed.f tinv = 1.0/3.0 c smooth the data on the boundary cells do iter=1,numiters image(1) = (overimage(0)+overimage(1)+overimage(2))*tinv image(myblock) = (overimage(myblock-1)+overimage(myblock)+ $ overimage(myblock+1))*tinv

its.unc.edu 38 c now send and receive the ghost cells c everyone but the end sends to the next processor c everyone but the first sends to the previous processor if (mype.ne.npes-1) then call mpi_isend (image(myblock),1,mpi_real,inext,msgtag, & mpi_comm_world,nsendreq,istat) endif if (mype.ne.0) then call mpi_isend (image(1),1,mpi_real,iprev,msgtag, & mpi_comm_world,psendreq,istat) endif if (mype.ne.npes-1) then call mpi_irecv (overimage(myblock+1),1,mpi_real,inext, $ msgtag,mpi_comm_world,nrecvreq,istat) endif if (mype.ne.0) then call mpi_irecv (overimage(0),1,mpi_real,iprev,msgtag, & mpi_comm_world,precvreq,istat) endif Smooth.immed.f

its.unc.edu 39 Smooth.immed.f c smooth data excluding the ghost cells which are done already do i=2,myblock-1 image(i) = (overimage(i-1)+overimage(i)+overimage(i+1))*tinv enddo c update overimage, excluding the ghost cells do i=1,myblock overimage(i) = image(i) enddo call mpi_wait(nsendreq, irecvstat, istat) call mpi_wait(psendreq, irecvstat, istat) call mpi_wait(nrecvreq, irecvstat, istat) call mpi_wait(precvreq, irecvstat, istat) enddo ! end of do iter c call barrier and exit call mpi_barrier(MPI_COMM_WORLD,istat) call mpi_finalize (istat)