Asynchronous I/O with MPI Anthony Danalis. Basic Non-Blocking API  MPI_Isend()  MPI_Irecv()  MPI_Wait()  MPI_Waitall()  MPI_Waitany()  MPI_Test()

Slides:



Advertisements
Similar presentations
ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.
Advertisements

Problems with M things. Review: NULL MPI handles All MPI_*_NULL are handles to invalid MPI objects MPI_REQUEST_NULL is an exception Calling MPI_TEST/MPI_WAIT.
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
MPI Message Passing Interface
VSCSE Summer School Programming Heterogeneous Parallel Computing Systems Lecture 6: Basic CUDA and MPI.
MPI version of the Serial Code With One-Dimensional Decomposition Presented by Timothy H. Kaiser, Ph.D. San Diego Supercomputer Center Presented by Timothy.
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Parallel Processing1 Parallel Processing (CS 667) Lecture 9: Advanced Point to Point Communication Jeremy R. Johnson *Parts of this lecture was derived.
MPI Collective Communications
Sahalu Junaidu ICS 573: High Performance Computing 8.1 Topic Overview Matrix-Matrix Multiplication Block Matrix Operations A Simple Parallel Matrix-Matrix.
1 Implementing Master/Slave Algorithms l Many algorithms have one or more master processes that send tasks and receive results from slave processes l Because.
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
Point-to-Point Communication Self Test with solution.
SOME BASIC MPI ROUTINES With formal datatypes specified.
Distributed Memory Programming with MPI. What is MPI? Message Passing Interface (MPI) is an industry standard message passing system designed to be both.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
E.Papandrea 06/11/2003 DFCI COMPUTING - HW REQUIREMENTS1 Enzo Papandrea COMPUTING HW REQUIREMENT.
MPI Point-to-Point Communication CS 524 – High-Performance Computing.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Today Objectives Chapter 6 of Quinn Creating 2-D arrays Thinking about “grain size” Introducing point-to-point communications Reading and printing 2-D.
Basics of Message-passing Mechanics of message-passing –A means of creating separate processes on different computers –A way to send and receive messages.
Parallel Programming with Java
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
MA471Fall 2003 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
Parallel Computing A task is broken down into tasks, performed by separate workers or processes Processes interact by exchanging information What do we.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams
1 Overview on Send And Receive routines in MPI Kamyar Miremadi November 2004.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
11/04/2010CS4961 CS4961 Parallel Programming Lecture 19: Message Passing, cont. Mary Hall November 4,
MA471Fall 2002 Lecture5. More Point To Point Communications in MPI Note: so far we have covered –MPI_Init, MPI_Finalize –MPI_Comm_size, MPI_Comm_rank.
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
Its.unc.edu 1 University of North Carolina - Chapel Hill ITS Research Computing Instructor: Mark Reed Point to Point Communication.
Message Passing and MPI Laxmikant Kale CS Message Passing Program consists of independent processes, –Each running in its own address space –Processors.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
FIT5174 Parallel & Distributed Systems Dr. Ronald Pose Lecture FIT5174 Distributed & Parallel Systems Lecture 5 Message Passing and MPI.
An Introduction to MPI (message passing interface)
1 Using PMPI routines l PMPI allows selective replacement of MPI routines at link time (no need to recompile) l Some libraries already make use of PMPI.
Message Passing Interface (MPI) 2 Amit Majumdar Scientific Computing Applications Group San Diego Supercomputer Center Tim Kaiser (now at Colorado School.
MPI Send/Receive Blocked/Unblocked Josh Alexander, University of Oklahoma Ivan Babic, Earlham College Andrew Fitz Gibbon, Shodor Education Foundation Inc.
Parallel Algorithms & Implementations: Data-Parallelism, Asynchronous Communication and Master/Worker Paradigm FDI 2007 Track Q Day 2 – Morning Session.
Message Passing Interface Using resources from
Lecture 3 Point-to-Point Communications Dr. Muhammad Hanif Durad Department of Computer and Information Sciences Pakistan Institute Engineering and Applied.
An Introduction to Parallel Programming with MPI February 17, 19, 24, David Adams
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Computer Science Department
MPI Point to Point Communication
Introduction to MPI.
Computer Science Department
Parallel Programming with MPI and OpenMP
More on MPI Nonblocking point-to-point routines Deadlock
MPI-Message Passing Interface
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
CSCE569 Parallel Computing
More on MPI Nonblocking point-to-point routines Deadlock
Synchronizing Computations
Computer Science Department
5- Message-Passing Programming
Presentation transcript:

Asynchronous I/O with MPI Anthony Danalis

Basic Non-Blocking API  MPI_Isend()  MPI_Irecv()  MPI_Wait()  MPI_Waitall()  MPI_Waitany()  MPI_Test()  MPI_Testall()  MPI_Testany()

API: sending MPI_ISEND(buf, count, datatype, dest, tag, comm, request) [ IN buf] initial address of send buffer (choice) [ IN count] number of elements in send buffer (integer) [ IN datatype] datatype of each send buffer element (handle) [ IN dest] rank of destination (integer) [ IN tag] message tag (integer) [ IN comm] communicator (handle) [ OUT request] communication request (handle)

API: receiving MPI_IRECV (buf, count, datatype, source, tag, comm, request) [ OUT buf] initial address of receive buffer (choice) [ IN count] number of elements in receive buffer (integer) [ IN datatype] datatype of each receive buffer element (handle) [ IN source] rank of source (integer) [ IN tag] message tag (integer) [ IN comm] communicator (handle) [ OUT request] communication request (handle)

API: blocking/polling MPI_WAIT(request, status) [ INOUT request] request (handle) [ OUT status] status object (Status) MPI_TEST(request, flag, status) [ INOUT request] communication request (handle) [ OUT flag] true if operation completed (logical) [ OUT status] status object (Status)

Why N-B I/O? Example 1 CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ. 0) THEN CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr) CALL MPI_WAIT(request, status, ierr) ELSE CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr) CALL MPI_WAIT(request, status, ierr) END IF

Why N-B I/O? Example 2 CALL MPI_COMM_RANK(comm, rank, ierr) IF(rank.EQ. 0) THEN CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(request, status, ierr) ELSE CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(request, status, ierr) END IF

Why N-B I/O? Deadlocks(?) IF(rank.EQ. 0) THEN CALL MPI_SEND() CALL MPI_RECV() ELSE CALL MPI_SEND() CALL MPI_RECV() END IF Blocking IF(rank.EQ. 0) THEN CALL MPI_ISEND() CALL MPI_IRECV() ELSE CALL MPI_ISEND() CALL MPI_IRECV() END IF CALL MPI_WAITALL() Non-Blocking

Why N-B I/O? Efficiency (painless) for(i=1; i<NP; i++){ sender = (NP+myRank-i)%NP; receiver = (myRank+i)%NP; MPI_Irecv(sender,...) MPI_Isend(receiver,...) } MPI_Waitall()

Why N-B I/O? Overlapping  Communication / Computation  Communication / Communication Asynchronous I/O Overlap

Non-Blocking VS Asynchronous  Non-Blocking => function call returns  Asynchronous => CPU does other things IF(rank.EQ. 0) THEN CALL MPI_ISEND(a(1), 10, MPI_REAL, 1, tag, comm, request, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(request, status, ierr) ELSE CALL MPI_IRECV(a(1), 15, MPI_REAL, 0, tag, comm, request, ierr) **** do some computation to mask latency **** CALL MPI_WAIT(request, status, ierr) END IF

Multidimensional FFT P1 P2 P1 I/O 1-D FFT Dimension X Dimension Y

FFT code: MPI_ALLTOALL() DO IZ = 1, (NZ/NPROC) DO IY = 1, NYP CALL REAL_TO_COMPLEX( REAL_IN=AP(:,IY,IZ), COMPLEX_OUT=A3 ) DO IX = 1, (NX/2) IPROC = (IX-1)/((NX/2)/NPROC) IXEFF = IX - IPROC*((NX/2)/NPROC) CAT2(IXEFF,IY,IZ,IPROC+1) = A3(IX) ENDDO NT = NYP*(NZ/NPROC)*((NX/2)/NPROC) CALL MPI_BARRIER( MPI_COMM_WORLD, ERR ) CALL MPI_ALLTOALL( CAT2(1,1,1,1), NT, MPI_DOUBLE_COMPLEX, & CAT(1,1,1,1), NT, MPI_DOUBLE_COMPLEX, MPI_COMM_WORLD, ERR )

FFT code: MPI_SEND() + MPI_RECV() DO IZ = 1, (NZ/NPROC) DO IY = 1, NYP CALL REAL_TO_COMPLEX( REAL_IN=AP(:,IY,IZ), COMPLEX_OUT=A3 ) DO IPROC = 1, NPROC-1 REAL_RECEIVER = MOD(IPROC+MYNUM, NPROC) CHUNCK = ((NX/2)/NPROC)*REAL_RECEIVER+1 CALL MPI_SEND(A3(CHUNCK), ((NX/2)/NPROC), MPI_DOUBLE_COMPLEX, & REAL_RECEIVER, 0, MPI_COMM_WORLD, ERR) REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC, NPROC) CALL MPI_RECV(DAN_A3(1), ((NX/2)/NPROC), MPI_DOUBLE_COMPLEX, & REAL_SUBMITER, 0, MPI_COMM_WORLD, STATUS, ERR ) DO IX = 1, ((NX/2)/NPROC) TARG_IZ = IZ + REAL_SUBMITER*(NZ/NPROC) AUX(TARG_IZ,IY,IX) = DAN_A3(IX); ENDDO DO IX = 1, ((NX/2)/NPROC) TARG_IZ = IZ + MYNUM*(NZ/NPROC) AUX(TARG_IZ,IY,IX) = A3( IX+((NX/2)/NPROC)*MYNUM ); ENDDO

DO IZ = 1, (NZ/NPROC) DO IY = 1, NYP CALL REAL_TO_COMPLEX( REAL_IN=AP(:,IY,IZ), COMPLEX_OUT=A3 ) CALL MPI_WAITALL( (2*(NPROC-1)), REQST(1), STATS, ERR) IF( IY > 1 ) THEN DO IPROC = 1, NPROC-1 REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC, NPROC) DO IX = 1, ((NX/2)/NPROC) TARG_IZ = IZ + REAL_SUBMITER*(NZ/NPROC) AUX(TARG_IZ,IY-1,IX) = DAN_A3(IX, IPROC); ENDDO ENDIF DO IPROC = 1, NPROC-1 REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC,NPROC) CALL MPI_IRECV( DAN_A3(1,IPROC), ((NX/2)/NPROC), MPI_DOUBLE_COMPLEX, REAL_SUBMITER, 0, & MPI_COMM_WORLD, REQST((2*IPROC-1)), ERR) REAL_RECEIVER = MOD(IPROC+MYNUM, NPROC) CHUNCK = ((NX/2)/NPROC)*REAL_RECEIVER+1 CALL MPI_ISEND(A3(CHUNCK), ((NX/2)/NPROC), MPI_DOUBLE_COMPLEX, REAL_RECEIVER, 0, & MPI_COMM_WORLD, REQST((2*IPROC)), ERR) ENDDO DO IX = 1, ((NX/2)/NPROC) TARG_IZ = IZ + MYNUM*(NZ/NPROC) AUX(TARG_IZ,IY,IX) = A3( IX+((NX/2)/NPROC)*MYNUM ); ENDDO CALL MPI_WAITALL( (2*(NPROC-1)), REQST(1), STATS, ERR) DO IPROC = 1, NPROC-1 REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC, NPROC) TARG_IZ = IZ + REAL_SUBMITER*(NZ/NPROC) DO IX = 1, ((NX/2)/NPROC) AUX(TARG_IZ,NYP,IX) = DAN_A3(IX, IPROC); ENDDO FFT code: MPI_iSEND() + MPI_iRECV() race?

DO IZ = 1, (NZ/NPROC) !{ DO IY = 1, NYP !{ CALL REAL_TO_COMPLEX( REAL_IN=AP(:,IY,IZ), COMPLEX_OUT=A3(1,1+MOD(IY-1,K_LOOPS), 1+MOD((IY-1)/K_LOOPS,2))) IF( (IY > K_LOOPS).AND. (MOD(IY,K_LOOPS) == 0) ) THEN !{ CALL MPI_WAITALL( (2*(NPROC-1)), REQST(1), STATS, ERR) DO IPROC = 1, NPROC-1 !{ REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC, NPROC) TARG_IZ = IZ + REAL_SUBMITER*(NZ/NPROC) DO IIK = 1, K_LOOPS !{ OLDIY = IY-2*K_LOOPS+IIK DO IX = 1, ((NX/2)/NPROC) !{ AUX(TARG_IZ,OLDIY,IX) = DAN_RECV(IX, IIK, IPROC); ENDDO !} ENDIF !} IF( MOD(IY,K_LOOPS) == 0 ) THEN !{ DO IPROC = 1, NPROC-1 !{ REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC,NPROC) CALL MPI_IRECV( DAN_RECV(1,1,IPROC), K_LOOPS*((NX/2)/NPROC), MPI_DOUBLE_COMPLEX, & REAL_SUBMITER, 0, MPI_COMM_WORLD, REQST((2*IPROC-1)), ERR) REAL_RECEIVER = MOD(IPROC+MYNUM, NPROC) CHNK_STR = ((NX/2)/NPROC)*REAL_RECEIVER+1 CHNK_END = CHNK_STR+((NX/2)/NPROC)-1 CALL MPI_ISEND(A3( CHNK_STR:CHNK_END, :, 1+MOD((IY-1)/K_LOOPS,2) ), K_LOOPS*((NX/2)/NPROC), & MPI_DOUBLE_COMPLEX, REAL_RECEIVER, 0, MPI_COMM_WORLD, REQST((2*IPROC)), ERR) ENDDO !} ENDIF !} DO IX = 1, ((NX/2)/NPROC) !{ TARG_IZ = IZ + MYNUM*(NZ/NPROC) AUX(TARG_IZ,IY,IX) = A3( IX+((NX/2)/NPROC)*MYNUM, 1+MOD(IY-1,K_LOOPS), 1+MOD((IY-1)/K_LOOPS,2)); ENDDO !} FFT code: MPI_iSEND() + MPI_iRECV() (K)

FFT code: MPI_iSEND() + MPI_iRECV() (K) cont. LEFTOVERS = MOD(NYP,K_LOOPS) IF( LEFTOVERS /= 0 ) THEN !{ DO IPROC = 1, NPROC-1 !{ REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC, NPROC) CALL MPI_IRECV(LEFTOVERS_RECV(1,1,IPROC), LEFTOVERS*(NZ/NPROC), MPI_DOUBLE_COMPLEX, & REAL_SUBMITER, 0, MPI_COMM_WORLD, LASTREQST((2*IPROC-1)), ERR) REAL_RECEIVER = MOD(IPROC+MYNUM, NPROC) CHNK_STR = ((NX/2)/NPROC)*REAL_RECEIVER+1 CHNK_END = CHNK_STR+((NX/2)/NPROC)-1 CALL MPI_ISEND(A3( CHNK_STR:CHNK_END, 1:LEFTOVERS, 1+MOD((NYP-1)/K_LOOPS,2) ), LEFTOVERS*((NX/2)/NPROC),& MPI_DOUBLE_COMPLEX, REAL_RECEIVER, 0, MPI_COMM_WORLD, LASTREQST((2*IPROC)), ERR) ENDDO !} ENDIF !} !!!!! deal with the chuncks where IY=K_LOOPS*(NYP/K_LOOPS)-K_LOOPS+1:K_LOOPS*(NYP/K_LOOPS) CALL MPI_WAITALL( (2*(NPROC-1)), REQST(1), STATS, ERR) DO IPROC = 1, NPROC-1 !{ REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC, NPROC) TARG_IZ = IZ + REAL_SUBMITER*(NZ/NPROC) DO IIK = 1, K_LOOPS !{ OLDIY = NYP-LEFTOVERS-K_LOOPS+IIK DO IX = 1, ((NX/2)/NPROC) !{ AUX(TARG_IZ,OLDIY,IX) = DAN_RECV(IX, IIK, IPROC); ENDDO !}

FFT code: MPI_iSEND() + MPI_iRECV() (K) cont. IF( LEFTOVERS /= 0 ) THEN !{ !!!!! deal with the chuncks where IY=NYP-MOD(NYP,K_LOOPS)+1:NYP CALL MPI_WAITALL( (2*(NPROC-1)), LASTREQST(1), STATS, ERR) DO IPROC = 1, NPROC-1 !{ REAL_SUBMITER = MOD(NPROC+MYNUM-IPROC, NPROC) TARG_IZ = IZ + REAL_SUBMITER*(NZ/NPROC) DO IIK = 1, LEFTOVERS !{ OLDIY = NYP-LEFTOVERS+IIK DO IX = 1, ((NX/2)/NPROC) !{ AUX(TARG_IZ,OLDIY,IX) = LEFTOVERS_RECV(IX, IIK, IPROC); ENDDO !} ENDIF !} ENDDO !}

Pseudocode: MPI_iSEND() + MPI_iRECV() (K&D) do i = 1, NX do j = 1, NY kernel(indata[j], outdata[j]) if( (j > D*K) && (j%K == 0) ) then MPI_WAITALL(request[j-D]) endif if( j%K == 0 ) then MPI_iSEND(outdata[j-K], SIZE=K) MPI_iRECV(recv_data, request[j]) endif enddo do m = 0, D-1 MPI_WAITALL(request[NY-D+m]) enddo other_computation() enddo

Performance gain

Comput/Commun Overlapping time progress communication computation