Dr. David Cronk Innovative Computing Lab University of Tennessee

Slides:



Advertisements
Similar presentations
MPI 2.2 William Gropp. 2 Scope of MPI 2.2 Small changes to the standard. A small change is defined as one that does not break existing correct MPI 2.0.
Advertisements

MPI Message Passing Interface
1 Computer Science, University of Warwick Accessing Irregularly Distributed Arrays Process 0’s data arrayProcess 1’s data arrayProcess 2’s data array Process.
1 Introduction to Collective Operations in MPI l Collective operations are called by all processes in a communicator. MPI_BCAST distributes data from one.
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
MPI Collective Communications
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
Its.unc.edu 1 Derived Datatypes Research Computing UNC - Chapel Hill Instructor: Mark Reed
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
Distributed Systems CS Programming Models- Part II Lecture 17, Nov 2, 2011 Majd F. Sakr, Mohammad Hammoud andVinay Kolar 1.
A Brief Look At MPI’s Point To Point Communication Brian T. Smith Professor, Department of Computer Science Director, Albuquerque High Performance Computing.
Edgar Gabriel MPI derived datatypes Edgar Gabriel.
1 MPI Datatypes l The data in a message to sent or received is described by a triple (address, count, datatype), where l An MPI datatype is recursively.
A Message Passing Standard for MPP and Workstations Communications of the ACM, July 1996 J.J. Dongarra, S.W. Otto, M. Snir, and D.W. Walker.
L15: Putting it together: N-body (Ch. 6) October 30, 2012.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Specialized Sending and Receiving David Monismith CS599 Based upon notes from Chapter 3 of the MPI 3.0 Standard
MPI-2 Sathish Vadhiyar Using MPI2: Advanced Features of the Message-Passing.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
Jonathan Carroll-Nellenback CIRC Summer School MESSAGE PASSING INTERFACE (MPI)
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Parallel Programming with MPI By, Santosh K Jena..
Lecture 6: Message Passing Interface (MPI). Parallel Programming Models Message Passing Model Used on Distributed memory MIMD architectures Multiple processes.
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
MPI Derived Data Types and Collective Communication
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
MPI_Alltoall By: Jason Michalske. What is MPI_Alltoall? Each process sends distinct data to each receiver. The Jth block of process I is received by process.
3/12/2013Computer Engg, IIT(BHU)1 MPI-2. POINT-TO-POINT COMMUNICATION Communication between 2 and only 2 processes. One sending and one receiving. Types:
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
Introduction to parallel computing concepts and technics
MPI Jakub Yaghob.
Auburn University COMP7330/7336 Advanced Parallel and Distributed Computing Message Passing Interface (cont.) Topologies.
CS4402 – Parallel Computing
Introduction to MPI Programming
MPI Point to Point Communication
Introduction to MPI.
MPI Message Passing Interface
Computer Science Department
CS 584.
An Introduction to Parallel Programming with MPI
Distributed Systems CS
More on MPI Nonblocking point-to-point routines Deadlock
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Message Passing Models
Distributed Systems CS
Lecture 14: Inter-process Communication
A Message Passing Standard for MPP and Workstations
MPI: Message Passing Interface
May 19 Lecture Outline Introduce MPI functionality
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
Introduction to parallelism and the Message Passing Interface
More on MPI Nonblocking point-to-point routines Deadlock
Barriers implementations
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Message Passing Programming Based on MPI
Synchronizing Computations
Hello, world in MPI #include <stdio.h> #include "mpi.h"
Computer Science Department
5- Message-Passing Programming
CS 584 Lecture 8 Assignment?.
Presentation transcript:

Dr. David Cronk Innovative Computing Lab University of Tennessee MPI TIPS AND TRICKS Dr. David Cronk Innovative Computing Lab University of Tennessee

Course Outline Day 1 Morning - Lecture Afternoon - Lab Point-to-point communication modes Collective operations Derived datatypes Afternoon - Lab Hands on exercises demonstrating collective operations and derived datatypes 11/13/2018 David Cronk

Course Outline (cont) Day 2 Morning - Lecture Afternoon - Lab Finish lecture on derived datatypes MPI analysis using VAMPIR Performance analysis and tuning Afternoon - Lab Finish Day 1 exercises VAMPIR demo 11/13/2018 David Cronk

Course Outline (cont) Day 3 Morning - Lecture Afternoon - Lab MPI-I/O MPI-I/O exercises 11/13/2018 David Cronk

Point-to-Point Communication Modes Standard Mode: blocking: MPI_SEND (buf, count, datatype, dest, tag, comm) MPI_RECV (buf, count, datatype, source, tag, comm, status) Generally ONLY use if you cannot call earlier AND there is no other work that can be done! Standard ONLY states that buffers can be used once calls return. It is implementation dependant on when blocking calls return. Blocking sends MAY block until a matching receive is posted. This is not required behavior, but the standard does not prohibit this behavior either. Further, a blocking send may have to wait for system resources such as system managed message buffers. Be VERY careful of deadlock when using blocking calls! 11/13/2018 David Cronk

Point-to-Point Communication Modes (cont) Standard Mode: Non-blocking (immediate) sends/receives: MPI_ISEND (buf, count, datatype, dest, tag, comm, request) MPI_IRECV (buf, count, datatype, source, tag, comm, request) MPI_WAIT (request, status) MPI_TEST (request, flag, status) Allows communication calls to be posted early, which may improve performance. Overlap computation and communication Latency tolerance Less (or no) buffering * MUST either complete these calls (with wait or test) or call MPI_REQUEST_FREE 11/13/2018 David Cronk

Point-to-Point Communication Modes (cont) Non-standard mode communication Only used by the sender! (MPI uses the push communication model) Buffered mode - A buffer must be provided by the application Synchronous mode - Completes only after a matching receive has been posted Ready mode - May only be called when a matching receive has already been posted 11/13/2018 David Cronk

Point-to-Point Communication Modes: Buffered MPI_BSEND (buf, count, datatype, dest, tag, comm) MPI_IBSEND (buf, count, dtype, dest, tag, comm, req) MPI_BUFFER_ATTACH (buff, size) MPI_BUFFER_DETACH (buff, size) Buffered sends do not rely on system buffers The user supplies a buffer that MUST be large enough for all messages User need not worry about calls blocking, waiting for system buffer space The buffer is managed by MPI The user MUST ensure there is no buffer overflow 11/13/2018 David Cronk

Buffered Sends Seg violation Buffer overflow Safe #define BUFFSIZE 1000 char *buff; char b1[500], b2[500] MPI_Buffer_attach (buff, BUFFSIZE); #define BUFFSIZE 1000 char *buff; char b1[500], b2[500]; buff = (char*) malloc(BUFFSIZE); MPI_Buffer_attach(buff, BUFFSIZE); MPI_Bsend(b1, 500, MPI_CHAR, 1, 1, MPI_COMM_WORLD); MPI_Bsend(b2, 500, MPI_CHAR, 2, 1, int size; char *buff; char b1[500], b2[500]; MPI_Pack_size (500, MPI_CHAR, MPI_COMM_WORLD, &size); size += MPI_BSEND_OVERHEAD; size = size * 2; buff = (char*) malloc(size); MPI_Buffer_attach(buff, size); MPI_Bsend(b1, 500, MPI_CHAR, 1, 1, MPI_COMM_WORLD); MPI_Bsend(b2, 500, MPI_CHAR, 2, 1, MPI_Buffer_detach (&buff, &size); Seg violation Buffer overflow Safe 11/13/2018 David Cronk

Point-to-Point Communication Modes: Synchronous MPI_SSEND (buf, count, datatype, dest, tag, comm) MPI_ISSEND (buf, count, dtype, dest, tag, comm, req) Can be started (called) at any time. Does not complete until a matching receive has been posted and the receive operation has been started * Does NOT mean the matching receive has completed Can be used in place of sending and receiving acknowledgements Can be more efficient when used appropriately buffering may be avoided 11/13/2018 David Cronk

Point-to-Point Communication Modes: Ready Mode MPI_RSEND (buf, count, datatype, dest, tag, comm) MPI_IRSEND (buf, count, dtype, dest, tag, comm, req) May ONLY be started (called) if a matching receive has already been posted. If a matching receive has not been posted, the results are undefined May be most efficient when appropriate Removal of handshake operation Should only be used with extreme caution 11/13/2018 David Cronk

Ready Mode SAFE UNSAFE SLAVE MASTER } } while (!done) { MPI_Send (NULL, 0, MPI_INT, MASTER, 1, MPI_COMM_WORLD); MPI_Recv (buff, count, datatype, MASTER, 2, MPI_COMM_WORLD, &status); … } SLAVE MASTER while (!done) { MPI_Recv (NULL, 0, MPI_INT, MPI_ANYSOURCE, 1, MPI_COMM_WORLD, &status); source = status.MPI_SOURCE; get_work (…..); MPI_Rsend (buff, count, datatype, source, 2, MPI_COMM_WORLD); if (no more work) done = TRUE; } UNSAFE while (!done) { MPI_Irecv (buff, count, datatype, MASTER, 2, MPI_COMM_WORLD, &request); MPI_Ssend (NULL, 0, MPI_INT, MASTER, 1, MPI_COMM_WORLD); MPI_Wait (&request, &status); … } SAFE 11/13/2018 David Cronk

Point-to-Point Communication Modes: Performance Issues Non-blocking calls are almost always the way to go Communication can be carried out during blocking system calls Computation and communication can be overlapped if there is special purpose communication hardware Less likely to have errors that lead to deadlock Standard mode is usually sufficient - but buffered mode can offer advantages Particularly if there are frequent, large messages being sent If the user is unsure the system provides sufficient buffer space Synchronous mode can be more efficient if acks are needed Also tells the system that buffering is not required 11/13/2018 David Cronk

Collective Communication Amount of data sent must exactly match the amount of data received Collective routines are collective across an entire communicator and must be called in the same order from all processors within the communicator Collective routines are all blocking This simply means buffers can be re-used upon return Collective routines return as soon as the calling process’ participation is complete Does not say anything about the other processors Collective routines may or may not be synchronizing No mixing of collective and point-to-point communication 11/13/2018 David Cronk

Collective Communication Barrier: MPI_BARRIER (comm) Only collective routine which provides explicit synchronization Returns at any processor only after all processes have entered the call 11/13/2018 David Cronk

Collective Communication Collective Communication Routines: Except broadcast, each routine has 2 variants: Standard variant: All messages are the same size Vector Variant: Each item is a vector of possibly varying length If there is a single origin or destination, it is referred to as the root Each routine (except broadcast) has distinct send and receive arguments Send and receive buffers must be disjoint Each can use MPI_IN_PLACE, which allows the user to specify that data contributed by the caller is already in its final location. 11/13/2018 David Cronk

Collective Communication: Bcast MPI_BCAST (buffer, count, datatype, root, comm) Strictly in place MPI-1 insists on using an intra-communicator MPI-2 allows use of an inter-communicator REMEMBER: A broadcast need not be synchronizing. Returning from a broadcast tells you nothing about the status of the other processes involved in a broadcast. Furthermore, though MPI does not require MPI_BCAST to be synchronizing, it neither prohibits synchronous behavior. 11/13/2018 David Cronk

BCAST THAT’S BETTER OOPS! If (myrank == root) { fp = fopen (filename, ‘r’); fscanf (fp, ‘%d’, &iters); fclose (&fp); MPI_Bcast (&iters, 1, MPI_INT, root, MPI_COMM_WORLD); } else { MPI_Recv (&iters, 1, MPI_INT, root, tag, MPI_COMM_WORLD, &status); If (myrank == root) { fp = fopen (filename, ‘r’); fscanf (fp, ‘%d’, &iters); fclose (&fp); } MPI_Bcast (&iters, 1, MPI_INT, root, MPI_COMM_WORLD); cont THAT’S BETTER OOPS! 11/13/2018 David Cronk

Collective Communication: Gather MPI_GATHER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) Receive arguments are only meaningful at the root Each processor must send the same amount of data Root can use MPI_IN_PLACE for sendbuf: data is assumed to be in the correct place in the recvbuf P1 = root P2 P3 MPI_GATHER 11/13/2018 David Cronk

MPI_Gather WORKS A OK int tmp[20]; int res[320]; for (i = 0; i < 20; i++) { do some computation if (myrank == 0) res[i] = some value else tmp[i] = some value } MPI_Gather (MPI_IN_PLACE, 20, MPI_INT, RES, 20, MPI_INT, 0, MPI_COMM_WORLD); write out results else MPI_Gather (tmp, 20, MPI_INT, tmp, 320, MPI_REAL, MPI_COMM_WORLD); for (i = 0; i < 20; i++) { do some computation tmp[i] = some value; } MPI_Gather (tmp, 20, MPI_INT, res, 20, MPI_INT, 0, MPI_COMM_WORLD); if (myrank == 0) write out results WORKS A OK 11/13/2018 David Cronk

Collective Communication: Gatherv MPI_GATHERV (sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, root, comm) Vector variant of MPI_GATHER Allows a varying amount of data from each proc allows root to specify where data from each proc goes No portion of the receive buffer may be written more than once MPI_IN_PLACE may be used by root. 11/13/2018 David Cronk

Collective Communication: Gatherv (cont) 1 2 3 4 counts 9 7 4 0 displs P1 = root P2 P3 P4 MPI_GATHERV 11/13/2018 David Cronk

Collective Communication: Gatherv (cont) stride = 105; root = 0; for (i = 0; i < nprocs; i++) { displs[i] = i*stride counts[I] = 100; } MPI_Gatherv (sbuff, 100, MPI_INT, rbuff, counts, displs, MPI_INT, root, MPI_COMM_WORLD); 11/13/2018 David Cronk

Collective Communication: Scatter MPI_SCATTER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) Opposite of MPI_GATHER Send arguments only meaningful at root Root can use MPI_IN_PLACE for recvbuf P2 P4 P3 P1 B (on root) MPI_SCATTER 11/13/2018 David Cronk

MPI_SCATTER IF (MYPE .EQ. ROOT) THEN OPEN (25, FILE=‘filename’) READ (25, *) nprocs, nboxes READ (25, *) mat(i,j) (i=1,nboxes)(j=1,nprocs) CLOSE (25) ENDIF CALL MPI_BCAST (nboxes, 1, MPI_INTEGER, ROOT, MPI_COMM_WORLD, ierr) CALL MPI_SCATTER (mat, nboxes, MPI_INT, lboxes, nboxes, MPI_INT, ROOT, MPI_COMM_WORLD, ierr) 11/13/2018 David Cronk

Collective Communication: Scatterv MPI_SCATTERV (sendbuf, scounts, displs, sendtype, recvbuf, recvcount, recvtype) Opposite of MPI_GATHERV Send arguments only meaningful at root Root can use MPI_IN_PLACE for recvbuf No location of the sendbuf can be read more than once 11/13/2018 David Cronk

Collective Communication: Scatterv (cont) P1 P2 P3 P4 MPI_SCATTERV B (on root) 1 2 3 4 counts 9 7 4 0 dislps 11/13/2018 David Cronk

MPI_SCATTERV C mnb = max number of boxes IF (MYPE .EQ. ROOT) THEN OPEN (25, FILE=‘filename’) READ (25, *) nprocs READ (25, *) (nboxes(I), I=1,nprocs) READ (25, *) mat(I,J) (I=1,nboxes(I))(J=1,nprocs) CLOSE (25) DO I = 1,nprocs displs(I) = (I-1)*mnb ENDDO ENDIF CALL MPI_SCATTER (nboxes, 1, MPI_INT, nb, 1, MPI_INT, ROOT, MPI_COMM_WORLD, ierr) CALL MPI_SCATTERV (mat, nboxes, displs, MPI_INT, lboxes, nb, MPI_INT, ROOT, MPI_COMM_WORLD, ierr) 11/13/2018 David Cronk

Collective Communication: Allgather MPI_ALLGATHER (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) Same as MPI_GATHER, except all processors get the result MPI_IN_PLACE may be used for sendbuf of all processors Equivalent to a gather followed by a bcast 11/13/2018 David Cronk

Collective Communication: Allgatherv MPI_ALLGATHERV (sendbuf, sendcount, sendtype, recvbuf, recvcounts, displs, recvtype, comm) Same as MPI_GATHERV, except all processors get the result MPI_IN_PLACE may be used for sendbuf of all processors Equivalent to a gatherv followed by a bcast 11/13/2018 David Cronk

Collective Communication: Alltoall (scatter/gather) MPI_ALLTOALL (sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) 11/13/2018 David Cronk

Collective Communication: Alltoallv MPI_ALLTOALLV (sendbuf, sendcounts, sdispls, sendtype, recvbuf, recvcounts, rdispls, recvtype, comm) Same as MPI_ALLTOALL, but the vector variant Can specify how many blocks to send to each processor, location of blocks to send, how many blocks to receive from each processor, and where to place the received blocks 11/13/2018 David Cronk

Collective Communication: Alltoallw MPI_ALLTOALLW (sendbuf, sendcounts, sdispls, sendtypes, recvbuf, recvcounts, rdispls, recvtypes, comm) Same as MPI_ALLTOALLV, except different datatypes can be specified for data scattered as well as data gathered Can specify how many blocks to send to each processor, location of blocks to send, how many blocks to receive from each processor, and where to place the received blocks Displacements are now in terms of bytes rather that types 11/13/2018 David Cronk

Collective Communication: Reduction Global reduction across all members of a group Can us predefined operations or user defined operations Can be used on single elements or arrays of elements Counts and types must be the same on all processors Operations are assumed to be associative User defined operations can be different on each processor, but not recommended 11/13/2018 David Cronk

Collective Communication: Reduction (reduce) MPI_REDUCE (sendbuf, recvbuf, count, datatype, op, root, comm) recvbuf only meaningful on root Combines elements (on an element by element basis) in sendbuf according to op Results of the reduction are returned to root in recvbuf MPI_IN_PLACE can be used for sendbuf on root 11/13/2018 David Cronk

MPI_REDUCE REAL a(n), b(n,m), c(m) REAL sum(m) DO j=1,m sum(j) = 0.0 DO i = 1,n sum(j) = sum(j) + a(i)*b(i,j) ENDDO CALL MPI_REDUCE(sum, c, m, MPI_REAL, MPI_SUM, 0, MPI_COMM_WORLD, ierr) 11/13/2018 David Cronk

Collective Communication: Reduction (cont) MPI_ALLREDUCE (sendbuf, recvbuf, count, datatype, op, comm) Same as MPI_REDUCE, except all processors get the result MPI_REDUCE_SCATTER (sendbuf, recv_buff, recvcounts, datatype, op, comm) Acts like it does a reduce followed by a scatterv 11/13/2018 David Cronk

MPI_REDUCE_SCATTER DO j=1,nprocs counts(j) = n DO j=1,m sum(j) = 0.0 DO i = 1,n sum(j) = sum(j) + a(i)*b(i,j) ENDDO CALL MPI_REDUCE_SCATTER(sum, c, counts, MPI_REAL, MPI_SUM, MPI_COMM_WORLD, ierr) 11/13/2018 David Cronk

Collective Communication: Prefix Reduction MPI_SCAN (sendbuf, recvbuf, count, datatype, op, comm) Performs an inclusive element-wise prefix reduction MPI_EXSCAN (sendbuf, recvbuf, count, datatype, op, comm) Performs an exclusive prefix reduction Results are undefined at process 0 11/13/2018 David Cronk

MPI_SCAN MPI_SCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) MPI_EXSCAN (sbuf, rbuf, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD) 11/13/2018 David Cronk

Collective Communication: Reduction - user defined ops MPI_OP_CREATE (function, commute, op) if commute is true, operation is assumed to be commutative Function is a user defined function with 4 arguments invec: input vector inoutvec: input and output value len: number of elements datatype: MPI_DATATYPE Returns invec[i] op inoutvec[i], i = 0..len-1 MPI_OP_FREE (op) 11/13/2018 David Cronk

Collective Communication: Performance Issues Collective operations should have much better performance than simply sending messages directly Broadcast may make use of a broadcast tree (or other mechanism) All collective operations can potentially make use of a tree (or other) mechanism to improve performance Important to use the simplest collective operations which still achieve the needed results Use MPI_IN_PLACE whenever appropriate Reduces unnecessary memory usage and redundant data movement 11/13/2018 David Cronk

Derived Datatypes A derived datatype is a sequence of primitive datatypes and displacements Derived datatypes are created by building on primitive datatypes A derived datatype’s typemap is the sequence of (primitive type, disp) pairs that defines the derived datatype These displacements need not be positive, unique, or increasing. A datatype’s type signature is just the sequence of primitive datatypes A messages type signature is the type signature of the datatype being sent, repeated count times 11/13/2018 David Cronk

Derived Datatypes (cont) Typemap = (MPI_INT, 0) (MPI_INT, 12) (MPI_INT, 16) (MPI_INT, 20) (MPI_INT, 36) Type Signature = {MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT} Type Signature = {MPI_INT, MPI_INT, MPI_INT, MPI_INT, MPI_INT} In collective communication, the type signature of data sent must match the type signature of data received! 11/13/2018 David Cronk

Derived Datatypes (cont) Lower Bound: The lowest displacement of an entry of this datatype Upper Bound: Relative address of the last byte occupied by entries of this datatype, rounded up to satisfy alignment requirements Extent: The span from lower to upper bound MPI_GET_EXTENT (datatype, lb, extent) MPI_TYPE_SIZE (datatype, size) MPI_GET_ADDRESS (location, address) 11/13/2018 David Cronk

Datatype Constructors MPI_TYPE_DUP (oldtype, newtype) Simply duplicates an existing type Not useful to regular users MPI_TYPE_CONTIGUOUS (count, oldtype, newtype) Creates a new type representing count contiguous occurrences of oldtype ex: MPI_TYPE_CONTIGUOUS (2, MPI_INT, 2INT) creates a new datatype 2INT which represents an array of 2 integers Datatype 2INT 11/13/2018 David Cronk

CONTIGUOUS DATATYPE P1 sends 100 integers to P2 P1 P2 int buff[100]; MPI_Datatype dtype; ... MPI_Type_contiguous (100, MPI_INT, &dtype); MPI_Type_commit (&dtype); MPI_Send (buff, 1, dtype, 2, tag, MPI_COMM_WORLD) P2 int buff[100] MPI_Recv (buff, 100, MPI_INT, 1, tag, MPI_COMM_WORLD, &status) 11/13/2018 David Cronk

Datatype Constructors (cont) MPI_TYPE_VECTOR (count, blocklength, stride, oldtype, newtype) Creates a datatype representing count regularly spaced occurrences of blocklength contiguous oldtypes stride is in terms of elements of oldtype ex: MPI_TYPE_VECTOR (4, 2, 3, 2INT, AINT) AINT 11/13/2018 David Cronk

Datatype Constructors (cont) MPI_TYPE_HVECTOR (count, blocklength, stride, oldtype, newtype) Identical to MPI_TYPE_VECTOR, except stride is given in bytes rather than elements. ex: MPI_TYPE_HVECTOR (4, 2, 20, 2INT, BINT) BINT 11/13/2018 David Cronk

EXAMPLE REAL a(100,100), B(100,100) CALL MPI_COMM_RANK (MPI_COMM_WORLD, myrank, ierr) CALL MPI_TYPE_SIZE (MPI_REAL, sizeofreal, ierr) CALL MPI_TYPE_VECTOR (100, 1, 100,MPI_REAL, rowtype, ierr) CALL MPI_TYPE_CREATE_HVECTOR (100, 1, sizeofreal, rowtype, xpose, ierr) CALL MPI_TYPE_COMMIT (xpose, ierr) CALL MPI_SENDRECV (a, 1, xpose, myrank, 0, b, 100*100, MPI_REAL, myrank, 0, MPI_COMM_WORLD, status, ierr) 11/13/2018 David Cronk

Datatype Constructors (cont) MPI_TYPE_INDEXED (count, blocklengths, displs, oldtype, newtype) Allows specification of non-contiguous data layout Good for irregular problems ex: MPI_TYPE_INDEXED (3, lengths, displs, 2INT, CINT) lengths = (2, 4, 3) displs = (0,3,8) CINT Most often, block sizes are all the same (typically 1) MPI-2 introduced a new constructor 11/13/2018 David Cronk

Datatype Constructors (cont) MPI_TYPE_CREATE_INDEXED_BLOCK (count, blocklength, displs, oldtype, newtype) Same as MPI_TYPE_INDEXED, except all blocks are the same length (blocklength) ex: MPI_TYPE_INDEXED_BLOCK (7, 1, displs, MPI_INT, DINT) displs = (1, 3, 4, 6, 9, 13, 14) DINT 11/13/2018 David Cronk

Datatype Constructors (cont) MPI_TYPE_CREATE_HINDEXED (count, blocklengths, displs, oldtype, newtype) Identical to MPI_TYPE_INDEXED except displacements are in bytes rather then elements MPI_TYPE_CREATE_STRUCT (count, lengths, displs, types, newtype) Used mainly for sending arrays of structures count is number of fields in the structure lengths is number of elements in each field displs should be calculated (portability) 11/13/2018 David Cronk

MPI_TYPE_CREATE_STRUCT struct s1 { char class; double d[6]; char b[7]; }; struct s1 sarray[100]; MPI_Datatype stype; MPI_Datatype types[3] = {MPI_CHAR, MPI_DOUBLE, MPI_CHAR}; int lens[3] = {1, 6, 7}; MPI_Aint displs[3]; MPI_Get_address (&sarray[0].class, &displs[0]); MPI_Get_address (&sarray[0].d, &displs[1]); MPI_Get_address (&sarray[0].b, &displs[2]); for (i=2;i>=0;i--) displs[i] -=displs[0] MPI_Type_create_struct (3, lens, displs, types, &stype); MPI_Datatype stype; MPI_Datatype types[3] = {MPI_CHAR, MPI_DOUBLE, MPI_CHAR}; int lens[3] = {1, 6, 7}; MPI_Aint displs[3] = {0, sizeof(double), 7*sizeof(double)}; MPI_Type_create_struct (3, lens, displs, types, &stype); Non-portable Semi-portable 11/13/2018 David Cronk

MPI_TYPE_CREATE_STRUCT int i; char c[100]; float f[3]; int a; MPI_Aint disp[4]; int lens[4] = {1, 100, 3, 1}; MPI_Datatype types[4] = {MPI_INT, MPI_CHAR, MPI_FLOAT, MPI_INT}; MPI_Datatype stype; MPI_Get_address(&i, &disp[0]); MPI_Get_address(c, &disp[1]); MPI_Get_address(f, &disp[2]); MPI_Get_address(&a, &disp[3]); MPI_Type_create_struct(4, lens, disp, types, &stype); MPI_Type_commit (&stype); MPI_Send (MPI_BOTTOM, 1, stype, ……..); 11/13/2018 David Cronk

Derived Datatypes (cont) MPI_TYPE_CREATE_RESIZED (oldtype, lb, extent, newtype) sets a new lower bound and extent for oldtype Does NOT change amount of data sent in a message only changes data access pattern 11/13/2018 David Cronk

MPI_TYPE_CREATE_RESIZED MPI_Datatype stype, ntype; MPI_Datatype types[3] = {MPI_CHAR, MPI_DOUBLE, MPI_CHAR}; int lens[3] = {1, 6, 7}; MPI_Aint displs[3]; MPI_Get_address (&sarray[0].class, &displs[0]; MPI_Get_address (&sarray[0].d, &displs[1]; MPI_Get_address (&sarray[0].b, &displs[2]; for (i=2;i>=0;i--) displs[i] -=displs[0] MPI_Type_create_struct (3, lens, displs, types, &stype); MPI_Type_create_resized (stype, 0, sizeof(struct s1), &ntype); Struct s1 { char class; double d[2]; char b[3]; }; struct s1 sarray[100]; Really Portable 11/13/2018 David Cronk

Datatype Constructors (cont) MPI_TYPE_CREATE_SUBARRAY (ndims, sizes, subsizes, starts, order, oldtype, newtype) Creates a newtype which represents a contiguous subsection of an array with ndims dimensions This sub-array is only contiguous conceptually, it may not be stored contiguously in memory! Arrays are assumed to be indexed starting a zero!!! Order must be MPI_ORDER_C or MPI_ORDER_FORTRAN C programs may specify Fortran ordering, and vice-versa 11/13/2018 David Cronk

Datatype Constructors: Subarrays (1,1) (10,10) MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_INT, sarray) sizes = (10, 10) subsizes = (6,6) starts = (3, 3) 11/13/2018 David Cronk

Datatype Constructors: Subarrays (1,1) (10,10) MPI_TYPE_CREATE_SUBARRAY (2, sizes, subsizes, starts, MPI_ORDER_FORTRAN, MPI_INT, sarray) sizes = (10, 10) subsizes = (6,6) starts = (2,2) 11/13/2018 David Cronk

Datatype Constructors: Darrays MPI_TYPE_CREATE_DARRAY (size, rank, dims, gsizes, distribs, dargs, psizes, order, oldt, newtype) Used with arrays that are distributed in HPF-like fashion on Cartesian process grids Generates datatypes corresponding to the sub-arrays stored on each processor Returns in newtype a datatype specific to the sub-array stored on process rank 11/13/2018 David Cronk

Datatype Constructors (cont) Derived datatypes must be committed before they can be used MPI_TYPE_COMMIT (datatype) Performs a “compilation” of the datatype description into an efficient representation Derived datatypes should be freed when they are no longer needed MPI_TYPE_FREE (datatype) Does not effect datatypes derived from the freed datatype or current communication 11/13/2018 David Cronk

Pack and Unpack MPI_PACK (inbuf, incount, datatype, outbuf, outsize, position, comm) MPI_UNPACK (inbuf, insize, position, outbuf, outcount, datatype, comm) MPI_PACK_SIZE (incount, datatype, comm, size) Packed messages must be sent with the type MPI_PACKED Packed messages can be received with any matching datatype Unpacked messages can be received with the type MPI_PACKED Receives must use type MPI_PACKED if the messages are to be unpacked 11/13/2018 David Cronk

Pack and Unpack int i; char c[100]; MPI_Aint disp[2]; int lens[2] = {1, 100]; MPI_Datatype types[2] = {MPI_INT, MPI_CHAR}; MPI_Datatype type1; MPI_Get_address (&i, &(disp[0]); MPI_Get_address(c, &(disp[1]); MPI_Type_create_struct (2, lens, disp, types, &type1); MPI_Type_commit (&type1); MPI_Send (MPI_BOTTOM, 1, type1, 1, 0, MPI_COMM_WORLD) Char c[100]; MPI_Status status; int i, comm; MPI_Aint disp[2]; int len[2] = {1, 100}; MPI_Datatype types[2] = {MPI_INT, MPI_CHAR}; MPI_Datatype type1; MPI_Get_address (&i, &(disp[0]); MPI_Get_address(c, &(disp[1]); MPI_Type_create_struct (2, lens, disp, types, &type1); MPI_Type_commit (&type1); comm = MPI_COMM_WORLD; MPI_Recv (MPI_BOTTOM, 1, type1, 0, 0, comm, &status); int i; char c[100]; char buf[110]; int pos = 0; MPI_Pack(&i, 1, MPI_INT, buf, 110, &pos, MPI_COMM_WORLD); MPI_Pack(c, 100, MPI_CHAR, buf, 110, &pos, MPI_COMM_WORLD); MPI_Send(buf, pos, MPI_PACKED, 1, 0, MPI_COMM_WORLD); int i, comm; char c[100]; MPI_Status status; char buf[110] int pos = 0; comm = MPI_COMM_WORLD; MPI_Recv (buf, 110, MPI_PACKED, 1, 0, comm, &status); MPI_Unpack (buf, &pos, &i, 1, MPI_INT, comm); MPI_Unpack (buf, 110, &pos, c, 100, MPI_CHAR, comm); 11/13/2018 David Cronk

Derived Datatypes: Performance Issues May allow the user to send fewer or smaller messages System dependant on how well this works May be able to significantly reduce memory copies can make I/O much more efficient Data packing may be more efficient if it reduces the number of send operations by packing meta-data at the front of the message This is often possible (and advantageous) for data layouts that are runtime dependant 11/13/2018 David Cronk

DAY 2 Morning - Lecture Afternoon - Lab Performance analysis and tuning Afternoon - Lab VAMPIR demo 11/13/2018 David Cronk

Performance analysis and tuning It is typically much more difficult to debug and tune parallel programs Programmers often have no idea where to begin searching for possible bottlenecks A tool that allows the programmer to get a quick overview of the program’s execution can aid the programmer in beginning this search 11/13/2018 David Cronk

VAMPIR Vampir is a visualization program used to visualize trace data generated by Vampitrace Vampirtrace is an instrumented MPI library to link with user code for automatic tracefile generation on parallel platforms 11/13/2018 David Cronk

Vampir and Vampirtrace 11/13/2018 David Cronk

Timeline display with zooming and scrolling capabilities Vampir Features Tool for converting tracefile data for MPI programs into a variety of graphical views Highly configurable Timeline display with zooming and scrolling capabilities Profiling and communications statistics Source-code clickback OpenMP support under development 11/13/2018 David Cronk

Vampirtrace Profiling library for MPI applications Produces tracefiles that can be analyzed with the Vampir performance analysis tool or the Dimemas performance prediction tool. Merely linking your application with Vampirtrace enables tracing of all MPI calls. On some platforms, calls to user-level subroutines are also recorded. API for controlling profiling and for defining and tracing user-defined activities. 11/13/2018 David Cronk

Running and Analyzing Vampirtrace-instrumented Programs Programs linked with Vampirtrace are started in the same way as ordinary MPI programs. Use Vampir to analyze the resulting tracefile. Uses configuration file (VAMPIR2.cnf) in $HOME/.VAMPIR_defaults Can copy $VAMPIR_ROOT/etc/VAMPI2.cnf If no configuration file is available, VAMPIR will create one with default values 11/13/2018 David Cronk

Getting Started If your path is set up correctly, simply enter “vampir” To open a trace file, select “File” followed by “Open Tracefile” Select trace file to open or enter a known tracefile Once the tracefile is loaded, VAMPIR starts with a global timeline display. This is the analysis starting point 11/13/2018 David Cronk

Vampir Global Timeline Display 11/13/2018 David Cronk

Global Timeline Display Context menu is activated with a right mouse click inside any display window Zoom in by selecting start of desired region, left click held, drag mouse to end of desired region and release Can zoom in to unlimited depth Step out of zooms from context menu 11/13/2018 David Cronk

Zoomed Global Timeline Display 11/13/2018 David Cronk

Displays Activity Charts Timeline Default is pie chart, but can also use histograms or table mode Can select different activities to be shown Can hide some activities Can change scale in histograms Timeline Activity Chart Summary Chart Message Statistics File I/O Statistics Parallelism 11/13/2018 David Cronk

Global Activity Chart Display (single file) 11/13/2018 David Cronk

Global Activity Chart Display (modular) 11/13/2018 David Cronk

Global Activity Chart with All Symbols Displayed 11/13/2018 David Cronk

Global Activity Chart with MPI Activities Displayed 11/13/2018 David Cronk

Global Activity Chart with Application Activity Displayed 11/13/2018 David Cronk

Global Activity Chart with Application Activity Displayed 11/13/2018 David Cronk

Global Activity Chart with Timeline Portion Displayed 11/13/2018 David Cronk

Process Activity Chart Display 11/13/2018 David Cronk

Process Activity Chart Display (mpi) 11/13/2018 David Cronk

Process Activity Chart Display (hide max) 11/13/2018 David Cronk

Process Activity Chart Histogram Display 11/13/2018 David Cronk

Process Activity Chart Histogram Display (log display) 11/13/2018 David Cronk

Process Activity Chart Table Display 11/13/2018 David Cronk

Shows total time spent on each activity Summary Charts Shows total time spent on each activity Can be sum of all processors or average for each processor Similar context menu options as activity charts Default display is horizontal histogram, but can also be vertical histogram, pie chart, or table 11/13/2018 David Cronk

Global Summaric Chart Displays (all symbols) 11/13/2018 David Cronk

Global Summaric Chart Displays (per process) 11/13/2018 David Cronk

Global Summaric Chart Displays (mpi) 11/13/2018 David Cronk

Global Summaric Chart Displays (timeline) 11/13/2018 David Cronk

Global Summaric Chart Displays 11/13/2018 David Cronk

Communication Statistics Shows matrix of comm statistics Can show total bytes, total msgs, avg msg size, longest, shortest, and transmission rates Can zoom into sub-matrices Can get length statistics Can filter messages by type (tag) and communicator 11/13/2018 David Cronk

Global Communication Statistics Display (total bytes) 11/13/2018 David Cronk

Global Communication Statistics Display (total bytes zoomed in) 11/13/2018 David Cronk

Global Communication Statistics Display (Average size) 11/13/2018 David Cronk

Global Communication Statistics Display (total messages) 11/13/2018 David Cronk

Global Communication Statistics Display using Timeline Portion 11/13/2018 David Cronk

Global Communication Statistics Display (length statistics) 11/13/2018 David Cronk

Global Communication Statistics Display (Filter Dialog) 11/13/2018 David Cronk

Global Parallelism Display 11/13/2018 David Cronk

Tracefile Size Often, the trace file from a fully instrumented code grows to an unmanageable size Can limit the problem size for analysis Can limit the number of iterations Can use the vampirtrace API to limit size vttraceoff (): Disables tracing vttraceon(): Re-enables tracing 11/13/2018 David Cronk

Performance Analysis and Tuning First, make sure there is available speedup in the MPI routines Use a profiling tool such as VAMPIR If the total time spent in MPI routines is a small fraction of total execution time, there is probably not much use tuning the message passing code BEWARE: Profiling tools can miss compute cycles used due to non-blocking calls! 11/13/2018 David Cronk

Performance Analysis and Tuning If MPI routines account for a significant portion of your execution time: Try to identify communication hot-spots Will changing the order of communication reduce the hotspot problem? Will changing the data distribution reduce communication without increasing computation? Sending more data is better than sending more messages 11/13/2018 David Cronk

Performance Analysis and Tuning Are you using non-blocking calls? Post sends/receives as soon as possible, but don’t wait for their completion if there is still work you can do! If you are waiting for long periods of time for completion of non-blocking sends, this may be an indication of small system buffers. Consider using buffered mode. 11/13/2018 David Cronk

Performance Analysis and Tuning Are you sending lots of small messages? Message passing has significant overhead (latency). Latency accounts for a large proportion of the message transmission time for small messages. Consider marshaling values into larger messages if this is appropriate If you are using derived datatypes, check if the MPI implementation handles these types efficiently Consider using MPI_PACK where appropriate dynamic data layouts or sender needs to send the receiver meta-data. 11/13/2018 David Cronk

Performance Analysis and Tuning Use collective operations when appropriate many collective operations use mechanisms such as broadcast trees to achieve better performance Is your computation to communication ratio too small? You may be running on too many processors for the problem size 11/13/2018 David Cronk

DAY 3 Morning - Lecture Afternoon - Lab MPI-I/O MPI-I/O exercises 11/13/2018 David Cronk

MPI-I/O Introduction MPI-I/O What is parallel I/O Why do we need parallel I/O What is MPI-I/O MPI-I/O Terms and definitions File manipulation Derived data types and file views 11/13/2018 David Cronk

OUTLINE (cont) MPI-I/O (cont) Data access File interoperability Non-collective access Collective access Split collective access File interoperability Gotchas - Consistency and semantics 11/13/2018 David Cronk

INTRODUCTION What is parallel I/O? Multiple processes accessing a single file 11/13/2018 David Cronk

INTRODUCTION What is parallel I/O? Multiple processes accessing a single file Often, both data and file access is non-contiguous Ghost cells cause non-contiguous data access Block or cyclic distributions cause non-contiguous file access 11/13/2018 David Cronk

Non-Contiguous Access File layout Local Mem 11/13/2018 David Cronk

INTRODUCTION What is parallel I/O? Multiple processes accessing a single file Often, both data and file access is non-contiguous Ghost cells cause non-contiguous data access Block or cyclic distributions cause non-contiguous file access Want to access data and files with as few I/O calls as possible 11/13/2018 David Cronk

INTRODUCTION (cont) Why use parallel I/O? Many users do not have time to learn the complexities of I/O optimization 11/13/2018 David Cronk

INTRODUCTION (cont) Integer dim parameter (dim=10000) Integer*4 out_array(dim) OPEN (fh,filename,UNFORMATTED) WRITE(fh) (out_array(I), I=1,dim) rl = 4*dim OPEN (fh, filename, DIRECT, RECL=rl) WRITE (fh, REC=1) out_array 11/13/2018 David Cronk

INTRODUCTION (cont) Why use parallel I/O? Many users do not have time to learn the complexities of I/O optimization Use of parallel I/O can simplify coding Single read/write operation vs. multiple read/write operations 11/13/2018 David Cronk

INTRODUCTION (cont) Why use parallel I/O? Many users do not have time to learn the complexities of I/O optimization Use of parallel I/O can simplify coding Single read/write operation vs. multiple read/write operations Parallel I/O potentially offers significant performance improvement over traditional approaches 11/13/2018 David Cronk

Traditional approaches INTRODUCTION (cont) Traditional approaches Each process writes to a separate file Often requires an additional post-processing step Without post-processing, restarts must use same number of processor Result sent to a master processor, which collects results and writes out to disk Each processor calculates position in file and writes individually 11/13/2018 David Cronk

INTRODUCTION (cont) What is MPI-I/O? MPI-I/O is a set of extensions to the original MPI standard This is an interface specification: It does NOT give implementation specifics It provides routines for file manipulation and data access Calls to MPI-I/O routines are portable across a large number of architectures 11/13/2018 David Cronk

MPI-I/O Terms and Definitions Displacement - Number of bytes from the beginning of a file etype - unit of data access within a file filetype - datatype used to express access patterns of a file file view - definition of access patterns of a file 11/13/2018 David Cronk

MPI-I/O Terms and Definitions Offset - Position in the file, relative to the current view, expressed in terms of number of etypes file pointers - offsets into the file maintained by MPI Individual file pointer - local to the process that opened the file Shared file pointer - shared (and manipulated) by the group of processes that opened the file 11/13/2018 David Cronk

MPI_FILE_OPEN(Comm, filename, mode, info, fh, ierr) FILE MANIPULATION MPI_FILE_OPEN(Comm, filename, mode, info, fh, ierr) Opens the file identified by filename on each processor in communicator Comm Collective over this group of processors Each processor must use same value for mode and reference the same file info is used to give hints about access patterns 11/13/2018 David Cronk

FILE MANIPULATION (cont) MPI_FILE_CLOSE (fh) This routine synchronizes the file state and then closes the file The user must ensure all I/O routines have completed before closing the file This is a collective routine (but not synchronizing) 11/13/2018 David Cronk

DERIVED DATATYPES & VIEWS Derived datatypes are not part of MPI-I/O They are used extensively in conjunction with MPI-I/O A filetype is really a datatype expressing the access pattern of a file Filetypes are used to set file views 11/13/2018 David Cronk

DERIVED DATATYPES & VIEWS Non-contiguous memory access MPI_TYPE_CREATE_SUBARRAY NDIMS - number of dimensions ARRAY_OF_SIZES - number of elements in each dimension of full array ARRAY_OF_SUBSIZES - number of elements in each dimension of sub-array ARRAY_OF_STARTS - starting position in full array of sub-array in each dimension ORDER - MPI_ORDER_(C or FORTRAN) OLDTYPE - datatype stored in full array NEWTYPE - handle to new datatype 11/13/2018 David Cronk

NONCONTIGUOUS MEMORY ACCESS 0,101 0,0 1,1 1,100 101,1 100,100 101,101 101,0 11/13/2018 David Cronk

NONCONTIGUOUS MEMORY ACCESS INTEGER sizes(2), subsizes(2), starts(2), dtype, ierr sizes(1) = 102 sizes(2) = 102 subsizes(1) = 100 subsizes(2) = 100 starts(1) = 1 starts(2)= 1 CALL MPI_TYPE_CREATE_SUBARRAY(2,sizes,subsizes,starts, MPI_ORDER_FORTRAN,MPI_REAL8,dtype,ierr) 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS MPI_FILE_SET_VIEW( FH, DISP, ETYPE, FILETYPE, DATAREP, INFO, IERROR) header 100 bytes Memory layout 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS The file has ‘holes’ in it from the processor’s perspective 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS The file has ‘holes’ in it from the processor’s perspective MPI_TYPE_CONTIGUOUS(NUM,OLD,NEW,IERR) NUM - Number of contiguous elements OLD - Old data type NEW - New data type MPI_TYPE_CREATE_RESIZED(OLD,LB,EXTENT,NEW, IERR) LB - Lower Bound EXTENT - New size 11/13/2018 David Cronk

‘Holes’ in the file Memory layout File layout (2 ints followed by 3 ints) CALL MPI_TYPE_CONTIGUOUS(2, MPI_INT, CTYPE, IERR) DISP = 4 LB = 0 EXTENT=5*4 CALL MPI_TYPE_CREATE_RESIZED(CTYPE,LB,EXTENT,FTYPE,IERR) CALL MPI_TYPE_COMMIT(FTYPE, IERR) CALL MPI_FILE_SET_VIEW(FH,DISP,MPI_INT,FTYPE,’native’,MPI_INFO_NULL, IERR) 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS The file has ‘holes’ in it from the processor’s perspective A block-cyclic data distribution 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS The file has ‘holes’ in it from the processor’s perspective A block-cyclic data distribution MPI_TYPE_VECTOR( COUNT - Number of blocks BLOCKLENGTH - Number of elements per block STRIDE - Elements between start of each block OLDTYPE - Old datatype NEWTYPE - New datatype) 11/13/2018 David Cronk

Block-cyclic distribution P0 P1 P2 P3 File layout (blocks of 4 ints) CALL MPI_TYPE_VECTOR(3, 4, 16, MPI_INT, FILETYPE, IERR) CALL MPI_TYPE_COMMIT (FILETYPE, IERR) DISP = 4 * 4 * MYRANK CALL MPI_FILE_SET_VIEW (FH, DISP, MPI_INT, FILETYPE, ‘native’, MPI_INFO_NULL, IERR) 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS The file has ‘holes’ in it from the processor’s perspective A block-cyclic data distribution multi-dimensional array access 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS The file has ‘holes’ in it from the processor’s perspective A block-cyclic data distribution multi-dimensional array access MPI_TYPE_CREATE_SUBARRAY() 11/13/2018 David Cronk

Distributed array access (0,0) (0,199) (199,0) (199,199) 11/13/2018 David Cronk

Distributed array access Sizes(1) = 200 sizes(2) = 200 subsizes(1) = 100 subsizes(2) = 100 starts(1) = 0 starts(2) = 0 CALL MPI_TYPE_CREATE_SUBARRAY(2, SIZES, SUBSIZES, STARTS, MPI_ORDER_FORTRAN, MPI_INT, FILETYPE, IERR) CALL MPI_TYPE_COMMIT(FILETYPE, IERR) CALL MPI_FILE_SET_VIEW(FH, 0, MPI_INT, FILETYPE, ‘NATIVE’, MPI_INFO_NULL, IERR) 11/13/2018 David Cronk

NONCONTIGUOUS FILE ACCESS The file has ‘holes’ in it from the processor’s perspective A block-cyclic data distribution multi-dimensional array distributed with a block distribution Irregularly distributed arrays 11/13/2018 David Cronk

Irregularly distributed arrays MPI_TYPE_CREATE_INDEXED_BLOCK COUNT - Number of blocks LENGTH - Elements per block MAP - Array of displacements OLD - Old datatype NEW - New datatype 11/13/2018 David Cronk

Irregularly distributed arrays 0 1 2 4 7 11 12 15 20 22 0 1 2 4 7 11 12 15 20 22 MAP_ARRAY 11/13/2018 David Cronk

Irregularly distributed arrays CALL MPI_TYPE_CREATE_INDEXED_BLOCK (10, 1, FILE_MAP, MPI_INT, FILETYPE, IERR) CALL MPI_TYPE_COMMIT (FILETYPE, IERR) DISP = 0 CALL MPI_FILE_SET_VIEW (FH, DISP, MPI_INT, FILETYPE, ‘native’, MPI_INFO_NULL, IERR) 11/13/2018 David Cronk

DATA ACCESS Explicit Offsets Blocking Non-Collective Individual File Pointers Shared Blocking Non-Blocking Non-Collective Collective 11/13/2018 David Cronk

COLLECTIVE I/O Memory layout on 4 processor MPI temporary memory buffer File layout 11/13/2018 David Cronk

EXPLICIT OFFSETS Parameters FH - File handle OFFSET - Location in file to start BUF - Buffer to write from/read to COUNT - Number of elements DATATYPE - Type of each element STATUS - Return status (blocking) REQUEST - Request handle (non-blocking,non-collective) 11/13/2018 David Cronk

EXPLICIT OFFSETS (cont) I/O Routines MPI_FILE_(READ/WRITE)_AT () MPI_FILE_(READ/WRITE)_AT_ALL () MPI_FILE_I(READ/WRITE)_AT () MPI_FILE_(READ/WRITE)_AT_ALL_BEGIN () MPI_FILE_(READ/WRITE)_AT_ALL_END (FH, BUF, STATUS) 11/13/2018 David Cronk

EXPLICIT OFFSETS int buff[3]; count = 5; disp = 58; blocklen = 2; header 50 bytes int buff[3]; count = 5; blocklen = 2; stride = 4 MPI_Type_vector (count, blocklen, stride, MPI_INT, &ftype); MPI_Type_commit (ftype); disp = 58; MPI_File_open (MPI_COMM_WORLD, filename, MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); MPI_File_set_view (fh, disp, MPI_INT, ftype, “native”, MPI_INFO_NULL); MPI_File_write_at (fh, 5, buff, 3, MPI_INT, &status); 11/13/2018 David Cronk

IDIVIDUAL FILE POINTERS Parameters FH - File handle BUF - Buffer to write to/read from COUNT - number of elements to be read/written DATATYPE - Type of each element STATUS - Return status (blocking) REQUEST - Request handle (non-blocking, non-collective) 11/13/2018 David Cronk

INDIVIDUAL FILE POINTERS I/O Routines MPI_FILE_(READ/WRITE) () MPI_FILE_(READ/WRITE)_ALL () MPI_FILE_I(READ/WRITE) () MPI_FILE_(READ/WRITE)_ALL_BEGIN() MPI_FILE_(READ/WRITE)_ALL_END (FH, BUF, STATUS) 11/13/2018 David Cronk

INDIVIDUAL FILE POINTERS fp0 fp0 fp1 fp1 fp0 fp1 int buff[12]; count = 6; blocklen = 2; stride = 4 MPI_Type_vector (count, blocklen, stride, MPI_INT, &ftype); MPI_Type_commit (ftype); disp = 50 + myrank*8; MPI_File_open (MPI_COMM_WORLD, filename, MPI_MODE_WRONLY, MPI_INFO_NULL, &fh); MPI_File_set_view (fh, disp, MPI_INT, ftype, “native”, MPI_INFO_NULL); MPI_File_write(fh, buff, 6, MPI_INT, &status); 11/13/2018 David Cronk

All processes must have the same view Parameters SHARED FILE POINTERS All processes must have the same view Parameters FH - File handle BUF - Buffer COUNT - Number of elements DATATYPE - Type of the elements STATUS - Return status (blocking) REQUEST - Request handle (Non-blocking, non-collective 11/13/2018 David Cronk

SHARED FILE POINTERS I/O Routines MPI_FILE_(READ/WRITE)_SHARED () MPI_FILE_I(READ/WRITE)_SHARED () MPI_FILE_(READ/WRITE)_ORDERED () MPI_FILE_(READ/WRITE)_ORDERED_BEGIN () MPI_FILE_(READ/WRITE)_ORDERED_END (FH, BUF, STATUS) 11/13/2018 David Cronk

SHARED FILE POINTERS P0 P1 Pn-1 P2 comm = MPI_COMM_WORLD; 100 comm = MPI_COMM_WORLD; MPI_Comm_rank (comm, &rank); amode = MPI_MODE_CREATE | MPI_MODE_WRONLY; ….. MPI_File_open (comm, logfile, amode, MPI_INFO_NULL, &fh); do some computing if (some event occurred) { sprintf(buf, “Process %d: %s\n”, rank, event); size = strlen(buf); MPI_File_write_shared (fh, buf, size MPI_CHAR, &status); } int buff[100]; MPI_File_open (comm, logfile, amode, MPI_INFO_NULL, &fh); MPI_File_write_ordered (fh, buf, 100, MPI_INT, &status); 11/13/2018 David Cronk

FILE INTEROPERABILITY MPI puts no constraints on how an implementation should store files If a file is not stored as a linear byte stream, there must be a utility for converting the file into a linear byte stream Data representation aids interoperability 11/13/2018 David Cronk

FILE INTEROPERABILITY (cont) Data Representation Native - Data stored exactly as it is in memory. Internal - Data may be converted, but can always be read by the same MPI implementation, even on different architectures external32 - This representation is defined by MPI. Files written in external32 format can be read by any MPI on any machine 11/13/2018 David Cronk

FILE INTEROPERABILITY (cont) Some MPI-I/O implementations (Romio), created files are no different than those created by the underlying file system. This means normal Posix commands (cp, rm, etc) work with files created by these implementations Non-MPI programs can read these files 11/13/2018 David Cronk

GOTCHAS - Consistency & Semantics Collective routines are NOT synchronizing Output data may be buffered Just because a process has completed a write does not mean the data is available to other processes Three ways to ensure file consistency: MPI_FILE_SET_ATOMICITY () MPI_FILE_SYNC () MPI_FILE_CLOSE () 11/13/2018 David Cronk

CONSISTENCY & SEMANTICS MPI_FILE_SET_ATOMICITY () Causes all writes to be immediately written to disk. This is a collective operation MPI_FILE_SYNC () Collective operation which forces buffered data to be written to disk MPI_FILE_CLOSE () Writes any buffered data to disk before closing the file 11/13/2018 David Cronk

GOTCHA!!! CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_SET_ATOMICITY (FH) CALL MPI_FILE_WRITE_AT (FH, 100, …) CALL MPI_FILE_READ_AT (FH, 0, …) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_SET_ATOMICITY (FH) CALL MPI_FILE_WRITE_AT (FH, 0, …) CALL MPI_FILE_READ_AT (FH, 100, …) 11/13/2018 David Cronk

GOTCHA!!! CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_SET_ATOMICITY (FH) CALL MPI_FILE_WRITE_AT (FH, 100, …) CALL MPI_BARRIER () CALL MPI_FILE_READ_AT (FH, 0, …) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_SET_ATOMICITY (FH) CALL MPI_FILE_WRITE_AT (FH, 0, …) CALL MPI_BARRIER () CALL MPI_FILE_READ_AT (FH, 100, …) 11/13/2018 David Cronk

GOTCHA!!! CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_WRITE_AT (FH, 100, …) CALL MPI_FILE_CLOSE (FH) CALL MPI_FILE_READ_AT (FH, 0, …) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_WRITE_AT (FH, 0, …) CALL MPI_FILE_CLOSE (FH) CALL MPI_FILE_READ_AT (FH, 100, …) 11/13/2018 David Cronk

GOTCHA!!! CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_WRITE_AT (FH, 100, …) CALL MPI_FILE_CLOSE (FH) CALL MPI_BARRIER () CALL MPI_FILE_READ_AT (FH, 0, …) CALL MPI_FILE_OPEN (…, FH) CALL MPI_FILE_WRITE_AT (FH, 0, …) CALL MPI_FILE_CLOSE (FH) CALL MPI_BARRIER () CALL MPI_FILE_READ_AT (FH, 100, …) 11/13/2018 David Cronk

MPI-I/O potentially offers significant improvement in I/O performance CONCLUSIONS MPI-I/O potentially offers significant improvement in I/O performance This improvement can be attained with minimal effort on part of the user Simpler programming with fewer calls to I/O routines Easier program maintenance due to simple API 11/13/2018 David Cronk

Recommended references MPI - The Complete Reference Volume 1, The MPI Core MPI - The Complete Reference Volume 2, The MPI Extensions USING MPI: Portable Parallel Programming with the Message-Passing Interface Using MPI-2: Advanced Features of the Message-Passing Interface 11/13/2018 David Cronk