Computational Methods in Astrophysics Dr Rob Thacker (AT319E)

Computational Methods in Astrophysics Dr Rob Thacker (AT319E) thacker@ap.smu.ca

This week Part 1: MPI-2: RMA, Parallel I/O Part 1: MPI-2: RMA, Parallel I/O Part 2: (Wed) Odds and ends Part 2: (Wed) Odds and ends Wrapping up MPI: Extensions & mixed-mode

Part 1: MPI-2 RMA & I/O One sided communication is a significant step forward in functionality for MPI One sided communication is a significant step forward in functionality for MPI Introduced in MPI-2 standard, MPI-3 includes updates Introduced in MPI-2 standard, MPI-3 includes updates Ability to retrieve remote data without cooperative message passing enables increased time in computation Ability to retrieve remote data without cooperative message passing enables increased time in computation One sided comms are sensitive to OS/machine optimizations though One sided comms are sensitive to OS/machine optimizations though

MPI-2 RMA Is it worthwhile? Is it worthwhile? Depends entirely on your algorithm Depends entirely on your algorithm If you have large amount of computation to communication probably not If you have large amount of computation to communication probably not RMA thus sits in a strange place RMA thus sits in a strange place Technically non-trivial Technically non-trivial Not all codes need it Not all codes need it

Standard message passing HOST AHOST B NIC Memory/ Buffer CPU Memory/ Buffer MPI_SendMPI_Recv Packet transmission is directly mitigated by the CPU’s on both machines, multiple buffer copies may be necessary

Traditional message passing Both sender and receiver must cooperate Both sender and receiver must cooperate Send needs to address buffer to be sent Send needs to address buffer to be sent Sender specifies destination and tag Sender specifies destination and tag Recv needs to specify it’s own buffer Recv needs to specify it’s own buffer Recv must specify origin and tag Recv must specify origin and tag In blocking mode this is a very expensive operation In blocking mode this is a very expensive operation Both sender and receiver must cooperate and stop any computation they may be doing Both sender and receiver must cooperate and stop any computation they may be doing

Sequence of operations to `get’ data Suppose process A wants to retrieve a section of an array from process B (process B is unaware of what is required) Suppose process A wants to retrieve a section of an array from process B (process B is unaware of what is required) Process A executes MPI_Send to B with details of what it requires Process A executes MPI_Send to B with details of what it requires Process B executes MPI_Recv from A and determines data required by A Process B executes MPI_Recv from A and determines data required by A Process B executes MPI_Send to A with required data Process B executes MPI_Send to A with required data Process A executes MPI_Recv from B… Process A executes MPI_Recv from B… 4 MPI-1 commands 4 MPI-1 commands Additionally process B has to be aware of incoming message Additionally process B has to be aware of incoming message Requires frequent polling for messages – potentially highly wasteful Requires frequent polling for messages – potentially highly wasteful Process A Process B MPI_SEND MPI_RECV MPI_SEND MPI_RECV

Even worse example Suppose you need to read a remote list to figure out what data you need – sequence of ops is then: Suppose you need to read a remote list to figure out what data you need – sequence of ops is then: Process A Process B MPI_Send (get list) MPI_Recv (list request) MPI_Recv (list returned) MPI_Send (list info) MPI_Send (get data) MPI_Recv (data request) MPI_Recv (data returned) MPI_Send (data info) Process A Process B MPI_SEND MPI_RECV MPI_SEND MPI_RECV MPI_SEND MPI_RECV MPI_SEND MPI_RECV GET LIST GET DATA

Coarse versus fine graining Expense of message passing implicitly suggests MPI-1 programs should be coarse grained Expense of message passing implicitly suggests MPI-1 programs should be coarse grained Unit of messaging in NUMA systems is the cache line Unit of messaging in NUMA systems is the cache line What about API for (fast network) distributed memory systems that is optimized for smaller messages? What about API for (fast network) distributed memory systems that is optimized for smaller messages? e.g. ARMCI http://www.emsl.pnl.gov/docs/parsoft/armci e.g. ARMCI http://www.emsl.pnl.gov/docs/parsoft/armcihttp://www.emsl.pnl.gov/docs/parsoft/armci Would enable distributed memory systems to have moderately high performance fine grained parallelism Would enable distributed memory systems to have moderately high performance fine grained parallelism A number of applications are suited to this style of parallelism (especially irregular data structures) A number of applications are suited to this style of parallelism (especially irregular data structures) API’s supporting fine grained parallelism have one-sided communication for efficiency – no handshaking to take processes away from computation API’s supporting fine grained parallelism have one-sided communication for efficiency – no handshaking to take processes away from computation

Puts and Gets in MPI-2 In one sided communication the number of operations is reduced by at least factor of 2 In one sided communication the number of operations is reduced by at least factor of 2 For our earlier example 4 MPI operations can be replaced with a single MPI_Get For our earlier example 4 MPI operations can be replaced with a single MPI_Get Circumvents the need to forward information directly to the remote CPU specifying what data is required Circumvents the need to forward information directly to the remote CPU specifying what data is required MPI_Sends+MPI_Recv’s are replaced by three possibilities MPI_Sends+MPI_Recv’s are replaced by three possibilities MPI_Get: Retrieve section of a remote array MPI_Get: Retrieve section of a remote array MPI_Put: Place a section of a local array into remote memory MPI_Put: Place a section of a local array into remote memory MPI_Accumulate: Remote update over operator and local data MPI_Accumulate: Remote update over operator and local data However, programmer must be aware of the possibility of remote processes changing local arrays! However, programmer must be aware of the possibility of remote processes changing local arrays!

RMA illustrated HOST AHOST B NIC (with RDMA engine) NIC (with RDMA engine) Memory/ Buffer CPU Memory/ Buffer

Benefits of one-sided communication No matching operation required for remote process No matching operation required for remote process All parameters of the operations are specified by the origin process All parameters of the operations are specified by the origin process Allows very flexible communcations patterns Allows very flexible communcations patterns Communication and synchronization are separated Communication and synchronization are separated Synchronization is now implied by the access epoch Synchronization is now implied by the access epoch Removes need for polling for incoming messages Removes need for polling for incoming messages Significantly improves performance of applications with irregular and unpredictable data movement Significantly improves performance of applications with irregular and unpredictable data movement

Windows: The fundamental construction for one-sided comms One sided comms may only write into memory regions “windows” set aside for communication One sided comms may only write into memory regions “windows” set aside for communication Access to the windows must be within a specific access epoch Access to the windows must be within a specific access epoch All processes may agree on access epoch, or just a pair of processes may cooperate All processes may agree on access epoch, or just a pair of processes may cooperate Origin Target One-sided Put Memory Window

Creating a window MPI_Win_create(base,size,disp_unit,info,comm,win,ier r) MPI_Win_create(base,size,disp_unit,info,comm,win,ier r) Base address of window Base address of window Size of window in BYTES Size of window in BYTES Local unit size for displacements (BYTES, e.g. 4) Local unit size for displacements (BYTES, e.g. 4) Info – argument about type of operations that may occur on window Info – argument about type of operations that may occur on window Win – window object returned by call Win – window object returned by call Should also free window using MPI_Win_free(win,ierr) Should also free window using MPI_Win_free(win,ierr) Window performance is always better when base aligns on a word boundary Window performance is always better when base aligns on a word boundary

Options to info Vendors are allowed to include options to improve window performance under certain circumstances Vendors are allowed to include options to improve window performance under certain circumstances MPI_INFO_NULL is always valid MPI_INFO_NULL is always valid If win_lock is not going to be used then this information can be passed as an info argument: If win_lock is not going to be used then this information can be passed as an info argument: MPI_Info info; MPI_Info_create(&info); MPI_Info_set(info,”no_locks”,”true”); MPI_Win_create(…,info,…); MPI_Info_free(&info);

Access epochs Although communication is mediated by GETs and PUTs they do not guarantee message completion Although communication is mediated by GETs and PUTs they do not guarantee message completion All communication must occur within an access epoch All communication must occur within an access epoch Communication is only guaranteed to have completed when the epoch is finished Communication is only guaranteed to have completed when the epoch is finished This is to optimize messaging – do not have to worry about completion until access epoch is ended This is to optimize messaging – do not have to worry about completion until access epoch is ended Two ways of coordinating access Two ways of coordinating access Active target: remote process governs completion Active target: remote process governs completion Passive target: Origin process governs completion Passive target: Origin process governs completion

Access epochs : Active target Active target communication is usually expressed in a collective operation Active target communication is usually expressed in a collective operation All process agree on the beginning of the window All process agree on the beginning of the window Communication occurs Communication occurs Communication is then guaranteed to have completed when second WIN_Fence is called Communication is then guaranteed to have completed when second WIN_Fence is called OriginTarget WIN_FENCE !Processes agree on fence Call MPI_Win_fence !Put remote data Call MPI_PUT(..) !Collective fence Call MPI_Win_fence !Message is guaranteed to !complete after win_fence !on remote process completes All other processes

Access epochs : Passive target For passive target communication, the origin process controls all aspects of communication For passive target communication, the origin process controls all aspects of communication Target process is oblivious to the communication epoch Target process is oblivious to the communication epoch MPI_Win_(un)lock facilitates the communication MPI_Win_(un)lock facilitates the communication !Lock remote process window Call MPI_Win_lock !Put remote data Call MPI_PUT(..) !Unlock remote process window Call MPI_win_unlock !Message is guaranteed to !complete after win_unlock OriginTarget WIN_LOCK WIN_UNLOCK

Non-collective active target Win_fence is collective over the comm of the window Win_fence is collective over the comm of the window A similar construct over groups is available A similar construct over groups is available See Using MPI-2 for more details See Using MPI-2 for more details OriginTarget WIN_FENCE All other processes !Processes agree on fence Call MPI_Win_start(group,..) Call MPI_Win_post(group,..) !Put remote data Call MPI_PUT(..) !Collective fence Call MPI_Win_complete(win) Call MPI_Win_wait(win) !Message is guaranteed to !complete after waits finish

Rules for memory areas assigned to windows Memory regions for windows involved in active target synchronization may be statically declared Memory regions for windows involved in active target synchronization may be statically declared Memory regions for windows involved in passive target access epochs may have to be dynamically allocated Memory regions for windows involved in passive target access epochs may have to be dynamically allocated depends on implementation depends on implementation For Fortran requires definition of Cray-like pointers to arrays For Fortran requires definition of Cray-like pointers to arrays MPI_Alloc_mem(size,MPI_INFO _NULL,pointer,ierr) MPI_Alloc_mem(size,MPI_INFO _NULL,pointer,ierr) Must be associated with freeing call Must be associated with freeing call MPI_Free_mem(array,ierr) MPI_Free_mem(array,ierr) double precision u pointer (p,u(0:50,0:20)) integer (kind=MPI_ADDRESS_KIND) size integer sizeofdouble, ierr call MPI_Sizeof(u,sizeofdouble,ierr) size=51*21*sizeofdouble call MPI_Alloc_mem(size,MPI_INFO_NULL,p,ierr) … Can now refer to u … call MPI_Free_mem(u,ierr)

More on passive target access Closest idea to shared memory operation on a distributed system Closest idea to shared memory operation on a distributed system Very flexible communication model Very flexible communication model Multiple origin processes must negotiate on access to locks Multiple origin processes must negotiate on access to locks Process A Process B Process C lock A Locked put to A window unlock A lock B Locked put to B window unlock B lock A Locked put to A window unlock A Time

MPI_Get/Put/Accumulate Non-blocking operations Non-blocking operations MPI_Get(origin address,count,datatype,target,target displ,target count,target datatype,win,ierr) MPI_Get(origin address,count,datatype,target,target displ,target count,target datatype,win,ierr) Must specify information about both origin and remote datatypes – more arguments Must specify information about both origin and remote datatypes – more arguments No need to specify communicator – contained in window No need to specify communicator – contained in window Target displ is displacement from beginning of target window Target displ is displacement from beginning of target window Note remote datatype cannot resolve to overlapping entries Note remote datatype cannot resolve to overlapping entries MPI_Put has same interface MPI_Put has same interface MPI_Accumulate requires the reduction operator also be specified (argument before the window) MPI_Accumulate requires the reduction operator also be specified (argument before the window) Same operators as MPI_REDUCE, but user defined functions cannot be used Same operators as MPI_REDUCE, but user defined functions cannot be used Note MPI_Accumulate is really MPI_Put_accumulate, there is no get functionality (must do by hand) Note MPI_Accumulate is really MPI_Put_accumulate, there is no get functionality (must do by hand)

Don’t forget datatypes In one-sided comms datatypes play an extremely important role In one-sided comms datatypes play an extremely important role Specify explicitly the unpacking on the remote node Specify explicitly the unpacking on the remote node Origin node must know precisely what the required remote data type is Origin node must know precisely what the required remote data type is Contiguous origin datatype Sparse target datatype

MPI_Accumulate Extremely powerful operation “put+op” Extremely powerful operation “put+op” Questions marks for implementations though: Questions marks for implementations though: Who actually implements the “op” side of things? Who actually implements the “op” side of things? If on remote node then there must be an extra thread to do this operation If on remote node then there must be an extra thread to do this operation If on local node, then accumulate becomes get followed by operation followed by put If on local node, then accumulate becomes get followed by operation followed by put Many computations involve summing values into fields Many computations involve summing values into fields MPI_Accumulate provides the perfect command for this MPI_Accumulate provides the perfect command for this For scientific computation it is frequently more useful than MPI_Put For scientific computation it is frequently more useful than MPI_Put

Use PUTs rather than GETs Although both PUTs and GETs are nonblocking it is desirable to use PUTs whenever possible Although both PUTs and GETs are nonblocking it is desirable to use PUTs whenever possible GETs imply an inherent wait for data arrival and only complete when the message side has fully decoded the incoming message GETs imply an inherent wait for data arrival and only complete when the message side has fully decoded the incoming message PUTs can be thought of as “fire and forget” PUTs can be thought of as “fire and forget”

MPI_Win_fence MPI_Win_fence(info,win,ierr) MPI_Win_fence(info,win,ierr) Info allows user to specify constant that may improve performance (default of 0) Info allows user to specify constant that may improve performance (default of 0) MPI_MODE_NOSTORE: No local stores MPI_MODE_NOSTORE: No local stores MPI_MODE_NOPUT: No puts will occur within the window (don’t have to watch for remote updates) MPI_MODE_NOPUT: No puts will occur within the window (don’t have to watch for remote updates) MPI_MODE_NOPRECEDE: No earlier epochs of communication (optimize assumptions about window variables) MPI_MODE_NOPRECEDE: No earlier epochs of communication (optimize assumptions about window variables) MPI_MODE_NOSUCCEED: No epochs of communication will follow this fence MPI_MODE_NOSUCCEED: No epochs of communication will follow this fence NO_PRECEDE and NOSUCCEED must be called collectively NO_PRECEDE and NOSUCCEED must be called collectively Multiple messages sent to the same target between fences may be concatenated to improve performance Multiple messages sent to the same target between fences may be concatenated to improve performance Active target sync.

MPI_Win_(un)lock MPI_Win_lock(lock_type,target,info,win,ierr) MPI_Win_lock(lock_type,target,info,win,ierr) Lock_types: Lock_types: MPI_LOCK_SHARED – use only for concurrent reads MPI_LOCK_SHARED – use only for concurrent reads MPI_LOCK_EXCLUSIVE – use when updates are necessary MPI_LOCK_EXCLUSIVE – use when updates are necessary Although called a lock – it actually isn’t (very poor naming convention) Although called a lock – it actually isn’t (very poor naming convention) “MPI_begin/end_passive_target_epoch” “MPI_begin/end_passive_target_epoch” Only on the local process does MPI_Win_lock act as a lock Only on the local process does MPI_Win_lock act as a lock Otherwise non-blocking Otherwise non-blocking Provides a mechanism to ensure that the communication epoch is completed Provides a mechanism to ensure that the communication epoch is completed Says nothing about order in which other competing message updates will occur on the target (consistency model is not specified) Says nothing about order in which other competing message updates will occur on the target (consistency model is not specified) Passive target sync.

Subtleties of nonblocking ‘locking’ and messaging Suppose we wanted to implement a fetch and add: Suppose we wanted to implement a fetch and add: Code is erroneous for two reasons: Code is erroneous for two reasons: 1. Cannot read and update same memory location in same access epoch 1. Cannot read and update same memory location in same access epoch 2. Even if you could, communication is nonblocking and can complete in any order 2. Even if you could, communication is nonblocking and can complete in any order int one=1; MPI_Win_create(…,&win); … MPI_Win_lock(MPI_LOCK_EXCLUSIVE,0,0,win); MPI_Get(&value,1,MPI_INT,0,0,1,MPI_INT,win); MPI_Accumulate(&one,1,MPI_INT,0,0,1,MPI_INT,MPI_SUM,win); MPI_Win_unlock(0,win);

Simple example subroutine exchng2( a, sx, ex, sy, ey, win, * left_nbr, right_nbr, top_nbr, bot_nbr, * right_ghost_disp, left_ghost_disp, * top_ghost_disp, coltype, right_coltype, left_coltype ) include 'mpif.h' integer sx, ex, sy, ey, win, ierr integer left_nbr, right_nbr, top_nbr, bot_nbr integer coltype, right_coltype, left_coltype double precision a(sx-1:ex+1,sy-1:ey+1) C This assumes that an address fits in a Fortran integer. C Change this to integer*8 if you need 8-byte addresses integer (kind=MPI_ADDRESS_KIND) right_ghost_disp, * left_ghost_disp, top_ghost_disp, bot_ghost_disp integer nx nx = ex - sx + 1 call MPI_WIN_FENCE( 0, win, ierr ) C Put bottom edge into bottom neighbor's top ghost cells call MPI_PUT( a(sx,sy), nx, MPI_DOUBLE_PRECISION, bot_nbr, * top_ghost_disp, nx, MPI_DOUBLE_PRECISION, * win, ierr ) C Put top edge into top neighbor's bottom ghost cells bot_ghost_disp = 1 call MPI_PUT( a(sx,ey), nx, MPI_DOUBLE_PRECISION, top_nbr, * bot_ghost_disp, nx, MPI_DOUBLE_PRECISION, * win, ierr ) C Put right edge into right neighbor's left ghost cells call MPI_PUT( a(ex,sy), 1, coltype, * right_nbr, left_ghost_disp, 1, right_coltype, * win, ierr ) C Put left edge into the left neighbor's right ghost cells call MPI_PUT( a(sx,sy), 1, coltype, * left_nbr, right_ghost_disp, 1, left_coltype, * win, ierr ) call MPI_WIN_FENCE( 0, win, ierr ) return end exchng2 for 2d poisson problem No gets are required – just put your own data into other processes memory window.

Problems with passive target access Window creation must be collective over the communicator Window creation must be collective over the communicator Expensive and time consuming Expensive and time consuming MPI_Alloc_mem may be required MPI_Alloc_mem may be required Race conditions on a single window location under concurrent get/put must be handled by user Race conditions on a single window location under concurrent get/put must be handled by user See section 6.4 in Using MPI-2 See section 6.4 in Using MPI-2 Local and remote operations on a remote window cannot occur concurrently even if different parts of the window are being accessed at the same time Local and remote operations on a remote window cannot occur concurrently even if different parts of the window are being accessed at the same time Local processes must execute MPI_Win_lock as well Local processes must execute MPI_Win_lock as well Multiple windows may have overlap, but must ensure concurrent operations to do different windows do not lead to race conditions on the overlap Multiple windows may have overlap, but must ensure concurrent operations to do different windows do not lead to race conditions on the overlap Cannot access (via MPI_get for example) and update (via a put back ) the same location in the same access epoch (either between fences or lock/unlock) Cannot access (via MPI_get for example) and update (via a put back ) the same location in the same access epoch (either between fences or lock/unlock)

Drawbacks of one sided comms in general No evidence for advantage except on No evidence for advantage except on SMP machines SMP machines Cray distributed memory systems Cray distributed memory systems Infiniband has RDMA engine Infiniband has RDMA engine Slow acceptance Slow acceptance Just how mature is it? Just how mature is it? Unclear how many applications actually benefit from this model Unclear how many applications actually benefit from this model Not entirely clear whether nonblocking normal send/recvs can achieve similar speed for some applications Not entirely clear whether nonblocking normal send/recvs can achieve similar speed for some applications

Case Study: Matrix transpose See Sun documentation See Sun documentation Need to transpose elements across processor space Need to transpose elements across processor space Could do one element at a time (bad idea!) Could do one element at a time (bad idea!) Aggregate as much local data as possible and send large message (requires a lot of local data movement) Aggregate as much local data as possible and send large message (requires a lot of local data movement) Send medium-sized contiguous packets of elements (there is some contiguity in the data layout) Send medium-sized contiguous packets of elements (there is some contiguity in the data layout)

Parallel Issues 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 Storage on 2 processors 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 P0 P1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 1234567812345678 9 10 11 12 13 14 15 16 1d storage order 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16

Possible parallel algorithm 1234567812345678 9 10 11 12 13 14 15 16 1256347812563478 9 10 13 14 11 12 15 16 1 2 5 6 9 10 13 14 3 4 7 8 11 12 15 16 1 5 9 13 2 6 10 14 3 7 11 15 4 8 12 16 Local permutation Send dataLocal permutation

Program 1 include "mpif.h" real(8), allocatable, dimension(:) :: a, b, c, d real(8) t0, t1, t2, t3 ! initialize parameters call init(me,np,n,nb) ! allocate matrices allocate(a(nb*np*nb)) allocate(b(nb*nb*np)) allocate(c(nb*nb*np)) allocate(d(nb*np*nb)) ! initialize matrix call initialize_matrix(me,np,nb,a) ! timing do itime = 1, 10 call MPI_Barrier(MPI_COMM_WORLD,ier) t0 = MPI_Wtime() ! first local transpose do k = 1, nb do j = 0, np - 1 ioffa = nb * ( j + np * (k-1) ) ioffb = nb * ( (k-1) + nb * j ) do i = 1, nb b(i+ioffb) = a(i+ioffa) enddo t1 = MPI_Wtime() ! global all-to-all call MPI_Alltoall(b, nb*nb, MPI_REAL8, & c, nb*nb, MPI_REAL8, MPI_COMM_WORLD, ier) t2 = MPI_Wtime() ! second local transpose call dtrans(`o', 1.d0, c, nb, nb*np, d) call MPI_Barrier(MPI_COMM_WORLD,ier) t3 = MPI_Wtime() if ( me.eq. 0 ) & write(6,'(f8.3," seconds; breakdown on proc 0 = ",3f10.3)') & t3 - t0, t1 - t0, t2 - t1, t3 - t2 enddo ! check call check_matrix(me,np,nb,d) deallocate(a) deallocate(b) deallocate(c) deallocate(d) call MPI_Finalize(ier) end This code aggregates data locally and uses the two-sided Alltoall collective Operation. Data is then rearranged using a subroutine called DTRANS()

Version 2 – one sided include "mpif.h" integer(kind=MPI_ADDRESS_KIND) nbytes integer win real(8) c(*) pointer (cptr,c) real(8), allocatable, dimension(:) :: a, b, d real(8) t0, t1, t2, t3 ! initialize parameters call init(me,np,n,nb) ! allocate matrices allocate(a(nb*np*nb)) allocate(b(nb*nb*np)) allocate(d(nb*np*nb)) nbytes = 8 * nb * nb * np call MPI_Alloc_mem(nbytes, MPI_INFO_NULL, cptr, ier) if ( ier.eq. MPI_ERR_NO_MEM ) stop ! create window call MPI_Win_create(c, nbytes, 1, MPI_INFO_NULL, MPI_COMM_WORLD, win, ier) ! initialize matrix call initialize_matrix(me,np,nb,a) ! timing do itime = 1, 10 call MPI_Barrier(MPI_COMM_WORLD,ier) t0 = MPI_Wtime() t1 = t0 ! combined local transpose with global all-to-all call MPI_Win_fence(0, win, ier) do ip = 0, np - 1 do ib = 0, nb - 1 nbytes = 8 * nb * ( ib + nb * me ) call MPI_Put(a(1+nb*ip+nb*np*ib), nb, MPI_REAL8, ip, nbytes, & nb, MPI_REAL8, win, ier) enddo call MPI_Win_fence(0, win, ier) t2 = MPI_Wtime() ! second local transpose call dtrans(`o', 1.d0, c, nb, nb*np, d) call MPI_Barrier(MPI_COMM_WORLD,ier) t3 = MPI_Wtime() if ( me.eq. 0 ) & write(6,'(f8.3," seconds; breakdown on proc 0 = ",3f10.3)') & t3 - t0, t1 - t0, t2 - t1, t3 - t2 enddo ! check call check_matrix(me,np,nb,d) ! deallocate matrices and stuff call MPI_Win_free(win, ier) deallocate(a) deallocate(b) deallocate(d) call MPI_Free_mem(c, ier) call MPI_Finalize(ier) end No local aggregation is used, and communication is mediated via MPI_Puts. Data is then rearranged using a subroutine called DTRANS()

Parallel I/O Motivations Review of different I/O strategies Far too many issues to put into one lecture – plenty of web resources provide more details if you need them Draws heavily from Using MPI-2 – see the excellent discussion presented therein for further details

Non-parallel I/O Simplest way to do I/O – number of factors may contribute to this kind of model: May have to implement this way because only one process is capable of I/O May have to use serial I/O library I/O may be enhanced for large writes Easiest file handling – only one file to keep track of Arguments against: Strongly limiting in terms of throughput if the underlying file system does permit parallel I/O Memory Processes File

Additional argument for parallel I/O standard Standard UNIX I/O is not portable Standard UNIX I/O is not portable Endianess becomes serious problem across different machines Endianess becomes serious problem across different machines Writing wrappers to perform byte conversion is tedious Writing wrappers to perform byte conversion is tedious

Simple parallel I/O Multiple independent files are written Same as parallel I/O allowed under OpenMP May still be able to use sequential I/O libraries Significant increases in throughput are possible Drawbacks are potentially serious Now have multiple files to keep track of (concatenation may not be an option) May be non-portable if application reading these files must have the same number of processes Memory Processes Files

Simple parallel I/O (no MPI calls) I/O operations are completely independent across the processes I/O operations are completely independent across the processes Append a rank to each file name to specify the different files Append a rank to each file name to specify the different files Need to be very careful about performance – individual files may be too small for good throughput Need to be very careful about performance – individual files may be too small for good throughput #include “mpi.h” #include #define BUFSIZE 100 Int main(int argc, char *argv[]) { int i,myrank,buf[BUFSIZE]; char filename[128]; FILE *myfile; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + I; sprintf(filename,”testfile.%d”,myrank); myfile= fopen(filename, “w”); fwrite(buf,sizeof(int), BUFSIZE, myfile); fclose(myfile); MPI_Finalize(); return 0; }

Simple parallel I/O (using MPI calls) Rework previous example to use MPI calls Rework previous example to use MPI calls Note the file pointer has been replaced by a variable of type MPI_File Note the file pointer has been replaced by a variable of type MPI_File Under MPI I/O open, write (read) and close statements are provided Under MPI I/O open, write (read) and close statements are provided Note MPI_COMM_SELF denotes a communicator over local process Note MPI_COMM_SELF denotes a communicator over local process #include “mpi.h” #include #define BUFSIZE 100 Int main(int argc, char *argv[]) { int i,myrank,buf[BUFSIZE]; char filename[128]; MPI_File myfile; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; sprintf(filename,”testfile.%d”,myrank); MPI_File_open(MPI_COMM_SELF, filename, MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &myfile); MPI_File_write(myfile,buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&myfile); MPI_Finalize(); return 0; }

MPI_File_open arguments MPI_File_open(comm,filename,accessmode,info,fileha ndle,ierr) MPI_File_open(comm,filename,accessmode,info,fileha ndle,ierr) Comm – can choose immediately whether file access is collective or local Comm – can choose immediately whether file access is collective or local MPI_COMM_WORLD or MPI_COMM_SELF MPI_COMM_WORLD or MPI_COMM_SELF Access mode is specified by MPI_MODE_CREATE and MPI_MODE_WRONLY are or’d together Access mode is specified by MPI_MODE_CREATE and MPI_MODE_WRONLY are or’d together Same as Unix open Same as Unix open The file handler is passed back from this call to be used later in MPI_File_write The file handler is passed back from this call to be used later in MPI_File_write

MPI_File_write MPI_File_write(filehandler,buff,count,datatype,status,ie rr) MPI_File_write(filehandler,buff,count,datatype,status,ie rr) Very similar to message passing interface Very similar to message passing interface Imagine file handler as providing destination Imagine file handler as providing destination “Address,count,datatype” interface “Address,count,datatype” interface Specification of non-contiguous writes would be done via a user-defined datatype Specification of non-contiguous writes would be done via a user-defined datatype MPI_STATUS_IGNORE can be passed in the status field MPI_STATUS_IGNORE can be passed in the status field Informs system not to fill field since user will ignore it Informs system not to fill field since user will ignore it May slightly improve I/O performance when not needed May slightly improve I/O performance when not needed

True parallel MPI I/O Processes must all now agree on opening a single file Each process must have its own pointer within the file File clearly must reside on a single file system Can read the file with a different number of processes as compared to what it was written with Memory Processes File

Parallel I/O to a single file using MPI calls Rework previous example to write to a single file Rework previous example to write to a single file Write is now collective Write is now collective MPI_COMM_SELF has been replaced by MPI_COMM_WORLD MPI_COMM_SELF has been replaced by MPI_COMM_WORLD All files agree on a collective name for the file “testfile” All files agree on a collective name for the file “testfile” Access to given parts of the file is specifically controlled Access to given parts of the file is specifically controlled MPI_File_set_view MPI_File_set_view Shift displacements according to local rank Shift displacements according to local rank #include “mpi.h” #include #define BUFSIZE 100 Int main(int argc, char *argv[]) { int i,myrank,buf[BUFSIZE]; MPI_File thefile; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); for (i=0; i< BUFSIZE; i++) buf[i] = myrank * BUFSIZE + i; MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_WRONLY | MPI_MODE_CREATE, MPI_INFO_NULL, &thefile); MPI_File_set_view(thefile, myrank*BUFSIZE*sizeof(int), MPI_INT,MPI_INT,”native”, MPI_INFO_NULL); MPI_File_write(thefile,buf, BUFSIZE, MPI_INT, MPI_STATUS_IGNORE); MPI_File_close(&thefile); MPI_Finalize(); return 0; }

File view File view defines which portion of a file is “visible” to a given process File view defines which portion of a file is “visible” to a given process On first opening a file the entire file is visible On first opening a file the entire file is visible Data is described at the byte level initially Data is described at the byte level initially MPI_File_set_view provides information to MPI_File_set_view provides information to enable reading of the data (datatypes) enable reading of the data (datatypes) Specify which parts of the file should be skipped Specify which parts of the file should be skipped

MPI_File_set_view MPI_File_set_view(filehandler,displ,etype,filedatatype,datarep,in fo,ierr) MPI_File_set_view(filehandler,displ,etype,filedatatype,datarep,in fo,ierr) Displ controls the BYTE offset from the beginning of the file Displ controls the BYTE offset from the beginning of the file The displacement is of a type “MPI_Offset” larger than normal MPI_INT to allow for 64 bit addressing (byte offsets can easily exceed 32 bit limit) The displacement is of a type “MPI_Offset” larger than normal MPI_INT to allow for 64 bit addressing (byte offsets can easily exceed 32 bit limit) etype is the datatype of the buffer etype is the datatype of the buffer filedatatype is the corresponding datatype in the file, which must be either etype, or derived from it filedatatype is the corresponding datatype in the file, which must be either etype, or derived from it Must use MPI_Type_create etc Must use MPI_Type_create etc Datarep Datarep native: data is stored in the file as in memory native: data is stored in the file as in memory internal: implementation specific format that may provide a level or portability internal: implementation specific format that may provide a level or portability external32: 32bit big endian IEEE format (defined for all MPI implementations) external32: 32bit big endian IEEE format (defined for all MPI implementations) Only use if portability is required (conversion may be necessary) Only use if portability is required (conversion may be necessary) Portion of file visible to a given process

etype and filetype in action Suppose we have a buffer with an etype of MPI_INT Suppose we have a buffer with an etype of MPI_INT Filetype is defined to be 2 MPI_INTs followed by an offset of 4 MPI_INTS (extent=6) Filetype is defined to be 2 MPI_INTs followed by an offset of 4 MPI_INTS (extent=6) MPI_File_set_view(fh,5*sizeof(int),etype,filetype,”native”,MPI_ INFO_NULL) MPI_File_write(fh,buf,1000,MPI_INT,MPI_STATUS_IGNORE) 123456 etype = MPI_INT filetype = 2 MPI_INTs, followed by 4 offset Displacement of 5 MPI_INTS filetype

Fortran issues Two levels of support Two levels of support basic – just include ‘mpif.h’ basic – just include ‘mpif.h’ Designed for f77 backwards compatibility Designed for f77 backwards compatibility extended – need to include f90 module (use mpi) extended – need to include f90 module (use mpi) Use the extended set whenever possible Use the extended set whenever possible Note that the MPI_FILE type is an integer in Fortran Note that the MPI_FILE type is an integer in Fortran PROGRAM main include ‘mpif.h’ ! Should really use “use mpi” integer ierr,i,myrank,BUFSIZE,thefile parameter (BUFSIZE=100) integer buf(BUFSIZE) integer(kind=MPI_OFFSET_KIND) disp call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,myrank,ierr) do i=1,BUFSIZE buf(i) = myrank * BUFSIZE + i end do call MPI_File_open(MPI_COMM_WORLD, ‘testfile’,& MPI_MODE_WRONLY + MPI_MODE_CREATE,& MPI_INFO_NULL,thefile,ierr) disp=myrank*BUFSIZE*4 call MPI_File_set_view(thefile,disp,MPI_INTEGER, & MPI_INTEGER,’native’, & MPI_INFO_NULL,ierr) call MPI_File_write(thefile,buf, BUFSIZE, & MPI_INTEGER,MPI_STATUS_IGNORE,ierr) call MPI_File_close(thefile,ierr) call MPI_Finalize(ierr); END PROGRAM main

Reading a single file with an unknown number of processors New function: New function: MPI_File_get_size MPI_File_get_size Returns size of open file, need to use 64 bit int (MPI_Offset) Returns size of open file, need to use 64 bit int (MPI_Offset) Check how much data has been read by using status handler Check how much data has been read by using status handler Pass to MPI_Get_count Pass to MPI_Get_count Determines number of datatypes that were read Determines number of datatypes that were read If number of read items is less than expected then (hopefully) EOF is reached If number of read items is less than expected then (hopefully) EOF is reached #include “mpi.h” #include int main(int argc, char *argv[]) { int myrank,numprocs,bufsize,*buf,count; MPI_File thefile; MPI_Status status; MPI_Offset filesize; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_RDONLY,MPI_INFO_NULL, &thefile); MPI_File_get_size(thefile,&filezize); filesize=filesize/sizeof(int); bufsize=filesize/(numprocs+1); buf=(int *) malloc(bufsize*sizeof(int)); MPI_File_set_view(thefile, myrank*bufsize*sizeof(int), MPI_INT,MPI_INT,”native”, MPI_INFO_NULL); MPI_File_read(thefile,buf, bufsize, MPI_INT, &status); MPI_Get_count(&status, MPI_INT, &count); printf(“process %d read %d ints\n”,myrank,count); MPI_File_close(&thefile); MPI_Finalize(); return 0; }

Using Individual file pointers by hand We assume a known filesize in this case We assume a known filesize in this case Pointers must be explicitly moved Pointers must be explicitly moved MPI_File_seek MPI_File_seek Offset of position to move pointer to is second argument Offset of position to move pointer to is second argument Second argument is of type MPI_Offset Second argument is of type MPI_Offset Good practice to resolve calculation into this variable and then call using this variable Good practice to resolve calculation into this variable and then call using this variable #include “mpi.h” #define FILESIZE (1024*1024) int main(int argc, char *argv[]) { int *buf,rank,nprocs,nints,bufsize; MPI_File fh; MPI_Status status; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD, &myrank); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); bufsize = FILESIZE/nprocs; buf=(int *) malloc(bufsize); nints=bufsize/sizeof(int); MPI_File_open(MPI_COMM_WORLD, “testfile”, MPI_MODE_RDONLY,MPI_INFO_NULL, &fh); MPI_File_seek(fh,rank*bufsize, MPI_SEEK_SET); MPI_File_read(fh,buf, nints, MPI_INT, &status); MPI_File_close(&fh); MPI_Finalize(); return 0; }

Using explicit offsets MPI_File_read/write are individual-file-pointer functions MPI_File_read/write are individual-file-pointer functions Both use current location of pointer to determine where to read/write Both use current location of pointer to determine where to read/write Must perform a seek to arrive at the correct region of data Must perform a seek to arrive at the correct region of data Explicit offset functions don’t use an individual file pointer Explicit offset functions don’t use an individual file pointer File offset is passed directly as an argument to the function File offset is passed directly as an argument to the function Seek is not required Seek is not required MPI_File_read/write_at MPI_File_read/write_at Must use this version in multithreaded environment Must use this version in multithreaded environment

Revised code for explicit offsets No need to apply seeks No need to apply seeks Specific movement of file pointer is not applied, instead offset is passed as argument Specific movement of file pointer is not applied, instead offset is passed as argument Remember offsets must be of kind MPI_OFFSET_KIND Remember offsets must be of kind MPI_OFFSET_KIND Same issue here with precalculating offset and resolving into a variable with the appropriate typing Same issue here with precalculating offset and resolving into a variable with the appropriate typing PROGRAM main include ‘mpif.h integer FILESIZE,MAX_BUFSIZE,INTSIZE parameter (FILESIZE=1048576, MAX_BUFSIZE=1048576) parameter (INTSIZE=4) integer buf(MAX_BUFSIZE),rank,ierr,fh,nprocs,nints integer status(MPI_STATUS_SIZE), count integer(kind=MPI_OFFSET_KIND) offset call MPI_Init(ierr) call MPI_Comm_rank(MPI_COMM_WORLD,rank,ierr) call MPI_Comm_size(MPI_COMM_WORLD,nprocs,ierr) call MPI_File_open(MPI_COMM_WORLD, ‘testfile’,& MPI_MODE_RDONLY,& MPI_INFO_NULL,fh,ierr) nints=FILESIZE/(nprocs*INTSIZE) offset=rank*nints*INTSIZE call MPI_File_read_at(fh,offset, buf, nints, & MPI_INTEGER,status,ierr) call MPI_Get_count(status,MPI_INTEGER,count,ierr) print*, ‘process ’,rank,’read ‘,count’,’ints’ call MPI_File_close(fh,ierr) call MPI_Finalize(ierr); END PROGRAM main

Dealing with multidimensional arrays Storage formats differ for C versus fortran Storage formats differ for C versus fortran row major versus column major row major versus column major MPI_Type_create_darray, and also MPI_Type_create_subarray are used to specify dervied datatypes MPI_Type_create_darray, and also MPI_Type_create_subarray are used to specify dervied datatypes These datatypes can then be used to resolve local regions within a linearized global array These datatypes can then be used to resolve local regions within a linearized global array Specifically they deal with the noncontiguous nature of the domain decomposition Specifically they deal with the noncontiguous nature of the domain decomposition See Using MPI-2 book for more details See Using MPI-2 book for more details

Computational Methods in Astrophysics Dr Rob Thacker (AT319E)

Similar presentations

Presentation on theme: "Computational Methods in Astrophysics Dr Rob Thacker (AT319E)"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Computational Methods in Astrophysics Dr Rob Thacker (AT319E)

Similar presentations

Presentation on theme: "Computational Methods in Astrophysics Dr Rob Thacker (AT319E)"— Presentation transcript:

Similar presentations

About project

Feedback