ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004

ECMWF Slide 2Porting MPI Programs to the IBM Cluster 1600 Topics The current hardware switch Parallel Environment (PE) Issues with Standard Sends/Receives Use of non blocking communications Debugging MPI programs MPI tracing Profiling MPI programs Tasks per Node Communications Optimisation The new hardware switch Third Practical

ECMWF Slide 3Porting MPI Programs to the IBM Cluster 1600 The current hardware switch Designed for a previous generation of IBM hardware Referred to as the Colony switch 2 switch adaptors per logical node -8 processors share 2 adaptors -called a dual plane switch Adaptors are multiplexed -software stripes large messages across both adaptors Minimum latency 21 microseconds Maximum bandwidth approx 350 MBytes/s - about 45 MB/s per task when all going off node together

ECMWF Slide 4Porting MPI Programs to the IBM Cluster 1600 Parallel Environment (PE) MPI programs are managed by the IBM PE IBM documentation refers to PE and POE -POE stands for Parallel Operating Environment -many environment variables to tune the parallel environment -talks about launching parallel jobs interactively ECMWF uses Loadleveler for batch jobs -PE usage becomes almost transparent

ECMWF Slide 5Porting MPI Programs to the IBM Cluster 1600 Issues with Standard Sends/Receives The MPI standard can be implemented in different ways Programs may not be fully portable across platforms Standard Sends and Receives can cause problems -Potential for deadlocks -need to understand Blocking v Non Blocking communications -need to understand Eager versus Rendezvous protocols IFS had to be modified to run on IBM

ECMWF Slide 6Porting MPI Programs to the IBM Cluster 1600 Blocking Communications MPI_Send is a blocking routine It returns when it is safe to re-use the buffer being sent -the send buffer can then be overwritten The MPI layer may have copied the data elsewhere -using internal buffer/mailbox space -the message is then in transit but not yet received -this is called an eager protocol -good for short messages The MPI layer may have waited for the receiver -the data is copied from send to receive buffer directly -lower overhead transfer -this is called a rendezvous protocol -good for large messages

ECMWF Slide 7Porting MPI Programs to the IBM Cluster 1600 MPI_Send on the IBM Uses the Eager protocol for short messages -By default short means up to 4096 bytes the higher the task count, the lower the value Uses the Rendezvous protocol for long messages Potential for send/send deadlocks - tasks block in mpi_send if(me.eq.0) then him=1 else him=0 endif call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)

ECMWF Slide 8Porting MPI Programs to the IBM Cluster 1600 Solutions to Send/Send deadlocks Pair up sends and receives use MPI_SENDRECV use a buffered send - MPI_BSEND use asynchronous sends/receives -MPI_ISEND/MPI_IRECV

ECMWF Slide 9Porting MPI Programs to the IBM Cluster 1600 Paired Sends and Receives More complex code Requires close synchronisation if (me.eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) endif

ECMWF Slide 10Porting MPI Programs to the IBM Cluster 1600 MPI_SENDRECV Easier to code Still implies close synchronisation call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, & rbuff,n,MPI_REAL8,him,1, & MPI_COMM_WORLD,stat,ierror)

ECMWF Slide 11Porting MPI Programs to the IBM Cluster 1600 MPI_BSEND This performs a send using an additional buffer -the buffer is allocated by the program via MPI_BUFFER_ATTACH -done once as part of the program initialisation Typically quick to implement -add the mpi_buffer_attach call how big to make the buffer? -change MPI_SEND to MPI_BSEND everywhere But introduces additional memory copy -extra overhead -not recommended for production codes

ECMWF Slide 12Porting MPI Programs to the IBM Cluster 1600 MPI_IRECV / MPI_ISEND Uses Non Blocking Communications Routines return without completing the operation -the operations run asynchronously -Must NOT reuse the buffer until safe to do so Later test that the operation completed -via an integer identification handle passed to MPI_WAIT I stands for immediate -the call returns immediately call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror) call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror) call mpi_wait(request,stat,ierr) Alternatively could have used MPI_ISEND and MPI_RECV

ECMWF Slide 13Porting MPI Programs to the IBM Cluster 1600 Non Blocking Communications Routines include -MPI_ISEND -MPI_IRECV -MPI_WAIT -MPI_WAITALL

ECMWF Slide 14Porting MPI Programs to the IBM Cluster 1600 Debugging MPI Programs The Universal Debug Tool and Totalview

ECMWF Slide 15Porting MPI Programs to the IBM Cluster 1600 The Universal Debug Tool The Print/Write Statement Recommend the use of call flush(unit_number) -ensures output is not left in runtime buffers Recommend the use of separate output files eg: unit_number=100+mytask write(unit_number,*)...... call flush(unit_number) Or set the Environment variable MP_LABELIO=yes Do not output too much Use as few processors as possible Think carefully..... Discuss the problem with a colleague

ECMWF Slide 16Porting MPI Programs to the IBM Cluster 1600 Totalview Assumes you can launch X-Windows remotely Run totalview as part of a loadleveler job export DISPLAY=....... poe totalview -a a.out But you have to wait for the job to run..... Use only a few processors -minimises the queuing time -minimises the waste of resource while thinking....

ECMWF Slide 17Porting MPI Programs to the IBM Cluster 1600 MPI Trace Tools Identify message passing hot spots Just link with /usr/local/lib/trace/libmpiprof.a low overhead timer for all mpi routine calls Produces output files named mpi_profile.N -were N is the task number Examples of the output follow

ECMWF Slide 18Porting MPI Programs to the IBM Cluster 1600

ECMWF Slide 19Porting MPI Programs to the IBM Cluster 1600

ECMWF Slide 20Porting MPI Programs to the IBM Cluster 1600 Profiling MPI programs The same as for serial codes Use the –pg flag at compile and/or link time Produces multiple gmon.out.N files -N is the task number gprof a.out gmon.out.* The routine.kickpipes often appears high up the profile -an internal mpi library routine -where the mpi library spins waiting for something eg a message to be sent or in a barrier

ECMWF Slide 21Porting MPI Programs to the IBM Cluster 1600 Tasks Per Node ( 1 of 2 ) Try both 7 and 8 tasks per node for multi node jobs -7 tasks may run faster than 8 -depends on the frequency of communications 7 tasks leaves a processor spare -used by the OS and background daemons such as for GPFS -mpi tasks run with minimal scheduling interference 8 tasks are subject to scheduling interference -by default mpi tasks cpu spin in kickpipes -they may spin waiting for a task that has been scheduled out -the OS has to schedule cpu time for background processes -random interference across nodes is cumulative

ECMWF Slide 22Porting MPI Programs to the IBM Cluster 1600 Tasks Per Node ( 2 of 2 ) Also try 8 tasks per node and MP_WAIT_MODE=sleep -export MP_WAIT_MODE=sleep -tasks give up the cpu instead of spinning -increases latency but reduces interference -effect varies from application to application Mixed mode MPI/OpenMP works well -master OpenMP thread does the message passing -while slave OpenMP threads go to sleep -cpu cycles are freed up for background processes -used by the IFS to good effect 2 tasks each of 4 threads per node suspect success depends on the parallel granularity

ECMWF Slide 23Porting MPI Programs to the IBM Cluster 1600 Communications Optimisation Communications costs often impact parallel speedup Concatenate messages -fewer larger messages are better -reduces the effect of latency Increase MP_EAGER_LIMIT -export MP_EAGER_LIMIT=65536 -maximum size for messages sent with the eager protocol Use collective routines Use ISEND/IRECV Remove barriers Experiment with tasks per node

ECMWF Slide 24Porting MPI Programs to the IBM Cluster 1600 The new hardware switch Designed for the Cluster 1600 Referred to as the Federation switch 2 switch adaptors per physical node -2 links each 2GB/s per adaptor -32 processors share 4 links Adaptors/links are NOT multiplexed Minimum latency 10 microseconds Maximum bandwidth approx 2000 MByte/s -about 250 MB/s per task when all going off node together Up to 5 times better performance 32 processor nodes -will affect how we schedule and run jobs

ECMWF Slide 25Porting MPI Programs to the IBM Cluster 1600 Third Practical Contained in the directory /home/ectrain/trx/mpi/exercise3 on hpca Parallelising the computation of PI See the README for details

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

Similar presentations

Presentation on theme: "ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

Similar presentations

Presentation on theme: "ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004."— Presentation transcript:

Similar presentations

About project

Feedback