ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004.

Slides:



Advertisements
Similar presentations
Shared-Memory Model and Threads Intel Software College Introduction to Parallel Programming – Part 2.
Advertisements

Chapter 13: I/O Systems I/O Hardware Application I/O Interface
Chapter 5: CPU Scheduling
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
Jeopardy Q 1 Q 6 Q 11 Q 16 Q 21 Q 2 Q 7 Q 12 Q 17 Q 22 Q 3 Q 8 Q 13
1 Processes and Threads Creation and Termination States Usage Implementations.
1 Interprocess Communication 1. Ways of passing information 2. Guarded critical activities (e.g. updating shared data) 3. Proper sequencing in case of.
NGS computation services: API's,
1 Data Link Protocols By Erik Reeber. 2 Goals Use SPIN to model-check successively more complex protocols Using the protocols in Tannenbaums 3 rd Edition.
1 Communication in Distributed Systems REKs adaptation of Tanenbaums Distributed Systems Chapter 2.
Demo of ISP Eclipse GUI Command-line Options Set-up Audience with LiveDVD About 30 minutes – by Ganesh 1.
Real Time Versions of Linux Operating System Present by Tr n Duy Th nh Quách Phát Tài 1.
SE-292 High Performance Computing
L.N. Bhuyan Adapted from Patterson’s slides
Homework Reading Machine Projects Labs
IT253: Computer Organization
Mehdi Naghavi Spring 1386 Operating Systems Mehdi Naghavi Spring 1386.
Debugging operating systems with time-traveling virtual machines Sam King George Dunlap Peter Chen CoVirt Project, University of Michigan.
1 Tuning for MPI Protocols l Aggressive Eager l Rendezvous with sender push l Rendezvous with receiver pull l Rendezvous blocking (push or pull)
I/O and Networking Fred Kuhns
I/O Systems.
Slide 5-1 Copyright © 2004 Pearson Education, Inc. Operating Systems: A Modern Perspective, Chapter 5 5 Device Management.
Module 10: Virtual Memory
Making Time-stepped Applications Tick in the Cloud Tao Zou, Guozhang Wang, Marcos Vaz Salles*, David Bindel, Alan Demers, Johannes Gehrke, Walker White.
MPI Message Passing Interface
ECMWF 1 COM HPCF 2004: High performance file I/O High performance file I/O Computer User Training Course 2004 Carsten Maaß User Support.
CMPT 401 Dr. Alexandra Fedorova Lecture III: OS Support.
3.1 Silberschatz, Galvin and Gagne ©2009 Operating System Concepts – 8 th Edition Process An operating system executes a variety of programs: Batch system.
1 Processes and Threads Chapter Processes 2.2 Threads 2.3 Interprocess communication 2.4 Classical IPC problems 2.5 Scheduling.
Operating System.
Processes Management.
Processes Management.
Executional Architecture
Global Analysis and Distributed Systems Software Architecture Lecture # 5-6.
KAIST Computer Architecture Lab. The Effect of Multi-core on HPC Applications in Virtualized Systems Jaeung Han¹, Jeongseob Ahn¹, Changdae Kim¹, Youngjin.
© 2004, D. J. Foreman 1 Scheduling & Dispatching.
25 seconds left…...
SE-292 High Performance Computing
SE-292 High Performance Computing Memory Hierarchy R. Govindarajan
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 23, 2002 Topic: Memory Hierarchy Design (HP3 Ch. 5) (Caches, Main Memory and.
1 Non-Blocking Communications. 2 #include int main(int argc, char **argv) { int my_rank, ncpus; int left_neighbor, right_neighbor; int data_received=-1;
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
Scientific Programming OpenM ulti- P rocessing M essage P assing I nterface.
1 Parallel Computing—Introduction to Message Passing Interface (MPI)
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
1 What is message passing? l Data transfer plus synchronization l Requires cooperation of sender and receiver l Cooperation not always apparent in code.
Task Farming on HPCx David Henty HPCx Applications Support
IFS Benchmark with Federation Switch John Hague, IBM.
1 Choosing MPI Alternatives l MPI offers may ways to accomplish the same task l Which is best? »Just like everything else, it depends on the vendor, system.
Recall: Three I/O Methods Synchronous: Wait for I/O operation to complete. Asynchronous: Post I/O request and switch to other work. DMA (Direct Memory.
1 Using HPS Switch on Bassi Jonathan Carter User Services Group Lead NERSC User Group Meeting June 12, 2006.
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.
CE Operating Systems Lecture 3 Overview of OS functions and structure.
An Introduction to Parallel Programming with MPI March 22, 24, 29, David Adams
© 2005 IBM MPI Louisiana Tech University Ruston, Louisiana Charles Grassl IBM January, 2006.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Message-Passing Computing Chapter 2. Programming Multicomputer Design special parallel programming language –Occam Extend existing language to handle.
1 Lecture 4: Part 2: MPI Point-to-Point Communication.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
Threaded Programming Lecture 2: Introduction to OpenMP.
3/12/2013Computer Engg, IIT(BHU)1 PARALLEL COMPUTERS- 2.
Background Computer System Architectures Computer System Software.
An operating system (OS) is a collection of system programs that together control the operation of a computer system.
MPI Point to Point Communication
CMSC 611: Advanced Computer Architecture
More on MPI Nonblocking point-to-point routines Deadlock
CS703 - Advanced Operating Systems
Introduction to parallelism and the Message Passing Interface
More on MPI Nonblocking point-to-point routines Deadlock
Presentation transcript:

ECMWF Slide 1Porting MPI Programs to the IBM Cluster 1600 Peter Towers March 2004

ECMWF Slide 2Porting MPI Programs to the IBM Cluster 1600 Topics The current hardware switch Parallel Environment (PE) Issues with Standard Sends/Receives Use of non blocking communications Debugging MPI programs MPI tracing Profiling MPI programs Tasks per Node Communications Optimisation The new hardware switch Third Practical

ECMWF Slide 3Porting MPI Programs to the IBM Cluster 1600 The current hardware switch Designed for a previous generation of IBM hardware Referred to as the Colony switch 2 switch adaptors per logical node -8 processors share 2 adaptors -called a dual plane switch Adaptors are multiplexed -software stripes large messages across both adaptors Minimum latency 21 microseconds Maximum bandwidth approx 350 MBytes/s - about 45 MB/s per task when all going off node together

ECMWF Slide 4Porting MPI Programs to the IBM Cluster 1600 Parallel Environment (PE) MPI programs are managed by the IBM PE IBM documentation refers to PE and POE -POE stands for Parallel Operating Environment -many environment variables to tune the parallel environment -talks about launching parallel jobs interactively ECMWF uses Loadleveler for batch jobs -PE usage becomes almost transparent

ECMWF Slide 5Porting MPI Programs to the IBM Cluster 1600 Issues with Standard Sends/Receives The MPI standard can be implemented in different ways Programs may not be fully portable across platforms Standard Sends and Receives can cause problems -Potential for deadlocks -need to understand Blocking v Non Blocking communications -need to understand Eager versus Rendezvous protocols IFS had to be modified to run on IBM

ECMWF Slide 6Porting MPI Programs to the IBM Cluster 1600 Blocking Communications MPI_Send is a blocking routine It returns when it is safe to re-use the buffer being sent -the send buffer can then be overwritten The MPI layer may have copied the data elsewhere -using internal buffer/mailbox space -the message is then in transit but not yet received -this is called an eager protocol -good for short messages The MPI layer may have waited for the receiver -the data is copied from send to receive buffer directly -lower overhead transfer -this is called a rendezvous protocol -good for large messages

ECMWF Slide 7Porting MPI Programs to the IBM Cluster 1600 MPI_Send on the IBM Uses the Eager protocol for short messages -By default short means up to 4096 bytes the higher the task count, the lower the value Uses the Rendezvous protocol for long messages Potential for send/send deadlocks - tasks block in mpi_send if(me.eq.0) then him=1 else him=0 endif call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror)

ECMWF Slide 8Porting MPI Programs to the IBM Cluster 1600 Solutions to Send/Send deadlocks Pair up sends and receives use MPI_SENDRECV use a buffered send - MPI_BSEND use asynchronous sends/receives -MPI_ISEND/MPI_IRECV

ECMWF Slide 9Porting MPI Programs to the IBM Cluster 1600 Paired Sends and Receives More complex code Requires close synchronisation if (me.eq. 0) then him=1 call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) else him=0 call mpi_recv(rbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,stat,ierror) call mpi_send(sbuff,n,MPI_REAL8,him,tag,MPI_COMM_WORLD,ierror) endif

ECMWF Slide 10Porting MPI Programs to the IBM Cluster 1600 MPI_SENDRECV Easier to code Still implies close synchronisation call mpi_sendrecv(sbuff,n,MPI_REAL8,him,1, & rbuff,n,MPI_REAL8,him,1, & MPI_COMM_WORLD,stat,ierror)

ECMWF Slide 11Porting MPI Programs to the IBM Cluster 1600 MPI_BSEND This performs a send using an additional buffer -the buffer is allocated by the program via MPI_BUFFER_ATTACH -done once as part of the program initialisation Typically quick to implement -add the mpi_buffer_attach call how big to make the buffer? -change MPI_SEND to MPI_BSEND everywhere But introduces additional memory copy -extra overhead -not recommended for production codes

ECMWF Slide 12Porting MPI Programs to the IBM Cluster 1600 MPI_IRECV / MPI_ISEND Uses Non Blocking Communications Routines return without completing the operation -the operations run asynchronously -Must NOT reuse the buffer until safe to do so Later test that the operation completed -via an integer identification handle passed to MPI_WAIT I stands for immediate -the call returns immediately call mpi_irecv(rbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,request,ierror) call mpi_send (sbuff,n,MPI_REAL8,him,1,MPI_COMM_WORLD,ierror) call mpi_wait(request,stat,ierr) Alternatively could have used MPI_ISEND and MPI_RECV

ECMWF Slide 13Porting MPI Programs to the IBM Cluster 1600 Non Blocking Communications Routines include -MPI_ISEND -MPI_IRECV -MPI_WAIT -MPI_WAITALL

ECMWF Slide 14Porting MPI Programs to the IBM Cluster 1600 Debugging MPI Programs The Universal Debug Tool and Totalview

ECMWF Slide 15Porting MPI Programs to the IBM Cluster 1600 The Universal Debug Tool The Print/Write Statement Recommend the use of call flush(unit_number) -ensures output is not left in runtime buffers Recommend the use of separate output files eg: unit_number=100+mytask write(unit_number,*) call flush(unit_number) Or set the Environment variable MP_LABELIO=yes Do not output too much Use as few processors as possible Think carefully..... Discuss the problem with a colleague

ECMWF Slide 16Porting MPI Programs to the IBM Cluster 1600 Totalview Assumes you can launch X-Windows remotely Run totalview as part of a loadleveler job export DISPLAY= poe totalview -a a.out But you have to wait for the job to run..... Use only a few processors -minimises the queuing time -minimises the waste of resource while thinking....

ECMWF Slide 17Porting MPI Programs to the IBM Cluster 1600 MPI Trace Tools Identify message passing hot spots Just link with /usr/local/lib/trace/libmpiprof.a low overhead timer for all mpi routine calls Produces output files named mpi_profile.N -were N is the task number Examples of the output follow

ECMWF Slide 18Porting MPI Programs to the IBM Cluster 1600

ECMWF Slide 19Porting MPI Programs to the IBM Cluster 1600

ECMWF Slide 20Porting MPI Programs to the IBM Cluster 1600 Profiling MPI programs The same as for serial codes Use the –pg flag at compile and/or link time Produces multiple gmon.out.N files -N is the task number gprof a.out gmon.out.* The routine.kickpipes often appears high up the profile -an internal mpi library routine -where the mpi library spins waiting for something eg a message to be sent or in a barrier

ECMWF Slide 21Porting MPI Programs to the IBM Cluster 1600 Tasks Per Node ( 1 of 2 ) Try both 7 and 8 tasks per node for multi node jobs -7 tasks may run faster than 8 -depends on the frequency of communications 7 tasks leaves a processor spare -used by the OS and background daemons such as for GPFS -mpi tasks run with minimal scheduling interference 8 tasks are subject to scheduling interference -by default mpi tasks cpu spin in kickpipes -they may spin waiting for a task that has been scheduled out -the OS has to schedule cpu time for background processes -random interference across nodes is cumulative

ECMWF Slide 22Porting MPI Programs to the IBM Cluster 1600 Tasks Per Node ( 2 of 2 ) Also try 8 tasks per node and MP_WAIT_MODE=sleep -export MP_WAIT_MODE=sleep -tasks give up the cpu instead of spinning -increases latency but reduces interference -effect varies from application to application Mixed mode MPI/OpenMP works well -master OpenMP thread does the message passing -while slave OpenMP threads go to sleep -cpu cycles are freed up for background processes -used by the IFS to good effect 2 tasks each of 4 threads per node suspect success depends on the parallel granularity

ECMWF Slide 23Porting MPI Programs to the IBM Cluster 1600 Communications Optimisation Communications costs often impact parallel speedup Concatenate messages -fewer larger messages are better -reduces the effect of latency Increase MP_EAGER_LIMIT -export MP_EAGER_LIMIT= maximum size for messages sent with the eager protocol Use collective routines Use ISEND/IRECV Remove barriers Experiment with tasks per node

ECMWF Slide 24Porting MPI Programs to the IBM Cluster 1600 The new hardware switch Designed for the Cluster 1600 Referred to as the Federation switch 2 switch adaptors per physical node -2 links each 2GB/s per adaptor -32 processors share 4 links Adaptors/links are NOT multiplexed Minimum latency 10 microseconds Maximum bandwidth approx 2000 MByte/s -about 250 MB/s per task when all going off node together Up to 5 times better performance 32 processor nodes -will affect how we schedule and run jobs

ECMWF Slide 25Porting MPI Programs to the IBM Cluster 1600 Third Practical Contained in the directory /home/ectrain/trx/mpi/exercise3 on hpca Parallelising the computation of PI See the README for details