Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame

Overview n Overview and History of MPI n Performance Oriented Point to Point n Collectives, Data Types n Diagnostics and Tuning n Rules of Thumb and Gotchas

Scope of This Talk n Beginning to intermediate user n General principles and rules of thumb n When and where performance might be available n Omit (advanced) low-level issues

Overview and History of MPI n Library (not language) specification n Goals –Portability –Efficiency –Functionality (small and large) n Safety (communicators) n Conservative (current best practices)

Performance in MPI n MPI includes many performance- oriented features n These features are only potentially high- performance n The standard seeks not to preclude performance, it does not mandate it n Progress might only be made during MPI function calls

(Potential) Performance Features n Non-blocking operations n Persistent operations n Collective operations n MPI Datatypes

Basic Point to Point n “Six function MPI” includes n MPI_Send() n MPI_Recv() n These are useful, but there is more

Basic Point to Point MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD); } else { MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status); } MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD); } else { MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status); }

Non-Blocking Operations n MPI_Isend() n MPI_Irecv() n “I” is for immediate n Paired with MPI_Test()/MPI_Wait()

Non-Blocking Operations MPI_Comm_rank(comm,&rank); if (rank == 0) { MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } else { MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } MPI_Comm_rank(comm,&rank); if (rank == 0) { MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } else { MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); }

Persistent Operations n MPI_Send_Init() n MPI_Recv_init() n Creates a request but does not start it n MPI_Start() begins the communication n A single request can be re-used with multiple calls to MPI_Start()

Persistent Operations MPI_Comm_rank(comm, &rank); if (rank == 0) MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request); else MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request); /* … */ for (i = 0; i < n; i++) { MPI_Start(&request); /* Do some work */ MPI_Wait(&request, &status); } MPI_Comm_rank(comm, &rank); if (rank == 0) MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request); else MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request); /* … */ for (i = 0; i < n; i++) { MPI_Start(&request); /* Do some work */ MPI_Wait(&request, &status); }

Collective Operations n May be layered on point to point n May use tree communication patterns for efficiency n Synchronization! (No non-blocking collectives)

Collective Operations MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); O(P)O(log P)

MPI Datatypes n May allow MPI to send a message directly from memory n May avoid copying/packing n (General) high performance implementations not widely available network copy

Quiz: MPI_Send() n After I call MPI_Send() –The recipient has received the message –I have sent the message –I can write to the message buffer without corrupting the message n I can write to the message buffer

Sidenote: MPI_Ssend() n MPI_Ssend() has the (perhaps) expected semantics n When MPI_Ssend() returns, the recipient has received the message n Useful for debugging (replace MPI_Send() with MPI_Ssend())

Quiz: MPI_Isend() n After I call MPI_Isend() –The recipient has started to receive the message –I have started to send the message –I can write to the message buffer without corrupting the message n None of the above (I must call MPI_Test() or MPI_Wait())

Quiz: MPI_Isend() n True or False –I can overlap communication and computation by putting some computation between MPI_Isend() and MPI_Test()/MPI_Wait() n False (in many/most cases)

Communication is Still Computation n A CPU, usually the main one, must do the communication work –Part of your process (inside MPI calls) –Another process on main CPU –Another thread on main CPU –Another processor

No Free Lunch n Part of your process (most common) –Fast but no overlap n Another process (daemons) –Overlap, but slow (extra copies) n Another thread (rare) –Overlap and fast, but difficult n Another processor (emerging) –Overlap and fast, but more hardware –E.g., Myri/gm, VIA

How Do I Get Performance? n Minimize time spent communicating –Minimize data copies n Minimize synchronization –I.e., time waiting for communication

Minimizing Communication Time n Bandwidth n Latency

Minimizing Latency n Collect small messages together (if you can) –One 1024-byte message instead of 1024 one-byte messages n Minimize other overhead (e.g., copying) n Overlap with computation (if you can)

Example: Domain Decomposition

Naïve Approach while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_send(…); for (i = 0; i < 4; i++) MPI_recv(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_send(…); for (i = 0; i < 4; i++) MPI_recv(…); }

Naïve Approach n Deadlock! (Maybe) n Can fix with careful coordination of receiving versus sending on alternate processes n But this can still serialize

MPI_Sendrecv() while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Sendrecv(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Sendrecv(…); }

Immediate Operations while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Isend(…); MPI_Irecv(…); } MPI_Waitall(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Isend(…); MPI_Irecv(…); } MPI_Waitall(…); }

Receive Before Sending while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_Irecv(…); for (i = 0; i < 4; i++) MPI_Isend(…); MPI_Waitall(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_Irecv(…); for (i = 0; i < 4; i++) MPI_Isend(…); MPI_Waitall(…); }

Persistent Operations for (i = 0; i < 4; i++) { MPI_Recv_init(…); MPI_Send_init(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { MPI_Startall(…) MPI_Waitall(…); } for (i = 0; i < 4; i++) { MPI_Recv_init(…); MPI_Send_init(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { MPI_Startall(…) MPI_Waitall(…); }

Overlapping while (!done) { MPI_Startall(…); /* Start exchanges */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* As information arrives */ do_received_red(D); /* Process */ } MPI_Startall(…); do_inner_black(D); for (i = 0; i < 4; i++) { MPI_Wait_any(…); do_received_black(D); } while (!done) { MPI_Startall(…); /* Start exchanges */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* As information arrives */ do_received_red(D); /* Process */ } MPI_Startall(…); do_inner_black(D); for (i = 0; i < 4; i++) { MPI_Wait_any(…); do_received_black(D); }

Advanced Overlap MPI_Startall(…); /* Start all receives */ /* … */ while (!done) { MPI_Startall(…); /* Start sends */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* Wait on receives */ if (received) { do_received_red(D); /* Process */ MPI_Start(…); /* Restart receive */ } /* Repeat for black */ } MPI_Startall(…); /* Start all receives */ /* … */ while (!done) { MPI_Startall(…); /* Start sends */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* Wait on receives */ if (received) { do_received_red(D); /* Process */ MPI_Start(…); /* Restart receive */ } /* Repeat for black */ }

MPI Data Types n MPI_Type_vector n MPI_Type_struct n Etc. n MPI_Pack might be better network copy

Minimizing Synchronization n At synchronization point (e.g., with collective communication) all processes must arrive at collective call n Can spend lots of time waiting n This is often an algorithmic issue –E.g., check for convergence every 5 iterations instead of every iteration

Gotchas n MPI_Probe –Guarantees extra memory copy n MPI_Any_source –Can cause additional (internal) looping n MPI_All_to_all –All pairs must communicate –Synchronization (avoid in general)

Diagnostic Tools n Totalview n Prism n Upshot n XMPI

Summary n Receive before sending n Collect small messages together n Overlap (if possible) n Use immediate operations n Use persistent operations n Use diagnostic tools

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

Similar presentations

Presentation on theme: "Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame.

Similar presentations

Presentation on theme: "Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame."— Presentation transcript:

Similar presentations

About project

Feedback