Download presentation
Presentation is loading. Please wait.
Published byErick McKinney Modified over 9 years ago
2
Performance Oriented MPI Jeffrey M. Squyres Andrew Lumsdaine NERSC/LBNL and U. Notre Dame
3
Overview n Overview and History of MPI n Performance Oriented Point to Point n Collectives, Data Types n Diagnostics and Tuning n Rules of Thumb and Gotchas
4
Scope of This Talk n Beginning to intermediate user n General principles and rules of thumb n When and where performance might be available n Omit (advanced) low-level issues
5
Overview and History of MPI n Library (not language) specification n Goals –Portability –Efficiency –Functionality (small and large) n Safety (communicators) n Conservative (current best practices)
6
Performance in MPI n MPI includes many performance- oriented features n These features are only potentially high- performance n The standard seeks not to preclude performance, it does not mandate it n Progress might only be made during MPI function calls
7
(Potential) Performance Features n Non-blocking operations n Persistent operations n Collective operations n MPI Datatypes
8
Basic Point to Point n “Six function MPI” includes n MPI_Send() n MPI_Recv() n These are useful, but there is more
9
Basic Point to Point MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD); } else { MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status); } MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == 0) { MPI_Send(&work, 1, MPI_INT, dest, TAG, MPI_COMM_WORLD); } else { MPI_Recv(&result, 1, MPI_INT, src, TAG, MPI_COMM_WORLD, &status); }
10
Non-Blocking Operations n MPI_Isend() n MPI_Irecv() n “I” is for immediate n Paired with MPI_Test()/MPI_Wait()
11
Non-Blocking Operations MPI_Comm_rank(comm,&rank); if (rank == 0) { MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } else { MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } MPI_Comm_rank(comm,&rank); if (rank == 0) { MPI_Isend(sendbuf,count,MPI_REAL,1,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); } else { MPI_Irecv(recvbuf,count,MPI_REAL,0,tag,comm,&request); /* Do some computation */ MPI_Wait(&request,&status); }
12
Persistent Operations n MPI_Send_Init() n MPI_Recv_init() n Creates a request but does not start it n MPI_Start() begins the communication n A single request can be re-used with multiple calls to MPI_Start()
13
Persistent Operations MPI_Comm_rank(comm, &rank); if (rank == 0) MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request); else MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request); /* … */ for (i = 0; i < n; i++) { MPI_Start(&request); /* Do some work */ MPI_Wait(&request, &status); } MPI_Comm_rank(comm, &rank); if (rank == 0) MPI_Send_init(sndbuf, count, MPI_REAL, 1, tag, comm, &request); else MPI_Recv_init(rcvbuf, count, MPI_REAL, 0, tag, comm, &request); /* … */ for (i = 0; i < n; i++) { MPI_Start(&request); /* Do some work */ MPI_Wait(&request, &status); }
14
Collective Operations n May be layered on point to point n May use tree communication patterns for efficiency n Synchronization! (No non-blocking collectives)
15
Collective Operations MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, comm); O(P)O(log P)
16
MPI Datatypes n May allow MPI to send a message directly from memory n May avoid copying/packing n (General) high performance implementations not widely available network copy
17
Quiz: MPI_Send() n After I call MPI_Send() –The recipient has received the message –I have sent the message –I can write to the message buffer without corrupting the message n I can write to the message buffer
18
Sidenote: MPI_Ssend() n MPI_Ssend() has the (perhaps) expected semantics n When MPI_Ssend() returns, the recipient has received the message n Useful for debugging (replace MPI_Send() with MPI_Ssend())
19
Quiz: MPI_Isend() n After I call MPI_Isend() –The recipient has started to receive the message –I have started to send the message –I can write to the message buffer without corrupting the message n None of the above (I must call MPI_Test() or MPI_Wait())
20
Quiz: MPI_Isend() n True or False –I can overlap communication and computation by putting some computation between MPI_Isend() and MPI_Test()/MPI_Wait() n False (in many/most cases)
21
Communication is Still Computation n A CPU, usually the main one, must do the communication work –Part of your process (inside MPI calls) –Another process on main CPU –Another thread on main CPU –Another processor
22
No Free Lunch n Part of your process (most common) –Fast but no overlap n Another process (daemons) –Overlap, but slow (extra copies) n Another thread (rare) –Overlap and fast, but difficult n Another processor (emerging) –Overlap and fast, but more hardware –E.g., Myri/gm, VIA
23
How Do I Get Performance? n Minimize time spent communicating –Minimize data copies n Minimize synchronization –I.e., time waiting for communication
24
Minimizing Communication Time n Bandwidth n Latency
25
Minimizing Latency n Collect small messages together (if you can) –One 1024-byte message instead of 1024 one-byte messages n Minimize other overhead (e.g., copying) n Overlap with computation (if you can)
26
Example: Domain Decomposition
27
Naïve Approach while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_send(…); for (i = 0; i < 4; i++) MPI_recv(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_send(…); for (i = 0; i < 4; i++) MPI_recv(…); }
28
Naïve Approach n Deadlock! (Maybe) n Can fix with careful coordination of receiving versus sending on alternate processes n But this can still serialize
29
MPI_Sendrecv() while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Sendrecv(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Sendrecv(…); }
30
Immediate Operations while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Isend(…); MPI_Irecv(…); } MPI_Waitall(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) { MPI_Isend(…); MPI_Irecv(…); } MPI_Waitall(…); }
31
Receive Before Sending while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_Irecv(…); for (i = 0; i < 4; i++) MPI_Isend(…); MPI_Waitall(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { for (i = 0; i < 4; i++) MPI_Irecv(…); for (i = 0; i < 4; i++) MPI_Isend(…); MPI_Waitall(…); }
32
Persistent Operations for (i = 0; i < 4; i++) { MPI_Recv_init(…); MPI_Send_init(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { MPI_Startall(…) MPI_Waitall(…); } for (i = 0; i < 4; i++) { MPI_Recv_init(…); MPI_Send_init(…); } while (!done) { exchange(D, neighbors, myrank); dored(D); exchange(D, neighbors, myrank); doblack(D); } void exchange(Array D, int *neighbors, int myrank) { MPI_Startall(…) MPI_Waitall(…); }
33
Overlapping while (!done) { MPI_Startall(…); /* Start exchanges */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* As information arrives */ do_received_red(D); /* Process */ } MPI_Startall(…); do_inner_black(D); for (i = 0; i < 4; i++) { MPI_Wait_any(…); do_received_black(D); } while (!done) { MPI_Startall(…); /* Start exchanges */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* As information arrives */ do_received_red(D); /* Process */ } MPI_Startall(…); do_inner_black(D); for (i = 0; i < 4; i++) { MPI_Wait_any(…); do_received_black(D); }
34
Advanced Overlap MPI_Startall(…); /* Start all receives */ /* … */ while (!done) { MPI_Startall(…); /* Start sends */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* Wait on receives */ if (received) { do_received_red(D); /* Process */ MPI_Start(…); /* Restart receive */ } /* Repeat for black */ } MPI_Startall(…); /* Start all receives */ /* … */ while (!done) { MPI_Startall(…); /* Start sends */ do_inner_red(D); /* Internal computation */ for (i = 0; i < 4; i++) { MPI_Wait_any(…); /* Wait on receives */ if (received) { do_received_red(D); /* Process */ MPI_Start(…); /* Restart receive */ } /* Repeat for black */ }
35
MPI Data Types n MPI_Type_vector n MPI_Type_struct n Etc. n MPI_Pack might be better network copy
36
Minimizing Synchronization n At synchronization point (e.g., with collective communication) all processes must arrive at collective call n Can spend lots of time waiting n This is often an algorithmic issue –E.g., check for convergence every 5 iterations instead of every iteration
37
Gotchas n MPI_Probe –Guarantees extra memory copy n MPI_Any_source –Can cause additional (internal) looping n MPI_All_to_all –All pairs must communicate –Synchronization (avoid in general)
38
Diagnostic Tools n Totalview n Prism n Upshot n XMPI
39
Summary n Receive before sending n Collect small messages together n Overlap (if possible) n Use immediate operations n Use persistent operations n Use diagnostic tools
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.