CS 484
Message Passing Based on multi-processor Set of independent processors Connected via some communication net All communication between processes is done via a message sent from one to the other
MPI Message Passing Interface Computation is made of: One or more processes Communicate by calling library routines MIMD programming model SPMD most common.
MPI Processes use point-to-point communication operations Collective communication operations are also available. Communication can be modularized by the use of communicators. MPI_COMM_WORLD is the base. Used to identify subsets of processors
MPI Complex, but most problems can be solved using the 6 basic functions. MPI_Init MPI_Finalize MPI_Comm_size MPI_Comm_rank MPI_Send MPI_Recv
MPI Basics Most all calls require a communicator handle as an argument. MPI_COMM_WORLD MPI_Init and MPI_Finalize don’t require a communicator handle used to begin and end and MPI program MUST be called to begin and end
MPI Basics MPI_Comm_size determines the number of processors in the communicator group MPI_Comm_rank determines the integer identifier assigned to the current process zero based
MPI Basics #include main(int argc, char *argv[]) { int iproc, nproc; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &nproc); MPI_Comm_rank(MPI_COMM_WORLD, &iproc); printf("I am processor %d of %d\n", iproc, nproc); MPI_Finalize(); }
MPI Communication MPI_Send Sends an array of a given type Requires a destination node, size, and type MPI_Recv Receives an array of a given type Same requirements as MPI_Send Extra parameter MPI_Status variable.
MPI Basics Made for both FORTRAN and C Standards for C MPI_ prefix to all calls First letter of function name is capitalized Returns MPI_SUCCESS or error code MPI_Status structure MPI data types for each C type OUT parameters passed using & operator
Using MPI Based on rsh or ssh requires a.rhosts file or ssh key setup hostname login Path to compiler (CS open labs) MPI_HOME /users/faculty/snell/mpich MPI_CC MPI_HOME/bin/mpicc Marylou5 Use mpicc mpicc hello.c –o hello
Using MPI Write program Compile using mpicc Write process file (linux cluster) host nprocs full_path_to_prog 0 for nprocs on first line, 1 for all others Run program (linux cluster) prog -p4pg process_file args mpirun –np #procs –machinefile machines prog Run program (scheduled on marylou5 using pbs) mpirun -np #procs -machinefile $PBS_NODEFILE prog mpiexec prog
#include “mpi.h” #include #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); } /* broadcast data */ MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* Add my portion Of data */ x = n/nproc; low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize(); }
MPI Message Passing programs are non- deterministic because of concurrency Consider 2 processes sending messages to third MPI only guarantees that 2 messages sent from a single process to another will arrive in order. It is the programmer's responsibility to ensure computation determinism
MPI & Determinism MPI A Process may specify the source of the message A Process may specify the type of message Non-Determinism MPI_ANY_SOURCE or MPI_ANY_TAG
Example for (n = 0; n < nproc/2; n++) { MPI_Send(buff, BSIZE, MPI_FLOAT, rnbor, 1, MPI_COMM_WORLD); MPI_Recv(buff, BSIZE, MPI_FLOAT, MPI_ANY_SOURCE, 1, MPI_COMM_WORLD, &status); /* Process the data */ }
Global Operations Coordinated communication involving multiple processes. Can be implemented by the programmer using sends and receives For convenience, MPI provides a suite of collective communication functions. All participating processes must call the same function.
Collective Communication Barrier Synchronize all processes Broadcast Gather Gather data from all processes to one process Scatter Reduction Global sums, products, etc.
Collective Communication
Distribute Problem Size Distribute Input data Exchange Boundary values Find Max Error Collect Results
MPI_Reduce MPI_Reduce(inbuf, outbuf, count, type, op, root, comm)
MPI_Reduce
MPI_Allreduce MPI_Allreduce(inbuf, outbuf, count, type, op, comm)
MPI Collective Routines Several routines: MPI_ALLGATHER MPI_ALLGATHERV MPI_BCAST MPI_ALLTOALL MPI_ALLTOALLV MPI_REDUCE MPI_GATHER MPI_GATHERV MPI_SCATTER MPI_REDUCE_SCATTER MPI_SCAN MPI_SCATTERV MPI_ALLREDUCE All versions deliver results to all participating processes “V” versions allow the chunks to have different sizes MPI_ALLREDUCE, MPI_REDUCE, MPI_REDUCE_SCATTER, and MPI_SCAN take both built-in and user-defined combination functions
Built-In Collective Computation Operations
27 Example: PI in C -1 #include "mpi.h" #include int main(int argc, char *argv[]) { int done = 0, n, myid, numprocs, i, rc; double PI25DT = ; double mypi, pi, h, sum, x, a; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break;
28 Example: PI in C - 2 h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is %.16f\n", pi, fabs(pi - PI25DT)); } MPI_Finalize(); return 0; }
SOME OTHER THINGS
MPI Datatypes Data in messages are described by: Address, Count, Datatype MPI predefines many datatypes MPI_INT, MPI_FLOAT, MPI_DOUBLE, etc. There is an analog for each primitive type Can also construct custom data types for structured data
MPI_Recv Blocks until message is received Message is matched based on source & tag The MPI_Status argument gets filled with information about the message Source & Tag Receiving fewer elements than specified is OK Receiving more elements is an error Use MPI_Get_count to get number of elements received
MPI_Recv int recvd_tag, recvd_from, recvd_count; MPI_Status status; MPI_Recv(..., MPI_ANY_SOURCE, MPI_ANY_TAG,..., &status ) recvd_tag = status.MPI_TAG; recvd_from = status.MPI_SOURCE; MPI_Get_count( &status, datatype, &recvd_count );
Non-blocking communication MPI_Send and MPI_Recv are blocking MPI_Send does not complete until the buffer is available to be modified MPI_Recv does not complete until the buffer is filled Blocking communication can lead to deadlocks for(int p = 0; p < nproc; p++) { MPI_Send(… p ….) MPI_Recv(… p ….) }
Non-blocking communiction MPI_Isend & MPI_Irecv return immediately (non-blocking) MPI_Request request; MPI_Status status; MPI_Isend( start, count, datatype, dest, tag, comm, &request ) MPI_Irecv( start, count, datatype, src, tag, comm, &request ) MPI_WAIT( &request, &status ) Used to overlap communication with computation Anywhere you use MPI_Send or MPI_Recv, you can use the pair of MPI_Isend/MPI_Wait or MPI_Irecv/MPI_Wait Also can use MPI_Waitall, MPI_Waitany, MPI_Waitsome Can also check to see if you have any messages without actually receiving them – MPI_Probe & MPI_Iprobe MPI_Probe blocks until there is a message – MPI_Iprobe sets a flag
Communicators All MPI communication is based on a communicator which contains a context and a group Contexts define a safe communication space for message-passing Contexts can be viewed as system-managed tags Contexts allow different libraries to co-exist The group is just a set of processes Processes are always referred to by unique rank in group
Uses of MPI_COMM_WORLD Contains all processes available at the time the program was started Provides initial safe communication space Simple programs communicate with MPI_COMM_WORLD Even complex programs will use MPI_COMM_WORLD for most communications Complex programs duplicate and subdivide copies of MPI_COMM_WORLD Provides a global communicator for forming smaller groups or subsets of processors for specific tasks MPI_COMM_WORLD
int MPI_Comm_split( MPI_Comm comm, int color, int key, MPI_Comm *newcomm) MPI_COMM_SPLIT( COMM, COLOR, KEY, NEWCOMM, IERR ) INTEGER COMM, COLOR, KEY, NEWCOMM, IERR Subdividing a Communicator with MPI_COMM_SPLIT MPI_COMM_SPLIT partitions the group associated with the given communicator into disjoint subgroups Each subgroup contains all processes having the same value for the argument color Within each subgroup, processes are ranked in the order defined by the value of the argument key, with ties broken according to their rank in old communicator
Subdividing a Communicator To divide a communicator into two non- overlapping groups color = (rank < size/2) ? 0 : 1 ; MPI_Comm_split(comm, color, 0, &newcomm) ; comm newcomm
Subdividing a Communicator To divide a communicator such that all processes with even ranks are in one group all processes with odd ranks are in the other group maintain the reverse order by rank color = (rank % 2 == 0) ? 0 : 1 ; key = size - rank ; MPI_Comm_split(comm, color, key, &newcomm) ; comm newcomm
program main include 'mpif.h' integer ierr, row_comm, col_comm integer myrank, size, P, Q, myrow, mycol P = 4 Q = 3 call MPI_INIT(ierr) call MPI_COMM_RANK(MPI_COMM_WORLD, myrank, ierr) call MPI_COMM_SIZE(MPI_COMM_WORLD, size, ierr) C Determine row and column position myrow = myrank/Q mycol = mod(myrank,Q) C Split comm into row and column comms call MPI_Comm_split(MPI_COMM_WORLD, myrow, mycol, row_comm, ierr) call MPI_Comm_split(MPI_COMM_WORLD, mycol, myrow, col_comm, ierr) print*, "My coordinates are[",myrank,"] ",myrow. mycol call MPI_Finalize(ierr) stop end
0 (0,0) 1 (0,1) 2 (0,2) 3 (1,0) 4 (1,1) 5 (1,2) 6 (2,0) 7 (2,1) 8 (2,2) 9 (3,0) 10 (3,1) 11 (3,2) MPI_COMM_WORLD row_comm col_comm
DEBUGGING
An ounce of prevention… Defensive programming Check function return codes Verify send and receive sizes Incremental programming Modular programming Test modules – keep test code in place Identify all shared data and think carefully about how it is accessed Correctness first – then speed
Debugging Characterize the bug Run code serially Run in parallel on one core (2-4 processes) Run in parallel (2-4 processes on 2-4 cores) Play around with inputs and other data and data sizes Find smallest data size that exposes the bug Remove as much non-determinism as you can Print statements – use stderr (non buffered) Before and after communication or shared variable access Print all information – source, sizes, data, tag, etc. Identify process number – first thing in print (helps sorting) Leave the prints in your code - #ifdef
Debugging Learn about C constructs __FILE__, __LINE__, and __FUNCTION__ Make one logical change at a time and then test Learn how to attach debuggers You will probably need some sort of stall code – ie. Wait for input on master then do a barrier – all others just do barrier
Common problems Not all processes call collective call Be very careful about putting collective calls inside conditionals Be sure the communicator is correct Deadlock (everybody on recv) Use non-blocking calls Use MPI_Sendrecv Process waiting for data that is never sent Use collective calls where you can Use simple communication patterns
Best Advice Program incrementally and modularly Characterize the bug and leave yourself time to walk away from it and think about it Never underestimate the value of a second set of eyes Sometimes just explaining your code to someone else helps you help yourself