MPI - A User level Message-passing Interface K. Ramakrishna Shenai
History of MPI Principal branches of MPI MPI 1.2 Version MPI 2.0 Version An MPI Process in normally written in C (C++) or fortran and linked with MPI Libraries.
MPI Basic functionality MPI Program consists of A set of processes occupying a separate and unshared address space Communication of data by invoking MPI procedures MPI 2, has the added ability of adding and destroying processes.
MPI Basic functionality (contd.) MPI Processes can be Can have the same executable file running in different address spaces Individual Processes are instances of different executables from different source files, linked together Pass different command line arguments to different processes Rank in a MPI_COMM_WORLD communicator Each process has a unique rank in the range of [0.. p-1]
/* Example Program: Hello world */ # include int main(int argc, char *argv[]){ int rank, size, partner; int namelen; char name[MPI_MAX_PROCESSOR_NAME]; char greeting[sizeof(name) + 100]; MPI_Init(&argc, 7argv); /* Initialize MPI */ MPI_Comm_size(MPI_COMM_WORLD, &size); /* Which one am I */ MPI_Comm_rank(MPI_COMM_WORLD, &rank); /* Where am I running*/ MPI_Get_processor(name, &namelen); sprintf(greeting, “hello world %d of %d running on %s \n”, rank, size, name); if(rank==0) { fputs(greeting, stdout); for(partner=1; partner < size; partner++) { MPI_Status stat; MPI_Recv(greeting, sizeof(greeting), MPI_BYTE, partner, 1, MPI_COMM_WORLD, &stat); fputs(greeting, stdout); } }else { MPI_Send(greeting, strlen(greeting)+1, MPI_BYTE, 0, 1, MPI_COMM_WORLD); } MPI_Finalize(); exit(0); }
Basic Concepts MPI_Init and MPI_finalize In C, MPI_init should be passed the address of the argc and argv. This procedure must be called before any other MPI Procedure MPI_Finalize: This is a terminating procedure after which any other MPI procedure must not be implemented
Basic Concepts Communicators Requirement the caller specify a communicator argument, which define the context in which communication takes place In C, Communicators have the type MPI_Comm Every process within a communicator has size MPI_Comm_size and the process rank with MPI_Comm_rank. MPI_Get_processor_name A procedure that is used to determine what physical processor the current process is running. Usually returns the system’s Internet Hostname
Basic Concepts: MPI_Send and MPI_Receive They are point to point communication primitives Transmission of messages directly from sender to receiver’s memory. c prototypes int MPI_Send(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) int MPI_Receive(void *buf, int count, MPI_Datatype datatype, int dest, int tag, MPI_Comm comm) Mainly Blocking Mode of Communication, However in certain cases a non-blocking mode of Communication is possible. Other communication modes (buffered, synchronous and ready exist)
Basic Concepts MPI_Status Structure can be examined to determine the source and tag of any message Loosely Synchronous Operations Basically means that all processes engaged in communication operation must call the corresponding MPI procedure at the same logical point in execution, with respect to all other MPI communication calls.
Basic Concepts Balancing Sends and Receives /* InCorrect MAY Deadlock depending upon buffering */ MPI_Send(out, count, type, partner, tag, comm); MPI_Recv(in, count, type, partner, tag, comm, &stat); /* Correct Send must match receive exactly */ If(rank > partner) { MPI_Send(out, count, type, partner, tag, comm); MPI_Recv(in, count, type, partner, tag, comm, &stat);} else if(rank < partner){ MPI_Recv(in, count, type, partner, tag, comm, &stat); MPI_Send(out, count, type, partner, tag, comm);} else { /* rank==partner */ MPI_Type_size(type, &sz); memcpy(in, out, count, * sz); /*content of stat not set */ /* Also Correct use only MPI_Sendrecv Procedure */ Mpi_Sendrecv(out, count, type, partner, tag, in, count, type, partner, tag, comm, &stat);
Basic Concepts MPI Data types: Elementary MPI data types in C MPI_CHAR, MPI_SHORT, MPI_INT, MPI_LONG, MPI_UNSIGNED_CHAR, MPI_UNSIGNED_SHORT, MPI_UNSIGNED, MPI_UNISIGNED_LONG, MPI_FLOAT, MPI_DOUBLE, MPI_LONG_DOUBLE, MPI_BYTE Use of Standard Libraries MPI does not alter any Standard Libraries. Calls to external libraries (especially I/O libraries) should be made from a distinguished process, the rank=0 process. LAM and MPICH are both quite flexible, when it comes to handling such libraries. Errors in MPI
Compiling using MPI on the Beowulf MPICH is installed in a public location like /usr/local/mpich, also called All user oriented utilities are stored in /bin, header files in /include and libraries in /lib/LINUX/ch_p4/ /bin contain mpicc, the script files for compiling C programs with certain options. Ways of Compiling a C program to give the executable MPIROOT=/usr/local/mpi mpicc hello.c –o hello cc –o hello –I$MPIROOT/include hello.c – L$MPIROOT/LINUX/ch_p4 -lmpich
Running an MPI Program mpirun utility Options -np : starts the program with the specifies number of processors -machinefile ; explicitly specify the list of processors. -nolocal force the first process to execute on the first processor specified in the processor_list mpirun will start process 0 on the processor that called mpirun.
Parallel Data Structures with MPI MPI provides a simple memory model over which Parallel data structures can be created. Every symbol is actually a parallel object that exists on every process. The symbol can take different values for different processes. Example Implementation: A Parallel Array parray structure Typedef struct { int nelem_local; /*the number in local memory */ void *data; MPI_Comm comm; int first_index_local; /*global index of the first element of the local data (whose local index is 0) */ int nelem_global; /* aggregate number of elements */ size_t elem_size; int commrank, commsize; }parray;
/* procedure to create a self consistent parallel array from information provided by each process about its own local data */ Void pacreate(parray *pa, void *data, int nelem, size_t size, MPI_Comm comm) { int last_index; pa->nelem_local = nelem; pa->data = data; pa->comm = comm; pa->elem_size = size; /* My first element is the sum of the number of elements in lower ranked procs */ MPI_Scan(&pa->nelem_local, &last_index, 1, MPI_INT, MPI_SUM, pa->comm); pa->first_index_local = last_index – pa->nelem_local; MPI_Comm_size(pa->comm, &pa->commsize); MPI_Comm_rank(pa->comm, &pa->commrank); /* The global element count is the last index in the highest rank processor (commsize – 1) use bcast to distribute it */ Mpi_Bcast(&last_index, 1, MPI_INT, pa->commsize-1, pa->comm); Pa->nelem_global = last_index; }
Example: A Parallel Array The parray structure automatically exists on every process, when defined. Collective Communication MPI_Broadcast sends the content of the buffer on one process MPI_Scan performs operation (like ‘+’) on values supplied by each processor, like the result returned on process I is the result of the operation applied to values supplied by every process of rank 0 to i. MPI predefined Reductions/scan operations MPI_MAX, MPI_SUM, MPI_PROD, MPI_MAXLOC (returns extremum, like an array index or rank) etc.
Example: A Parallel Array (contd.) Use of MPI_Allreduce computes the result of an operation applied to a value supplied by every process. The result is returned to every process. Can be equivalent to MPI_Scan MPI_Gather, MPI_Scatter are combination of two or more operations like MPI_Allreduce MPI_Reduce + MPI_Broadcast
/* procedure to compute the mean of the elements in a parallel array */ double parraymean(struct parray *pa) { double sum, sumall; int i; /* Compute the local sum of elements */ sum = 0; for(i=0; i nelem_mine; i++) { sum = sum + pa->data[i]; } /* Use all reduce to get the global sum */ MPI_Allreduce(&sum, &sumall, 1, MPI_DOUBLE, MPI_SUM, pa->comm); return sumall/pa->nelem_global; }
A One dimensional Cellular Automaton One Dimensional Cellular Automata (CA), are simple dynamic systems, where the placement of data is meaningful, requiring point to point communication. A CA with half-width hw --> It is an array of values with update rule that states that the next value in location i depends only on the previous value in locations (i-hw, ,i ,i_hw). The values of a Cellular Automaton can be integers, or bits etc. The update rule can be an arbitrary function of the 2hw+1 input values. Special cases include linear functions and functions like “parity” which count the number of values in an input domain. Structure definition of the the one dimension cellular automaton typedef struct ca_s { unsigned char *state; /* size ncells */ int A; int hw; int ncells; unsigned char *old; /* size ncells + 2*hw */ /* other fields that control updateRule */ }CA_t;
/* updating a cellular automaton */ void Caiterate(CA_t *ca, int ntime){ int i; for(i=0;i<ntimes; i++) { Cacopystate(ca); Caupdate(ca); } } static void Caupdate(CA_t *ca) { int n = ca->ncells; unsigned char *oldcenter = ca->old; unsigned char *new = ca->state; while(n-->0) { *new++ = updateRule(ca, oldcenter++); } } static void { Cacopystate(CA_t *ca){ memcpy(ca->old, ca->state, ca->ncells); /* Now for periodic boundary conditions */ memcpy(ca->old-ca->hw, ca->state+(ca->ncells - ca->hw), ca->hw); memcpy(ca->old+ca->ncells, ca->state, ca->hw); }
A One dimensional Cellular Automaton (contd) The two operations in the CAiterate The CAcopystate involves copying the contents of the state array to the old array so that we can write new values into the state. The second phase involves computing the new state depending on the values of the old array. The periodic boundary conditions are imposed by padding the old array on both the ends by hw values, copied from the opposite end of the state. A B C D E F G H F G H A B C D E F G H A B C The arrow indicates a call to memcpy
Parallelizing the problem ! General guideline : Keep a maximum amount of the sequential code intact. The order of the elements and the relationship between elements stored in different processors is important. Boundary data should be exchanged between neighbours Modification Adding a MPI_COMM Element to the CA Structure calls to memcpy has been replaced to MPI_Sendrecv in CA Copystate typedef struct ca_s { unsigned char *state; /* size ncells */ int A; int hw; int ncells; unsigned char *old; /* size ncells + 2*hw */ MPI_Comm comm; /* other fields that control updateRule */ }CA_t;
Parallelizing the problem ! static void { Cacopystate(CA_t *ca){ int up_neighbor, down_neighbor, myrank, nproc;_ memcpy(ca->old, ca->state, ca->ncells); MPI_Comm_rank (ca->comm, &myrank); MPI_Comm_size (ca->comm, &nproc); up_neighbor = (myrank + 1)% nproc; down_neighbor = (myrank+nproc-1)%nproc; MPI_Sendrecv(ca->state + (ca->ncells - ca->hw), ca->hw, MPI_BYTE, up_neighbor, CA_UPTAG, ca->old, ca->hw, MPI_BYTE, down_neighbor, CA_UPTAG, ca->comm, &stat); MPI_Sendrecv(ca->state, ca->hw, MPI_BYTE, down_neighbor, CA_DOWNTAG, ca->old+(ca->ncells+ca->hw), ca->hw, MPI_BYTE, up_neighbor, CA_DOWNTAG, ca->comm, &stat);
A B C D E F G HI J K L MN O P X Y Z A B C D E F G H I J KF G H I J K L MN O P Q R S In the Parallel version, when the state of CA is copied to old, it is padded at both ends by data from neighboring processors. Two processors are shown, The diagonal arrow represent data transfers by MPI_Sendrecv, and the dotted arrow represents a memcpy
Domain Decomposition and boundaries: Make a parallel version of a problem as a collection of near-copies of a sequential implementation with special boundary conditions derived by communication with neighbors. Starting with a good sequential Implementation --> Fundamental Algorithms are basically designed sequentially by default, hence changes required for parallelization should be as small as possible. For the CA, it is the matter of obtaining boundary conditions from neighboring processors. Avoid changing fundamental sequential parts of a code as far as possible. CONSIDERATIONS
Communication should be treated as a separate phase from local computation. It has a major effect on performance Performance Considerations --> Communication is Slow. Communication latency will dominate the time consideration. Despite our local computation being fast, eventual performance will reduce due to time spent in communication, CONSIDERATIONS
Time taken to complete an iteration in N steps with P processors t step = 2 t latency + (N/P)t update = (N t update /P) + ( 1 + (2 t latency /(N/P)t update ) ) first Term is just the time taken on one processor, divided by P, a perfect speedup result. Last term represents how worse than perfect is the actual implementation --> it is the ration of the time spent on one processor to communication, 2 t latency to the time spent in CAupdate, (N/P)t update CONSIDERATIONS
MPI Advanced Features Blocking and Non-blocking Calls and Alternative Sending Modes Blocking Calls always wait for the requested action to complete before returning control to the caller. Non blocking calls are Initiate background process, only used to avoid unnecessary copying between system buffers and user memory. Standard Mode sends Buffered Mode sends Synchronous mode sends
MPI Advanced Features Derived data types MPI includes a powerful set of procedures that can be used to refer no-contiguous data; example--> refer data of heterogeneous type, and integer followed by a set of floating point values. Latency is a major problem on Beowulf systems, specially in transfer of data, an alternative is to pack multiple logical messages into a single physical message and to incur the startup cost only once. Intercommunicators Basically the MPI_COMM_WORLD, the group of all active processes and communicators are derived by selecting subsets from this communicators, the resulting logical structures are insufficient. It is desirable to construct a new communicator, from the union of the two existing communicators and these are called INTERCOMMUNICATORS. Used in application relevant to functional parallelism.
MPI 2 Process creation and management = Processes can be dynamically created and destroyed. One sided communication = Allowing either the sender or the receiver set the parameters for communication. Reduces complexity and can exploit fast communication mechanisms. Extended collective operations = New procedures for creating collective operations and Intercommunicators. External interfaces = Users and library designers can access internals of MPI’s objects, allows for new functionality to be layered atop MPI’s internal data structures. New Language bindings for C++ and fortran. I/0 MPI-I0 describes the application side of an abstract model of a parallel I/O, how to define requests to a parallel abstract I/O device from a distributed memory parallel program.