High Performance Parallel Programming

Dirk van der Knijff Advanced Research Computing Information Division

Lecture 4: Message Passing Interface

So Far.. Messages source, dest, data, tag, communicator Communicators MPI_COMM_WORLD Point-to-point communications different modes - standard, synchronous, buffered, ready blocking vs non-blocking Derived datatypes construct then commit High Performance Parallel Programming

4 Ping-pong exercise: program
/********************************************************************** * This file has been written as a sample solution to an exercise in a * course given at the Edinburgh Parallel Computing Centre. It is made * freely available with the understanding that every copy of this file * must include this header and that EPCC takes no responsibility for * the use of the enclosed teaching material. * * Authors: Joel Malard, Alan Simpson * Contact: * Purpose: A program to experiment with point-to-point * communications. * Contents: C source code. ********************************************************************/ High Performance Parallel Programming

5 #include <stdio.h>
#include <mpi.h> #define proc_A 0 #define proc_B 1 #define ping 101 #define pong 101 float buffer[100000]; long float_size; void processor_A (void), processor_B (void); void main ( int argc, char *argv[] ) { int ierror, rank, size; extern long float_size; MPI_Init(&argc, &argv); MPI_Type_extent(MPI_FLOAT, &float_size); MPI_Comm_rank(MPI_COMM_WORLD, &rank); if (rank == proc_A) processor_A(); else if (rank == proc_B) processor_B(); MPI_Finalize(); }

6 void processor_A( void )
{ int i, length, ierror; MPI_Status status; double start, finish, time; extern float buffer[100000]; extern long float_size; printf("Length\tTotal Time\tTransfer Rate\n"); for (length = 1; length <= ; length += 1000){ start = MPI_Wtime(); for (i = 1; i <= 100; i++){ MPI_Ssend(buffer, length, MPI_FLOAT, proc_B, ping, MPI_COMM_WORLD); MPI_Recv(buffer, length, MPI_FLOAT, proc_B, pong, MPI_COMM_WORLD, &status); } finish = MPI_Wtime(); time = finish - start; printf("%d\t%f\t%f\n", length, time/200., (float)(2 * float_size * 100 * length)/time);

7 void processor_B( void )
{ int i, length, ierror; MPI_Status status; extern float buffer[100000]; for (length = 1; length <= ; length += 1000) { for (i = 1; i <= 100; i++) { MPI_Recv(buffer, length, MPI_FLOAT, proc_A, ping, MPI_COMM_WORLD, &status); MPI_Ssend(buffer, length, MPI_FLOAT, proc_A, pong, MPI_COMM_WORLD); }

8 Ping-pong exercise: results
9 Ping-pong exercise: results 2
Running ping-pong compile: mpicc ping_pong.c -o ping_pong submit: qsub where is #PBS -q exclusive #PBS -l nodes=2 cd <your sub_directory> mpirun ping_pong High Performance Parallel Programming

11 Collective communication
Communications involving a group of processes Called by all processes in a communicator for sub-groups need to form a new communicator Examples Barrier synchronisation Broadcast, Scatter, Gather Global sum, Global maximum, etc. High Performance Parallel Programming

Characteristics Collective action over a communicator All processes must communicate Synchronisation may or may not occur All collective operations are blocking No tags Recieve buffers must be exactly the right size Collective communications and point-to-point communications cannot interfere High Performance Parallel Programming

MPI_Barrier Blocks each calling process until all other members have also called it. Generally used to synchronise between phases of a program Only one argument - no data is exchanged MPI_Barrier(comm) High Performance Parallel Programming

Broadcast Copies data from a specified root process to all other processes in communicator all processes must specify the same root other aguments same as for point-to-point datatypes and sizes must match MPI_Bcast(buffer, count, datatype, root, comm) Note: MPI does not support a multicast function High Performance Parallel Programming

Scatter, Gather Scatter and Gather are inverse operations Note that all processes partake - even root Scatter: a b c d e before after High Performance Parallel Programming

Gather Gather: a b c d e before after High Performance Parallel Programming

17 MPI_Scatter, MPI_Gather
MPI_Scatter(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, root, comm) sendcount in scatter and recvcount in gather refer to the size of each individual message (sendtype = recvtype => sendcount = recvcount) total type signatures must match High Performance Parallel Programming

Example MPI_Comm comm; int gsize, sendarray[100]; int root, myrank, *rbuf; MPI_Datatype rtype; ... MPI_Comm_rank(comm, myrank); MPI_Comm_size(comm, &gsize); MPI_Type_contigous(100, MPI_INT, &rtype); MPI_Type_commit(&rtype); if (myrank == root) { rbuf = (int *)malloc(gsize*100*sizeof(int)); } MPI_Gather(sendarray, 100, MPI_INT, rbuf, 1, rtype, root, comm); High Performance Parallel Programming

More routines MPI_Allgather(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) MPI_Alltoall(sendbuf, sendcount, sendtype, recvbuf, recvcount, recvtype, comm) a b c d e a b c d e f g h i j k l m n o p q r s t u v w x y High Performance Parallel Programming

Vector routines MPI_Scatterv(sendbuf, sendcount, displs, sendtype, recvbuf, recvcount, recvtype, root, comm) MPI_Gatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, root, comm) MPI_Allgatherv(sendbuf, sendcount, sendtype, recvbuf, recvcount, displs, recvtype, comm) MPI_Alltoallv(sendbuf, sendcount, sdispls, sendtype, recvbuf, recvcount, rdispls, recvtype, comm) Allow send/recv to be from/to non-contiguous locations in an array Useful if sending different counts at different times High Performance Parallel Programming

21 Global reduction routines
Used to compute a result which depends on data distributed over a number of processes Examples: global sum or product global maximum or minimum global user-defined operation Operation should be associative aside: remember floating-point operations technically aren’t associative but we usually don’t care - can affect results in parallel programs though High Performance Parallel Programming

22 Global reduction (cont.)
MPI_Reduce(sendbuf, recvbuf, count, datatype, op, root, comm) combines count elements from each sendbuf using op and leaves results in recvbuf on process root e.g. MPI_Reduce(&s, &r, 2, MPI_INT, MPI_SUM, 1, comm) r r r r r 2 3 1 1 3 2 1 1 1 2 s s s s s r r 2 2 1 1 r 3 3 r 1 1 r 1 1 s s 8 9 s s s High Performance Parallel Programming

Reduction operators MPI_MAX Maximum MPI_MIN Minumum MPI_SUM Sum MPI_PROD Product MPI_LAND Logical AND MPI_BAND Bitewise AND MPI_LOR Logical OR MPI_BOR Bitwise OR MPI_LXOR Logical XOR MPI_BXOR Bitwise XOR MPI_MAXLOC Max value and location MPI_MINLOC Min value and location High Performance Parallel Programming

24 User-defined operators
In C the operator is defined as a function of type typedef void MPI_User_function(void *invec, void *inoutvec, int *len, MPI_Datatype *datatype); In Fortran must write a function as function <user_function>(invec(*), inoutvec(*), len, type) where the function has the following schema for (i = 1 to len) inoutvec(i) = inoutvec(i) op invec(i) Then MPI_Op_create(user_function, commute, op) returns a handle op of type MPI_Op High Performance Parallel Programming

Variants MPI_Allreduce(sendbuf, recvbuf, count, datatype, op, comm) All processes invloved receive identical results MPI_Reduce_scatter(sendbuf, recvbuf, recvcounts, datatype, op, comm) Acts as if a reduce was performed and then each process recieves recvcount(myrank) elements of the result. High Performance Parallel Programming

Reduce-scatter MPI_INT *s, *r, *rc; int rank, gsize; ... rc = (/ 1, 2, 0, 1, 1 /) MPI_Reduce-scatter(s, r, rc, MPI_INT, MPI_SUM, comm) 1 2 3 7 9 6 High Performance Parallel Programming

Scan MPI_Scan(sendbuf, recvbuf, count, datatype, op, comm) Performs a prefix reduction on data across group recvbuf(myrank) = op(sendbuf((i,i=1,myrank))) MPI_Scan(&s, &r, 5, MPI_INT, MPI_SUM, comm); 1 2 3 5 6 4 7 8 9 High Performance Parallel Programming

Further topics Error-handling Errors are handled by an error handler MPI_ERRORS_ARE_FATAL - default for MPI_COMM_WORLD MPI_ERRORS_RETURN - MPI state is undefined MPI_Error_string(errorcode, string, resultlen) Message probing Messages can be probed Note - wildcard reads may receive a different message blocking and non-blocking Persistent communications High Performance Parallel Programming

Assignment 2. Write a general procedure to multiply 2 matrices. Start with This is a harness for last years assignment Last year I asked them to optimise first This year just parallelize Next Tuesday I will discuss strategies That doesn’t mean don’t start now… Ideas available in various places… High Performance Parallel Programming

