CS4402 – Parallel Computing

CS4402 – Parallel Computing
Lecture 3 MPI – Collective Communication Parallel Processing on Arrays

Basic Blocking P2P Operations
MPI_Send – Basic send routine returns only after the application buffer in the sending task is free for reuse. MPI_Send (&buf,count,datatype,dest,tag,comm) MPI_Recv - Receive a message and block until the requested data is available. MPI_Recv (&buf,count,datatype,source,tag,comm,&status) MPI_Ssend - Synchronous blocking send: MPI_Ssend (&buf,count,datatype,dest,tag,comm,ierr) MPI_Bsend - Buffered blocking send MPI_Bsend (&buf,count,datatype,dest,tag,comm) MPI_Rsend - Blocking ready send. MPI_Rsend (&buf,count,datatype,dest,tag,comm) Similar MPI_?recv can be analyzed.

Basic Non-Blocking P2P Operations
MPI_Isend – immediate send operation that should be followed by MPI_Wait or MPI_Test. MPI_Isend (&buf,count,datatype,dest,tag,comm,&request) MPI_Irecv - immediate receive MPI_Irecv (&buf,count,datatype,source,tag,comm,&request) MPI_Issend – immediate synchronous send. MPI_Wait() or MPI_Test() indicates when the destination process has received the message. MPI_Issend (&buf,count,datatype,dest,tag,comm,&request) MPI_Ibsend - Non-blocking buffered send. MPI_Ibsend (&buf,count,datatype,dest,tag,comm,&request) MPI_Irsend Non-blocking ready send. MPI_Irsend (&buf,count,datatype,dest,tag,comm,&request) MPI_Test - MPI_Test checks the status of a specified non-blocking send or receive operation.

be executed by only 2 processors
P2P Routines MUST be executed by only 2 processors MPI_Init(…); MPI_Comm_size(…); MPI_Comm_rank(…); … if(rank==sender)MPI_send(…);  routine executed by sender only else if(rank==receiver)MPI_recv(…);  routine executed by receiver only MPI_Finalise(..);

Collective Communication
Collective = Communication with all in MPI_COMM_WORLD. Types of Collective Operations: Synchronization - processes wait until all members of the group have reached the synchronization point. Data Movement – broadcast/reduce, scatter/gather, or variants. Collective Computation (reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data.

Collective Communication
Some Other Properties - Collective operations are blocking. - Collective communication routines do not take message tag arguments. - Accept both MPI predefined and derived types. - The blocks could be of different sizes.

be executed by all processors
Collective Routines MUST be executed by all processors MPI_Init(…); MPI_Comm_size(…); MPI_Comm_rank(…); … MPI_Bcast(…);  routine executed by all processors MPI_Finalise(..);

MPI_Bcast int MPI_Bcast( void* buffer, // the address of the buffer
int count, // number of elements to broadcast MPI_Datatype datatype, // the MPI_datatype of elemenst int root, // the root of broadcasting MPI_Comm comm // the world to broadcast ); The message from the buffer will be broadcasted to the whole Comm.

MPI_Reduce int MPI_Reduce(
void* sendbuf, // the buffer to reduce on processors void* recvbuf, // the destination buffer int count, // number of elements of the buffer MPI_Datatype datatype, // data type of the elements MPI_Op op, // operation used in reducing int root, // the root of reducing MPI_Comm comm // the comm ); The values from sendbuf are reduced with the MPI_Op op and the results is sored on root at recvbuf. MPI_op can be MPI_SUM, MPI_PROD, MPI_MAX, MPI_MIN etc. Several other types of reducing e.g. MPI_Allreduce, etc.

MPI_Scatter int MPI_Scatter(
void* sendbuf, // the buffer of the message to be sent. int sendcount, // the number of elements to send. MPI_Datatype sendtype, // the MPI_Datatype of the sent elements void* recvbuf,int recvcount, // the receiving buffer MPI_Datatype recvtype, // the MPI_Datatype for the received elements int root, // the root of scattering MPI_Comm comm // the comm ); Scatter the message from sendbuf in groups of sendcount to the comm. Similar methods: MPI_Reducescatter, MPI_Alltoall(), etc

MPI_Gather int MPI_Gather(
void* sendbuf, // the buffer of the message to be sent. int sendcount, // the number of elements to send. MPI_Datatype sendtype, // the MPI_Datatype of the sent elements void* recvbuf, int recvcount, MPI_Datatype recvtype, int root, // the root of scattering MPI_Comm comm // the comm ); Gather the information from sendbuf to recvbuf. Similar method: MPI_Allgather.

Other Bcast, Reduce, Gather, Scatter
Variants of the collective communication: 1. All Collective Communication MPI_Allgather MPI_Allreduce MPI_Alltoall 2. Variable Block Collective Communication MPI_Scatterv MPI_Gatherv MPI_Allgatherv 3. MPI_Reduce_scatter

Example: Compute the Overall Execution Time
Each processor knows its own execution time e.g. the double variable time. Processor 0 then reduces these execution times to find the overall execution time. Example: time1 = MPI_Wtime(); // section to time time2 = MPI_Wtime(); time = time2-time1; MPI_Reduce(&time, &overall_time,1, MPI_DOUBLE, MPI_MAX, 0, MPI_COMM_WORLD):

Working with Pointers Consider the following declarations:
type var = value;  var is a variable carrying value  &var is the address of the variable type * pointer = value;  pointer is an address of type  *pointer is the value of the location Methods to allocate dynamically: type * pointer = (type *) malloc(sizeof(type)); type * pointer = (type *) calloc(n, sizeof(type));  n locations of type.

Working with Arrays Static Declaration: type array[1000];
array is a pointer  array = &array[0] array + i is a pointer too  array+i = &array[i] array does not change Dynamic Declaration: type * array = (type *) calloc(n, sizeof(type)); array is a pointer *(array + i) = array[i] the content of the i-th element array can change

General MPI Program Structure:
include declarations (MPI header) the main function - initialise the MPI environment - get the MPI basic elements: size, rank, etc - have the global data / array initialised - do the parallel work - scatter the data to the local data perform the computation of the local data - find out the overall result - terminate the MPI environment. some other functions

Operations on Arrays: Scatter + Compute + Reduce
Basic Operation: Find sum, prod, max, min of an array with many elements. Step 1. if (rank==0) read/load/generate the elements of the array. Step 2. Scatter the array to processors into scattered array. Step 3. Compute sum, prod, max, min of the scattered array. Step 4. Reduce all the sums to the global sum. Remarks: Which array operation really needs parallel computing? !!!search, sorting

#include "mpi.h" #include <stdio.h> int main(int argc, char * argv []) { int size, rank; int sum = 0, final_sum = 0; double time, final_time; int * scattered_array, * array; int n = ; int i; MPI_Init (&argc,&argv); MPI_Comm_size (MPI_COMM_WORLD, &size); MPI_Comm_rank (MPI_COMM_WORLD, &rank); scattered_array=(int *)calloc(n/size, sizeof(int)); array = (int *)calloc(n, sizeof(int)); if (rank == 0) for(i=0;i<n;i++)array[i]=1.; time=MPI_Wtime(); MPI_Scatter(array, n/size, MPI_INT, scattered_array, n/size, MPI_INT, 0, MPI_COMM_WORLD); for(i=0;i<n/size;i++)sum+=scattered_array[i]; MPI_Reduce(&sum,&final_sum,1,MPI_INT,MPI_SUM,0,MPI_COMM_WORLD); MPI_Reduce(&time,&final_time,1,MPI_DOUBLE,MPI_MAX,0,MPI_COMM_WORLD); time=MPI_Wtime()-time; if(rank==0) printf("the final sum is %d and the execution time %lf \n\n", final_sum, final_time); MPI_Finalize (); }

MPI Methods on Arrays The root processor has array with n component to process. Objective: write a MPI function to process array. Any MPI Methods must have: - the inputs as simple arguments and the output as pointer argument - a root and a MPI_Comm communicator as arguments - return an error code: MPI_ERROR, MPI_SUCCES - Work with the simple scheme: 1. Find rank and size 2. Scatter + Compute + Reduce 3. Return the error code after each MPI operation.

int MPI_Array_sum(int n, int. array, int
int MPI_Array_sum(int n, int * array, int *final_sum, int root, MPI_Comm comm) { int size, rank, sum = 0, i; int rc; int * scattered_array; MPI_Comm_size (comm, &size); MPI_Comm_rank (comm, &rank); scattered_array=(int *)calloc(n/size, sizeof(int)); rc = MPI_Scatter(array, n/size, MPI_INT, scattered_array, n/size, MPI_INT, root, comm); if(rc==MPI_ERROR) return rc; for(i=0;i<n/size;i++)sum+=scattered_array[i]; rc = MPI_Reduce(&sum, final_sum, 1, MPI_INT, root, comm); return MPI_SUCCESS; } Remarks: 1. MPI_Array_sum(n, array, &final_sum, 0, MPI_COMM_WORLD); 2. Similar methods can be developed for the other methods.

Scatter + Compute + Reduce: Complexity
Complexity analysis must consider: Time of computation  Tcom time per computation unit Time of communication  Tcomm time per communication unit Time to start up a comm routine  Tstartup time per communication unit MPI_Array_sum has the following elements: One scatter with n/size elements: Sum computation of n/size elements: One reduce with 1 element: OVERALL TIME:

Example: Simple Sorting in Parallel
How to use Scatter + Sort + Combine for Sorting Scatter the array onto processors Sort each sub-array Gather the sub-arrays on Processor 0. If Processor 0 then !!! How to sort the sorted chunks

int MPI_Array_sort(int n, int. array, int
int MPI_Array_sort(int n, int * array, int *final_sum, int root, MPI_Comm comm) { int size, rank, sum = 0, i; int * scattered_array; MPI_Comm_size (comm, &size); MPI_Comm_rank (comm, &rank); scattered_array=(int *)calloc(n/size, sizeof(int)); MPI_Scatter(array, n/size, MPI_INT, scattered_array, n/size, MPI_INT, root, comm); sort(n/size, scattered_array); MPI_Gather(&sum, final_sum,1,MPI_INT,MPI_SUM,root, comm); if(rank == root) { // sort array based on sorted parts } return 1; Remarks: How we can sort the array? What is the complexity?

Problem Solving MPI function for the linear search. - the root processor keeps the array. - the array is scattered onto processor. - each processor does a linear search. - the search results are gathered on root. - if root then find the overall result ANALYSE THE COMPLEXITY

Reading: From the MPI LLML Tutorial - Sections on Collective communication - Study the full list of collective routines -

CS4402 – Parallel Computing

Similar presentations

Presentation on theme: "CS4402 – Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS4402 – Parallel Computing

Similar presentations

Presentation on theme: "CS4402 – Parallel Computing"— Presentation transcript:

Similar presentations

About project

Feedback