Collective Communication in MPI and Advanced Features

Slides:



Advertisements
Similar presentations
Its.unc.edu 1 Collective Communication University of North Carolina - Chapel Hill ITS - Research Computing Instructor: Mark Reed
Advertisements

MPI Collective Communications
1 Buffers l When you send data, where does it go? One possibility is: Process 0Process 1 User data Local buffer the network User data Local buffer.
Message Passing Interface COS 597C Hanjun Kim. Princeton University Serial Computing 1k pieces puzzle Takes 10 hours.
EECC756 - Shaaban #1 lec # 7 Spring Message Passing Interface (MPI) MPI, the Message Passing Interface, is a library, and a software standard.
Collective Communication.  Collective communication is defined as communication that involves a group of processes  More restrictive than point to point.
CS 179: GPU Programming Lecture 20: Cross-system communication.
1 TRAPEZOIDAL RULE IN MPI Copyright © 2010, Elsevier Inc. All rights Reserved.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 3 Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco.
Parallel Processing1 Parallel Processing (CS 676) Lecture 7: Message Passing using MPI * Jeremy R. Johnson *Parts of this lecture was derived from chapters.
2a.1 Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
1 Collective Communications. 2 Overview  All processes in a group participate in communication, by calling the same function with matching arguments.
1 MPI: Message-Passing Interface Chapter 2. 2 MPI - (Message Passing Interface) Message passing library standard (MPI) is developed by group of academics.
Part I MPI from scratch. Part I By: Camilo A. SilvaBIOinformatics Summer 2008 PIRE :: REU :: Cyberbridges.
Parallel Programming with MPI Prof. Sivarama Dandamudi School of Computer Science Carleton University.
CS 838: Pervasive Parallelism Introduction to MPI Copyright 2005 Mark D. Hill University of Wisconsin-Madison Slides are derived from an online tutorial.
MPI Communications Point to Point Collective Communication Data Packaging.
Message Passing Programming Model AMANO, Hideharu Textbook pp. 140-147.
Summary of MPI commands Luis Basurto. Large scale systems Shared Memory systems – Memory is shared among processors Distributed memory systems – Each.
MPI Introduction to MPI Commands. Basics – Send and Receive MPI is a message passing environment. The processors’ method of sharing information is NOT.
Distributed-Memory (Message-Passing) Paradigm FDI 2004 Track M Day 2 – Morning Session #1 C. J. Ribbens.
MPI (continue) An example for designing explicit message passing programs Advanced MPI concepts.
Parallel Programming with MPI By, Santosh K Jena..
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd Edition, by B. Wilkinson & M. Allen, ©
CSCI-455/522 Introduction to High Performance Computing Lecture 4.
MPI Point to Point Communication CDP 1. Message Passing Definitions Application buffer Holds the data for send or receive Handled by the user System buffer.
1 Copyright © 2010, Elsevier Inc. All rights Reserved Chapter 3 Distributed Memory Programming with MPI An Introduction to Parallel Programming Peter Pacheco.
12.1 Parallel Programming Types of Parallel Computers Two principal types: 1.Single computer containing multiple processors - main memory is shared,
2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST()
-1.1- MPI Lectured by: Nguyễn Đức Thái Prepared by: Thoại Nam.
1 Parallel and Distributed Processing Lecture 5: Message-Passing Computing Chapter 2, Wilkinson & Allen, “Parallel Programming”, 2 nd Ed.
Message Passing Programming Based on MPI Collective Communication I Bora AKAYDIN
Message Passing Interface Using resources from
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming - Exercises Dr. Xiao Qin Auburn University
COMP7330/7336 Advanced Parallel and Distributed Computing MPI Programming: 1. Collective Operations 2. Overlapping Communication with Computation Dr. Xiao.
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Computer Science Department
Introduction to MPI Programming Ganesh C.N.
Chapter 4.
Introduction to parallel computing concepts and technics
CS4402 – Parallel Computing
MPI Point to Point Communication
Introduction to MPI.
MPI Message Passing Interface
Computer Science Department
Send and Receive.
CS 584.
An Introduction to Parallel Programming with MPI
Send and Receive.
CS4961 Parallel Programming Lecture 16: Introduction to Message Passing Mary Hall November 3, /03/2011 CS4961.
More on MPI Nonblocking point-to-point routines Deadlock
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
Distributed Systems CS
Collective Communication in MPI and Advanced Features
Lecture 14: Inter-process Communication
MPI: Message Passing Interface
Message-Passing Computing More MPI routines: Collective routines Synchronous routines Non-blocking routines ITCS 4/5145 Parallel Computing, UNC-Charlotte,
Introduction to parallelism and the Message Passing Interface
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B
More on MPI Nonblocking point-to-point routines Deadlock
Barriers implementations
Hardware Environment VIA cluster - 8 nodes Blade Server – 5 nodes
Message-Passing Computing Message Passing Interface (MPI)
Synchronizing Computations
Hello, world in MPI #include <stdio.h> #include "mpi.h"
Computer Science Department
5- Message-Passing Programming
Hello, world in MPI #include <stdio.h> #include "mpi.h"
MPI Message Passing Interface
CS 584 Lecture 8 Assignment?.
Presentation transcript:

Collective Communication in MPI and Advanced Features Message passing patterns and Collective MPI routines

What MPI Functions are commonly used For simple applications, these are common: Startup MPI_Init() MPI_Finalize() Information on the processes MPI_Comm_rank() MPI_Comm_size() MPI_Get_processor_name() Point-to-Point communication MPI_Send() & MPI_Recv() MPI_Isend() & MPI_Irecv, MPI_Wait() Collective communication MPI_Allreduce() , MPI_Bcast(), MPI_Allgather() http://mpitutorial.com/mpi-broadcast-and-collective-communication/ CS267 Lecture 2

Blocking Message Passing The call waits until the data transfer is done MPI_Send() The sending process waits until all data are transferred to the system buffer MPI_Recv() The receiving process waits until all data are transferred from the system buffer to the receive buffer Buffers can be freely reused

Blocking Message Send MPI_Send(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm); buf Specifies the starting address of the buffer. count Indicates the number of buffer elements dtype Denotes the datatype of the buffer elements dest Specifies the rank of the destination process in the group associated with the communicator comm tag Denotes the message label comm Designates the communication context that identifies a group of processes

Blocking Message Send Standard: MPI_Send() The sending process returns when the system can buffer the message or when the message is received and the buffer is ready for reuse. Buffered: MPI_Bsend() The sending process returns when the message is buffered in an application-supplied buffer. Synchronous: MPI_Ssend() The sending process returns only if a matching receive is posted and the receiving process has started to receive the message. Ready: MPI_Rsend() The message is sent as soon as possible (ASAP).

Blocking Message Receive MPI_Recv(void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm, MPI_Status *status); buf Specifies the starting address of the buffer count Indicates the number of buffer elements dtype Denotes the datatype of the buffer elements source Specifies the rank of the source process in the group associated with the communicator comm tag Denotes the message label comm Designates the communication context that identifies a group of processes status Returns information about the received message . Status information is useful when wildcards are used or the received message is smaller than expected. Status may also contain error codes.

Example More Examples at http://mpi.deino.net/mpi_functions/index.htm … if (rank == 0) { for (i=0; i<10; i++) buffer[i] = i; MPI_Send(buffer, 10, MPI_INT, 1, 123, MPI_COMM_WORLD); } else if (rank == 1) { for (i=0; i<10; i++) buffer[i] = -1; MPI_Recv(buffer, 10, MPI_INT, 0, 123, MPI_COMM_WORLD, &status); for (i=0; i<10; i++) if (buffer[i] != i) printf("Error: buffer[%d] = %d but is expected to be %d\n", i, buffer[i], i); } More Examples at http://mpi.deino.net/mpi_functions/index.htm

Non-blocking Message Passing Returns immediately after the data transferred is initiated Allows to overlap computation with communication Need to be careful though When send and receive buffers are updated before the transfer is over, the result will be wrong MPI_Request() represents a handle on a non-blocking operation. it represents a handle on a non-blocking operation. It can be used by wait MPI_Wait() MPI_Waitall() MPI_Waitany() MPI_Waitsome() to know when the non-blocking operation handled completes.

Non-blocking Message Passing MPI_Isend(void *buf, int count, MPI_Datatype dtype, int dest, int tag, MPI_Comm comm, MPI_Request *req); MPI_Irecv(void *buf, int count, MPI_Datatype dtype, int source, int tag, MPI_Comm comm, MPI_Request *req); MPI_Wait(MPI_Request *req, MPI_Status *status); A call to MPI_Wait returns when the operation identified by request is complete. req Specifies the request used by a completion routine when called by the application to complete the send operation. Blocking MPI_Send MPI_Bsend MPI_Ssend MPI_Rsend MPI_Recv Non-blocking MPI_Isend MPI_Ibsend MPI_Issend MPI_Irsend MPI_Irecv

Non-blocking meaning MPI_Isend() and MPI_Irecv() are non-blocking meaning that the function call returns before the communication is completed. MPI_Isend() returns immediately, usually before the data is finished being sent. Thus, you must be extra careful not to change any of the data in the buffer that you passed as an argument into MPI_Isend MPI_Irecv() returns immediately, which allows the process to continue doing calculations of some sort while the data is still being received. However, before the receiving process actually uses the data it is receiving, it must call MPI_Wait to ensure that the data is valid  Deadlock then becomes impossible with non-blocking communication, but other precautions must be taken when using them. In particular you need to be sure at a certain point, that your data has effectively arrived! You may need to place an MPI_Wait() call for each send and/or receive in order to be sure it is completed before advancing in the program. In MPI: blocking communications are synchronous communication but non-blocking communications are asynchronous communication Asynchronous is often the key to achieving high performance computing with MPI biggest advantage is that the functions are non-blocking, which allows processes to continue doing computations while communication with another process is still pending.

Non-blocking Message Passing … right = (rank + 1) % nproc; left = rank - 1; if (left < 0) left = nproc – 1; MPI_Irecv(buffer, 10, MPI_INT, left, 123, MPI_COMM_WORLD, &request); MPI_Isend(buffer2, 10, MPI_INT, right, 123, MPI_COMM_WORLD, &request2); MPI_Wait(&request, &status); MPI_Wait(&request2, &status);

Message passing patterns Source Destination Data Point-to-point Send-Receive Process 1 MPI_Send() MPI_Recv() Process 2 Message containing data Implementation of send-receive pattern with explicit MPI send and receive routines

Collective message-passing routines A program might require many point to point send-receive patterns. Certain groups of send-receive patterns appear frequently and become slightly higher level patterns Collective patterns are those that involve multiple processes. One process (the root / master) is the source of data sent to other processes, or destination of data sent from other processes Collective message passing patterns implemented in MPI with higher efficiency than separate point-to-point routines although routines not absolutely necessary.

Collective Communications A single call handles the communication between all the processes in a communicator There are 3 types of collective communications Data movement (e.g. MPI_Bcast) Reduction (e.g. MPI_Reduce) Synchronization (e.g. MPI_Barrier) You may find more examples at: http://mpi.deino.net/mpi_functions/index.htm

Broadcast pattern Sends same data to each of a group of processes Destinations A common pattern to get same data to all processes, especially at the beginning of a computation Same data sent to all destinations Source Note: Patterns given do not mean the implementation does them as shown. Only the final result is the same in any parallel implementation. Patterns do not describe the implementation.

Broadcast P1 A B C D P1 A B C D P2 P2 A B C D MPI_Bcast P3 P3 A B C D int MPI_Bcast(void *buffer, int count, MPI_Datatype datatype, int root, MPI_Comm comm); One process (root) sends data to all the other processes in the same communicator Must be called by all the processes with the same arguments P1 A B C D P1 A B C D P2 P2 A B C D MPI_Bcast P3 P3 A B C D P4 P4 A B C D

MPI broadcast operation Sending same message to all processes in communicator Notice same routine called by each process, with same parameters. MPI processes usually execute the same program so this is a handy construction.

MPI_Bcast parameters source Notice that there is no tag. All processes in the Communicator must call the MPI_Bcast with the same parameters Notice that there is no tag.

Creating a broadcast with individual sends and receives if(rank == 0) { for (i=1; i < P; i++) MPI_Send(buf, N, MPI_INT, i, tag, MPI_COMM_WORLD); }else MPI_Recv(buf, N, MPI_INT, 0, tag, MPI_COMM_WORLD, &status); Complexity of doing that is O(N * P), where the number of bytes in the message is N and there are P processors.

Likely MPI_Bcast implementation The number of processes that have the data doubles with each iteration 1 2 3 4 5 6 7 log2P Already has the data Receives the data Complexity of broadcast is O( log2(N * P) ).

MPI Scatter MPI_Scatter() is a collective routine that is very similar to MPI_Bcast() A root process sending data to all processes in a communicator.  Difference between MPI_Bcast() and MPI_Scatter() is small but important MPI_Bcast() sends the same piece of data to all processes, while MPI_Scatter() sends chunks of an array to different processes.

Scatter Pattern Distributes a collection of data items to a group of processes A common pattern to get data to all processes Usually data sent are parts of an array Destinations Different data sent to each destinations Source

Scatter Pattern MPI_Bcast takes a single data element at the root process (the red box) and copies it to all other processes. MPI_Scatter takes an array of elements and distributes the elements in the order of process rank. The first element (in red) goes to process zero, the second element (in green) goes to process one, and so on.

Basic MPI scatter operation Sending one of more contiguous elements of an array in root process to a separate process. Notice same routine called by each process, with same parameters. MPI processes usually execute the same program so this is a handy construction.

MPI scatter parameters source Usually number of elements sent to each process and received by each process is the same. All processes in the Communicator must call the MPI_Scatter with the same parameters Notice that there is no tag.

Example In the following code, size of send buffer is given by 100 * <number of processes> and 100 contiguous elements are send to each process: int main (int argc, char *argv[]) { int size, *sendbuf, recvbuf[100]; /* for each process */ MPI_Init(&argc, &argv); /* initialize MPI */ MPI_Comm_size(MPI_COMM_WORLD, &size); sendbuf = (int *)malloc(size*100*sizeof(int)); . MPI_Scatter(sendbuf,100,MPI_INT,recvbuf,100,MPI_INT,0, MPI_COMM_WORLD); MPI_Finalize(); /* terminate MPI */ }

Scatter Example (source: http://www.mpi-forum.org)

Scattering contiguous groups of elements to each process

Gather Pattern Essentially the reverse of a scatter. It receives data items from a group of processes Sources Data Destination Data A common pattern especially at the end of a computation to collect results Data collected at destination in an array Data

MPI Gather Having one process collect individual values from set of processes (includes itself). As usual same routine called by each process, with same parameters.

Gather Pattern P1 A P1 A B C D P2 B P2 MPI_Gather P3 C P3 P4 D P4 int MPI_Gather(void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, int root, MPI_Comm comm) One process (root) collects data to all the other processes in the same communicator Must be called by all the processes with the same arguments P1 A P1 A B C D P2 B P2 MPI_Gather P3 C P3 P4 D P4

Gather parameters (from each process) (in any single receive) Usually number of elements sent to each process and received by each process is the same. Note: All processes in the Communicator must call the MPI_Gather() with the same parameters

Gather Example To gather 10 data elements from each process into process 0, using dynamically allocated memory in root process: int data[10]; /*data to be gathered from processes*/ MPI_Comm_rank(MPI_COMM_WORLD, &myrank); /* find rank */ if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD, &grp_size); /*find group size*/ buf = (int *)malloc(grp_size*10*sizeof (int)); /*alloc. mem*/ } MPI_Gather(data, 10, MPI_INT, buf, 10, MPI_INT, 0, MPI_COMM_WORLD) ; … Note: MPI_Gather() gathers from all processes, including root.

Reduce Pattern Sources Data Destination A common pattern to get data back to master from all processes and then aggregate it by combining collected data into one answer. Reduction operation must be a binary operation that is commutative (changing the order of the operands does not change the result) Data Data collected at destination and combined to get one answer with a commutative operation Data Needs to be commutative operation to allow the implementation to do the operations in any order.

MPI Reduce Gather operation combined with specified arithmetic/logical operation. Example: Values could be gathered and then added together by root: MPI_Reduce() MPI_Reduce() MPI_Reduce() As usual same routine called by each process, with same parameters.

Reduce parameters Note: All processes in the Communicator must call the MPI_Reduce() with the same parameters

Reduction P1 A … P1 A+B+C+D P2 B … P2 MPI_Reduce P3 C … P3 P4 D … P4 int MPI_Reduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, int root, MPI_Comm comm) One process (root) collects data to all the other processes in the same communicator, and performs an operation on the data MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and a few more MPI_Op_create(): User defined operator P1 A … P1 A+B+C+D P2 B … P2 MPI_Reduce P3 C … P3 P4 D … P4

Implementation of reduction using a tree construction 14 39 53 120 66 29 + 53 + 173 + 95 O(log2 P) with P processes + 226 + 321

MPI_Reduce Example for (i = 0; i < N; i++) { table[i] = ((float) random()) / RAND_MAX * 100; } printf ("<rank %d>:", rank); printf ("%4d", table[i]); printf ("\n"); MPI_Reduce (table, result, N, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (rank == 0) { printf ("\nAnswer:\n"); for (i = 0; i < N; i++) printf ("%4d", row[i]);

Complete sample MPI program with collective routines #include “mpi.h” #include <stdio.h> #include <math.h> #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs, data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */ x = n/nproc; /* Add my portion Of data */ low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize();

Summary

Combined Patterns

Gather to All P1 A P1 A B C D P2 B P2 A B C D MPI_Allgather P3 C P3 A int MPI_Allgather( void *sendbuf, int sendcnt, MPI_Datatype sendtype, void *recvbuf, int recvcnt, MPI_Datatype recvtype, MPI_Comm comm ) All the processes collects data to all the other processes in the same communicator Must be called by all the processes with the same arguments P1 A P1 A B C D P2 B P2 A B C D MPI_Allgather P3 C P3 A B C D P4 D P4 A B C D

Reduction to All P1 A … P1 A+B+C+D P2 B … P2 A+B+C+D MPI_Allreduce P3 int MPI_Allreduce(void *sendbuf, void *recvbuf, int count, MPI_Datatype datatype, MPI_Op op, MPI_Comm comm) All the processes collect data to all the other processes in the same communicator, and perform an operation on the data MPI_SUM, MPI_MIN, MPI_MAX, MPI_PROD, logical AND, OR, XOR, and a few more MPI_Op_create(): User defined operator P1 A … P1 A+B+C+D P2 B … P2 A+B+C+D MPI_Allreduce P3 C … P3 A+B+C+D P4 D … P4 A+B+C+D

For more functions… http://www.mpi-forum.org http://www.llnl.gov/computing/tutorials/mpi/ http://www.nersc.gov/nusers/help/tutorials/mpi/intro/ http://www-unix.mcs.anl.gov/mpi/tutorial/gropp/talk.html http://www-unix.mcs.anl.gov/mpi/tutorial/ MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) Open MPI (http://www.open-mpi.org/) MPI descriptions and examples are referred from http://mpi.deino.net/mpi_functions/index.htm Stéphane Ethier (PPPL)’s PICSciE/PICASso Mini-Course Slides

Synchronization

MPI Collective Communication Collective routines provide a higher-level way to organize a parallel program Each process executes the same communication operations Communication and computation is coordinated among a group of processes in a communicator Tags are not used No non-blocking collective operations. Three classes of operations: Synchronization - processes wait until all members of the group have reached the synchronization point. MPI_Barrier blocks all MPI processes in the given communicator until they all call this routine. data movement - Data Movement - broadcast, scatter/gather, all to all. collective computation - (reductions) - one member of the group collects data from the other members and performs an operation (min, max, add, multiply, etc.) on that data. CS267 Lecture 2

 Scope: Collective communication routines must involve all processes within the scope of a communicator. All processes are by default, members in the communicator MPI_COMM_WORLD. Additional communicators can be defined by the programmer. Unexpected behavior, including program failure, can occur if even one task in the communicator doesn't participate. It is the programmer's responsibility to ensure that all processes within a communicator participate in any collective operations.

Synchronization int MPI_Barrier(MPI_Comm comm) Synchronization operation. Creates a barrier synchronization in a group. Each task, when reaching the MPI_Barrier call, blocks until all tasks in the group reach the same MPI_Barrier call. Then all tasks are free to proceed. #include <mpi.h> #include <stdio.h> int main(int argc, char *argv[]) { int rank, nprocs; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&nprocs); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Barrier(MPI_COMM_WORLD); printf("Hello, world. I am %d of %d\n", rank, nprocs); MPI_Finalize(); return 0; }

Synchronization MPI_Barrier( comm ) Blocks until all processes in the group of the communicator comm call it. Not used often. Sometime used in measuring performance and load balancing CS267 Lecture 2

Collective Data Movement: Operation Description MPI_Bcast Data movement operation. Broadcasts (sends) a message from the process with rank "root" to all other processes in the group. MPI_Scatter Data movement operation. Distributes distinct messages from a single source task to each task in the group. MPI_Gather Data movement operation. Gathers distinct messages from each task in the group to a single destination task. This routine is the reverse operation of MPI_Scatter. MPI_Allgather Data movement operation. Concatenation of data to all tasks in a group. Each task in the group, in effect, performs a one-to-all broadcasting operation within the group. MPI_Reduce Collective computation operation. Applies a reduction operation on all tasks in the group and places the result in one task. MPI_Allreduce Collective computation operation + data movement. Applies a reduction operation and places the result in all tasks in the group. This is equivalent to an MPI_Reduce followed by an MPI_Bcast.

Collective Data Movement: Broadcast, Scatter, and Gather P0 P1 P2 P3 A Broadcast A B D C Scatter Gather CS267 Lecture 2

Broadcast Data belonging to a single process is sent to all of the processes in the communicator. Copyright © 2010, Elsevier Inc. All rights Reserved

Comments on Broadcast All collective operations must be called by all processes in the communicator MPI_Bcast is called by both the sender (called the root process) and the processes that are to receive the broadcast MPI_Bcast is not a “multi-send” “root” argument is the rank of the sender; this tells MPI which process originates the broadcast and which receive If one processor does not call collective, program hangs CS267 Lecture 2

MPI_Reduce

Predefined reduction operators in MPI Copyright © 2010, Elsevier Inc. All rights Reserved

MPI_Allreduce Useful in a situation in which all of the processes need the result of a global sum in order to complete some larger computation. Copyright © 2010, Elsevier Inc. All rights Reserved

MPI Collective Routines: Summary Many Routines: Allgather, Allgatherv, Allreduce, Alltoall, Alltoallv, Bcast, Gather, Gatherv, Reduce, Reduce_scatter, Scan, Scatter, Scatterv All versions deliver results to all participating processes. V versions allow the hunks to have variable sizes. Allreduce, Reduce, Reduce_scatter, and Scan take both built-in and user-defined combiner functions. MPI-2 adds Alltoallw, Exscan, intercommunicator versions of most routines Allgatherv –variable amounts of data on each processor Allreduce – everyone gets copy of sum CS267 Lecture 2

Example Extra: Self Study MPI PI program

Example of MPI PI program using 6 Functions Using basic MPI functions: MPI_INIT MPI_FINALIZE MPI_COMM_SIZE MPI_COMM_RANK Using MPI collectives: MPI_BCAST MPI_REDUCE Slide source: Bill Gropp, ANL CS267 Lecture 2

Midpoint Rule for a b x f(x) xm

Example: PI in C - 1 Input and broadcast parameters #include "mpi.h" #include <math.h> #include <stdio.h> int main(int argc, char *argv[]) { int done = 0, n, myid, numprocs, i, rc; double PI25DT = 3.141592653589793238462643; double mypi, pi, h, sum, x, a; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); while (!done) { if (myid == 0) { printf("Enter the number of intervals: (0 quits) "); scanf("%d",&n); } MPI_Bcast(&n, 1, MPI_INT, 0, MPI_COMM_WORLD); if (n == 0) break; Input and broadcast parameters Slide source: Bill Gropp, ANL CS267 Lecture 2

Example: PI in C - 2 Compute local pi values Compute summation h = 1.0 / (double) n; sum = 0.0; for (i = myid + 1; i <= n; i += numprocs) { x = h * ((double)i - 0.5); sum += 4.0 / (1.0 + x*x); } mypi = h * sum; MPI_Reduce(&mypi, &pi, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf("pi is approximately %.16f, Error is .16f\n", pi, fabs(pi - PI25DT)); } MPI_Finalize(); return 0; } Compute local pi values Compute summation Slide source: Bill Gropp, ANL CS267 Lecture 2

Collective vs. Point-to-Point Communications All the processes in the communicator must call the same collective function. Will this program work? if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Recv(&a, MPI_INT, MPI_SUM,0,0, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications All the processes in the communicator must call the same collective function. For example, a program that attempts to match a call to MPI_Reduce on one process with a call to MPI_Recv on another process is erroneous, and, in all likelihood, the program will hang or crash. if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Recv(&a, MPI_INT, MPI_SUM,0,0, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications The arguments passed by each process to an MPI collective communication must be “compatible.” Will this program work? if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 1, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Collective vs. Point-to-Point Communications The arguments passed by each process to an MPI collective communication must be “compatible.” For example, if one process passes in 0 as the dest_process and another passes in 1, then the outcome of a call to MPI_Reduce is erroneous, and, once again, the program is likely to hang or crash. if(my_rank==0) MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); else MPI_Reduce(&a,&b,1, MPI_INT, MPI_SUM, 1, MPI_COMM_WORLD); Copyright © 2010, Elsevier Inc. All rights Reserved

Example of MPI_Reduce execution Multiple calls to MPI_Reduce with MPI_SUM and Proc 0 as destination (root) Is b=3 on Proc 0 after two MPI_Reduce() calls? Is d=6 on Proc 0? Copyright © 2010, Elsevier Inc. All rights Reserved

Example: Output results However, the names of the memory locations are irrelevant to the matching of the calls to MPI_Reduce. The order of the calls will determine the matching so the value stored in b will be 1+2+1 = 4, and the value stored in d will be 2+1+2 = 5. Copyright © 2010, Elsevier Inc. All rights Reserved

Example Parallel Matrix Vector Multiplication Collective Communication Application Textbook p. 113-116

Matrix-vector multiplication: y= A * x

Partitioning and Task graph for matrix-vector multiplication yi= Row Ai * x

Execution Schedule and Task Mapping yi= Row Ai * x

Data Partitioning and Mapping for y= A*x

SPMD Code for y= A*x

Evaluation: Parallel Time Ignore the cost of local address calculation. Each task performs n additions and n multiplications. Each addition/multiplication costs ω The parallel time is approximately

How is initial data distributed? Assume initially matrix A and vector x are distributed evenly among processes Need to redistribute vector x to everybody in order to perform parallel computation! What MPI collective communication is needed?

Communication Pattern for Data Redistribution Data requirement for Process 0 MPI_Gather Data requirement for all processes MPI_Allgather

MPI Code for Gathering Data Data gather for Process 0 Repeat for all processes

Allgather A Allgather B C D Concatenates the contents of each process’ send_buf_p and stores this in each process’ recv_buf_p. As usual, recv_count is the amount of data being received from each process. Copyright © 2010, Elsevier Inc. All rights Reserved

MPI SPMD Code for y=A*x

MPI SPMD Code for y=A*x

Performance Evaluation of Matrix Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved

How to measure elapsed parallel time Use MPI_Wtime() that returns the number of seconds that have elapsed since some time in the past. Copyright © 2010, Elsevier Inc. All rights Reserved

Measure elapsed sequential time in Linux This code works for Linux without using MPI functions Use GET_TIME() which returns time in microseconds elapsed from some point in the past. Sample code for GET_TIME() #include <sys/time.h> /* The argument now should be a double (not a pointer to a double) */ #define GET_TIME(now) { struct timeval t; gettimeofday(&t, NULL); now = t.tv_sec + t.tv_usec/1000000.0; }

Measure elapsed sequential time Copyright © 2010, Elsevier Inc. All rights Reserved

Use MPI_Barrier() before time measurement Start timing until every process in the communicator has reached the same time stamp

Run-times of serial and parallel matrix-vector multiplication (Seconds) Copyright © 2010, Elsevier Inc. All rights Reserved

Speedup and Efficiency Copyright © 2010, Elsevier Inc. All rights Reserved

Speedups of Parallel Matrix-Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved

Efficiencies of Parallel Matrix-Vector Multiplication Copyright © 2010, Elsevier Inc. All rights Reserved

Scalability A program is scalable if the problem size can be increased at a rate so that the efficiency doesn’t decrease as the number of processes increase. Programs that can maintain a constant efficiency without increasing the problem size are sometimes said to be strongly scalable. Programs that can maintain a constant efficiency if the problem size increases at the same rate as the number of processes are sometimes said to be weakly scalable. Copyright © 2010, Elsevier Inc. All rights Reserved

Safety Issues in MPI programs

Safety in MPI programs Is it a safe program? (Assume tag/process ID is assigned properly) Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) ) Copyright © 2010, Elsevier Inc. All rights Reserved

Safety in MPI programs Process 0 Send(1) Recv(1) Process 1 Send(0) Is it a safe program? (Assume tag/process ID is assigned properly) May be unsafe because MPI standard allows MPI_Send to behave in two different ways: It can simply copy the message into an MPI managed buffer and return, or it can block until the matching call to MPI_Recv starts. Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) Copyright © 2010, Elsevier Inc. All rights Reserved

Buffer a message implicitly during MPI_Send() When you send data, where does it go? One possibility is: Process 0 Process 1 User data Local buffer the network Doubles storage, but safe Slide source: Bill Gropp, ANL CS267 Lecture 2

Avoiding Buffering Avoiding copies uses less memory May use more time Process 0 Process 1 User data the network Time-space tradeoff User data MPI_Send() waits until a matching receive is executed. Slide source: Bill Gropp, ANL CS267 Lecture 2

Safety in MPI programs Many implementations of MPI set a threshold at which the system switches from buffering to blocking. Relatively small messages will be buffered by MPI_Send. Larger messages, will cause it to block. If the MPI_Send() executed by each process blocks, no process will be able to start executing a call to MPI_Recv, and the program will hang or deadlock. Each process is blocked waiting for an event that will never happen. Copyright © 2010, Elsevier Inc. All rights Reserved

Example of unsafe MPI code with possible deadlocks Send a large message from process 0 to process 1 If there is insufficient storage at the destination, the send must wait for the user to provide the memory space (through a receive) Process 0 Send(1) Recv(1) Process 1 Send(0) Recv(0) Doubles the memory to be able to continue computing past send() This may be “unsafe” because it depends on the availability of system buffers in which to store the data sent until it can be received Slide source: Bill Gropp, ANL CS267 Lecture 2

Safety in MPI programs A program that relies on MPI provided buffering is said to be unsafe. Such a program may run without problems for various sets of input, but it may hang or crash with other sets. Copyright © 2010, Elsevier Inc. All rights Reserved

How can we tell if a program is unsafe Replace MPI_Send() with MPI_Ssend() The extra “s” stands for synchronous and MPI_Ssend is guaranteed to block until the matching receive starts. If the new program does not hang/crash, the original program is safe. MPI_Send() and MPI_Ssend() have the same arguments Copyright © 2010, Elsevier Inc. All rights Reserved

Some Solutions to the “unsafe” Problem Order the operations more carefully: Process 0 Send(1) Recv(1) Process 1 Recv(0) Send(0) Simultaneous send and receive in one call Trouble with sendrecv: need to be paired up, called about the same time Process 0 Sendrecv(1) Process 1 Sendrecv(0) Slide source: Bill Gropp, ANL CS267 Lecture 2

Restructuring communication in odd-even sort Copyright © 2010, Elsevier Inc. All rights Reserved

Use MPI_Sendrecv() to conduct a blocking send and a receive in a single call. Copyright © 2010, Elsevier Inc. All rights Reserved

More Solutions to the “unsafe” Problem Supply own space as buffer for send Process 0 Bsend(1) Recv(1) Process 1 Bsend(0) Recv(0) Use non-blocking operations: Process 0 Isend(1) Irecv(1) Waitall Process 1 Isend(0) Irecv(0) Buffered send: Bsend: supply buffer space yourself Immediate send: you promise not to touch it until done (need to call waitall later) Immediate recv: returns right away, you promise not to touch it until done (need to call waitall later) Riskier (race conditions possible) but allows more overlap CS267 Lecture 2

Concluding Remarks (1) MPI works in C, C++, or Python A communicator is a collection of processes that can send messages to each other. Many parallel programs use the SPMD approach. Most serial programs are deterministic: if we run the same program with the same input we’ll get the same output. Parallel programs often don’t possess this property. Collective communications involve all the processes in a communicator. Copyright © 2010, Elsevier Inc. All rights Reserved

Concluding Remarks (2) Performance evaluation Use elapsed time or “wall clock time”. Speedup = sequential/parallel time Efficiency = Speedup/ p If it’s possible to increase the problem size (n) so that the efficiency doesn’t decrease as p is increased, a parallel program is said to be scalable. An MPI program is unsafe if its correct behavior depends on the fact that MPI_Send is buffering its input. Copyright © 2010, Elsevier Inc. All rights Reserved