ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Message-Passing Computing Collective patterns and MPI routines 1 - Transferring data
ITCS 4/5145 Parallel Computing, UNC-Charlotte, B. Wilkinson, Feb 15, 2014.

Message containing data
Recap Source Destination Data Point-to-point Send-Receive Process 1 MPI_Send() MPI_Recv() Process 2 Message containing data Implementation of send-receive pattern with explicit MPI send and receive routines

Collective patterns A program might require many point-to-point send-receive patterns. Certain groups of send-receive patterns appear frequently and become slightly higher level patterns. Collective patterns are those that involve multiple processes, one of which is source of data sent to other processes or final destination of data received from other processes. Implemented in MPI with higher efficiency than separate point-to-point routines although collective routines not absolutely necessary for the programmer.

Broadcast Pattern Sends same data to each of a group of processes. A common pattern to get same data to all processes, especially at beginning of a computation Destinations Same data sent to all destinations Source Note: Patterns given do not mean the implementation does them as shown. Only the final result is the same in any parallel implementation. Patterns do not describe the implementation.

Broadcast in MPI Sending same data to all processes in communicator.
Notice same routine called by each process, with same parameters. MPI processes execute the same program so this is a handy construction.

MPI_Bcast parameters Also receive buffer in destinations All processes in the Communicator must call the MPI_Bcast with the same parameters

Using MPI Broadcast routine
Broadcast an array data[1000] from process 0 (rank=0) to every process in group. #include "mpi.h" #include <stdio.h> #define N 1000 int main( int argc, char **argv ) { int rank, P; int data [1000]; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD, &P); MPI_Bcast(data, N, MPI_INT, 0, MPI_COMM_WORLD); … MPI_Finalize(); return(); } Notice that there is no tag. Name of array, ptr source

Implementation - Using a series of point-to point send-receive patterns to create a broadcast pattern Send Destination processes Source process Requires P - 1 sequential send-receive patterns with P processes, O(P×N) communication time complexity with N data items being broadcast, or O(P) communication time complexity for constant data size

MPI code for broadcast with sends and receives
if (rank == 0) { for (int i = 1; i < P; i++) MPI_Send(data, N, MPI_INT, i, tag, MPI_COMM_WORLD); } else MPI_Recv(data, N, MPI_INT, 0, tag, MPI_COMM_WORLD, &status); Communication time complexity of doing that is O(N * P), where number of bytes in message is N and there are P processors.

Likely implementation in MPI – using tree distribution pattern
Destination processes Data The number of processes that have the data doubles with each iteration Communication time complexity = O(N log2P) with N data items being broadcast, or O(log2P) time complexity for constant data size, assuming all processes operate in synchronism and can send messages at the same time as implied in figure.

Scatter Pattern Distributes a collection of data items to a group of processes. A common pattern to get data to all processes. Usually data sent are parts of an array Different data sent to each destinations Source Destinations

Scattering contiguous elements of an array
Root process One group of data items sent to each destination process Scatter Destination processes Array Scatter usually associated with sending one or more contiguous elements of a 1-D array to each destination process

Implementation – using tree distribution pattern
Destination processes Data P0 first sends half of the array to P4, keeping other half. Then both P0 and P4 would divide array again into half and send half to P2 and P6 respectively. Finally P0, P2, P4, and P6 would divide the array again to half and send half to P1, P3, P5, and P7 respectively.

Communication time complexity
To discuss

MPI Scatter routine Sending each element of an array in root process to a separate process. Contents of ith location of array sent to ith process (includes root). Notice same routine called by each process, with same parameters. MPI processes execute the same program so this is a handy construction.

MPI scatter parameters
All processes in the Communicator must call the MPI_Scatter with the same parameters

Using MPI scatter routine
Scatter one element of array data[100] from process 0 (rank=0) to each process in group. This code assumes a fixed number of processes, 100 processes. See later for cases when number of processes not known during compilation. #include "mpi.h" #include <stdio.h> #define N 1000 int main( int argc, char **argv ) { int rank, P; int senddata [100], recvdata; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD, &P); MPI_Scatter(&senddata,1,MPI_INT,&recvdata,1,MPI_INT,0,MPI_COMM_WORLD); … MPI_Finalize(); return(); } source Single integer needs to be given as an address (ptr) Notice that there is no tag.

Scattering contiguous groups of elements to each process

Using MPI scatter routine
Scatter 10 elements of array data[1000] from process 0 (rank=0) to each process in group (100 processes). #include "mpi.h" #include <stdio.h> #define N 1000 int main( int argc, char **argv ) { int rank, P; int senddata [100], recvdata; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD, &P); MPI_Scatter(&senddata,10,MPI_INT,&recvdata,10,MPI_INT,0,MPI_COMM_WORLD); … MPI_Finalize(); return(); } Usually number of elements sent to each process and received by each process is the same

There is a version scatter called MPI_Scatterv, that can jump over parts of the array:
MPI_Scatterv Example (source:

Scattering Rows of a Matrix
Since C stores multi-dimensional arrays in row-major order, scattering rows of a matrix is easy. Suppose P0 is to have the first 3 rows, P1 the next 3 rows etc.: P0 P1 P2 P3

Scatter Example #include "mpi.h" #include <stdio.h> #include <math.h> #define N 10 int main( int argc, char **argv ) { int rank, P, i, j; int table[N][N]; int row[N][N]; int blksz; MPI_Init( &argc, &argv ); MPI_Comm_rank( MPI_COMM_WORLD, &rank); MPI_Comm_size( MPI_COMM_WORLD, &P); blksz = (int) ceil ((double) N/P); Code and results from Dr. Ferner

if (rank == 0) { printf ("<rank %d initial>: blksz = %d\n", rank, blksz); for (i = 0; i < N; i++) // Initialize with multiplication table for (j = 0; j < N; j++) table[i][j] = i*j; for (i = 0; i < N; i++) { // Display the table printf ("<rank %d initial>: ", rank); for (j = 0; j < N; j++) { printf ("%4d", table[i][j]); } printf ("\n"); // All processors do this MPI_Scatter (table, blksz*N, MPI_INT, row, blksz*N, MPI_INT, 0, MPI_COMM_WORLD);

// All processes print what they get for (i = 0; i < blksz && rank
// All processes print what they get for (i = 0; i < blksz && rank * blksz + i < N; i++) { printf ("<rank %d>:", rank); for (j = 0; j < N; j++) { printf ("%4d", row[i][j]); } printf ("\n"); if (rank * blksz >= N) printf ("<rank %d>: no data\n", rank); MPI_Finalize(); Don’t go beyond N. Some processors may get less than blksz number of rows.

Scatter Example Output (N=10, P=8)
<rank 0>: blksz = 2 <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>:

<rank 1>: <rank 1>: <rank 2>: <rank 2>: <rank 3>: <rank 3>: <rank 4>: <rank 4>: <rank 5>: no data <rank 6>: no data <rank 0>: <rank 0>: <rank 7>: no data Since each process gets 2 rows, only first 5 processes get any data (because 5*2 >= 10).

<rank 0>: blksz = 3 <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>: <rank 0 initial>:

<rank 0>: <rank 0>: <rank 0>: <rank 2>: <rank 2>: <rank 2>: <rank 3>: <rank 1>: <rank 1>: <rank 1>:

Scattering Columns of a Matrix
What if we want to scatter columns? PE 0 PE 1 PE 2 PE 3

Scattering Columns of a Matrix
Could use MPI_Datatype and MPI_Type_vector features of MPI (not covered here) OR An easier solution -- transpose matrix, then scatter rows (although transpose incurs an overhead especially for large matrices).

Gather pattern Sources Destination Essentially reverse of scatter pattern. It receives data items from a group of processes Data Data Data collected at destination in an array Data Common pattern especially at the end of a computation to collect results.

Gathering contiguous elements of an array
Root process One group of data items received from each source process Gather Source processes Array Gather usually associated with sending one or more contiguous elements of a 1-D array from each source process

MPI Gather routine Having one process collect individual values from set of processes. MPI_Gather() gathers from all processes, including root. Same routine called by each process, with same parameters.

Number of items in any single receive
Gather parameters Number of items in any single receive All processes in the Communicator must call the MPI_Gather with the same parameters

Gather Example To gather 10 data elements from each process in group into process 0, using dynamically allocated memory in root process to allow a variable number of processes: int data[10]; // data to be gathered from processes MPI_Comm_rank(MPI_COMM_WORLD, &myrank); // find rank if (myrank == 0) { MPI_Comm_size(MPI_COMM_WORLD, &grp_size); // find group size buf = (int *)malloc(grp_size*10*sizeof (int)); // alloc. mem } MPI_Gather(data,10,MPI_INT,buf,10,MPI_INT,0,MPI_COMM_WORLD) ; … Number of items received by root from each process. Generally number of elements received by the root from each process and sent by each process is the same Number of items sent to root from each process

Reduce Pattern Sources Destination A common pattern to get data back to master from all processes and then aggregate it by combining collected data into one answer. Reduction operation must be a binary operation that is commutative (changing the order of the operands does not change the result) Data Data Data collected at destination and combined to get one answer with a commutative operation Data Needs to be commutative operation to allow the implementation to do the operations in any order.

Rationale for providing reduce pattern is that it is common and can be implemented efficiently in parallel. Could be implemented using a tree structure: P0 P2 P4 P6 P1 P3 P5 P7 Source processes Combine Data Destination process First data of P1 is combined with that of P0, for example added together for add reduction. At the same time, the data of P3 is combined with that of P2, the data of P5 is combined with that of P4, and the data of P7 is combined with that of P6. Then accumulated data of P2 is combined with that of P0, and accumulated data of P6 is combined with that of P4. Finally accumulated data of P0 is combined with that of P4.

Communication time complexity
To discuss

MPI Reduce routine Gather operation combined with specified arithmetic/logical operation. Example: Values could be gathered and then added together by root: MPI_Reduce() MPI_Reduce() MPI_Reduce() As usual, same routine called by each process, with same parameters.

Reduce parameters All processes in the Communicator must call the MPI_Reduce with the same parameters

Reduce - operations Parameters:
MPI_Reduce(*sendbuf,*recvbuf,count,datatype,op,root,comm) Parameters: *sendbuf send buffer address *recvbuf receive buffer address count number of send buffer elements datatype data type of send elements op reduce operation. Several operations, including MPI_MAX Maximum MPI_MIN Minimum MPI_SUM Sum MPI_PROD Product root root process rank for result comm communicator

Reduction example using a tree construction
PE0 PE1 PE2 PE3 PE4 PE5 14 39 53 120 66 29 + 53 + 173 + 95 O(log2 P) with P processes + 226 + 321

MPI_Reduce Example for (i = 0; i < N; i++) { table[i] = ((float) random()) / RAND_MAX * 100; } printf ("<rank %d>:", rank); printf ("%4d", table[i]); printf ("\n"); MPI_Reduce (table, result, N, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (rank == 0) { printf ("\nAnswer:\n"); for (i = 0; i < N; i++) printf ("%4d", row[i]); Code and results from Dr. Ferner

MPI_Reduce Result <rank 0>: <rank 3>: <rank 1>: <rank 2>: Answer: <rank 0>:

Complete sample MPI program with broadcast and reduce routines
#include “mpi.h” #include <stdio.h> #include <math.h> #define MAXSIZE 1000 void main(int argc, char *argv) { int myid, numprocs, data[MAXSIZE], i, x, low, high, myresult, result; char fn[255]; char *fp; MPI_Init(&argc,&argv); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if (myid == 0) { /* Open input file and initialize data */ strcpy(fn,getenv(“HOME”)); strcat(fn,”/MPI/rand_data.txt”); if ((fp = fopen(fn,”r”)) == NULL) { printf(“Can’t open the input file: %s\n\n”, fn); exit(1); } for(i = 0; i < MAXSIZE; i++) fscanf(fp,”%d”, &data[i]); MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* broadcast data */ x = n/nproc; /* Add my portion Of data */ low = myid * x; high = low + x; for(i = low; i < high; i++) myresult += data[i]; printf(“I got %d from %d\n”, myresult, myid); /* Compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if (myid == 0) printf(“The sum is %d.\n”, result); MPI_Finalize();

Combined patterns

Combined Broadcast-Gather Pattern
Often data broadcast or scattered to processes at beginning of a computation and results collected with a gather or reduce pattern at end. Leads to broadcast-gather pattern (and also scatter-gather, broadcast-reduce scatter-reduce patterns). Actual computation done by slaves specified in some fashion, but at this level is unspecified. Same data sent to each destination process Master (Source/destination) Results collected from each slave Slaves performing computations General broadcast-gather not directly implemented in MPI – have to use separate gather and gather routines.

Generalized All-to-All Pattern
Every process communicates with every other process In the generalized all-to-all pattern, each process can communicate with each other process. Each process participating in the pattern is both source and a destination. Applications include the N-body problem, see later. MPI has specific types of all-to-all patterns based upon scatter and gather. (Seeds has a higher level iterative all-to-all pattern.)

Collective all-to-all broadcast
Sources and destinations are the same processes Sources Destinations A common all-to-all pattern, often within a computation, is to send data from all processes to all processes often within a computation Every process sends data to every other process (one-way) Versions of this can be found in MPI, see next.

all-gather pattern found in MPI as MPI_Allgather()
Each process performs a gather operation, gathering elements of an array from a group of processes. Finally each process has same array composed of data for each of processes. Process Gather P0 P1 Pn-1 all-gather pattern found in MPI as MPI_Allgather()

Pattern found in MPI as MPI_AlltoAll()
Scatter-gather pattern (All-to-All) Each process performs a scatter operation. 1st process scatters its elements to 1st location of each destination. 2nd process scatters its element to 2nd location of each destination, and so on. Effect will be a gather operation at each process, although this happens implicitly with the multiple scatters. (Could be described in terms of multiple gathers.) Processes Gather P0 P1 Pn-1 Scatter Pattern found in MPI as MPI_AlltoAll()

Effect Combines multiple scatters:
This is essentially matrix transposition Rows become columns

MPI_alltoall parameters
Similar to scatter without root MPI_Alltoall No root

Some other combined patterns in MPI
MPI_Reduce_scatter() Combines values and scatters the results MPI_Allreduce() Combines values from all processes and distributes the result back to all processes MPI_Sendrecv() Sends and receives a message

MPI Collective (Data Transfer) Routines General features
Performed on a group of processes, identified by a communicator Substitutes for a sequence of point-to-point calls Communications are locally blocking Synchronization is not guaranteed (implementation dependent) Most routines use a root process to originate or receive all data (broadcast, scatter, gather, reduce, …) Data amounts must exactly match Many variations to basic categories No message tags needed From

Synchronization Generally MPI collective data transfer operations have same semantics as if individual MPI_ send()s and MPI_recv()’s were used (according to the MPI standard). i.e. both sends and recv’s are locally blocking, sends will return after sending message, recv’s will wait for messages. However we need to know the exact implementation to really figure out when each process will return.

Questions

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Similar presentations

Presentation on theme: "ITCS 4/5145 Parallel Computing, UNC-Charlotte, B"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

ITCS 4/5145 Parallel Computing, UNC-Charlotte, B

Similar presentations

Presentation on theme: "ITCS 4/5145 Parallel Computing, UNC-Charlotte, B"— Presentation transcript:

Similar presentations

About project

Feedback