2.1 Collective Communication Involves set of processes, defined by an intra-communicator. Message tags not present. Principal collective operations: MPI_BCAST() - Broadcast from root to all other processes MPI_GATHER() - Gather values for group of processes MPI_SCATTER() - Scatters buffer in parts to group of processes MPI_REDUCE() - Combine values on all processes to single value
2.2 Broadcast Sending same message to all processes concerned with problem. Multicast - sending same message to defined group of processes. bcast(); buf bcast(); data bcast(); data Process 0Processp - 1Process 1 Action Code MPI form
2.3 Broadcast Illustrated
2.4 Broadcast (MPI_BCAST) One-to-all communication: same data sent from root process to all others in the communicator MPI_BCAST(data, size, MPI_Datatype,root,MPI_COMM) All processes must specify same root, rank and comm
2.5 Reduction (MPI_REDUCE) The reduction operation allow to: Collect data from each process Reduce the data to a single value Store the result on the root processes Store the result on all processes Reduction function works with arrays Operations: sum, product, min, max, and, ….
COMPE472 Parallel Computing 2.6 Reduction Operation (SUM)
2.7 MPI_REDUCE MPI_REDUCE( snd_buf, rcv_buf, count, type, op, root, comm, ierr) –snd_buf input array of type type containing local values. –rcv_buf output array of type type containing global results –count (INTEGER) number of element of snd_buf and rcv_buf –type (INTEGER) MPI type of snd_buf and rcv_buf –op (INTEGER) parallel operation to be performed –root (INTEGER) MPI id of the process storing the result –comm (INTEGER) communicator of processes involved in the operation –ierr (INTEGER) output, error code (if ierr=0 no error occours) MPI_ALLREDUCE( snd_buf, rcv_buf, count, type, op, comm, ierr) –The argument root is missing, the result is stored to all processes.
2.8 Predefined Reduction Operations
2.9 Example #include #define MAXSIZE 100 int main(int argc, char **argv) { int myid, numprocs; int data[MAXSIZE], i, x, low, high, myresult=0, result; char fn[255]; FILE *fp; MPI_Init(&argc, &argv); MPI_Comm_size(MPI_COMM_WORLD, &numprocs); MPI_Comm_rank(MPI_COMM_WORLD, &myid);
2.10 Example cont. if(myid==0) { /* open input file and intialize data */ strcpy(fn, getenv("PWD")); strcat(fn, "/rand_data.txt"); if( (fp = fopen(fn, "r")) == NULL) { printf("Can't open the input file: %s\n\n", fn); exit(1); } for(i=0; i<MAXSIZE; i++) { fscanf(fp, "%d", &data[i]); }
2.11 Example cont. /* broadcast data */ MPI_Bcast(data, MAXSIZE, MPI_INT, 0, MPI_COMM_WORLD); /* add portion of data */ x = MAXSIZE/numprocs;/* must be an integer */ low = myid * x; high = low + x; for(i=low; i<high; i++) { myresult += data[i]; } printf("I got %d from %d\n", myresult, myid); /* compute global sum */ MPI_Reduce(&myresult, &result, 1, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD); if(myid == 0) { printf("The sum is %d.\n", result); } MPI_Finalize(); return 0; }
COMPE472 Parallel Computing 2.12 MPI_Scatter One-to-all communication: different data sent from root process to all others in the communicator MPI_SCATTER(sndbuf, sndcount, sndtype, rcvbuf, rcvcount, rcvtype, root, comm, ierr) –Arguments definition are like other MPI subroutine –sndcount is the number of elements sent to each process, not the size of sndbuf, that should be sndcount times the number of process in the communicator –The sender arguments are significant only at root
MPI_Scatter Example Suppose there are four processes including the root (process 0). A 16 element array on the root should be distributed among the processes. Every process should include the following line: 2.13
2.14 MPI_Gather different data collected by the root process, from all others processes in the communicator. Is the opposite of Scatter MPI_GATHER(sndbuf, sndcount, sndtype, rcvbuf, rcvcount,rcvtype, root, comm, ierr) –Arguments definition are like other MPI subroutine –rcvcount is the number of elements collected from each process, not the size of rcvbuf, that should be rcvcount times the number of process in the communicator –The receiver arguments are significant only at root
2.15 Scatter/Gather
2.16 Scatter/Gather Example #include int main ( int argc, char *argv[] ) { int myid,j,data[100],tosum[25],sums[4]; MPI_Init(&argc,&argv); MPI_Comm_rank(MPI_COMM_WORLD,&myid); if(myid==0) { for (j=0; j<100; j++) data[j] = j+1; printf("The data to sum : "); for (j=0; j<100; j++) printf(" %d",data[j]); printf("\n"); }
2.17 Scatter/Gather Example MPI_Scatter(data,25,MPI_INT,tosum,25,MPI_INT,0,MPI_COMM_WORLD); printf("Node %d has numbers to sum :",myid); for(j=0; j<25; j++) printf(" %d", tosum[j]); printf("\n"); sums[myid] = 0; for(j=0; j<25; j++) sums[myid] += tosum[j]; printf("Node %d computes the sum %d\n",myid,sums[myid]);
2.18 Scatter/Gather Example MPI_Gather(&sums[myid],1,MPI_INT,sums,1,MPI_INT,0,MPI_COMM_WORLD); if(myid==0) /* after the gather, sums contains the four sums*/ { printf("The four sums : "); printf("%d",sums[0]); for(j=1; j<4; j++) printf(" + %d", sums[j]); for(j=1; j<4; j++) sums[0] += sums[j]; printf(" = %d, which should be 5050.\n",sums[0]); } MPI_Finalize(); return 0; }
2.19 MPI_Barrier() Stop processes until all processes within a communicator reach the barrier Almost never required in a parallel program Occasionally useful in measuring performance and load balancing C: –int MPI_Barrier(MPI_Comm comm)
2.20 Barrier
2.21 Barrier routine A means of synchronizing processes by stopping each one until they all have reached a specific “barrier” call.
2.22 Barrier example #include // some time consuming functionality int function(int t) { sleep(3*t+1); return 0; }
2.23 Barrier example cont. int main(int argc, char** argv){ int MyRank; double s_time, l_time, g_time; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &MyRank); s_time = MPI_Wtime(); function(MyRank); l_time = MPI_Wtime(); // wait for all to come together MPI_Barrier(MPI_COMM_WORLD); g_time = MPI_Wtime(); printf("Processor %d: LocalTime = %lf,GlobalTime = %lf\n",MyRank,l_time-s_time, g_time-s_time); MPI_Finalize(); return 0; }
COMPE472 Parallel Computing 2.24 Evaluating Parallel Programs
COMPE472 Parallel Computing 2.25 Sequential execution time, t s : Estimate by counting computational steps of best sequential algorithm. Parallel execution time, t p : In addition to number of computational steps, t comp, need to estimate communication overhead, t comm : t p = t comp + t comm
Elapsed parallel time Returns the number of seconds that have elapsed since some time in the past.
COMPE472 Parallel Computing 2.27 Communication Time Many factors, including network structure and network contention. As a first approximation, use t comm = t startup + nt data t startup is startup time, essentially time to send a message with no data. Assumed to be constant. t data is transmission time to send one data word, also assumed constant, and there are n data words.
COMPE472 Parallel Computing 2.28 Idealized Communication Time Number of data items (n) Startup time
2.29 Benchmark Factors With t s, t comp, and t comm, can establish speedup factor and computation/communication ratio for a particular algorithm/implementation: Both functions of number of processors, p, and number of data elements, n.
COMPE472 Parallel Computing 2.30 Factors give indication of scalability of parallel solution with increasing number of processors and problem size. Computation/communication ratio will highlight effect of communication with increasing problem size and system size.