Lecture 6 CSS314 Parallel Computing Book: “An Introduction to Parallel Programming” by Peter Pacheco http://instructor.sdu.edu.kz/~moodle/ M.Sc. Bogdanchikov Andrey, Suleyman Demirel University
Content The trapezoidal rule Parallelizing the trapezoidal rule Dealing with I/O Collective communication MPI_Reduce MPI_Allreduce Broadcast Data distributions MPI_Scatter
The trapezoidal rule Recall that we can use the trapezoidal rule to approximate the area between the graph of a function, y=f (x), two vertical lines, and the x-axis. The basic idea is to divide the interval on the x-axis into n equal subintervals. If the endpoints of the subinterval are xi and xi+1, then the length of the subinterval is h = xi+1 - xi. Also, if the lengths of the two vertical segments are f(xi) and f(xi+1), then the area of the trapezoid is Self-study: learn formula of multiple consecutive trapezoids. Chapter 3.2.1
The trapezoidal rule: (a) area to be estimated and (b) approximate area using trapezoids
Parallelizing the trapezoidal rule Recall that we can design a parallel program using four basic steps: 1. Partition the problem solution into tasks. 2. Identify the communication channels between the tasks. 3. Aggregate the tasks into composite tasks. 4. Map the composite tasks to cores.
Pseudo-code Let’s make the simplifying assumption that comm_sz evenly divides n. Then pseudo-code for the program might look something like the following: 1 Get a, b, n; 2 h = (b-a)/n; 3 local_n = n/comm_sz; 4 local_a = a + my_rank*local_n*h; 5 local_b = local_a + local_n*h; 6 local_integral = Trap(local_a, local_b, local_n, h);
Pseudo-code 7 if (my_rank != 0) 8 Send local integral to process 0; 9 else /* my_rank == 0 */ 10 total_integral = local_integral; 11 for (proc = 1; proc < comm_sz; proc++) { 12 Receive local integral from proc; 13 total_integral += local_integral; 14 } 15 16 if (my_rank == 0) 17 print result;
Discussion Notice that in our choice of identifiers, we try to differentiate between local and global variables. Local variables are variables whose contents are significant only on the process that’s using them. Some examples from the trapezoidal rule program are local_a, local_b, and local_n. Variables whose contents are significant to all the processes are sometimes called global variables. Some examples from the trapezoidal rule are a, b, and n.
trapezoid.c part 1
trapezoid.c part 2
Dealing with I/O Of course, the current version of the parallel trapezoidal rule has a serious deficiency: it will only compute the integral over the interval [0, 3] using 1024 trapezoids. We can edit the code and recompile, but this is quite a bit of work compared to simply typing in three new numbers. We need to address the problem of getting input from the user. While we’re talking about input to parallel programs, it might be a good idea to also take a look at output.
Output In both the “greetings” program and the trapezoidal rule program we’ve assumed that process 0 can write to stdout, that is, its calls to printf behave as we might expect. Although the MPI standard doesn’t specify which processes have access to which I/O devices, virtually all MPI implementations allow all the processes in MPI_COMM_WORLD full access to stdout and stderr, so most MPI implementations allow all processes to execute printf and fprintf(stderr, ...).
Input Unlike output, most MPI implementations only allow process 0 in MPI_COMM_WORLD access to stdin. In order to write MPI programs that can use scanf, we need to branch on process rank, with process 0 reading in the data and then sending it to the other processes. For example, we might write the Get_input function shown in next slide for our parallel trapezoidal rule program. In this function, process 0 simply reads in the values for a, b, and n and sends all three values to each process.
Get_input function
Collective communication If we pause for a moment and think about our trapezoidal rule program, we can find several things that we might be able to improve on. One of the most obvious is that the “global sum” after each process has computed its part of the integral. If we hire eight workers to, say, build a house, we might feel that we weren’t getting our money’s worth if seven of the workers told the first what to do, and then the seven collected their pay and went home. Sometimes it does happen that this is the best we can do in a parallel program, but we have seen already that this problem can be optimized.
Tree-structured communication
An alternative tree-structured global sum
MPI_Reduce With virtually limitless possibilities, it’s unreasonable to expect each MPI programmer to write an optimal global-sum function, so MPI specifically protects programmers against this trap of endless optimization by requiring that MPI implementations include implementations of global sums. This places the burden of optimization on the developer of the MPI implementation, rather than the application developer. The assumption here is that the developer of the MPI implementation should know enough about both the hardware and the system software so that she can make better decisions about implementation details.
Point-to-point communications Now, a “global-sum function” will obviously require communication. However, unlike the MPI_Send-MPI Recv pair, the global-sum function may involve more than two processes. In fact, in our trapezoidal rule program it will involve all the processes in MPI COMM WORLD. In MPI parlance, communication functions that involve all the processes in a communicator are called collective communications. To distinguish between collective communications and functions such as MPI_Send and MPI_Recv, MPI_Send and MPI_Recv are often called point-to-point communications.
MPI_Reduce In fact, global sum is just a special case of an entire class of collective communications. For example, it might happen that instead of finding the sum of a collection of comm_sz numbers distributed among the processes, we want to find the maximum or the minimum or the product or any one of many other possibilities:
Description The key to the generalization is the fifth argument, operator. It has type MPI_Op, which is a predefined MPI type like MPI_Datatype and MPI_Comm. There are a number of predefined values in this type. MPI_Reduce(&local_int, &total_int, 1, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
Collective vs. point-to-point communications It’s important to remember that collective communications differ in several ways from point-to-point communications: All the processes in the communicator must call the same collective function. The arguments passed by each process to an MPI collective communication must be “compatible.” The output_data_p argument is only used on dest process. Point-to-point communications are matched on the basis of tags and communicators. Collective communications don’t use tags, so they’re matched solely on the basis of the communicator and the order in which they’re called. Self-Study: Chapter 3.4.3
MPI_Allreduce In our trapezoidal rule program, we just print the result, so it’s perfectly natural for only one process to get the result of the global sum. However, it’s not difficult to imagine a situation in which all of the processes need the result of a global sum in order to complete some larger computation. In this situation, we encounter some of the same problems we encountered with our original global sum. For example, if we use a tree to compute a global sum, we might “reverse” the branches to distribute the global sum
Alternative Alternatively, we might have the processes exchange partial results instead of using one-way communications. Such a communication pattern is sometimes called a butterfly. Once again, we don’t want to have to decide on which structure to use, or how to code it for optimal performance.
A butterfly-structured global sum
MPI_Allreduce MPI provides a variant of MPI_Reduce that will store the result on all the processes in the communicator: The argument list is identical to that for MPI_Reduce except that there is no dest process since all the processes should get the result.
Broadcast If we can improve the performance of the global sum in our trapezoidal rule program by replacing a loop of receives on process 0 with a tree-structured communication, we ought to be able to do something similar with the distribution of the input data. In fact, if we simply “reverse” the communications in the tree-structured global sum in previous slides, we obtain the tree-structured communication shown in next slide, and we can use this structure to distribute the input data.
A tree-structured broadcast
MPI_Bcast A collective communication in which data belonging to a single process is sent to all of the processes in the communicator is called a broadcast, and you’ve probably guessed that MPI provides a broadcast function: The process with rank source_proc sends the contents of the memory referenced by data_p to all the processes in the communicator comm.
A version of Get_input that uses MPI_Bcast
Data distributions Suppose we want to write a function that computes a vector sum: Code of this vector sum can be like:
Data distribution How could we implement this using MPI? The work consists of adding the individual components of the vectors, so we might specify that the tasks are just the additions of corresponding components. Then there is no communication between the tasks, and the problem of parallelizing vector addition boils down to aggregating the tasks and assigning them to the cores.
Data distribution If the number of components is n and we have comm_sz cores or processes, let’s assume that n evenly divides comm_sz and define local_n = n / comm_sz. Then we can simply assign blocks of local_n consecutive components to each process. This is often called a block partition of the vector. An alternative to a block partition is a cyclic partition. In a cyclic partition, we assign the components in a round robin fashion
Different Partitions of a 12-Component Vector among Three Processes A third alternative is a block-cyclic partition. The idea here is that instead of using a cyclic distribution of individual components, we use a cyclic distribution of blocks of components.
A parallel implementation of vector addition Once we’ve decided how to partition the vectors, it’s easy to write a parallel vector addition function: each process simply adds its assigned components. Furthermore, regardless of the partition, each process will have local_n components of the vector, and, in order to save on storage, we can just store these on each process as an array of local_n elements.
Scatter Now suppose we want to test our vector addition function. It would be convenient to be able to read the dimension of the vectors and then read in the vectors x and y. If there are 10 processes and the vectors have 10,000 components, then each process will need to allocate storage for vectors with 10,000 components, when it is only operating on subvectors with 1000 components. Using this approach, processes 1 to 9 would only need to allocate storage for the components they’re actually using.
MPI_Scatter For the communication MPI provides just such a function: Perhaps surprisingly, send_count should also be local_n—send_count is the amount of data going to each process; it’s not the amount of data in the memory referred to by send_buf_p.
A function for reading and distributing a vector
Gather Of course, our test program will be useless unless we can see the result of our vector addition, so we need to write a function for printing out a distributed vector. Our function can collect all of the components of the vector onto process 0, and then process 0 can print all of the components. The communication in this function can be carried out by MPI Gather,
MPI_Gather The data stored in the memory referred to by send_buf_p on process 0 is stored in the first block in recv_buf_p, the data stored in the memory referred to by send_buf_p on process 1 is stored in the second block referred to by recv_buf_p, and so on. So, if we’re using a block distribution, we can implement our distributed vector print function as shown in next slide.
A function for printing a distributed vector
Allgather This is for self-study: Chapter 3.4.9
Homework Exercise 3.3 Exercise 3.6 Exercise 3.7 Exercise 3.13
THE END Thank you