Algorithms on Rings of Processors Chapter 4, CLR Textbook
Algorithms on Rings of Processors When using message passing, it is common to abstract away from the physical network and to choose a convenient logical network instead. This chapter presents several algorithms intended for the logical ring network studied earlier Coverage of mapping logical networks map onto physical networks are deferred to Sections 4.6 and 4.7 Rings are linear interconnection network Ideal for a first look at distributed memory algorithms Each processor has a single predecessor and successor
Matrix-Vector Multiplication The first unidirectional ring algorithm will be the multiplication y = Ax of a nn matrix A by a vector x of dimension n. 1. for i = 0 to n-1 do 2. yi 0 3. for j = 0 to n-1 do 4. yi = yi + Ai,j xj Each outer (e.g., i) loop computes the scalar product of one row of A by vector x. These scalar products can be performed in any order. These scalar products will be distributed among the processors so these can be done in parallel.
Matrix-Vector Multiplication (cont.) We assume that n is divisible by p and let r = n/p. Each processor must store r contiguous rows of matrix A and r scalar products. This is called a block row. The corresponding r components of the vector y and x are also stored with each processor. Each processor Pq will then store Rows qr to (q+1)r -1 of matrix A of dimension rn Components qr to (q+1)r -1 of vectors x and y. For simplicity, we will ignore the case where n is not divisible by p. However, this case can be handled by temporarily adding additional rows of zeros to matrix and zeros to vector x so the resulting nr. of rows is divisible by p.
Matrix-Vector Multiplication (cont.) Declarations needed var A: array[0..r-1,0..n-1] of real; var x, y: array [0..r-1] of real; Then A[0,0] on P0 corresponds to A0,0 but on P1 to Ar,0. Note the subscript are global while array indices are local. Also, note global index (i,j) corresponds to local index (i - i/r, j) on processor Pk where k = i/r The next figure illustrates how the rows and vectors are partitioned among the processors.
Matrix-Vector Multiplication (cont.) The partitioning of the data makes it possible to solve larger problems in parallel. The parallel algorithm can solve a problem roughly p times larger than the sequential algorithm. Algorithm 4.1 is given on the next slide. For each loop in Algorithm 4.1, each processor Pq computes the scalar product of its rr matrix with a vector of size r. This is a partial result. The values of the components of y assigned to Pq is obtained by adding all of these partial results together.
Matrix-Vector Multiplication (cont.) In the first pass through the loop, the x-components in Pq are the ones originally assigned. During each pass through the loop, Pq executes the scalar product of the appropriate part of the qth block of A with Pq ‘s current components of x. Concurrently, each Pq sends its block of x-values to Pq+1 (mod p) and receives a new block of x-values from Pq-1 (mod p). At the conclusion, each Pq has its original block of x-values, but has calculated the correct values for its y-components. These steps are illustrated in Figure 4.2.
Analysis of Matrix-Vector Multiplication There are p identical steps Each step involves three activities: compute, send, and receive. The time to send and receive are identical and concurrent, so the execution time is T(p) = p max{ r2w, L+rb} where w is the computation time for multiplying a vector component by matrix component & adding two products, b is the inverse of the bandwidth, and L is communications startup cost. As r = n/p, the computation cost becomes asymptotically larger than the communication cost as n increases, since (for n large)
Matrix-Vector Multiplication Analysis (cont) Next, we calculate various metrics and their complexity For large n, T(p) = p(r2w) = n2w/p or O(n2/p) O(n2) if p constant The cost = (n2w/p)*p = n2w or O(n2) The speedup = ts/T(p) = cn2 * (p/n2w) = (c/w)p or O(p) However if p is constant/small, the speedup is only O(1) The efficiency = ts/cost = cn2/ (n2w) = c/w or O(1) Note efficiency = tsp/tp O(1) Note that if vector x were duplicated across all processors, then there would be no need for any communication and parallel efficiency would be O(1) for all values of n. However, there would be an increased memory cost
Matrix-Matrix Multiplication Using matrix-vector multiplication, this is easy. Let C = AB, where all are nn matrices The multiplication consists of computing n scalar products: for i = 0 to n-1 do for j = 0 to n-1 do Ci,j = 0 for k = 0 to n-1 do Ci,j = Ci,j + Ai,k Bk,j We will distribute the matrices over the p processors, giving each the first processor the first r = n/p rows, etc. Declaration: var A, B, C: array[0…r-1, 0…r-1] of reals.
Matrix-Matrix Multiplication & Analysis This algorithm is very similar to the one for matrix-vector multiplication Scalar products are replaced by sub-matrix multiplication Circular shifting of a vector is replaced by circular shifting of matrix rows Analysis: Each step lasts as long as the longest of the three activities performed during the step: Compute, send, and receive. T(p) = p * max{ nr2w, L+nrb} As before, the asymptotic parallel efficiency is 1 when n is large.
Matrix-Matrix Multiplication Analysis Naïve Algorithm: matrix-matrix could be achieved by executing matrix-vector multiplication n times Analysis of Naïve algorithm: Execution time is just the time for matrix-vector multiplication, multiplied by n. T’(p) = p max{ nr2w, nL +nrb} The only difference between T and T’ is that term L has become nL Naïve approach exchange vectors of size r In the algorithm while in the algorithm developed in this section, they exchange matrices of size r n This does not change asymptotic efficiency However, sending data in bulk can significantly reduce the communications overhead.
Stencil Applications Popular applications that operate on a discrete domain that consists of cells. Each cell holds some value(s) and has neighbor cells. The application uses an application that applies pre-defined rules to update the value(s) of a cell using the values of the neighbor cells. The location of the neighbor cells and the function used to update cell values constitute a stencil that is applied to all cells in the domain. These type of applications arise in many areas of science and engineering. Examples include image processing, approximate solutions to differential equations, and simulation of complex cellular automata (e.g., Conway’s Game of Life)
A Simple Sequential Algorithm We consider a stencil application on a 2D domain of size nn. Each cell has 8 neighbors, as shown below: NW N NE W c E SW S SE The algorithm we consider updates the values of Cell c based on the value of the already updated value of its West and North neighbors. The stencil is shown on the next slide and can be formalized as cnew UPDATE(cold, Wnew, Nnew)
A Simple Sequential Algorithm (cont) This simple stencil is similar to important applications Gauss-Seidel numerical method algorithm Smith-Waterman biological string comparison algorithm This stencil can not be applied to cells in top row or left column. These cells are handled by the update function. To indicate that no neighbor exists for a cell update, we pass a Nil argument to UPDATE.
Greedy Parallel Algorithm for Stencil Consider a ring of p processors, P0, P1, … , Pp-1. Must decide how to allocate cells among processors. Need to balance computational load without creating overly expensive communications. Assume initially that p is equal to n We will allocate row i of domain A to ith processor, Pi. Declaration Needed: Var A: Array[0..n-1] of real; As soon as Pi has computed a cell value, it sends that value to Pi+1 (0 i < p-1). Initially, only A0,0 can be computed Once A0,0 is computed, then A1,0 and A0,1can be computed. The computation proceeds in steps. At step k, all values on the k-th anti-diagonal are computed.
General Steps of Greedy Algorithm At time i+j, processor Pi performs the following operations: It receives A(i-1,j) from Pi-1 It computes A(i,j) Then it sends A(i,j) to Pi+1 Exceptions: P0 does not need to receive cell values to update its cells. Pp-1 does not send its cell values after updating its cells. Above exceptions do not influence algorithm performance. This algorithm is captured in Algorithm 4.3 on next slide.
Tracing Steps in Preceding Algorithm Re-read pgs72-73 CLR on send & receive for sych.rings. See slides 35-40, esp. 37-38 in slides on synchronous networks Steps 1-3 are performed by all processors. All processors obtain a array A of n reals, their ID nr, and the nr of processors. Steps 4-6 are preformed only by P0. In Step 5, P0 updates the cell A0,0 in NW top corner. In Step 6, P0 sends contents in A[0] of cell A0,0 to its successor, P1. Steps 7- 8 are executed only by P1 with since it is only processor receiving a message. (Note this is not blocking “receive”, as would block all Pi for i>1.) In Step 8. P1. stores update of A0,0 from P0 in address v. In Step 9, P0. uses value in v to update value in A[0] of cell A1,0.
Tracing Steps in Algorithm (cont) Steps 12-13 are executed by P0 to update the value A[j] of its next cell A0,j in top row and send its value to P1. Steps 14-16 are executed only by Pn-1 on bottom row to update the value A[j] of its next cell An-1,j. This value will be used by Pn-1 to update its next cell in the next round. Pn-1 does not send a value since its row is the last one. Only Pi for 0<i<n-1 can execute 18-19. In Step 18, Pi executing 18-19 on j-th loop are further restricted to those receiving a message (i.e., blocking “receive”) In Step 18, Pi executes the send and receive in parallel In Step 19, Pi uses the received value to update the value A[j] of the next cell Ai,j.
Algorithm for Fewer Processors Typically, have much fewer processors than nr of rows. WLOG, assume p divides n. If n/p rows are assigned to each processor, then at least n/p steps must occur before P0 can send a value to P1. This situation repeats for each Pi and Pi+1, severely restricting parallelism. Instead, we assign rows to processors cyclically, with row j assigned to Pj mod p. Each processor has following declaration: var A: array[0...n/p, 0..n-1] of real; This is a contiguous array of rows, but these rows are not contiguous. Algorithm 4.4 for the stencil application on a ring of processors using a cyclic data distribution is given next.
Cyclic Stencil Algorithm Execution Time Let T(n,p) be the execution time for preceding algorithm. We assume that “receiving“ is blocking while “sending” is not. The sending of a message in step k is followed by the reception of the message in step k+1. The time needed to perform one algorithm step is +b+L, where The time needed to update a cell is The rate at which cell values are communicated is b The startup cost is L. The computation terminates when Pp-1 finishes computing the rightmost cell value of its last row of cells.
Cyclic Stencil Algorithm Run Time (cont) Number of algorithm steps is p-1 + n2/p Pp-1 is idle for first p-1 steps Once Pp-1 starts computing, it computes a cell each step until the algorithm computation is completed. There are n2 cells, split evenly between the processors, so each processor is assigned n2/p cells This yields Additional problem: Algorithm was designed to minimize (time between a cell update computation) and (its reception by the next processor) However, the algorithm performs many communications of small data items. L can be orders of magnitude larger than b if cell value small.
Cyclic Stencil Algorithm Run Time (cont) Stencil application characteristics: The cell value is often as small as an integer or real nr. The computations to update the cells may involve only a few operations, so may also be small. For many computations, most of the execution time could be due to the L term in the equation for T(n,p). Spending a large amount of time in communications overhead reduces the parallel efficiency considerably. Note that Ep(n) = Tseq(n) / pTpar(n) = n2w / pTpar(n) Ep(n) reduces to the below formula. Note that as n increases, the efficiency may drop well below 1.
Augmenting Granularity of Algorithm The communication overhead due to startup latencies can be decreased by sending fewer messages that are larger. Let each processor compute k contiguous cell values in each row during each step, instead of just 1 value. To simplify analysis, we assume k divides n, so each row has n/k segments of k contiguous cells. To avoid above, let the last incomplete segment spill over to the next row. The last segment of last row may have fewer than k elements. With this algorithm, cell values are communicated in bulk, k at a time.
Augmenting Granularity of Algorithm (cont) Effect of bulk communication k items on algorithm Larger values of k produce less communication overhead. However, larger values of k increase the time between a cell value update and its reception in the next processor In this algorithm, processors will start computing cell values later, leading to more idle time for cells. This approach is illustrated in next diagram.
Block-Cyclic Allocation of Cells A second way to reduce communication costs is to decrease the number of cell values that are being communicated. This is done by allocating blocks from r consecutive rows to processors cyclically. To simplify the analysis, we assume rp divides n. This idea of a block-cyclic allocation is very useful, and is illustrated below:
Block-Cyclic Allocation of Cells (cont) Each processor computes k contiguous cells in each row from a block of r rows. At each step, each processor now computes rk cells Note blocks are rk (r rows, k columns) in size Note: Only those values on the edges of the block have to be sent to other processors. This general approach can dramatically decrease the number of cells whose updates have to be sent to other processors. The algorithm for this allocation is similar to those shown for the cyclic row assignment scheme in Figure 4.6, Simply replace “rows” by “blocks of rows”. A processor calculates all cell values in its first block of rows in n/k steps of the algorithm.
Block-Cyclic Allocations (cont) Processor Pp-1 sends its k cell values to P0 after p algorithm steps. P0 needs these values to compute its second “block of rows” . As a result, we need n kp in order to keep processors busy. If n > kp, then processors must temporarily store received cell values while they finishing computing their block of rows for the previous step. Recall processors only have to exchange data at the boundaries between blocks. Using r rows per block, the amount of data communicated is r times smaller than the previous algorithm.
Block-Cyclic Allocations (cont) Processor activities in computing block: Receive k cell values from predecessor Compute kr cell values Sends k cell values to its successor Again we assume “receives” are blocking while “sends” are not. The time required to perform one step of algorithm is krw+kb+L The computation finishes when processor Pp-1 finishes computing its rightmost segment in its last block of rows of cells. Pp-1 computes one segment of a block row in each step
Optimizing Block-Cyclic Allocations There are n2/(kr) such segments and so p processors can compute them in n2/(pkr) steps It takes p-1 algorithm steps before processor Pp-1 can start doing any computation. Afterwards, Pp-1 will computer one segment at each step. Overall, the algorithm runs for p-1+n2/pkr steps with a total computation time of The efficiency of this algorithm is
Optimizing Block-Cyclic Allocations (cont) However, this gives the asymptotic efficiency of Note that by increasing r and k, it is possible to achieve significantly higher efficiency. However, increasing r and k reduces communications. The text also outlines how to determine optimal values for k and r using a fair amount of mathematics.
Implementing Logical Topologies Designers of parallel algorithms should choose the logical topology. In section 4.5, switching the topology from unidirectional ring to a bidirectional ring made the program much simpler and lowered the communications time. The message passing libraries, such as the ones implemented for MPI, allow communications between any two processors using the Send and Recv functions. Using a logical topology restricts communications to only a few paths, which usually makes the algorithm design simpler. The logical topology can be implemented by creating a set of functions that allows each processor to identify its neighbors. Unidirectional Ring only needs NextNode(P) Bidirectional Ring would need also need PreviousNode(P)
Logical Topologies (cont) Some systems (e.g., many modern supercomputers) provide many physical networks, but sometimes creation of logical topologies left to the user. A difficult task is matching the logical topology to the physical topology for the application. The common wisdom is that a local topology that resembles the physical topology of application should produce a good performance. Sometime the reason for using a logical topology is to hide the complexity of the physical topology. Often extensive benchmarking is required to determine the best topology for a given algorithm on a given platform. The local topologies studied in this chapter and the next are known to be useful in the majority of scenarios.
Distributed vs Centralized Implementations In the CLR text, the data is already distributed among the processors at the start of the execution. One may wonder how the data was distributed to the processors if whether that should also be part of the algorithm. There are two approaches: Distributed & Centralized. In the centralized approach,one assumes that the data resides in a single “master” location. A single processor A file on a disk, if data size is large. The CLR book takes the distributed approach. The Akl book usually takes the distributed approach, but occasionally takes the centrailized approach.
Distributed vs Centralized (cont) An advantage of the centralized approach is that the library routine can choose the data distribution scheme to enforce. The best performance requires that the choice for each algorithm consider its underlying topology. This cannot be done in advance Often the library developer will provide multiple versions of possible data distribution The user can then choose the version that bet fits the underlying platform. This choice may be difficult without extensive benchmarking. The main disadvantage of the centralized approach is when user applies successive algorithms using the same data. Data will be repeatedly distributed & undistributed. Causes most library developers to opt for distributed option.
Summary of Algorithmic Principles (For Asynchronous Message Passing) Although used only for ring topology, the below principles are general. Unfortunately, they often conflict with each other. Sending data in bulk Reduces communication overhead due to network latencies Sending data early Sending data as early as possible allows other processors to start computing as early as possible.
Summary of Algorithmic Principles (For Asynchronous Message Passing) -- Continued -- Overlapping communication and computation If both can be performed at the same time, the communication cost is often hidden Block Data Distribution Having processors assigned blocks of contiguous data elements reduces the amount of communication Cyclic Data Distribution Having data elements interleaved among processors makes it possible to reduce idle time and achieve a better load balance