CS 584
Dense Matrix Algorithms There are two types of Matrices Dense (Full) Sparse We will consider matrices that are Dense Square
Mapping Matrices How do we partition a matrix for parallel processing? There are two basic ways Striped partitioning Block partitioning
Striped Partitioning P0 P1 P2 P3 P0 P1 P2 P3 P0 P1 P2 P3 Block striping Cyclic striping
Block Partitioning P0P1 P2P3 P0P1P2P3 P4P5P6P7 P0P1P2P3 P4P5P6P7 Block checkerboard Cyclic checkerboard
Block vs. Striped Partitioning Scalability? Striping is limited to n processors Checkerboard is limited to n x n processors Complexity? Striping is easy Block could introduce more dependencies
Dense Matrix Algorithms Transposition Matrix - Vector Multiplication Matrix - Matrix Multiplication Solving Systems of Linear Equations Gaussian Elimination
Matrix Transposition The transpose of A is A T such that A T [i,j] = A[j,i] All elements below the diagonal move above the diagonal and vice-versa If we assume unit time to exchange: Transpose takes (n 2 - n)/2
Transpose Consider case where each processor has more than one element. Hypothesis: The transpose of the full matrix can be done by first sending the multiple element messages to their destination and then transposing the contents of the message.
Transpose (Striped Partitioning)
Transpose (Block Partitioning)
Matrix Multiplication
One Dimensional Decomposition Each processor "owns" black portion To compute the owned portion of the answer, each processor requires all of A P N ttPT ws 2 )1(
Two Dimensional Decomposition Requires less data per processor Algorithm can be performed stepwise.
Broadcast an A sub- matrix to the other processors in row. Compute Rotate the B sub- matrix upwards
Algorithm Set B' = B local for j = 0 to sqrt(P) -2 in each row I the [(I+j) mod sqrt(P)]th task broadcasts A' = A local to the other tasks in the row accumulate A' * B' send B' to upward neighbor done P N tt P PT ws log 1
Cannon’s Algorithm Broadcasting a submatrix to all who need it is costly. Suggestion: Shift both submatrices P N ttPT ws 2 12
Divide and Conquer A pp A pq A qp A qq B pp B pq B qp B qq P0 = App * Bpp P1 = Apq * Bpq P2 = App * Bpq P3 = Aqp * Bqq P4 = Aqp * Bpp P5 = Aqq * Bqp P6 = Aqp * Bpq P7 = Aqq * Bqq P0 + P1P2 + P3 P4 + P5P6 + P7 =x
Systems of Linear Equations A linear equation in n variables has the form A set of linear equations is called a system. A solution exists for a system iff the solution satisfies all equations in the system. Many scientific and engineering problems take this form. a 0 x 0 + a 1 x 1 + … + a n-1 x n-1 = b
Solving Systems of Equations Many such systems are large. Thousands of equations and unknowns a 0,0 x 0 + a 0,1 x 1 + … + a 0,n-1 x n-1 = b 0 a 1,0 x 0 + a 1,1 x 1 + … + a 1,n-1 x n-1 = b 1 a n-1,0 x 0 + a n-1,1 x 1 + … + a n-1,n-1 x n-1 = b n-1
Solving Systems of Equations A linear system of equations can be represented in matrix form a 0,0 a 0,1 … a 0,n-1 x 0 b 0 a 1,0 a 1,1 … a 1,n-1 x 1 b 1 a n-1,0 a n-1,1 … a n-1,n-1 x n-1 b n-1 = Ax = b
Solving Systems of Equations Solving a system of linear equations is done in two steps: Reduce the system to upper-triangular Use back-substitution to find solution These steps are performed on the system in matrix form. Gaussian Elimination, etc.
Solving Systems of Equations Reduce the system to upper-triangular form Use back-substitution a 0,0 a 0,1 … a 0,n-1 x 0 b 0 0 a 1,1 … a 1,n-1 x 1 b … a n-1,n-1 x n-1 b n-1 =
Reducing the System Gaussian elimination systematically eliminates variable x[k] from equations k+1 to n-1. Reduces the coefficients to zero This is done by subtracting a appropriate multiple of the k th equation from each of the equations k+1 to n-1
Procedure GaussianElimination(A, b, y) for k = 0 to n-1 /* Division Step */ for j = k + 1 to n - 1 A[k,j] = A[k,j] / A[k,k] y[k] = b[k] / A[k,k] A[k,k] = 1 /* Elimination Step */ for i = k + 1 to n - 1 for j = k + 1 to n - 1 A[i,j] = A[i,j] - A[i,k] * A[k,j] b[i] = b[i] - A[i,k] * y[k] A[i,k] = 0 endfor end
Parallelizing Gaussian Elim. Use domain decomposition Rowwise striping Division step requires no communication Elimination step requires a one-to-all broadcast for each equation. No agglomeration Initially map one to to each processor
Communication Analysis Consider the algorithm step by step Division step requires no communication Elimination step requires one-to-all bcast only bcast to other active processors only bcast active elements Final computation requires no communication.
Communication Analysis One-to-all broadcast log 2 q communications q = n - k - 1 active processors Message size q active processors q elements required T = (t s + t w q)log 2 q
Computation Analysis Division step q divisions Elimination step q multiplications and subtractions Assuming equal time --> 3q operations
Computation Analysis In each step, the active processor set is reduced by one resulting in: 2/)1( nnCompTime kn n k
Can we do better? Previous version is synchronous and parallelism is reduced at each step. Pipeline the algorithm Run the resulting algorithm on a linear array of processors. Communication is nearest-neighbor Results in O(n) steps of O(n) operations
Pipelined Gaussian Elim. Basic assumption: A processor does not need to wait until all processors have received a value to proceed. Algorithm If processor p has data for other processors, send the data to processor p+1 If processor p can do some computation using the data it has, do it. Otherwise, wait to receive data from processor p-1
Conclusion Using a striped partitioning method, it is natural to pipeline the Gaussian elimination algorithm to achieve best performance. Pipelined algorithms work best on a linear array of processors. Or something that can be linearly mapped Would it be better to block partition? How would it affect the algorithm?