Download presentation
Presentation is loading. Please wait.
1
Dense Matrix Algorithms CS 524 – High-Performance Computing
2
CS 524 (Wi 2003/04)- Asim Karim @ LUMS2 Definitions p = number of processors (0 to p-1) n = dimension of array/matrix (0 to n-1) q = number of blocks along one dimension (0 to q-1) t c = computation time for one flop t s = communication startup time t w = communication transfer time per word Interconnection network: crossbar switch with bi- directional links
3
CS 524 (Wi 2003/04)- Asim Karim @ LUMS3 Uniform Striped Partitioning
4
CS 524 (Wi 2003/04)- Asim Karim @ LUMS4 Checkerboard Partitioning
5
CS 524 (Wi 2003/04)- Asim Karim @ LUMS5 Matrix Transpose (MT) A T (i, j) = A(j, i) for all I and j Sequential run-time do i = 0, n-1 do j = 0, n-1 B(i, j) = A(j, i) end do Run time is (n 2 – n)/2 or n 2 /2
6
CS 524 (Wi 2003/04)- Asim Karim @ LUMS6 MT - Checkerboard Partitioning (1)
7
CS 524 (Wi 2003/04)- Asim Karim @ LUMS7 MT – Checkerboard Partitioning (2)
8
CS 524 (Wi 2003/04)- Asim Karim @ LUMS8 MT – Striped Partitioning
9
CS 524 (Wi 2003/04)- Asim Karim @ LUMS9 Matrix-Vector Multiplication (MVM) MVM: y = Ax do i = 0, n-1 do j = 0, n-1 y(i) = y(i) + A(i, j)*x(j) end do Sequential algorithm requires n 2 multiplications and additions Assuming one flop takes t c time, sequential run time is 2t c n 2
10
CS 524 (Wi 2003/04)- Asim Karim @ LUMS10 Row-wise Striping – p = n (1)
11
CS 524 (Wi 2003/04)- Asim Karim @ LUMS11 Row-wise Striping – p = n (2) Data partitioning: P i has row i of A and element i of x Communication: Each processor broadcasts its element of x Computation: Each processor perform n additions and multiplications Parallel run time: T p = 2nt c + p(t s + t w ) = 2nt c + n(t s + t w ) Algorithm is cost-optimal as both parallel and serial cost is O(n 2 )
12
CS 524 (Wi 2003/04)- Asim Karim @ LUMS12 Row-wise Striping – p < n Data partitioning: Each processor has n/p rows of A and corresponding n/p elements of x Communication: Each processor broadcasts its elements of x Computation: Each processor perform n 2 /p additions and multiplications Parallel run time: T p = 2t c n 2 /p+ p[t s + (n/p)t w ] Algorithm is cost-optimal for p = O(n)
13
CS 524 (Wi 2003/04)- Asim Karim @ LUMS13 Checkerboard Partitioning – p = n 2 (1)
14
CS 524 (Wi 2003/04)- Asim Karim @ LUMS14 Checkerboard Partitioning – p = n 2 (2) Data partitioning: Each processor has one element of A; only processors in last column have one element of x Communication One element of x from last column to diagonal processor Broadcast from diagonal processor to all processors in column Global sum of y from all processors in row to last processor Computation: one multiplication + addition Parallel run time: T p = 2t c + 3(t s + t w ) Algorithm is cost-optimal as serial and parallel cost is O(n 2 ) For bus network, communication time is 3n(t s + t w ); system is not cost-optimal as cost is O(n 3 )
15
CS 524 (Wi 2003/04)- Asim Karim @ LUMS15 Checkerboard Partitioning – p < n 2 Data partitioning: Each processor has n/√p x n/√p elements of A; processors in last column have n/√p elements of x Communication n/√p elements of x from last column to diagonal processor Broadcast from diagonal processor to all processors in column Global sum of y from all processors in row to last processor Computation: n 2 /p multiplications + additions Parallel run time: T p = 2t c n 2 /p+ 3 (t s + t w n/√p) Algorithm is cost-optimal only if p = O(n 2 )
16
CS 524 (Wi 2003/04)- Asim Karim @ LUMS16 Matrix-Matrix Multiplication (MMM) C = A x B, n x n square matrices Block matrix multiplication: algebraic operations on sub-matrices or blocks of matrices. This view of MMM aids parallelization. do i = 0, q-1 do j = 0, q-1 do k = 0, q-1 C i,j = C i,j + A i,k x B k,j end do end do end do Number of multiplications + additions = n 3. Sequential run time = 2t c n 3
17
CS 524 (Wi 2003/04)- Asim Karim @ LUMS17 Checkerboard Partitioning – q = √p Data partitioning: P i,j has A i,j and B i,j blocks of A and B of dimension n/√p x n/√p Communication: Each processor broadcasts its submatrix A i,j to all processors in row; each processor broadcasts its submatrix B i,j to all processors in column Computation: Each processor performs n*n/√p* n/√p = n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal only if p = O(n 2 )
18
CS 524 (Wi 2003/04)- Asim Karim @ LUMS18 Cannon’s Algorithm (1) Memory-efficient version of the checkerboard partitioned block MMM At any time, each processor has one block of A and B Blocks are cycled after each computation in such a way that after √p computations the multiplication is done for C i,j Initial distribution of matrices is same as checkerboard partitioning Communication Initial: block A i,j is moved left by i steps (with wraparound); block B i,j is moved up by j steps (with wraparound) Subsequent √p-1 : block A i,j is moved left by one step; block B i,j moved up by one step (both with wraparound) After √p computation and communication steps the multiplication is complete for C i,j
19
CS 524 (Wi 2003/04)- Asim Karim @ LUMS19 Cannon’s Algorithm (2)
20
CS 524 (Wi 2003/04)- Asim Karim @ LUMS20 Cannon’s Algorithm (3)
21
CS 524 (Wi 2003/04)- Asim Karim @ LUMS21 Cannon’s Algorithm (4) Communication √p point-to-point communications of size n 2 /p along rows √p point-to-point communications of size n 2 /p along columns Computation: over √p steps, each processors performs n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal if p = O(n 2 )
22
CS 524 (Wi 2003/04)- Asim Karim @ LUMS22 Fox’s Algorithm (1) Another memory-efficient version of the checkerboard partitioned block MMM Initial distribution of matrices is same as checkerboard partitioning At any time, each processor has one block of A and B Steps (repeated √p times) 1. Broadcast A i,i to all processors in the row 2. Multiply block of A received with resident block of B 3. Send the block of B up one step (with wraparound) 4. Select block A i,(j+1)mod√p and broadcast to all processors in row. Go to 2.
23
CS 524 (Wi 2003/04)- Asim Karim @ LUMS23 Fox’s Algorithm (2)
24
CS 524 (Wi 2003/04)- Asim Karim @ LUMS24 Fox’s Algorithm (3) Communication √p broadcasts of size n 2 /p along rows √p point-to-point communications of size n 2 /p along columns Computation: Each processor performs n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal if p = O(n 2 )
25
CS 524 (Wi 2003/04)- Asim Karim @ LUMS25 Solving a System of Linear Equations System of linear equations, Ax = b A is dense n x n matrix of coefficients b is n x 1 vector of RHS values x is n x 1 vector of unknowns Solving x is usually done in two stages First, Ax = b is reduced to Ux = y, where U is an unit upper triangular matrix [U(i,j) = 0 if i > j; otherwise U(i,j) ≠ 0 and U(i,i) = 1 for 0 ≤ i < n]. This stage is called Gaussian elimination. Second, the unknowns are solved in reverse order starting from x(n-1). This stage is called back-substitution.
26
CS 524 (Wi 2003/04)- Asim Karim @ LUMS26 Gaussian Elimination (1) do k = 0, n-1 do j = k+1, n-1 A(k, j) = A(k, j)/A(k, k) end do y(k) = b(k)/A(k, k) A(k, k) = 1 do i = k+1, n-1 do j = k+1, n-1 A(i, j) = A(i, j) – A(i, k)*A(k, j) end do b(i) = b(i) – A(i, k)*y(k) A(i, k) = 0 end do
27
CS 524 (Wi 2003/04)- Asim Karim @ LUMS27 Gaussian Elimination (2) Computations Approximately n 2 /2 divisions Approximately n 3 /3 – n 2 /2 multiplications + subtractions Approx. sequential run time: T s = 2t c n 3 /3
28
CS 524 (Wi 2003/04)- Asim Karim @ LUMS28 Striped Partitioning – p = n (1) Data partitioning: Each processor has one row of matrix A Communication during k (outermost loop) broadcast of active part of kth (size: n–k–1) row to processors k+1 to n-1 Computation during iteration k (outermost loop) n – k -1 divisions at processor P k n –k -1 multiplications + subtractions for processors P i (k < i < n) Parallel run time: T p = (3/2)n(n-1)t c + nt s + 0.5n(n-1)t w Algorithm is not cost-optimal since serial and parallel costs are O(n 3 )
29
CS 524 (Wi 2003/04)- Asim Karim @ LUMS29 Striped Partitioning – p = n (2)
30
CS 524 (Wi 2003/04)- Asim Karim @ LUMS30 Striped Partitioning – p = n (3)
31
CS 524 (Wi 2003/04)- Asim Karim @ LUMS31 Pipelined Version (Striped Partitioning) In the non-pipelined or synchronous version, outer loop k is executed in order. When P k is performing the division step, all other processors are idle When performing the elimination step, only processors k+1 to n-1 are active; rest are idle In pipelined version, the division step, communication, and elimination step are overlapped. Each processor: communicates, if it has data to communicate; computes, if it has computations to be done; or waits, if none of these can be done. Cost-optimal for linear array, mesh and hypercube interconnection networks that have directly-connected processors.
32
CS 524 (Wi 2003/04)- Asim Karim @ LUMS32 Pipelined Version (2)
33
CS 524 (Wi 2003/04)- Asim Karim @ LUMS33 Pipelined Version (3)
34
CS 524 (Wi 2003/04)- Asim Karim @ LUMS34 Striped Partitioning – p < n (1)
35
CS 524 (Wi 2003/04)- Asim Karim @ LUMS35 Striped Partitioning – p < n (2)
36
CS 524 (Wi 2003/04)- Asim Karim @ LUMS36 Checkerboard Partitioning – p = n 2 (1)
37
CS 524 (Wi 2003/04)- Asim Karim @ LUMS37 Checkerboard Partitioning – p = n 2 (2) Data partitioning: P i,j has element A(i, j) of matrix A Communication during iteration k (outermost loop) Broadcast of A(k, k) to processor (k, k+1) to (k, n-1) in the kth row Broadcast of modified A(i,k) along ith row for k ≤ i < n Broadcast of modified A(k,j) along jth column for k ≤ j < n Computation during iteration k (outermost loop) One division at P k,k One multiplication + subtraction at processors P i,j (k < i,j< n) Parallel run time: T p = (3/2)n(n-1)t c + n[t s + 0.5(n-1)t w ] Algorithm is cost-optimal since serial and parallel costs are O(n 3 )
38
CS 524 (Wi 2003/04)- Asim Karim @ LUMS38 Back-Substitution Solution of Ux = y, where U is unit upper triangular matrix do k = n-1, 0 x(k) = y(k) do i = k-1, 0 y(i) = y(i) – x(k)*U(i,k) end do Computation: approx. n 2 /2 multiplications + subtractions Parallel algorithm is similar to that for the Gaussian elimination stage
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.