Dense Matrix Algorithms CS 524 – High-Performance Computing.

Dense Matrix Algorithms CS 524 – High-Performance Computing

CS 524 (Wi 2003/04)- Asim Karim @ LUMS2 Definitions p = number of processors (0 to p-1) n = dimension of array/matrix (0 to n-1) q = number of blocks along one dimension (0 to q-1) t c = computation time for one flop t s = communication startup time t w = communication transfer time per word Interconnection network: crossbar switch with bi- directional links

CS 524 (Wi 2003/04)- Asim Karim @ LUMS3 Uniform Striped Partitioning

CS 524 (Wi 2003/04)- Asim Karim @ LUMS4 Checkerboard Partitioning

CS 524 (Wi 2003/04)- Asim Karim @ LUMS5 Matrix Transpose (MT) A T (i, j) = A(j, i) for all I and j Sequential run-time do i = 0, n-1 do j = 0, n-1 B(i, j) = A(j, i) end do  Run time is (n 2 – n)/2 or n 2 /2

CS 524 (Wi 2003/04)- Asim Karim @ LUMS6 MT - Checkerboard Partitioning (1)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS7 MT – Checkerboard Partitioning (2)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS8 MT – Striped Partitioning

CS 524 (Wi 2003/04)- Asim Karim @ LUMS9 Matrix-Vector Multiplication (MVM) MVM: y = Ax do i = 0, n-1 do j = 0, n-1 y(i) = y(i) + A(i, j)*x(j) end do Sequential algorithm requires n 2 multiplications and additions  Assuming one flop takes t c time, sequential run time is 2t c n 2

CS 524 (Wi 2003/04)- Asim Karim @ LUMS10 Row-wise Striping – p = n (1)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS11 Row-wise Striping – p = n (2) Data partitioning: P i has row i of A and element i of x Communication: Each processor broadcasts its element of x Computation: Each processor perform n additions and multiplications Parallel run time: T p = 2nt c + p(t s + t w ) = 2nt c + n(t s + t w ) Algorithm is cost-optimal as both parallel and serial cost is O(n 2 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS12 Row-wise Striping – p < n Data partitioning: Each processor has n/p rows of A and corresponding n/p elements of x Communication: Each processor broadcasts its elements of x Computation: Each processor perform n 2 /p additions and multiplications Parallel run time: T p = 2t c n 2 /p+ p[t s + (n/p)t w ] Algorithm is cost-optimal for p = O(n)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS13 Checkerboard Partitioning – p = n 2 (1)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS14 Checkerboard Partitioning – p = n 2 (2) Data partitioning: Each processor has one element of A; only processors in last column have one element of x Communication  One element of x from last column to diagonal processor  Broadcast from diagonal processor to all processors in column  Global sum of y from all processors in row to last processor Computation: one multiplication + addition Parallel run time: T p = 2t c + 3(t s + t w ) Algorithm is cost-optimal as serial and parallel cost is O(n 2 ) For bus network, communication time is 3n(t s + t w ); system is not cost-optimal as cost is O(n 3 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS15 Checkerboard Partitioning – p < n 2 Data partitioning: Each processor has n/√p x n/√p elements of A; processors in last column have n/√p elements of x Communication  n/√p elements of x from last column to diagonal processor  Broadcast from diagonal processor to all processors in column  Global sum of y from all processors in row to last processor Computation: n 2 /p multiplications + additions Parallel run time: T p = 2t c n 2 /p+ 3 (t s + t w n/√p) Algorithm is cost-optimal only if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS16 Matrix-Matrix Multiplication (MMM) C = A x B, n x n square matrices Block matrix multiplication: algebraic operations on sub-matrices or blocks of matrices. This view of MMM aids parallelization. do i = 0, q-1 do j = 0, q-1 do k = 0, q-1 C i,j = C i,j + A i,k x B k,j end do end do end do Number of multiplications + additions = n 3. Sequential run time = 2t c n 3

CS 524 (Wi 2003/04)- Asim Karim @ LUMS17 Checkerboard Partitioning – q = √p Data partitioning: P i,j has A i,j and B i,j blocks of A and B of dimension n/√p x n/√p Communication: Each processor broadcasts its submatrix A i,j to all processors in row; each processor broadcasts its submatrix B i,j to all processors in column Computation: Each processor performs n*n/√p* n/√p = n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal only if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS18 Cannon’s Algorithm (1) Memory-efficient version of the checkerboard partitioned block MMM  At any time, each processor has one block of A and B  Blocks are cycled after each computation in such a way that after √p computations the multiplication is done for C i,j  Initial distribution of matrices is same as checkerboard partitioning Communication  Initial: block A i,j is moved left by i steps (with wraparound); block B i,j is moved up by j steps (with wraparound)  Subsequent √p-1 : block A i,j is moved left by one step; block B i,j moved up by one step (both with wraparound) After √p computation and communication steps the multiplication is complete for C i,j

CS 524 (Wi 2003/04)- Asim Karim @ LUMS19 Cannon’s Algorithm (2)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS20 Cannon’s Algorithm (3)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS21 Cannon’s Algorithm (4) Communication  √p point-to-point communications of size n 2 /p along rows  √p point-to-point communications of size n 2 /p along columns Computation: over √p steps, each processors performs n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS22 Fox’s Algorithm (1) Another memory-efficient version of the checkerboard partitioned block MMM  Initial distribution of matrices is same as checkerboard partitioning  At any time, each processor has one block of A and B Steps (repeated √p times) 1. Broadcast A i,i to all processors in the row 2. Multiply block of A received with resident block of B 3. Send the block of B up one step (with wraparound) 4. Select block A i,(j+1)mod√p and broadcast to all processors in row. Go to 2.

CS 524 (Wi 2003/04)- Asim Karim @ LUMS23 Fox’s Algorithm (2)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS24 Fox’s Algorithm (3) Communication  √p broadcasts of size n 2 /p along rows  √p point-to-point communications of size n 2 /p along columns Computation: Each processor performs n 3 /p multiplications + additions Parallel run time: T p = 2t c n 3 /p + 2√p[t s + (n 2 /p)t w ] Algorithm is cost-optimal if p = O(n 2 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS25 Solving a System of Linear Equations System of linear equations, Ax = b  A is dense n x n matrix of coefficients  b is n x 1 vector of RHS values  x is n x 1 vector of unknowns Solving x is usually done in two stages  First, Ax = b is reduced to Ux = y, where U is an unit upper triangular matrix [U(i,j) = 0 if i > j; otherwise U(i,j) ≠ 0 and U(i,i) = 1 for 0 ≤ i < n]. This stage is called Gaussian elimination.  Second, the unknowns are solved in reverse order starting from x(n-1). This stage is called back-substitution.

CS 524 (Wi 2003/04)- Asim Karim @ LUMS26 Gaussian Elimination (1) do k = 0, n-1 do j = k+1, n-1 A(k, j) = A(k, j)/A(k, k) end do y(k) = b(k)/A(k, k) A(k, k) = 1 do i = k+1, n-1 do j = k+1, n-1 A(i, j) = A(i, j) – A(i, k)*A(k, j) end do b(i) = b(i) – A(i, k)*y(k) A(i, k) = 0 end do

CS 524 (Wi 2003/04)- Asim Karim @ LUMS27 Gaussian Elimination (2) Computations  Approximately n 2 /2 divisions  Approximately n 3 /3 – n 2 /2 multiplications + subtractions Approx. sequential run time: T s = 2t c n 3 /3

CS 524 (Wi 2003/04)- Asim Karim @ LUMS28 Striped Partitioning – p = n (1) Data partitioning: Each processor has one row of matrix A Communication during k (outermost loop)  broadcast of active part of kth (size: n–k–1) row to processors k+1 to n-1 Computation during iteration k (outermost loop)  n – k -1 divisions at processor P k  n –k -1 multiplications + subtractions for processors P i (k < i < n) Parallel run time: T p = (3/2)n(n-1)t c + nt s + 0.5n(n-1)t w Algorithm is not cost-optimal since serial and parallel costs are O(n 3 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS29 Striped Partitioning – p = n (2)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS30 Striped Partitioning – p = n (3)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS31 Pipelined Version (Striped Partitioning) In the non-pipelined or synchronous version, outer loop k is executed in order.  When P k is performing the division step, all other processors are idle  When performing the elimination step, only processors k+1 to n-1 are active; rest are idle In pipelined version, the division step, communication, and elimination step are overlapped.  Each processor: communicates, if it has data to communicate; computes, if it has computations to be done; or waits, if none of these can be done.  Cost-optimal for linear array, mesh and hypercube interconnection networks that have directly-connected processors.

CS 524 (Wi 2003/04)- Asim Karim @ LUMS32 Pipelined Version (2)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS33 Pipelined Version (3)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS34 Striped Partitioning – p < n (1)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS35 Striped Partitioning – p < n (2)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS36 Checkerboard Partitioning – p = n 2 (1)

CS 524 (Wi 2003/04)- Asim Karim @ LUMS37 Checkerboard Partitioning – p = n 2 (2) Data partitioning: P i,j has element A(i, j) of matrix A Communication during iteration k (outermost loop)  Broadcast of A(k, k) to processor (k, k+1) to (k, n-1) in the kth row  Broadcast of modified A(i,k) along ith row for k ≤ i < n  Broadcast of modified A(k,j) along jth column for k ≤ j < n Computation during iteration k (outermost loop)  One division at P k,k  One multiplication + subtraction at processors P i,j (k < i,j< n) Parallel run time: T p = (3/2)n(n-1)t c + n[t s + 0.5(n-1)t w ] Algorithm is cost-optimal since serial and parallel costs are O(n 3 )

CS 524 (Wi 2003/04)- Asim Karim @ LUMS38 Back-Substitution Solution of Ux = y, where U is unit upper triangular matrix do k = n-1, 0 x(k) = y(k) do i = k-1, 0 y(i) = y(i) – x(k)*U(i,k) end do Computation: approx. n 2 /2 multiplications + subtractions Parallel algorithm is similar to that for the Gaussian elimination stage

Dense Matrix Algorithms CS 524 – High-Performance Computing.

Similar presentations

Presentation on theme: "Dense Matrix Algorithms CS 524 – High-Performance Computing."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Dense Matrix Algorithms CS 524 – High-Performance Computing.

Similar presentations

Presentation on theme: "Dense Matrix Algorithms CS 524 – High-Performance Computing."— Presentation transcript:

Similar presentations

About project

Feedback