Download presentation
Presentation is loading. Please wait.
Published byClaire Wade Modified over 9 years ago
1
Dense Linear Algebra Sathish Vadhiyar
2
Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k) 000000000000 0000000000 i i i,i X x j k
3
Gaussian Elimination - Review Version 2 – Remove A(j, i)/A(i, i) from inner loop for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k) 000000000000 0000000000 i i i,i X x j k
4
Gaussian Elimination - Review Version 3 – Don’t compute what we already know for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k) 000000000000 0000000000 i i i,i X x j k
5
Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) 000000000000 0000000000 i i i,i X x j k
6
GE - Runtime Divisions Multiplications / subtractions Total 1+ 2 + 3 + … (n-1) = n 2 /2 (approx.) 1 2 + 2 2 + 3 2 + 4 2 +5 2 + …. (n-1) 2 = n 3/ 3 – n 2 /2 2n 3 /3
7
Parallel GE 1 st step – 1-D block partitioning along blocks of n columns by p processors 000000000000 0000000000 i i i,i X x j k
8
1D block partitioning - Steps 1. Divisions 2. Broadcast 3. Multiplications and Subtractions Runtime: n 2 /2 xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n 2 logp (n-1)n/p + (n-2)n/p + …. 1x1 = n 3 /p (approx.) < n 2 /2 +n 2 logp + n 3 /p
9
2-D block To speedup the divisions 000000000000 0000000000 i i i,i X x j k P Q
10
2D block partitioning - Steps 1. Broadcast of (k,k) 2. Divisions 3. Broadcast of multipliers logQ n 2 /Q (approx.) xlog(P) + ylog(P-1) + zlog(P-2) + …. = n 2 /Q logP 4. Multiplications and subtractions n 3 /PQ (approx.)
11
Problem with block partitioning for GE Once a block is finished, the corresponding processor remains idle for the rest of the execution Solution? -
12
Onto cyclic The block partitioning algorithms waste processor cycles. No load balancing throughout the algorithm. Onto cyclic 0123023023110 cyclic 1-D block-cyclic2-D block-cyclic Load balance Load balance, block operations, but column factorization bottleneck Has everything
13
Block cyclic Having blocks in a processor can lead to block-based operations (block matrix multiply etc.) Block based operations lead to high performance Operations can be split into 3 categories based on number of operations per memory reference Referred to BLAS levels
14
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemo ry Refs. FlopsFlops/ Memor y refs. Level-1 (vector) y=y+ax Z=y.x 3n2n2/3 Level-2 (Matrix-vector) y=y+Ax A = A+(alpha) xy T n2n2 2n 2 2 Level-3 (Matrix-Matrix) C=C+AB 4n 2 2n 3 n/2
15
Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) 000000000000 0000000000 i i i,i X x j k What GE really computes: for i=1 to n-1 A(i+1:n, i) = A(i+1:n, i) / A(i, i) A(i+1:n, i+1:n) -= A(i+1:n, i)*A(i, i+1:n) A(i,i) A(j,i)A(j,k) A(i,k) Finished multipliers i i A(i+1:n, i) A(i, i+1:n) A(i+1:n, i+1:n) BLAS1 and BLAS2 operations
16
Converting BLAS2 to BLAS3 Use blocking for optimized matrix- multiplies (BLAS3) Matrix multiplies by delayed updates Save several updates to trailing matrices Apply several updates in the form of matrix multiply
17
Modified GE using BLAS3 Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end)+1 */ A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ b ibend ib end b Completed part of L Completed part of U A(ib:end, ib:end) A(end+1:n, ib:end) A(end+1:n, end+1:n)
18
GE: Miscellaneous GE with Partial Pivoting 1D block-column partitioning: which is better? Column or row pivoting 2D block partitioning: Can restrict the pivot search to limited number of columns Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1) The exchange information is passed to the other processes by piggybacking with the multiplier information Row pivoting Involves distributed search and exchange – O(n/P)+O(logP)
19
Triangular Solve – Unit Upper triangular matrix Sequential Complexity - Complexity of parallel algorithm with 2D block partitioning (P 0.5 *P 0.5 ) O(n 2 ) O(n 2 )/P 0.5 Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) Overall (GE+TS) – O(n 3 /P)
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.