Download presentation
Presentation is loading. Please wait.
Published byAmanda Little Modified over 9 years ago
1
Dense Linear Algebra Sathish Vadhiyar
2
Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n add a multiple of row i to row j for k = i to n A(j, k) = A(j, k) – A(j, i)/A(i, i) * A(i, k) 000000000000 0000000000 i i i,i X x j k
3
Gaussian Elimination - Review Version 2 – Remove A(j, i)/A(i, i) from inner loop for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i to n A(j, k) = A(j, k) – m* A(i, k) 000000000000 0000000000 i i i,i X x j k
4
Gaussian Elimination - Review Version 3 – Don’t compute what we already know for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n m = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – m* A(i, k) 000000000000 0000000000 i i i,i X x j k
5
Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) 000000000000 0000000000 i i i,i X x j k
6
GE - Runtime Divisions Multiplications / subtractions Total 1+ 2 + 3 + … (n-1) = n 2 /2 (approx.) 1 2 + 2 2 + 3 2 + 4 2 +5 2 + …. (n-1) 2 = n 3/ 3 – n 2 /2 2n 3 /3
7
Parallel GE 1 st step – 1-D block partitioning along blocks of n columns by p processors 000000000000 0000000000 i i i,i X x j k
8
1D block partitioning - Steps 1. Divisions 2. Broadcast 3. Multiplications and Subtractions Runtime: n 2 /2 xlog(p) + ylog(p-1) + zlog(p-3) + … log1 < n 2 logp (n-1)n/p + (n-2)n/p + …. 1x1 = n 3 /p (approx.) < n 2 /2 +n 2 logp + n 3 /p
9
2-D block To speedup the divisions 000000000000 0000000000 i i i,i X x j k P Q
10
2D block partitioning - Steps 1. Broadcast of (k,k) 2. Divisions 3. Broadcast of multipliers logQ n 2 /Q (approx.) xlog(P) + ylog(P-1) + zlog(P-2) + …. = n 2 /Q logP 4. Multiplications and subtractions n 3 /PQ (approx.)
11
Problem with block partitioning for GE Once a block is finished, the corresponding processor remains idle for the rest of the execution Solution? -
12
Onto cyclic The block partitioning algorithms waste processor cycles. No load balancing throughout the algorithm. Onto cyclic 0123023023110 cyclic 1-D block-cyclic2-D block-cyclic Load balance Load balance, block operations, but column factorization bottleneck Has everything
13
Block cyclic Having blocks in a processor can lead to block-based operations (block matrix multiply etc.) Block based operations lead to high performance Operations can be split into 3 categories based on number of operations per memory reference Referred to BLAS levels
14
Basic Linear Algebra Subroutines (BLAS) – 3 levels of operations Memory hierarchy efficiently exploited by higher level BLAS BLASMemo ry Refs. FlopsFlops/ Memor y refs. Level-1 (vector) y=y+ax Z=y.x 3n2n2/3 Level-2 (Matrix-vector) y=y+Ax A = A+(alpha) xy T n2n2 2n 2 2 Level-3 (Matrix-Matrix) C=C+AB 4n 2 2n 3 n/2
15
Gaussian Elimination - Review Version 4 – Store multipliers m below diagonals for each column i zero it out below the diagonal by adding multiples of row i to later rows for i= 1 to n-1 for each row j below row i for j = i+1 to n A(j, i) = A(j, i) / A(i, i) for k = i+1 to n A(j, k) = A(j, k) – A(j, i)* A(i, k) 000000000000 0000000000 i i i,i X x j k What GE really computes: for i=1 to n-1 A(i+1:n, i) = A(i+1:n, i) / A(i, i) A(i+1:n, i+1:n) -= A(i+1:n, i)*A(i, i+1:n) A(i,i) A(j,i)A(j,k) A(i,k) Finished multipliers i i A(i+1:n, i) A(i, i+1:n) A(i+1:n, i+1:n) BLAS1 and BLAS2 operations
16
Converting BLAS2 to BLAS3 Use blocking for optimized matrix- multiplies (BLAS3) Matrix multiplies by delayed updates Save several updates to trailing matrices Apply several updates in the form of matrix multiply
17
Modified GE using BLAS3 Courtesy: Dr. Jack Dongarra for ib = 1 to n-1 step b /* process matrix b columns at a time */ end = ib+b-1; /* Apply BLAS 2 version of GE to get A(ib:n, ib:end) factored. Let LL denote the strictly lower triangular portion of A(ib:end, ib:end)+1 */ A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n) /* update next b rows of U */ A(end+1:n, end+1:n) = A(end+1:n, end+1:n) - A(ib:n, ib:end)* A(ib:end, end+1:n) /* Apply delayed updates with single matrix multiply */ b ibend ib end b Completed part of L Completed part of U A(ib:end, ib:end) A(end+1:n, ib:end) A(end+1:n, end+1:n)
18
Parallel GE with 2-D block-cyclic & pivoting (Steps) for ib=1 to n-1 step b end = min(ib+b-1, n) for i = ib to end (1) find pivot row k, column broadcast
19
Parallel GE with 2-D block-cyclic (Steps) (2) swap rows k and i in block column, broadcast row k (3) A(i+1:n, i) /= A(i, i)
20
Parallel GE in ScaLAPACK (Steps) (4) A(i+1:n, i+1:end) -= A(i+1:n, i)* A(i, i+1:end) end for
21
Parallel GE in ScaLAPACK (Steps) (5) Communicate swap information to left and right (6) Apply all row swaps to other columns
22
Parallel GE in ScaLAPACK (Steps) (7) Broadcast LL right (8) A(ib:end, end+1:n) = LL -1 A(ib:end, end+1:n)
23
Parallel GE in ScaLAPACK (Steps) (9) Broadcast A(ib:end, end+1:n) down (10) Broadcast A(end+1:n, ib:end) right (11) Eliminate A(end+1:n, end+1:n)
24
GE: Miscellaneous GE with Partial Pivoting 1D block-column partitioning: which is better? Column or row pivoting 2D block partitioning: Can restrict the pivot search to limited number of columns Column pivoting does not involve any extra steps since pivot search and exchange are done locally on each processor. O(n-i-1) The exchange information is passed to the other processes by piggybacking with the multiplier information Row pivoting Involves distributed search and exchange – O(n/P)+O(logP)
25
Triangular Solve – Unit Upper triangular matrix Sequential Complexity - Complexity of parallel algorithm with 2D block partitioning (P 0.5 *P 0.5 ) O(n 2 ) O(n 2 )/P 0.5 Thus (parallel GE / parallel TS) < (sequential GE / sequential TS) Overall (GE+TS) – O(n 3 /P)
26
2-D partitioning with pipelining – Demonstration of advancing front 0 00 0 0 00 0 0 0 0 00 0 01 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 1 0 0 0 0 1 1 1 0 0 0 1 1 1 0 0 0 1 1 1 1 2 0 0 1 1 1 2 01 1 1 2 2 0 01 1 2 2 1 1 2 2 23 12 23 2 23 3 3 3 23 3 3 Communication Computation
27
2-D Partitioning with Pipeline Communication and computation move as a “front” The completion time is only O(n) The “front” or wave propagation is used in other algorithms as well
28
Parallel solution of linear system with special matrices Tridiagonal Matrices a1 h1 g2 a2 h2 g3 a3 h3 gn an x1 x2 x3. xn = b1 b2 b3. bn In general: g i x i-1 + a i x i + h i x i+1 = b i Substituting for x i-1 and x i+1 in terms of {x i-2, x i } and {x i, x i+2 } respectively: G i x i-2 + A i x i + H i x i+2 = B i
29
Tridiagonal Matrices A1 H1 A2 H2 G3 A3 H4 G4 A4 H4 Gn-2 An x1 x2 x3. xn = B1 B2 B3. Bn Reordering:
30
Tridiagonal Matrices A2 H2 G4 A4 H4 Gn An A1 H1 G3 A3 H3 Gn-3 An-1 x2 x4. xn x1 x3. xn-1 = B2 B4. Bn B1 B3. Bn-1
31
Tridiagonal Systems Thus the problem of size n has been split into even and odd equations of size n/2 This is odd–even reduction For parallelization, each process can divide the problem into subproblems of smaller size and solve the subproblems This is divide-and-conquer technique
32
Tridiagonal Systems - Parallelization At each stage one representative process of the domain of processes is chosen This representative performs the odd-even reduction of problem i to two problems of size i/2 The problems are distributed to 2 representatives 12345678 1 26 1357 n n/2 n/4 n/8
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.