Presentation is loading. Please wait.

Presentation is loading. Please wait.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

Similar presentations


Presentation on theme: "High Performance Computing 1 Numerical Linear Algebra An Introduction."— Presentation transcript:

1 High Performance Computing 1 Numerical Linear Algebra An Introduction

2 High Performance Computing 1 Levels of multiplication vector-vector a[i]*b[i] matrix-vector A[i,j]*b[j] matrix-matrix A[I,k]*B[k,j]

3 High Performance Computing 1 Matrix-Matrix for (i=0; i<n; i++) for (j=0; j<n; j++) { C[i,j]=0.0; for (k=0; k<n; k++) C[i,j]=C[i,j] + A[i,k]*B[k,j]; } Note:O(n 3 ) work

4 High Performance Computing 1 Block matrix Serial version - same pseudocode, but interpret i, j as indices of subblocks, and A*B means block matrix multiplication Let n be a power of two, and generate a recursive algorithm. Terminates with an explicit formula for the elementary 2X2 multiplications. Allows for parallelism. Can get O(n 2.8 ) work

5 High Performance Computing 1 Pipeline method Pump data left to right and top to bottom recv(&A,P[i,j-1]); recv(&B,P[i-1,j]); C=C+A*B send(&A,P[i,j+1]); send(&B,P[i+1, j]);

6 High Performance Computing 1 Pipeline method C[0,0] C[3,1]C[3,3]C[3,2] C[1,0] C[2,0] C[3,0] C[1,1] C[0,3]C[0,2]C[0,1] C[1,2] C[2,1]C[2,2] C[1,3] C[2,3] A[0,*] A[1,*] A[2,*] A[3,*] B[*,0]B[*,1]B[*,2]B[*,3]

7 High Performance Computing 1 Pipeline method Similar method for matrix-vector multiplication. But you lose some of the cache reuse

8 High Performance Computing 1 A sense of speed – vector ops LoopFlops per passOperation per passoperation 12v1(i)=v1(i)+a*v2(i)update 28v1(i)=v1(i)+Σ sk*vk(i)4-fold vector update 31v1(i)=v1(i)/v2(i)divide 42v1(i)=v1(i)+s*v2(ind(i))update+gather 52v1(i)=v2(i)-v3(i)*v1(i-1)Bidiagonal 62s=s+v1(i)*v2(i)Inner product

9 High Performance Computing 1 A sense of speed – vector ops LoopJ90 cft77 (100 nsec clock) r_∞ n1/2 T90 1 processor cft77 (2.2 nsec clock) r_∞ n1/2 197 191428 159 2163 291245 50 321 6219 43 472 21780 83 54.9 225 9 6120 202474 164

10 High Performance Computing 1 Observations Simple do loops not effective Cache and memory hierarchy bottlenecks For better performance, –combine loops –minimize memory transfer

11 High Performance Computing 1 LINPACK library of subroutines to solve linear algebra example – LU decomposition and system solve (dgefa and dgesl, resp.) In turn, built on BLAS see netlib.org

12 High Performance Computing 1 BLAS Basic Linear Algebra Subprograms a library of subroutines designed to provide efficient computation of commonly-used linear algebra routines, like dot products, matrix-vector multiplies, and matrix-matrix multiplies. The naming convention is not unlike other libraries - the fist letter indicates precision, the rest gives a hint (maybe) of what the routine does, e.g. SAXPY, DGEMM. The BLAS are divided into 3 levels: vector-vector, matrix-vector, and matrix-matrix. The biggest speed-up usually in level 3.

13 High Performance Computing 1 BLAS Level 1

14 High Performance Computing 1 BLAS Level 2

15 High Performance Computing 1 BLAS Level 3

16 High Performance Computing 1 How efficient is the BLAS? load/store float ops refs/ops level 1 SAXPY 3N 2N 3/2 level 2 SGEMV MN+N+2M 2MN 1/2 level 3 SGEMM 2MN+MK+KN 2MNK 2/N

17 High Performance Computing 1 Matrix-vector read x(1:n) into fast memory read y(1:n) into fast memory for i = 1:n read row i of A into fast memory for j = 1:n y(i) = y(i) + A(i,j)*x(j) write y(1:n) back to slow memory

18 High Performance Computing 1 Matrix-vector m=# slow memory refs = n^2 +3n f=# arithmetic ops = 2n^2 q=f/m ~2 Mat-vec multiple limited by slow memory

19 High Performance Computing 1 Matrix-matrix

20 High Performance Computing 1 Matrix Multiply - unblocked for i = 1 to n read row i of A into fast memory for j = 1 to n read C(i,j) into fast memory read column j of B into fast memory for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) write C(i,j) back to slow memory *

21 High Performance Computing 1 Matrix Multiply unblocked Number of slow memory references on unblocked matrix multiply m = n^3 read each column of B n times + n^2 read each column of A once for each i + 2*n^2 read and write each element of C once = n^3 + 3*n^2 So q = f/m = (2*n^3)/(n^3 + 3*n^2) ~= 2 for large n, no improvement over matrix- vector multiply

22 High Performance Computing 1 Matrix Multiply blocked Consider A,B,C to be N by N matrices of b by b subblocks where b=n/N is called the blocksize for i = 1 to N for j = 1 to N read block C(i,j) into fast memory for k = 1 to N read block A(i,k) into fast memory read block B(k,j) into fast memory C(i,j) = C(i,j) + A(i,k) * B(k,j) {do a matrix multiply on blocks} write block C(i,j) back to slow memory *

23 High Performance Computing 1 Matrix Multiply blocked Number of slow memory references on blocked matrix multiply m = N*n^2 read each block of B N^3 times (N^3 * n/N * n/N) + N*n^2 read each block of A N^3 times + 2*n^2 read and write each block of C = (2*N + 2)*n^2 So q = f/m = 2*n^3 / ((2*N + 2)*n^2) ~= n/N = b for large n

24 High Performance Computing 1 Matrix Multiply blocked So we can improve performance by increasing the blocksize b Can be much faster than matrix-vector multiplty (q=2) Limit: All three blocks from A,B,C must fit in fast memory (cache), so we cannot make these blocks arbitrarily large: 3*b^2 <= M, so q ~= b <= sqrt(M/3)

25 High Performance Computing 1 More on BLAS Industry standard interface(evolving) Vendors, others supply optimized implementations History –BLAS1 (1970s): vector operations: dot product, saxpy m=2*n, f=2*n, q ~1 or less –BLAS2 (mid 1980s) matrix-vector operations m=n^2, f=2*n^2, q~2, less overhead somewhat faster than BLAS1

26 High Performance Computing 1 More on BLAS –BLAS3 (late 1980s) matrix-matrix operations: matrix matrix multiply, etc m >= 4n^2, f=O(n^3), so q can possibly be as large as n, so BLAS3 is potentially much faster than BLAS2 Good algorithms used BLAS3 when possible (LAPACK) www.netlib.org/blas, www.netlib.org/lapack

27 High Performance Computing 1 BLAS on an IBM RS6000/590 BLAS 3 BLAS 2 BLAS 1 BLAS 3 (n-by-n matrix matrix multiply) vs BLAS 2 (n-by-n matrix vector multiply) vs BLAS 1 (saxpy of n vectors) Peak speed = 266 Mflops Peak


Download ppt "High Performance Computing 1 Numerical Linear Algebra An Introduction."

Similar presentations


Ads by Google