Presentation is loading. Please wait.

Presentation is loading. Please wait.

Numerical Linear Algebra

Similar presentations


Presentation on theme: "Numerical Linear Algebra"— Presentation transcript:

1 Numerical Linear Algebra
Karen Walters CS 521 March 2, 2000

2 Standard Linear Algebra Problems
Solving linear systems of equations : Example: solving ordinary or partial differential equations numerically Least squares problems: Example: fitting curves or surfaces to experimental data Eigenvalue problems: Example: analyzing vibrations Many computational science problems can be reduced to solving one or more of the following standard linear algebra problems. In a linear system of equations we are given a square matrix A and right hand side b. The matrix A is m by n in least squares problems. If m>n, we have more equations than unknowns, thus our system is overdetermined and we generally can’t solve Ax=b exactly. If m<n, the system is underdetermined and we can have infinitely many solutions. In eigenvalue problems, we are given the matrix A and we want to find the vector x and scalar lambda that satisfy this equation.

3 Choice of Algorithm Structure of A: Desired Accuracy: Ex:
dense Gaussian elimination SPD Cholesky triangular substitution certain PDEs multigrid Desired Accuracy: Ex: Gaussian elimination: more accurate, more expensive Conjugate gradients: less accurate, less expensive Computer System architecture, available languages, compiler quality libraries? The choice of algorithm depends on the structure of the matrix A, the accuracy desired, and the computer system. If the structure of A is, for example, dense, then Gaussian elimination takes 2/3 n^3 + O(n^2) flops. If A is symmetric positive definite, we can use Cholesky (variation of G.E. suitable for SPD matrices) which cuts the cost in half. If A is triangular, substitution is much faster than the previous two methods. Finally, if the matrix A results from certain elliptic partial differential equations (such as Poisson’s equation), we can get costs down to O(n) using multigrid. Our choice of algorithm also depends on the desired accuracy. For example, Gaussian elimination is more accurate than the conjugate gradient method, but it is also more expensive. Finally, the computer architecture, available programming languages and compiler quality can make a large difference in how we implement a particular algorithm. We cannot depend entirely on libraries since it is impossible to anticipate all possible structures of A and write corresponding libraries. Also, high performance computer architectures, programming languages and compilers have been changing so much that library writers can’t keep up.

4 Memory Hierarchies registers fastest, smallest, most expensive cache
main memory disks tape slowest, largest, cheapest Computer memories are built as hierarchies, with a series of different kinds of memories ranging from very fast, expensive and small memory at the top of the hierarchy, down to slow, cheap and very large memory at the bottom. For example, the memory may have the hierarchy given on the left here. Very large matrices cannot fit in the registers. When work needs to be done, it has to move up into the registers. It takes time to move between levels and more time is needed farther down in the hierarchy. One data movement takes much longer than performing a floating point operation, so we want to minimize the number of memory moves even if it means using more flops.

5 BLAS = Basic Linear Algebra Subroutines
This table gives the speeds on a single processor of a Cray Y-MP and on 8 processors. The first row gives the maximum possible speed in megaflops of the machine. Then we have the speed of various Linear Algebra Subroutines. These were written in assembly language for the Cray YMP to minimize data movement. Since operations like matrix-matrix multiplication are so common, computer manufacturers have standardized them as the Basic Linear Algebra Subroutines (or BLAS) and have optimized them for their machines. The fifth row gives the performance of a Cholesky subroutine form the LINPACK subroutine library, which is a library of Fortran 77 linear algebra routines. LINPACK was written before we had vector and parallel computers. The performance of this subroutine is quite poor because it was not designed to minimize memory movement on machines such as the Cray YMP. The last two rows give the performance of a different version of Cholesky taken from the LAPACK library. LAPACK reorganizes the algorithms of LINPACK and EISPACK (for eigenvalue problems) to call the BLAS in their innermost loops where most of the work is done. You can see that it is much faster than LINPACK alone. Although this approach cannot be used straightforwardly(*) on newer architectures (especially distributed memory machines), many of the same ideas still apply. [(*) 1. memory hierarchy is deeper; 2. languages and compilers are still evolving: many more possible ways to store a matrix] BLAS = Basic Linear Algebra Subroutines

6 The BLAS LEVEL 1 BLAS LEVEL 2 BLAS LEVEL 3 BLAS
Table 2 gives us the number of floating point operations and memory references for three subroutines. “q” is the ratio of flops to memory references and tells us how much useful work we can to compared to the time spent moving data. Clearly we want to use operations with higher values of q in our algorithms.

7 Matrix Multiplication: Shared Memory
Unblocked Matrix Multiplication Here we’re going to look at matrix multiplication on shared memory machines. This algorithm is called unblocked matrix multiplication. We’re assuming that memory here contains M words. If M is between n and n^2, where the matrices are n by n, then one row of A and one entry of C are kept in fast memory. In this case, we get a value of q approximately 2, which is no better than Level 2 BLAS. If M is much smaller than n, we can’t keep a full row of A in fast memory. In this case, we’re reduced to taking inner products, which is Level 1 BLAS. One row of A and one entry of C kept in fast memory: Cannot keep a full row of A in fast memory:

8 Matrix Multiplication: Shared Memory
Column-blocked Matrix Multiplication Fast memory contains 1 column of A and 1 column block each from B and C: In column-blocked matrix multiplication, assuming a large enough M, we can fit column blocks from B and C into fast memory along with 1 column of A. Here q is approximately M/n, thus M needs to grow with n to keep q large. (N denotes how many column blocks we have.)

9 Matrix Multiplication: Shared Memory
Rectangular-blocked Matrix Multiplication One block each from A, B, and C fits into fast memory: If we break A, B and C into blocks, and M is greater than or equal to 3(n/N)^2, then we can fit one block each from A, B and C into fast memory. Here q is approximately sqrt(M/3), which is much better than the previous 2 algorithms.

10 Matrix Multiplication: Distributed Memory
Cost model is more complicated due to data layout: Goal: Want to utilize essentially every communication link and floating point unit. Data layout determines: amount of parallelism cost of communication Matrix multiplication on distributed memory machines has a more complicated cost model due to the data layout. The data layout determines the amount of parallelism and the cost of communication. Our goal is to utilize essentially every communication link and floating point unit.

11 Matrix Multiplication: Distributed Memory
Cannon’s Matrix Multiplication Algorithm For example, if we have our computers laid out in a square mesh so that each processor communicates most efficiently with the 4 processors immediately north, south, east and west of itself, we can use an algorithm such as Cannon’s given here to multiply matrices A and B. This algorithm moves blocks around to different processors. Block C^(ij) is computed on processor ij.

12 Matrix Multiplication: Distributed Memory
Whenever A^(1k) and B^(k2) are stored on processor ij, they are multiplied and accumulated in C^(12).

13 Matrix Multiplication: Distributed Memory
Algorithm is scalable to larger machines and larger problems Optimized only for a particular architecture The advantage to an algorithm such as Cannon’s is that it is scalable to larger machines and larger problems. Unfortunately, the algorithm is only optimized for a particular architecture, so we would have to rewrite it for other architectures.

14 Data Layouts on Distributed Memory Machines
Choose a data layout that: Permits highly parallel implementation of a variety of matrix algorithms Limits communication cost as much as possible Scales to larger matrices and machines Our ideal data layout on distributed memory machines would permit highly parallel implementation of a variety of algorithms, would limit communication costs and would scale to larger matrices and machines.

15 Block Layout Layout suitable for BLAS
One example of a data layout is the block layout. Here the matrix A is broken up into subblocks. This layout makes using BLAS on each processor attractive. However, for straightforward matrix algorithms that process the matrix from left-to-right, such as Gaussian Elimination and QR decomposition, the leftmost processors will become idle early in the computation and make load balance poor. Layout suitable for BLAS For left-to-right algorithms, load balance is poor

16 Scatter Layout Optimizes load balance
This next layout is called the scatter layout. It optimizes the load balance since the matrix entries stored on a single processor are distributed uniformly throughout the matrix. However, this layout inhibits the use of the BLAS locally in each processor since the data owned by a processor are not contiguous from a point of view of the matrix. Optimizes load balance Inhibits use of BLAS locally in each processor

17 Block-Scatter Layout Finally, the block-scatter layout combines the previous two layouts. This layout offers a trade-off between the load balance and the applicability of the Basic Linear Algebra Subroutines. Trades off load balance and applicability of the BLAS

18 Gaussian Elimination on Dense Matrices
Solving Ax=b: Use Gaussian Elimination to factor A into PA=LU Solve triangular systems Ly=Pb and Ux=y for x (Level 2 BLAS) Now let’s take a look at an algorithm used to solve the linear system Ax=b, where A is a dense matrix. We first use Gaussian Elimination to factor A into PA=LU, where P is a permutation matrix which permutes the rows of A to ensure stability, L is a lower triangular matrix and U is upper triangular. Next we solve the triangular systems to get our final answer x.

19 Gaussian Elimination on Dense Matrices
Row-Oriented Gaussian Elimination (kij-LU decomposition) The simplest version of the Gaussian Elimination algorithm involves adding multiples of 1 row of A to others to zero out subdiagonal entries, and overwriting A with L & U. There are 6 ways of ordering the loops. We ultimately want to use algorithms that preserve locality. That is, if A is stored by rows, we want the algorithm with the inner loop accessing rows.

20 Gaussian Elimination: Shared Memory
Can parallelize innermost loop – each can be updated independently Use column blocking to compute LU in fast memory, then use Level 3 BLAS to update remainder On shared memory machines, we can parallelize the inner most loop to update each A_ij independently. Column blocking is used to compute the LU factorization in fast memory, then Level 3 BLAS are used to update the remainder.

21 Gaussian Elimination: Distributed Memory
Use block-scatter mapping: Factorize each block and broadcast pivot information to other processors Optimal choice of block size depends on cost of communication vs. computation On distributed memory machines, block-scatter mapping is used. The optimal choice of block size depends on the cost of communication vs. computation. For example, if the communication required to do a pivot search and swapping of rows is expensive, then the number of rows per block should be large.

22 Gaussian Elimination on Sparse Matrices
Sparse Matrix: contains enough zero entries to be advantageous in reducing storage and work required in solving a linear system Difficulties: More overhead Arithmetic operations slower Practical requirements: nonzeros systematic patterns Up to now we’ve been talking about dense matrices. Here I want to say a word about sparse matrices. A sparse matrix is a matrix that has few nonzero entries which enables us to reduce the storage and work required in solving a linear system. The difficulties arising from sparse data structures are that they involve more overhead (to store indices as well as numerical values of nonzero matrix entries) than the simple arrays used for dense matrices, and arithmetic operations on the data stored in them usually cannot be performed as rapidly (due to indirect addressing of operands). A practical requirement for a family of matrices to be ‘usefully’ sparse is that they have only O(n) nonzero entries, i.e., a (small) constant number of nonzeros per row or column, independent of the matrix dimension. In addition to the number of nonzeros, their particular location, or pattern, in the matrix also has a major effect on how well sparsity can be exploited.

23 Gaussian Elimination on Sparse Matrices
Modification may introduce fill Dense first column will cause fill in all remaining columns Dense last column causes no fill at all Can select sparsity preserving ordering in advance for SPD matrices Modification of a sparse matrix may cause fill, in which zero entries are changed to nonzeros. If the first column of a matrix is dense, and the factorization works on the columns from left to right, then the remaining columns will completely fill in with nonzeros during the factorization. If this dense column is permuted to be the last column of the matrix, then it will cause no fill at all. Symmetric positive definite matrices have the advantage that the factorization depends only on the pattern of nonzeros. Thus any particular ordering of the columns allows us to anticipate the location of fill elements and a structure can be set up to accommodate such elements and preserve the sparsity of the matrix.

24 Sparse Cholesky Factorization
Three algorithms, depending on outer loop index: row-Cholesky column-Cholesky submatrix-Cholesky A special case of Gaussian Elimination for symmetric positive definite matrices is Cholesky factorization. In Cholesky factorization, we factor a symmetric positive definite matrix A into the product of a lower triangular matrix L with its transpose. There are three basic types of algorithms which depend on which of the three indices is placed in the outer loop.

25 Sparse Cholesky Factorization
In row-Cholesky, successive rows of L are computed one by one, with the inner loops solving a triangular system for each new row in terms of the previously computed rows. It is seldom used due to the difficulty of provided a row-oriented data structure that can be accessed efficiently and the difficulty in parallelizing the triangular solutions. The column-Cholesky is the most commonly used one. Here successive columns of L are computed one by one, with the inner loops computing a matrix-vector product that gives the effect of previously computed columns on the column currently being computed. In submatrix-Cholesky, as soon as a particular column is computed, its effects on all subsequent columns are computed immediately.

26 Iterative Methods for Linear Systems
Basic Time Consuming Computational Kernels of Iterative Schemes: inner products vector updates matrix-vector products preconditioning Up to now we’ve talked about direct methods of solving linear systems. Iterative methods start with an initial guess and update that guess until it is within a certain error bound of the desired solution. There are 4 basic time consuming computational kernels of iterative schemes, given here.

27 Inner Products Easily parallelized: each processor computes the inner product of 2 segments of each vector Distributed memory: Local inner products sent to other processors Hard to find iterative algorithms which overlap communication and computation Shared memory: parallel inner product computation easily implemented In general, inner products are easily parallelized. On distributed memory systems, the local inner products have to be sent to other processors to get a global inner product. It is hard to find iterative algorithms that work well on distributed memory. Inner product computation works well on shared memory machines.

28 Vector Updates Parallelization is trivial if entire vector fits in memory Can have communication problems on distributed memory machines As for vector updates, the parallelization is trivial if the entire vector fits into memory. Depending on how the vectors are stored, we can have communication problems on distributed memory machines.

29 Matrix-Vector Products
Distributed memory: problems if vector doesn’t fit into memory Shared memory: easily parallelized Easily parallelized on shared memory machines by splitting the matrix into strips corresponding to the vector segments. Each processor takes care of the matrix-vector product of one strip. Again, we run into problems with distributed memory.

30 Preconditioning Can be the most problematic when parallelizing
Approaches: Reordering the computations Reordering the unknowns Forced parallelism To speed up the rate of convergence of an iterative method, a preconditioner is used to transform the matrix A. Preconditioning can be the most problematic computation when parallelizing. Some of the approaches to solving the associated problems are reordering the computations or the unknowns, and forced parallelism, where couplings to unknowns on other processors are neglected. Each approach has pros and cons.

31 Netlib Large body of numerical software Email netlib@ornl.gov:
‘send index’ ‘send subroutine_name from library_name’ X-window interface Xnetlib: ‘send Xnetlib.shar from netlib’ A lot of the routines discussed today can be found in Netlib, which is a large body of numerical software. You can access the software by sending an to netlib with the request ‘send subroutine_name from library_name’. The request ‘send index’ will give you a list of the available software. You also could use the x-window interface ‘Xnetlib’. Again, netlib with ‘send xnetlib.shar from netlib’ as the text of the message.


Download ppt "Numerical Linear Algebra"

Similar presentations


Ads by Google