03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09.

03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09

03/04/2009CS267 Lecture 12a2 Kinds of class projects Try tuning existing (widely used) codes in LAPACK, ScaLAPACK or possible future versions -Possible impact: help many people to run faster Add missing functionality to these libraries -Possible impact: lots of users want it Experiment with algorithms on new architectures -Possible impact: What do we need to do differently for performance on these platforms? Are there any bottlenecks or other problems in the architecture? Could they be fixed? Experiment with new software approaches -Possible impact: Is it easier to write these algorithms while getting most of the performance? Should we produce future versions of the libraries this way? Experiment with new algorithms -Possible impact: Find a better one!

Challenges to Libraries (and parallel SW in general) Minimizing communication costs -Cost of bandwidth and latency (to main memory or over a network) growing exponentially compared to arithmetic Heterogeneous platforms -Different communication costs depending on destination Same chip vs different socket vs different board … -CPU + GPU Perform different operations at very different rates Dynamic scheduling & load balancing -Can’t always assume each core/processor makes constant progress on your task -May be faster to grab next available task than use predesigned “perfectly balanced” schedule -OS may give, take away resources on the fly Fault tolerance – how to recover when one proc fails 03/02/2009CS267 Lecture 113

Strassen’s Matmul on Multicore or GPU Why no Strassen in most libraries? -See “Baleful Effect of Benchmarks…” by Prof. Kahan Likely to be faster for modest-to-large matrix sizes -Where is the crossover? May want hybrid: switch to O(n 3 ) algorithm for certain sizes (smaller) -Autotuning? Lots of “blocking” opportunities as for standard matmul -What is least amount of data movement possible? How well does it work for the rectangular matmuls in LU, QR and Cholesky? -Do we need to modify LU, QR or Cholesky to take advantage of Strassen (by using a variant that multiplies different size matrices)? 03/04/2009CS267 Lecture 12a4

Review: Alternative recursive GE formulation Toledo (1997) -Describe without pivoting for simplicity -“Do left half of matrix, then right half” 03/04/2009CS267 Lecture 12a5 function [L,U] = RLU (A) … assume A is m by n if (n=1) L = A/A(1,1), U = A(1,1) else [L1,U1] = RLU( A(1:m, 1:n/2)) … do left half of A … let L11 denote top n/2 rows of L1 A( 1:n/2, n/2+1 : n ) = L11 -1 * A( 1:n/2, n/2+1 : n ) … update top n/2 rows of right half of A A( n/2+1: m, n/2+1:n ) = A( n/2+1: m, n/2+1:n ) - A( n/2+1: m, 1:n/2 ) * A( 1:n/2, n/2+1 : n ) … update rest of right half of A [L2,U2] = RLU( A(n/2+1:m, n/2+1:n) ) … do right half of A return [ L1,[0;L2] ] and [U1, [ A(.,.) ; U2 ] ] A = L * U

Register-file resident Linear Algebra on GPUs Vasily’s results for LU, QR and Cholesky on GPU target single large matrices, too large to fit just in the “fast memory” (shared + registers) of the GPU There is also demand for solving many smaller problems in parallel, eg A(i) * x(i) = b(i) for many different A(1),…,A(k) and b(1),…,b(k) Project: Design linear algebra algorithms that operate on many different matrices in parallel, each small enough to fit in the 64 KB register set of each multiprocessor -single precision square matrix of dimension n=128 Question: Does possible need to branch differently on each multiprocessor (because of different pivot orders) matter? If so, is QR better than LU? Question: Do we need BLAS3 code versions on such small matrices, or is BLAS2 enough? 03/04/2009CS267 Lecture 12a6

Extend Vasily’s GPU analysis, code to ATI Vasily’s Best Student Paper Award from SC08 had two parts: -Analyzed bottlenecks, speedup possibilities in NVIDIA architecture -Applied lessons to reorganization of LU, QR, Cholesky What about ATI GPU? -Both above aspects interesting -ATI GPU available in ParLab -What are pros, cons of ATI, NVIDIA architectures? Others? -Do we need to reorganize algorithms differently for each, or does one algorithm (perhaps with different block sizes, other parameters) work for both (which would be simpler)? Other BLAS-like operations on GPU -Needed for finite-element analysis 03/04/2009CS267 Lecture 12a7

Missing Drivers in Sca/LAPACK LAPACKScaLAPACK Linear Equations LU Cholesky LDL T xGESV xPOSV xSYSV PxGESV PxPOSV missing Least Squares (LS) QR QR+pivot SVD/QR SVD/D&C SVD/MRRR QR + iterative refine. xGELS xGELSY xGELSS xGELSD missing (oops) missing PxGELS missing driver missing (intent) missing(oops) missing Generalized LSLS + equality constr. Generalized LM Above + Iterative ref. xGGLSE xGGGLM missing

More missing drivers LAPACKScaLAPACK Symmetric EVDQR / Bisection+Invit D&C MRRR xSYEV / X xSYEVD xSYEVR PxSYEV / X missing (intent) missing Nonsymmetric EVDSchur form Vectors too xGEES / X xGEEV /X missing driver SVDQR D&C MRRR Jacobi xGESVD xGESDD missing(oops) xGESVJ PxGESVD missing (intent) missing(oops) missing Generalized Symmetric EVD QR / Bisection+Invit D&C MRRR xSYGV / X xSYGVD missing PxSYGV / X missing (intent) missing Generalized Nonsymmetric EVD Schur form Vectors too xGGES / X xGGEV / X missing Generalized SVDKogbetliantz MRRR xGGSVD missing(oops) missing (intent) missing(oops)

Missing matrix types in ScaLAPACK Symmetric, Hermitian, triangular -Band, Packed Positive Definite -Packed Orthogonal, Unitary -Packed

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory Speedups for using 2D processor grid range from 2x to 8x Tuning the data layout Layout depends on block size b and processor grid Pr x Pc Simple layouts easy for user, but bad for performance

Times obtained on: 60 processors, Dual AMD Opteron 1.4GHz Cluster w/Myrinet Interconnect, 2GB Memory Cost of redistributing matrix to optimal layout is small Cost of tuning the data layout, compared to runtime Possible project: build “wrapper” that chooses fastest layout, whether to convert back and forth, and hides details from the user.

Parallel Eigenvalue Algorithms on GPU Harder to use all BLAS3 than solving Ax=b, least squares Symmetric eigenvalue problem for A=A T (SVD similar) -Find orthogonal Q to transform A = QTQ T, where T=T T is tridiagonal (nonzero on main diagonal, right above and below -Find eigenvals  =diag(λ 1,…,λ n )and orthog. eigenvecs U of T = U  U T Good parallel algorithms; cheaper than first step -Then A = (QU)  (QU) T so orthog. eigenvectors =QU, eigenvalues =  A = QTQ T is proposed challenge -Use “Successive Band Reduction” (Sun, Bischof et al) -Go from A to wide band matrix B via A = VBV T, V orthogonal All BLAS3, fast on GPU -Go from B to tridiagonal T via B = WTW T, W orthogonal BLAS1 and BLAS2, do it on CPU -Find T = U  U T as above, then A = (VWU)  (VWU) T Prospect of minimizing communication in theory 03/04/2009CS267 Lecture 12a 13

Experiment with PLASMA for Multicore PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs) -icl.cs.utk.edu/plasma/ 03/04/2009CS267 Lecture 12a14

15 A C A B C TTT Fork-Join vs. Dynamic Execution on Multicore Fork-Join – parallel BLAS DAG-based – dynamic scheduling Time Experiments on Intel’s Quad Core Clovertown with 2 Sockets w/ 8 Treads Time saved Source: Jack Dongarra

Experiment with PLASMA for Multicore  PLASMA is experimental system for writing, scheduling linear algebra algorithms as Directed Acyclic Graphs (DAGs) -icl.cs.utk.edu/plasma/  Experiment with PLASMA -Implement other factorizations -Compare performance To LAPACK with parallel BLAS To ScaLAPACK -Evaluate expressiveness for eigenvalue problems -Study interaction of scheduler with higher level scheduler being designed in ParLab Can PLASMA “gracefully” accept, give up, resources? 03/04/2009CS267 Lecture 12a16  Perform analogous experiments with UPC, Titanium or other PGAS languages

17 Investigate role of “Dense Motif” in ParLab Apps  Initial study (below) showed Dense Linear Algebra in  Image, Speech, Music  Determine what is really needed  Functions, problem sizes, performance requirements  What do we still need to optimize?

03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09.

Similar presentations

Presentation on theme: "03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09.

Similar presentations

Presentation on theme: "03/04/2009CS267 Lecture 12a1 CS 267 Dense Linear Algebra: Possible Class Projects James Demmel www.cs.berkeley.edu/~demmel/cs267_Spr09."— Presentation transcript:

Similar presentations

About project

Feedback