Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra

Similar presentations


Presentation on theme: "1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra"— Presentation transcript:

1 1 Jim Demmel EECS & Math Departments, UC Berkeley demmel@cs.berkeley.edu Minimizing Communication in Numerical Linear Algebra www.cs.berkeley.edu/~demmel Communication Lower Bounds for Direct Linear Algebra

2 Outline Recall Motivation Lower bound proof for Matmul, by Irony/Toledo/Tiskin How to generalize it to linear algebra without orthogonal xforms How to generalize it to linear algebra with orthogonal xforms Summary of linear algebra problems for which the lower bound is attainable Summary of extensions to Strassen 2

3 Collaborators Grey Ballard, UCB EECS Ioana Dumitriu, U. Washington Laura Grigori, INRIA Ming Gu, UCB Math Mark Hoemmen, UCB EECS Olga Holtz, UCB Math & TU Berlin Julien Langou, U. Colorado Denver Marghoob Mohiyuddin, UCB EECS Oded Schwartz, TU Berlin Hua Xiang, INRIA Kathy Yelick, UCB EECS & NERSC BeBOP group, bebop.cs.berkeley.edu 3 Summer School Lecture 3

4 4 Motivation (1/2) Two kinds of costs: Arithmetic (FLOPS) Communication: moving data between levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case). CPU Cache DRAM CPU DRAM CPU DRAM CPU DRAM CPU DRAM Summer School Lecture 3

5 Motivation (2/2) Running time of an algorithm is sum of 3 terms: # flops * time_per_flop # words moved / bandwidth # messages * latency 5 communication Exponentially growing gaps between Time_per_flop << 1/Network BW << Network Latency Improving 59%/year vs 26%/year vs 15%/year Time_per_flop << 1/Memory BW << Memory Latency Improving 59%/year vs 23%/year vs 5.5%/year Goal : reorganize linear algebra to avoid communication Between all memory hierarchy levels L1 L2 DRAM network, etc Not just hiding communication (speedup  2x ) Arbitrary speedups possible Summer School Lecture 3

6 Direct linear algebra: Prior Work on Matmul Assume n 3 algorithm (i.e. not Strassen-like) Sequential case, with fast memory of size M Lower bound on #words moved to/from slow memory =  (n 3 / M 1/2 ) [Hong & Kung (1981)] Attained using blocked or recursive (“cache-oblivious”) algorithm 6 Parallel case on P processors: Let NNZ be total memory needed; assume load balanced Lower bound on #words communicated =  (n 3 /(P· NNZ ) 1/2 ) [Irony, Tiskin & Toledo (2004)] When NNZ = 3n 2, (“2D alg”), get  (n 2 / P 1/2 ) Attained by Cannon’s Algorithm For “3D alg” (NNZ = O(n 2 P 1/3 )), get  (n 2 / P 2/3 ) Attainable too (details later) Summer School Lecture 3

7 Direct linear algebra: Generalizing Prior Work Sequential case: # words moved =  (n 3 / M 1/2 ) =  (#flops / (fast_memory_size) 1/2 ) Parallel case: # words moved =  (n 3 /(P· NNZ ) 1/2 ) =  ((n 3 /P) / (NNZ/P) 1/2 ) =  (#flops_per_proc / (memory_per_processor) 1/2 ) (true for at least one processor, assuming “balance” in either flops or in memory) In both cases, we let M = memory size, and write 7 #words_moved by at least one processor =  (#flops / M 1/2 ) Summer School Lecture 3

8 Lower bound for all “direct” linear algebra Holds for Matmul, BLAS, LU, QR, eig, SVD Need to explain model of computation Some whole programs (sequences of these operations, no matter how they are interleaved) Dense and sparse matrices (where #flops << n 3 ) Sequential and parallel algorithms Some graph-theoretic algorithms 8 #words_moved by at least one processor =  (#flops / M 1/2 ) #messages_sent by at least one processor =  (#flops / M 3/2 ) Summer School Lecture 3

9 Proof of Communication Lower Bound on C = A·B (1/5) Proof from Irony/Toledo/Tiskin (2004) Original proof, then generalization Think of instruction stream being executed Looks like “ … add, load, multiply, store, load, add, …” Each load/store moves a word between fast and slow memory, or between local memory and remote memory (another processor) We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n 3 flops, possibly reordered assuming addition is commutative/associative Assuming that at most M words can be stored in fast memory Outline: Break instruction stream into segments, each with M loads and stores Somehow bound the maximum number of flops that can be done in each segment, call it F So F · # segments  T = total flops = 2·n 3, so # segments  T / F So # loads & stores = M · #segments  M · T / F 9 Summer School Lecture 3

10 Load Store FLOP Time Segment 1 Segment 2 Segment 3 Illustrating Segments, for M=3...

11 Proof of Communication Lower Bound on C = A·B (1/5) Proof from Irony/Toledo/Tiskin (2004) Original proof, then generalization Think of instruction stream being executed Looks like “ … add, load, multiply, store, load, add, …” Each load/store moves a word between fast and slow memory We want to count the number of loads and stores, given that we are multiplying n-by-n matrices C = A·B using the usual 2n 3 flops, possibly reordered assuming addition is commutative/associative Assuming that at most M words can be stored in fast memory Outline: Break instruction stream into segments, each with M loads and stores Somehow bound the maximum number of flops that can be done in each segment, call it F So F · # segments  T = total flops = 2·n 3, so # segments  T / F So # loads & stores = M · #segments  M · T / F 11 Summer School Lecture 3

12 Proof of Communication Lower Bound on C = A·B (2/5) Given segment of instruction stream with M loads & stores, how many adds & multiplies (F) can we do? At most 2M entries of C, 2M entries of A and/or 2M entries of B can be accessed Use geometry: Represent n 3 multiplications by n x n x n cube One n x n face represents A each 1 x 1 subsquare represents one A(i,k) One n x n face represents B each 1 x 1 subsquare represents one B(k,j) One n x n face represents C each 1 x 1 subsquare represents one C(i,j) Each 1 x 1 x 1 subcube represents one C(i,j) += A(i,k) · B(k,j) May be added directly to C(i,j), or to temporary accumulator 12 Summer School Lecture 3

13 Proof of Communication Lower Bound on C = A·B (3/5) 13 k “A face” “B face” “C face” Cube representing C(1,1) += A(1,3)·B(3,1) If we have at most 2M “A squares”, 2M “B squares”, and 2M “C squares” on faces, how many cubes can we have? i j A(2,1) A(1,3) B(1,3) B(3,1) C(1,1) C(2,3) A(1,1) B(1,1) A(1,2) B(2,1)

14 Proof of Communication Lower Bound on C = A·B (4/5) 14 x z z y x y k “A shadow” “B shadow” “C shadow” j i # cubes in black box with side lengths x, y and z = Volume of black box = x·y·z = ( xz · zy · yx) 1/2 = (#A□s · #B□s · #C□s ) 1/2 (i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) · area(B shadow) · area(C shadow)) 1/2 Summer School Lecture 3

15 Proof of Communication Lower Bound on C = A·B (5/5) Consider one “segment” of instructions with M loads, stores Can be at most 2M entries of A, B, C available in one segment Volume of set of cubes representing possible multiply/adds in one segment is ≤ (2M · 2M · 2M) 1/2 = (2M) 3/2 ≡ F # Segments   2n 3 / F  # Loads & Stores = M · #Segments  M ·  2n 3 / F   n 3 / (2M) 1/2 – M =  (n 3 / M 1/2 ) 15 Parallel Case: apply reasoning to one processor out of P # Adds and Muls  2n 3 / P (at least one proc does this ) M= n 2 / P (each processor gets equal fraction of matrix) # “Load & Stores” = # words moved from or to other procs  M · (2n 3 /P) / F= M · (2n 3 /P) / (2M) 3/2 = n 2 / (2P) 1/2 Summer School Lecture 3

16 Proof of Loomis-Whitney inequality T = 3D set of 1x1x1 cubes on lattice N = |T| = #cubes T x = projection of T onto x=0 plane N x = |T x | = #squares in T x, same for T y, N y, etc Goal: N ≤ (N x · N y · N z ) 1/2 16 T(x=i) = subset of T with x=i T(x=i | y ) = projection of T(x=i) onto y=0 plane N(x=i) = |T(x=i)| etc N =  i N(x=i) =  i (N(x=i)) 1/2 · (N(x=i)) 1/2 ≤  i (N x ) 1/2 · (N(x=i)) 1/2 ≤ (N x ) 1/2 ·  i (N(x=i | y ) · N(x=i | z ) ) 1/2 = (N x ) 1/2 ·  i (N(x=i | y ) ) 1/2 · (N(x=i | z ) ) 1/2 ≤ (N x ) 1/2 · (  i N(x=i | y ) ) 1/2 · (  i N(x=i | z ) ) 1/2 = (N x ) 1/2 · (N y ) 1/2 · (N z ) 1/2 z y x T(x=i) T(x=i | y) T x=i N(x=i|y) N(x=i)  N(x=i|y) ·N(x=i|z) N(x=i|z) T(x=i)

17 Homework Prove more general statement of Loomis-Whitney Suppose T is d-dimensional N = |T| = #d-dimensional cubes in T T(i) is projection of T onto hyperplane x(i)=0 N(i) = d-1 – dimensional volume of T(i) Show N ≤  i=1 to d (N(i)) 1/(d-1) 17

18 How to generalize this lower bound (1/4) It doesn’t depend on C(i,j) being a matrix entry, just a unique memory location (same for A(i,k) and B(k,j) ) 18 It doesn’t depend on C and A not overlapping (or C and B, or A and B) Some reorderings may change answer; still get a lower bound for all reorderings It doesn’t depend on C(i,j) just being a sum of products For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) C(i,j) =  k A(i,k)*B(k,j) It doesn’t depend on doing n 3 multiply/adds

19 How to generalize this lower bound (2/4) 19 It does assume the presence of an operand generating a load and/or store; how could this not happen? Mem(b(k,j)) could be reused in many more g ijk than (P) allows Ex: Compute C (m) = A * (B.^m) (Matlab notation) for m=1 to t Can move many fewer words than  (#flops / M 1/2 ) We might generate a result during a segment, use it, and discard it, without generating any memory traffic Turns out QR, eig, SVD all may do this Need a different analysis for them later… For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) Summer School Lecture 3

20 How to generalize this lower bound (3/4) 20 Need to distinguish Sources, Destinations of each operand in fast memory during a segment: Possible Sources: S1: Already in fast memory at start of segment, or read; at most 2M S2: Created during segment; no bound without more information Possible Destinations: D1: Left in fast memory at end of segment, or written; at most 2M D2: Discarded; no bound without more information Need to assume no S2/D2 arguments; at most 4M of others For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) Summer School Lecture 3

21 How to generalize this lower bound (4/4) 21 Theorem: To evaluate (P) with memory of size M, where f ij and g ijk are “nontrivial” functions of their arguments G is the total number of g ijk ‘s, No S2/D2 arguments requires at least G/ (8M) 1/2 – M slow memory references For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) Corollary: To evaluate (P) requires at least G/ (8 1/2 M 3/2 ) – 1 messages (simpler:  (#flops / M 3/2 ) ) Proof: maximum message size is M Simpler: #words_moved =  (#flops / M 1/2 )

22 Some Corollaries of this lower bound (1/4) 22 Theorem applies to dense or sparse, parallel or sequential: MatMul, including A T A or A 2 Triangular solve C = A -1 ·X C(i,j) = (X(i,j) -  k<i A(i,k)*C(k,j)) / A(i,i) … A lower triangular C plays double role of b and c in Model (P) LU factorization (any pivoting, LU or “ILU”) L(i,j) = (A(i,j) -  k<j L(i,k)*U(k,j)) / U(j,j), U(i,j) = A(i,j) -  k<i L(i,k)*U(k,j) L (and U) play double role as c and a (c and b) in Model (P) Cholesky (any diagonal pivoting, C or “IC”) L(i,j) = (A(i,j) -  k<j L(i,k)*L T (k,j)) / L(j,j), L(j,j) = (A(j,j) -  k<j L(j,k)*L T (k,j)) 1/2 L (and L T ) play triple role as a, b and c For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) #words_moved =  (#flops / M 1/2 )  Summer School Lecture 3

23 Some Corollaries of this lower bound (2/4) 23 Applies to “simplified” operations Ex 1: Compute ||A·B|| F where A(i,k) = f(i,k) and B(k,j) = g(k,j), so no inputs and 1 output; assume each f(i,k) and g(k,j) evaluated once Ex 2: Compute determinant of A(i,j) = f(i,j) using LU Apply lower bound by “imposing reads and writes” Ex 1: Every time a final value of (A·B)(i,j) is computed, write it; every time f(i,k), g(k,j) evaluated, insert a read Ex 2: Every time a final value of L(i,j), U(i,j) computed, write it; every time f(i,j) evaluated, insert a read Still get  (#flops / M 1/2 – 3n 2 ) words_moved, by subtracting “imposed” reads/writes (sparse case analogous) For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) #words_moved =  (#flops / M 1/2 ) 

24 Some Corollaries of this lower bound (3/4) 24 For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) #words_moved =  (#flops / M 1/2 )  Applies to compositions of operations Ex: Compute A k by repeated multiplication, only input A, output A k Reorganize composition to match (P) Impose writes of intermediate results A 2, …, A t-1 Still get #words_moved =  (#flops / M 1/2 – (t-2) n 2 ) Holds for any interleaving of operations Homework: apply to repeated squaring A2A3…AtA2A3…At A A 2 … A t-1 · A=

25 Some Corollaries of this lower bound (4/4) 25 For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) #words_moved =  (#flops / M 1/2 )  Applies to some graph algorithms Ex: All-Pairs-Shortest-Path by Floyd-Warshall: Matches (P) with g ijk = “+” and f ij = “min” Get #words_moved =  (n 3 / M 1/2 ), for dense graphs Homework: state result for sparse graphs; how does n 3 change? Initialize all path(i, j) = length of edge from node i to j (or  if none) for k := 1 to n for i := 1 to n for j := 1 to n path(i, j) = min ( path(i, j), path(i, k) + path(k, j) );

26 “3D” parallel matrix multiplication C = C + A· B p procs arranged in p 1/3 x p 1/3 x p 1/3 grid, with tree networks along “fibers” Each A(i,k), B(k,j), C(i,j) is n/p 1/3 x n/p 1/3 Initially Proc (i,j,0) owns C(i,j) Proc (i,0,k) owns A(i,k) Proc (0,j,k) owns B(k,j) Algorithm For all (i,k), broadcast A(i,k) to proc (i,j,k)  j … comm. cost = log p 1/3   + log p 1/3  n 2 / p 2/3 ·  For all (j,k), broadcast B(k,j) to proc(i,j,k)  j … same comm. cost For all (i,j,k) Tmp(i,j,k) = A(i,k)  B(k,j) … cost = (2(n/p 1/3 ) 3 ) = 2n 3 /p flops For all (i,j), reduce C(i,j) =  k Tmp(i,j,k) … same comm. Cost Total comm. cost = O(log(p)  n 2 / p 2/3 ·  + log(p)   ) Lower bound =  ((n 3 /p)/(n 2 /p 2/3 ) 1/2 ·  + (n 3 /p)/(n 2 /p 2/3 ) 3/2 ·  ) =  (n 2 /p 2/3 ·  +  ) Lower bound also ~attainable for all n 2 /p 2/3  M = n 2 /p x  n 2 /p [Toledo et al] 26 i j A(2,1) A(1,3) B(1,2) B(3,1) C(1,1) C(2,2) A(1,1) B(1,1) A(1,2) B(2,1) p 1/3

27 So when doesn’t Model (P) apply? 27 For all (i,j)  S Mem(c(i,j)) = f ij ( g ijk ( Mem(a(i,k)), Mem(b(k,j)) ) for k  S ij, some other arguments) (P) #words_moved =  (#flops / M 1/2 )  Ex: for r = 1 to t, C (r) = A * (B.^(1/r)) … Matlab notation (P) does not apply With A(i,k) and B(k,j) in memory, we can do t operations, not just 1 g ijk Can’t apply (P) with arguments b(k,j) indexed in one-to-one fashion Can still analyze using segments, Loomis-Whitney #flops/segment  8(tM 3 ) 1/2 #segments  (t n 3 ) / 8(tM 3 ) 1/2 #words_moved =  (t 1/2 · n 3 / M 1/2 ), not  (t · n 3 / M 1/2 ) Attainable, using variation of usual blocked matmul algorithm Homework! (Hint: need to recompute (B(j,k)) 1/r when needed)

28 Lower bounds for Orthogonal Transformations (1/4) Needed for QR, eig, SVD, … Analyze Blocked Householder Transformations  j=1 to b (I – 2 u j u T j ) = I – U T U T where U = [ u 1, …, u b ] Treat Givens as 2x2 Householder Model details and assumptions Write (I – U T U T )A = A – U(TU T A) = A – UZ where Z = T(U T A) Only count operations in all A – UZ operations “Generically” a large fraction of the work Assume “Forward Progress”, that each successive Householder transformation leaves previously created zeros zero; Ok for QR decomposition Reductions to condensed forms (Hessenberg, tri/bidiagonal) –Possible exception: bulge chasing in banded case One sweep of (block) Hessenberg QR iteration 28 Summer School Lecture 3

29 Lower bounds for Orthogonal Transformations (2/4) Perform many A – UZ where Z = T(U T A) First challenge to applying theorem: need to collect all A-UZ into one big set to which model (P) applies Write them all as { A(i,j) = A(i,j) -  k U(i,k) Z(k,j) } where k = index of k-th transformation, k not necessarily = index of column of A it comes from Second challenge: All Z(k,j) may be S2/D2 Recall: S2/D2 means computed on the fly and discarded Ex: A n x 2n =Q n x n · R n x 2n where 2n 2 = M so A fits in cache Represent Q as n(n-1)/2 2x2 Householder (Givens) transforms There are n 2 (n-1)/2 =  (M 3/2 ) nonzero Z(k,j), not O(M) Still only do  (M 3/2 ) flops during segment But can’t use Loomis-Whitney to prove it! Need a new idea… 29 Summer School Lecture 3

30 Lower bounds for Orthogonal Transformations (3/4) Dealing with Z(k,j) being S2/D2 How to bound #flops in A(i,j) = A(i,j) -  k U(i,k) Z(k,j) ? Neither A nor U is S2/D2 A either turned into R or U, which are output So at most 2M of each during segment #flops ≤ ( #U(i,k) ) · ( #columns A(:,j) present ) ≤ ( #U(i,k) ) · ( #A(i,j) / min #nonzeros per column of A) ≤ h · O(M) / r where h = O(M) … How small can r be? To store h ≡ #U(i,k) Householder vector entries in r rows, there can be at most r in the first column, r-1 in the second, etc., (to maintain “Forward Progress”) so r(r-1)/2  h so r  h 1/2 # flops ≤ h · O(M) / r ≤ O(M) h 1/2 = O(M 3/2 ) as desired 30 Summer School Lecture 3

31 Lower bounds for Orthogonal Transformations (4/4) Theorem: #words_moved by QR decomposition using (blocked) Householder transformations =  ( #flops / M 1/2 ) Theorem: #words_moved by reduction to Hessenberg form, tridiagonal form, bidiagonal form, or one sweep of QR iteration (or block versions of any of these) =  ( #flops / M 1/2 ) Assuming Forward Progress (created zeros remain zero) Model: Merge left and right orthogonal transformations: A(i,j) = A(i,j) -  k L U L (i,k L ) ·Z L (k L,j) -  k R Z R (i,k R ) ·U R (j,k R ) 31 Summer School Lecture 3

32 Decompose a matrix into a low rank component and a sparse component: M = L + S Robust PCA (Candes, 2009) Homework: Apply lower bound result

33 Can we attain these lower bounds? Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? Mostly not If not, are there other algorithms that do? Yes Several goals for algorithms: Minimize Bandwidth (  (#flops/ M 1/2 )) and latency (  (#flops/ M 3/2 )) Multiple memory hierarchy levels (attain bound for each level?) Explicit dependence on (multiple) M(s), vs “cache oblivious” Fewest flops when fits in smallest memory What about sparse algorithms? 33 Summer School Lecture 3

34 Recall: Minimizing latency requires new data structures To minimize latency, need to load/store whole rectangular subblock of matrix with one “message” Incompatible with conventional columnwise (rowwise) storage Ex: Rows (columns) not in contiguous memory locations Blocked storage: store as matrix of bxb blocks, each block stored contiguously Ok for one level of memory hierarchy, what if more? Recursive blocked storage: store each block using subblocks 34

35 Recall: Blocked vs Cache-Oblivious Algorithms Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size 35 … Break A nxn, B nxn, C nxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops Cache-oblivious Matmul C = A·B is independent of cache Function C = MM(A,B) If A and B are 1x1 C = A · B else … Break A nxn, B nxn, C nxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + MM( A(i,k), B(k,j) ) … n/2 x n/2 matmul

36 Summary of dense sequential algorithms attaining communication lower bounds 36 Algorithms shown minimizing # Messages assume (recursive) block layout Many references (see reports), only some shown, plus ours Older references may or may not include analysis Cache-oblivious are underlined, Green are ours, ? is unknown/future work Algorithm2 Levels of MemoryMultiple Levels of Memory #Words Moved and # Messages#Words Moved and #Messages BLAS-3Usual blocked or recursive algorithms Usual blocked algorithms (nested), or recursive [Gustavson,97] Cholesky LAPACK (with b = M 1/2 ) [Gustavson 97] [BDHS09] [Gustavson,97] [Ahmed,Pingali,00] [BDHS09] (←same) LU with pivoting LAPACK (rarely) [Toledo,97], [GDX 08] [GDX 08][Toledo, 97] [GDX 08]? QR LAPACK (rarely) [Elmroth,Gustavson,98] [DGHL08] [Frens,Wise,03] [DGHL08] [Elmroth, Gustavson,98] [DGHL08] ? [Frens,Wise,03] [DGHL08] ? Eig, SVD Not LAPACK [BDD10]

37 Summary of dense 2D parallel algorithms attaining communication lower bounds 37 Assume nxn matrices on P processors, memory per processor = O(n 2 / P) Many references (see reports), Green are ours ScaLAPACK assumes best block size b chosen Recall lower bounds: #words_moved =  ( n 2 / P 1/2 ) and #messages =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix multiply[Cannon, 69]11 CholeskyScaLAPACKlog P LU[GDX08] ScaLAPACK log P log P · N / P 1/2 QR[DGHL08] ScaLAPACK log P log 3 P log P · N / P 1/2 Sym Eig, SVD[BDD10] ScaLAPACK log P log 3 P N / P 1/2 Nonsym Eig[BDD10] ScaLAPACK log P log P · P 1/2 log 3 P log P· N

38 Recent Communication Optimal Algorithms QR with column pivoting Cholesky with diagonal pivoting LU with “complete” pivoting LDL’ with “complete” pivoting Sparse Cholesky For matrices with “good separators” For “most sparse matrices” (as hard as dense case) 38

39 Homework – Extend lower bound To algorithms using (block) Givens rotations instead of (block) Householder transformations To QR done with Gram-Schmidt orthogonalization CGS and MGS To heterogeneous collections of processors Suppose processor k Does  (k) flops per second Has reciprocal bandwidth  (k) Has latency  (k) What is a lower bound on solution time, for any fractions of work assigned to each processor (i.e. not all equal) To a homogenous parallel shared memory machine All data initially resides in large shared memory Processors communicate by data written to/read from slow memory Each processor has local fast memory size M 39

40 EXTRA SLIDES 40


Download ppt "1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra"

Similar presentations


Ads by Google