Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance
Model & Motivation Two kinds of costs: Arithmetic (FLOPs) Communication: moving data Running time = #FLOPs + #Words ( + #Messages ) CPU M CPU M CPU M CPU M Distributed CPU M RAM Sequential Fast/local memory of size M P processors 2 Communication-minimizing algorithms: Save time Communication-minimizing algorithms: Save energy 23%/Year26%/Year 59%/Year
3 The Fastest Matrix Multiplication in the West Franklin (Cray XT4), Strong Scaling, n = [BDHLS SPAA’12] 24%-184% faster than previous algorithms Ours [2012] Strassen based [1995] ScaLAPACK Classical based [Solomonik-Demmel 2011]
The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Speedup example Rectangular Matrix Multiplication, 64 x k x 64 Benchmarked on 32 cores shared memory Intel machine “…Highest Performance and Scalability across Past, Present & Future Processors…” Intel’s Math Kernel Library. From: Intel’s Math Kernel Library Our Algorithm 4
The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Our Algorithm Machine Peak ScaLAPACK Speedup example Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory Rectangular Matrix Multiplication Dimensions: 192 x x 192 Strong-Scaling plot. 5
(1) proving bounds on these communication costs (2) developing faster algorithms by minimizing communication. Lower bounds and Algorithms CPU RAM CPU RAM CPU RAM CPU RAM Distributed 6 CPU M RAM Sequential
Today: I. Model, Motivation II. Graph expansion analysis III. Optimal parallel algorithms Maybe: IV. Energy V. Obliviousness VI. Scaling bounds VII. Geometric arguments 7
Naïve implementation For i = 1 to n For j = 1 to n For k = 1 to n C(i, j) = C(i, j) + A(i, k)B(k, j) 8 Example: Classical Matrix Multiplication Bandwidth cost: BW = (n 3 ) Can we do better? n = ABC ii jj CPU M RAM Sequential
9 Example: Classical Matrix Multiplication Yes. Compute block-wise [BLAS] [Cannon 69] M= (n 2 /P), [McColl Tiskin 99, Solomonik Demmel 11] any M Can we do better? n = M is fast/local memory size M < n 2
10 Lower bounds: “3-nested loops” [Hong & Kung 81] Sequential Mat-Mul [Irony,Toledo,Tiskin 04] Sequential and parallel Mat-Mul No!
11 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.
12 LU and Cholesky decompositions LU decomposition of A A = P·L·U U is upper triangular L is lower triangular P is a permutation matrix Cholesky decomposition of A If A is a real symmetric and positive definite matrix, then L (a real lower triangular) s.t., A = L·L T (L is unique if we restrict diagonal elements 0) Efficient parallel and sequential algorithms?
13 Lower bounds for matrix multiplication Bandwidth: [Hong & Kung 81] Sequential [Irony,Toledo,Tiskin 04] Sequential and parallel Latency: Divide by M.
14 Reduction (1 st approach) [Ballard, Demmel, Holtz, S. 2009a] Thm: Cholesky and LU decompositions are (communication-wise) as hard as matrix-multiplication Proof: By a reduction (from matrix-multiplication) that preserves communication bandwidth, latency, and arithmetic. Cor: Any classical O(n 3 ) algorithm for Cholesky and LU decomposition requires: Bandwidth: (n 3 / M 1/2 ) Latency: (n 3 / M 3/2 ) (similar cor. for the parallel model).
15 Some Sequential Classical Cholesky Algorithms BandwidthLatency Cache Oblivious Lower bound (n 3 /M 1/2 ) (n 3 /M 3/2 ) Naive (n3)(n3) (n 3 /M) LAPACK Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 ) Rectangular Recursive [Toledo 97] Column-major Contiguous blocks O(n 3 /M 1/2 +n 2 log n) (n 3 /M) (n 2 ) Square Recursive [Ahmad, Pingali 00] Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 )
16 Easier example: LU It is easy to reduce matrix multiplication to LU decomposition LU factorization can be used to perform matrix multiplication Communication lower bound for matrix multiplication applies to LU
17 Can we do the same for Cholesky? Is it easy to reduce matrix multiplication to Cholesky decomposition? Problems: A·A T appears in T. Perhaps, all communication guarantee takes place computing A·A T.
18 So, Can we do the same for Cholesky? [Ballard, Demmel, Holtz, S. 2009]: Yes, but not as easy as LU. Proof: Later.
19 Lower bounds for the Hierarchy model To reduce the hierarchy model to the sequential model: Observe the communication between two levels Conclusion: Any communication lower bound for the sequential model translates to a lower bound for the hierarchy model. Any CA algorithm for the sequential model translates to a CA algorithm for the hierarchy model. But not the other way round. CPU Cache RAM Sequential M1M1 M2M2 M3M3 M k = Hierarchy M1M1 M2M2 M k = Reduction
20 Next time: Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.
21 Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] Matrix multiplication form: (i,j) n x n, C(i,j) = k A(i,k) B(k,j), Thm: If an algorithm agrees with this form (regardless of the order of computation) then BW = (n 3 / M 1/2 ) BW = (n 3 / PM 1/2 )in P-parallel model.
22 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S k 4.The total communication BW is BW = BW of one segment #segments M #mults / k... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
23 Volume of box V = x·y·z = ( xz · zy · yx) 1/2 Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2 x z z y x y A B C “A shadow” “B shadow” “C shadow” A B C V V Matrix multiplication form: (i,j) n x n, C(i,j) = k A(i,k)B(k,j), Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
24 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S k 4.The total communication BW is BW = BW of one segment #segments M #mults / k = M n 3 / k 5.By Loomis-Whitney: BW M n 3 / (3M) 3/2... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]
25 From Sequential Lower bound to Parallel Lower Bound We showed: Any classical O(n 3 ) algorithm for matrix multiplication on sequential model requires: Bandwidth: (n 3 / M 1/2 ) Latency: (n 3 / M 3/2 ) Cor: Any classical O(n 3 ) algorithm for matrix multiplication on P-processors machine (with balanced workload) requires: 2D-layout: M=O(n 2 /P) Bandwidth: (n 3 /PM 1/2 ) (n 2 /P 1/2 ) Latency: (n 3 / PM 3/2 ) (P 1/2 )
26 From Sequential Lower bound to Parallel Lower Bound Proof: Observe one processor. Is it always true? “A shadow” “B shadow” “C shadow” A B C Let Alg be an algorithm with communication lower bound B = B(n,M). Then any parallel implementation of Alg has a communication lower bound B’(n, M, p) = B(n, M)/p ?
Proof of Loomis-Whitney inequality T = 3D set of 1x1x1 cubes on lattice N = |T| = #cubes T x = projection of T onto x=0 plane N x = |T x | = #squares in T x, same for T y, N y, etc Goal: N ≤ (N x · N y · N z ) 1/2 27 T(x=i) = subset of T with x=i T(x=i | y ) = projection of T(x=i) onto y=0 plane N(x=i) = |T(x=i)| etc N = i N(x=i) = i (N(x=i)) 1/2 · (N(x=i)) 1/2 ≤ i (N x ) 1/2 · (N(x=i)) 1/2 ≤ (N x ) 1/2 · i (N(x=i | y ) · N(x=i | z ) ) 1/2 = (N x ) 1/2 · i (N(x=i | y ) ) 1/2 · (N(x=i | z ) ) 1/2 ≤ (N x ) 1/2 · ( i N(x=i | y ) ) 1/2 · ( i N(x=i | z ) ) 1/2 = (N x ) 1/2 · (N y ) 1/2 · (N z ) 1/2 z y x T(x=i) T(x=i | y) T x=i N(x=i|y) N(x=i) N(x=i|y) ·N(x=i|z) N(x=i|z) T(x=i)
Motivation & Model Lower bounds techniques Summary 28
Recent progress and open problems in fault tolerance and communication minimizing, from theoretical and practical perspectives. Matrix multiplication (classical and Strassen-like), dense and sparse linear algebra, FFT, sorting, graph algorithms, and dynamic programming. Topics Will Include 29
Theory: Communication lower bounds Communication minimizing algorithm Algorithmic approaches for ascertaining fault tolerance. Practice: Software (e.g., automating the construction of new algorithms with specific properties); Hardware (e.g., what technological trends mean for algorithm design, and vice-versa; Applications (e.g., improving applications’ performance using known algorithms and applications motivated algorithmic research). Topics Will Include 30
Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance