Presentation is loading. Please wait.

Presentation is loading. Please wait.

Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance 3.1.2015.

Similar presentations


Presentation on theme: "Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance 3.1.2015."— Presentation transcript:

1 Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance 3.1.2015

2 Model & Motivation Two kinds of costs: Arithmetic (FLOPs) Communication: moving data Running time =   #FLOPs +   #Words ( +   #Messages ) CPU M CPU M CPU M CPU M Distributed CPU M RAM Sequential Fast/local memory of size M P processors 2 Communication-minimizing algorithms: Save time Communication-minimizing algorithms: Save energy 23%/Year26%/Year 59%/Year

3 3 The Fastest Matrix Multiplication in the West Franklin (Cray XT4), Strong Scaling, n = 94080 [BDHLS SPAA’12] 24%-184% faster than previous algorithms Ours [2012] Strassen based [1995] ScaLAPACK Classical based [Solomonik-Demmel 2011]

4 The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Speedup example Rectangular Matrix Multiplication, 64 x k x 64 Benchmarked on 32 cores shared memory Intel machine “…Highest Performance and Scalability across Past, Present & Future Processors…” Intel’s Math Kernel Library. From: http://software.intel.com/en-us/intel-mkl Intel’s Math Kernel Library Our Algorithm 4

5 The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Our Algorithm Machine Peak ScaLAPACK Speedup example Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory Rectangular Matrix Multiplication Dimensions: 192 x 6291456 x 192 Strong-Scaling plot. 5

6 (1) proving bounds on these communication costs (2) developing faster algorithms by minimizing communication. Lower bounds and Algorithms CPU RAM CPU RAM CPU RAM CPU RAM Distributed 6 CPU M RAM Sequential

7 Today: I. Model, Motivation II. Graph expansion analysis III. Optimal parallel algorithms Maybe: IV. Energy V. Obliviousness VI. Scaling bounds VII. Geometric arguments 7

8 Naïve implementation For i = 1 to n For j = 1 to n For k = 1 to n C(i, j) = C(i, j) + A(i, k)B(k, j) 8 Example: Classical Matrix Multiplication Bandwidth cost: BW =  (n 3 ) Can we do better? n = ABC ii jj CPU M RAM Sequential

9 9 Example: Classical Matrix Multiplication Yes. Compute block-wise [BLAS] [Cannon 69] M=  (n 2 /P), [McColl Tiskin 99, Solomonik Demmel 11] any M Can we do better? n = M is fast/local memory size M < n 2

10 10 Lower bounds: “3-nested loops” [Hong & Kung 81] Sequential Mat-Mul [Irony,Toledo,Tiskin 04] Sequential and parallel Mat-Mul No!

11 11 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

12 12 LU and Cholesky decompositions LU decomposition of A A = P·L·U U is upper triangular L is lower triangular P is a permutation matrix Cholesky decomposition of A If A is a real symmetric and positive definite matrix, then  L (a real lower triangular) s.t., A = L·L T (L is unique if we restrict diagonal elements  0) Efficient parallel and sequential algorithms?

13 13 Lower bounds for matrix multiplication Bandwidth: [Hong & Kung 81] Sequential [Irony,Toledo,Tiskin 04] Sequential and parallel Latency: Divide by M.

14 14 Reduction (1 st approach) [Ballard, Demmel, Holtz, S. 2009a] Thm: Cholesky and LU decompositions are (communication-wise) as hard as matrix-multiplication Proof: By a reduction (from matrix-multiplication) that preserves communication bandwidth, latency, and arithmetic. Cor: Any classical O(n 3 ) algorithm for Cholesky and LU decomposition requires: Bandwidth:  (n 3 / M 1/2 ) Latency:  (n 3 / M 3/2 ) (similar cor. for the parallel model).

15 15 Some Sequential Classical Cholesky Algorithms BandwidthLatency Cache Oblivious Lower bound  (n 3 /M 1/2 )  (n 3 /M 3/2 ) Naive (n3)(n3)  (n 3 /M)  LAPACK Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 )  Rectangular Recursive [Toledo 97] Column-major Contiguous blocks O(n 3 /M 1/2 +n 2 log n)  (n 3 /M)  (n 2 )  Square Recursive [Ahmad, Pingali 00] Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 ) 

16 16 Easier example: LU It is easy to reduce matrix multiplication to LU decomposition  LU factorization can be used to perform matrix multiplication  Communication lower bound for matrix multiplication applies to LU

17 17 Can we do the same for Cholesky? Is it easy to reduce matrix multiplication to Cholesky decomposition? Problems: A·A T appears in T. Perhaps, all communication guarantee takes place computing A·A T.

18 18 So, Can we do the same for Cholesky? [Ballard, Demmel, Holtz, S. 2009]: Yes, but not as easy as LU. Proof: Later.

19 19 Lower bounds for the Hierarchy model To reduce the hierarchy model to the sequential model: Observe the communication between two levels Conclusion: Any communication lower bound for the sequential model translates to a lower bound for the hierarchy model. Any CA algorithm for the sequential model translates to a CA algorithm for the hierarchy model. But not the other way round. CPU Cache RAM Sequential M1M1 M2M2 M3M3 M k =   Hierarchy M1M1 M2M2 M k =   Reduction

20 20 Next time: Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

21 21 Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] Matrix multiplication form:  (i,j)  n x n, C(i,j) =  k A(i,k) B(k,j), Thm: If an algorithm agrees with this form (regardless of the order of computation) then BW =  (n 3 / M 1/2 ) BW =  (n 3 / PM 1/2 )in P-parallel model.

22 22 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S  k 4.The total communication BW is BW = BW of one segment  #segments  M  #mults / k... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

23 23 Volume of box V = x·y·z = ( xz · zy · yx) 1/2 Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2 x z z y x y A B C “A shadow” “B shadow” “C shadow” A B C V V Matrix multiplication form:  (i,j)  n x n, C(i,j) =  k A(i,k)B(k,j), Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

24 24 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S  k 4.The total communication BW is BW = BW of one segment  #segments  M  #mults / k = M  n 3 / k 5.By Loomis-Whitney: BW  M  n 3 / (3M) 3/2... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

25 25 From Sequential Lower bound to Parallel Lower Bound We showed: Any classical O(n 3 ) algorithm for matrix multiplication on sequential model requires: Bandwidth:  (n 3 / M 1/2 ) Latency:  (n 3 / M 3/2 ) Cor: Any classical O(n 3 ) algorithm for matrix multiplication on P-processors machine (with balanced workload) requires: 2D-layout: M=O(n 2 /P) Bandwidth:  (n 3 /PM 1/2 )  (n 2 /P 1/2 ) Latency:  (n 3 / PM 3/2 )  (P 1/2 )

26 26 From Sequential Lower bound to Parallel Lower Bound Proof: Observe one processor. Is it always true? “A shadow” “B shadow” “C shadow” A B C Let Alg be an algorithm with communication lower bound B = B(n,M). Then any parallel implementation of Alg has a communication lower bound B’(n, M, p) = B(n, M)/p ?

27 Proof of Loomis-Whitney inequality T = 3D set of 1x1x1 cubes on lattice N = |T| = #cubes T x = projection of T onto x=0 plane N x = |T x | = #squares in T x, same for T y, N y, etc Goal: N ≤ (N x · N y · N z ) 1/2 27 T(x=i) = subset of T with x=i T(x=i | y ) = projection of T(x=i) onto y=0 plane N(x=i) = |T(x=i)| etc N =  i N(x=i) =  i (N(x=i)) 1/2 · (N(x=i)) 1/2 ≤  i (N x ) 1/2 · (N(x=i)) 1/2 ≤ (N x ) 1/2 ·  i (N(x=i | y ) · N(x=i | z ) ) 1/2 = (N x ) 1/2 ·  i (N(x=i | y ) ) 1/2 · (N(x=i | z ) ) 1/2 ≤ (N x ) 1/2 · (  i N(x=i | y ) ) 1/2 · (  i N(x=i | z ) ) 1/2 = (N x ) 1/2 · (N y ) 1/2 · (N z ) 1/2 z y x T(x=i) T(x=i | y) T x=i N(x=i|y) N(x=i)  N(x=i|y) ·N(x=i|z) N(x=i|z) T(x=i)

28 Motivation & Model Lower bounds techniques Summary 28

29 Recent progress and open problems in fault tolerance and communication minimizing, from theoretical and practical perspectives. Matrix multiplication (classical and Strassen-like), dense and sparse linear algebra, FFT, sorting, graph algorithms, and dynamic programming. Topics Will Include 29

30 Theory: Communication lower bounds Communication minimizing algorithm Algorithmic approaches for ascertaining fault tolerance. Practice: Software (e.g., automating the construction of new algorithms with specific properties); Hardware (e.g., what technological trends mean for algorithm design, and vice-versa; Applications (e.g., improving applications’ performance using known algorithms and applications motivated algorithmic research). Topics Will Include 30

31 Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance 3.1.2015


Download ppt "Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance 3.1.2015."

Similar presentations


Ads by Google