Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance Fault tolerance March/15/2015

Model & Motivation Two kinds of costs: Arithmetic (FLOPs) Communication: moving data Running time =   #FLOPs +   #Words ( +   #Messages ) CPU M CPU M CPU M CPU M Distributed CPU M RAM Sequential Fast/local memory of size M P processors 2 Communication-minimizing algorithms: Save time Communication-minimizing algorithms: Save energy 23%/Year26%/Year 59%/Year

3 Communication Lower Bounds – to be continued… Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

[Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7  T(n/2) +  (n 2 ) T(n) =  (n log 2 7 ) 4

Strassen-like algorithms T(n) = n 0  0  T(n/n 0 ) +  (n 2 ) T(n) =  (n  0 ) n/n 0 = 5 Subsequently… Compute n 0 x n 0 matrix multiplication using only n 0  0 multiplications (instead of n 0 3 ). Apply recursively (block-wise)  0  2.81 [Strassen 69],[Strassen-Winograd 71] 2.79 [Pan 78] 2.78 [Bini 79] 2.55 [Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach 2.3730 [Stothers 10] 2.3728640 [Vassilevska Williams 12] 2.3728642 [Le Gall 14]

[Ballard, Demmel, Holtz, S. 2011b]: Sequential and parallel Novel graph expansion proof Strassen-like:Classic (cubic):For Strassen’s: log 2 7 log 2 8 00 Sequential Distributed Communication costs lower bounds for matrix multiplication 6

Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth Time to multiply 2 largest locally storable square matrices exceeds latency Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If   2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic  M 1/2    M 3/2    Strassen-like  M  0 /2-1    M  0 /2   

RSRS WSWS S 8 The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency V How can we estimate R s and W s ? By bounding the expansion of the graph!

9 Let G = (V,E) be a graph A is the normalized adjacency matrix of a regular undirected graph, with eigenvalues  1 = 1 ≥ 2 ≥ … ≥ n   1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Small sets expansion: Expansion [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

RSRS WSWS S 10 The Computation Directed Acyclic Graph Expansion Communication-Cost is Graph-Expansion Input / Output Intermediate value Dependency V (Small-Sets)

11 What is the Computation Graph of Strassen? Can we Compute its Expansion?

12 M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` 7541326 1,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

Enc 1 A Dec 1 C ` 13 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12,2 One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B

14 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n Dec 1 C The DAG of Strassen: further recursive steps 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C  0 = lg 7

15 The Expansion of the Computation Graph Methods for the analysis of the expansion of recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]), or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00]) Main technical challenges: Two types of vertices (with/without recursion). The graph is not regular.

16 Estimating the edge expansion- Combinatorially Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n  0 = lg 7

17 Is Strassen’s Graph a Good Expander? S1S1 S2S2 S3S3 S5S5 S4S4 For n -by- n matrices: For M 1/2 -by- M 1/2 matrices: For M 1/2 -by- M 1/2 sub-matrices (or other small subsets): Summing up (the partition argument)

18 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of  (M  /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has  3M vertices with incoming / outgoing edges  perform  M read/writes. 4.The total communication BW is BW = BW of one segment  #segments =  (M)  O(n  ) /  (M  /2 ) =  (n  / M  /2 -1 ) M M MM M S RSRS WSWS V The partitioning argument S1S1 S2S2 S3S3 Read Write FLOP Time...

19 Subsequently… More Bounds, for example: For rectangular fast-matrix multiplication algorithms [MedAlg’12]. For fast numerical linear algebra [EECS-Techreport’12]. E.g., solving linear systems, least squares, eigenproblems,... with same arithmetic and communication costs, and numerically stably. How much extra memory is useful How far we can have perfect strong-scaling [SPAA’12b] New Parallel Algorithm…

Algorithms for Supercomputers Lower bounds by graph expansion Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance Fault tolerance March/15/2015

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

Similar presentations

Presentation on theme: "Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

Similar presentations

Presentation on theme: "Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance."— Presentation transcript:

Similar presentations

About project

Feedback