Presentation is loading. Please wait.

Presentation is loading. Please wait.

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.

Similar presentations


Presentation on theme: "How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10."— Presentation transcript:

1 How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms www.cs.berkeley.edu/~odedsc/CS294 Based on: G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication.

2 2 Previous talk on lower bounds Communication Lower Bounds: Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

3 3 Previous talk on lower bounds: algorithms with “flavor” of 3 nested loops [Ballard, Demmel, Holtz, S. 2009], [Ballard, Demmel, Holtz, S. 2011a] Following [Irony,Toledo,Tiskin 04] BLAS, LU, Cholesky, LDL T, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. Dense or sparse matrices In sparse cases: bandwidth is a function NNZ. Bandwidth and latency. Sequential, hierarchical, and parallel – distributed and shared memory models. Compositions of linear algebra operations. Certain graph optimization problems [Demmel, Pearson, Poloni, Van Loan, 11] Tensor contraction

4 4 Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments) But many algorithms just don’t fit the generalized form! For example: Strassen’s fast matrix multiplication

5 5 Beyond 3-nested loops How about the communication costs of algorithms that have a more complex structure?

6 6 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

7 7 [Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7  T(n/2) + O(n 2 ) T(n) =  (n log 2 7 )

8 8 Strassen-like algorithms Compute n 0 x n 0 matrix multiplication using only n 0  0 multiplications (instead of n 0 3 ). Apply recursively (block-wise)  0  2.81[Strassen 69] works fast in practice. 2.79[Pan 78] 2.78[Bini 79] 2.55[Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38[Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach T(n) = n 0  0  T(n/n 0 ) + O(n 2 ) T(n) =  (n  0 ) n/n 0 =

9 9 New lower bound for Strassen’s fast matrix multiplication [Ballard, Demmel, Holtz, S. 2011b]: The Communication bandwidth lower bound is Strassen-like:Recall for cubic:For Strassen’s: The parallel lower bounds applies to 2D: M =  (n 2 /P) 2.5D: M =  (c∙n 2 /P) log 2 7 log 2 8 00

10 10 For sequential? hierarchy? Yes, existing implementation do! For parallel 2D? parallel 2.5D? Yes: new algorithms.

11 11 Sequential and new 2D and 2.5D parallel Strassen-like algorithms Sequential and Hierarchy cases: Attained by the natural recursive implementation. Also: LU, QR,… (Black-box use of fast matrix multiplication) [Ballard, Demmel, Holtz, S., Rom 2011]: New 2D parallel Strassen-like algorithm. Attains the lower bound. New 2.5D parallel Strassen-like algorithm. c  0 /2-1 parallel communication speedup over 2D implementation ( c ∙ 3n 2 = M∙P ) [Ballard, Demmel, Holtz, S. 2011b]: This is as good as it gets.

12 Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth Time to multiply 2 largest locally storable square matrices exceeds latency Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If   2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic  M 1/2    M 3/2    Strassen-like  M  0 /2-1    M  0 /2   

13 13 Let G = (V,E) be a d -regular graph A is the normalized adjacency matrix, with eigenvalues  1 ≥ 2 ≥ … ≥ n   1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

14 RSRS WSWS S 14 The Computation Directed Acyclic Graph Expansion (3rd approach) Communication-cost is Graph-expansion Input / Output Intermediate value Dependency V

15 15 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of  (M  /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has  3M vertices with incoming / outgoing edges  perform  M read/writes. 4.The total communication BW is BW = BW of one segment  #segments =  (M)  O(n  ) /  (M  /2 ) =  (n  / M  /2 -1 ) M M MM M S RSRS WSWS V Expansion (3rd approach) S1S1 S2S2 S3S3 Read Write FLOP Time...

16 16 Is it a Good Expander? Break G into edge-disjoint graphs, corresponding to the algorithm on M 1/2  M 1/2 matrices. Consider the expansions of S in each part (they sum up). S1S1 S2S2 S3S3 S5S5 S4S4 We need to show that M  /2 expands to  (M). h(G(n)) =  (M/ M  /2 ) for n =  (M 1/2 ). Namely, for every n, h(G(n)) =  (n 2 /n  ) =  ((4/7) lg n ) BW =  (T(n))  h(G(M 1/2 )) BW =  (T(n))   (G(M 1/2 )) En lg n BEn lg n A Dec lg n C n2n2 n2n2 nn lg n

17 17 What is the CDAG of Strassen’s algorithm?

18 18 M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` 7541326 1,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

19 ` 19 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12,2 7541326 One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B

20 20 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 nn lg n Dec 1 C The DAG of Strassen: further recursive steps 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C

21 21 En lg n B En lg n A Dec lg n C n2n2 n2n2 nn lg n The DAG of Strassen 1.Compute weighted sums of A’s elements. 2.Compute weighted sums of B’s elements. 3.Compute multiplications m 1,m 2,…,m . 4.Compute weighted sums of m 1,m 2,…,m  to obtain C. AB C

22 22 Expansion of a Segment Two methods to compute the expansion of the recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]) or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00])

23 23 Expansion of a Segment Main technical challenges: Two types of vertices: with/without recursion. The graph is not regular. ` 7541326 1,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

24 24 Estimating the edge expansion- Combinatorially SkSk S1S1 S3S3 S2S2 Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed

25 25 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

26 26 Open Problems Find algorithms that attain the lower bounds: Sparse matrix algorithms for sequential and parallel models that auto-tune or are cache oblivious Address complex heterogeneous hardware: Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11] Extend the techniques to other algorithm and algorithmic tools: Non-uniform recursive structure Characterize a communication lower bound for a problem rather than for an algorithm. ?

27 How to Compute and Prove Lower Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms Based on: G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication. Thank you!


Download ppt "How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10."

Similar presentations


Ads by Google