How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms Based on: G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication.
2 Previous talk on lower bounds Communication Lower Bounds: Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.
3 Previous talk on lower bounds: algorithms with “flavor” of 3 nested loops [Ballard, Demmel, Holtz, S. 2009], [Ballard, Demmel, Holtz, S. 2011a] Following [Irony,Toledo,Tiskin 04] BLAS, LU, Cholesky, LDL T, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. Dense or sparse matrices In sparse cases: bandwidth is a function NNZ. Bandwidth and latency. Sequential, hierarchical, and parallel – distributed and shared memory models. Compositions of linear algebra operations. Certain graph optimization problems [Demmel, Pearson, Poloni, Van Loan, 11] Tensor contraction
4 Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form: (i,j) S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,… S ij other arguments) But many algorithms just don’t fit the generalized form! For example: Strassen’s fast matrix multiplication
5 Beyond 3-nested loops How about the communication costs of algorithms that have a more complex structure?
6 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.
7 [Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 ) (B 11 + B 22 ) M 2 = (A 21 + A 22 ) B 11 M 3 = A 11 (B 12 - B 22 ) M 4 = A 22 (B 21 - B 11 ) M 5 = (A 11 + A 12 ) B 22 M 6 = (A 21 - A 11 ) (B 11 + B 12 ) M 7 = (A 12 - A 22 ) (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7 T(n/2) + O(n 2 ) T(n) = (n log 2 7 )
8 Strassen-like algorithms Compute n 0 x n 0 matrix multiplication using only n 0 0 multiplications (instead of n 0 3 ). Apply recursively (block-wise) 0 2.81[Strassen 69] works fast in practice. 2.79[Pan 78] 2.78[Bini 79] 2.55[Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38[Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach T(n) = n 0 0 T(n/n 0 ) + O(n 2 ) T(n) = (n 0 ) n/n 0 =
9 New lower bound for Strassen’s fast matrix multiplication [Ballard, Demmel, Holtz, S. 2011b]: The Communication bandwidth lower bound is Strassen-like:Recall for cubic:For Strassen’s: The parallel lower bounds applies to 2D: M = (n 2 /P) 2.5D: M = (c∙n 2 /P) log 2 7 log 2 8 00
10 For sequential? hierarchy? Yes, existing implementation do! For parallel 2D? parallel 2.5D? Yes: new algorithms.
11 Sequential and new 2D and 2.5D parallel Strassen-like algorithms Sequential and Hierarchy cases: Attained by the natural recursive implementation. Also: LU, QR,… (Black-box use of fast matrix multiplication) [Ballard, Demmel, Holtz, S., Rom 2011]: New 2D parallel Strassen-like algorithm. Attains the lower bound. New 2.5D parallel Strassen-like algorithm. c 0 /2-1 parallel communication speedup over 2D implementation ( c ∙ 3n 2 = M∙P ) [Ballard, Demmel, Holtz, S. 2011b]: This is as good as it gets.
Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth Time to multiply 2 largest locally storable square matrices exceeds latency Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If 2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic M 1/2 M 3/2 Strassen-like M 0 /2-1 M 0 /2
13 Let G = (V,E) be a d -regular graph A is the normalized adjacency matrix, with eigenvalues 1 ≥ 2 ≥ … ≥ n 1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]
RSRS WSWS S 14 The Computation Directed Acyclic Graph Expansion (3rd approach) Communication-cost is Graph-expansion Input / Output Intermediate value Dependency V
15 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of (M /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has 3M vertices with incoming / outgoing edges perform M read/writes. 4.The total communication BW is BW = BW of one segment #segments = (M) O(n ) / (M /2 ) = (n / M /2 -1 ) M M MM M S RSRS WSWS V Expansion (3rd approach) S1S1 S2S2 S3S3 Read Write FLOP Time...
16 Is it a Good Expander? Break G into edge-disjoint graphs, corresponding to the algorithm on M 1/2 M 1/2 matrices. Consider the expansions of S in each part (they sum up). S1S1 S2S2 S3S3 S5S5 S4S4 We need to show that M /2 expands to (M). h(G(n)) = (M/ M /2 ) for n = (M 1/2 ). Namely, for every n, h(G(n)) = (n 2 /n ) = ((4/7) lg n ) BW = (T(n)) h(G(M 1/2 )) BW = (T(n)) (G(M 1/2 )) En lg n BEn lg n A Dec lg n C n2n2 n2n2 nn lg n
17 What is the CDAG of Strassen’s algorithm?
18 M 1 = (A 11 + A 22 ) (B 11 + B 22 ) M 2 = (A 21 + A 22 ) B 11 M 3 = A 11 (B 12 - B 22 ) M 4 = A 22 (B 21 - B 11 ) M 5 = (A 11 + A 12 ) B 22 M 6 = (A 21 - A 11 ) (B 11 + B 12 ) M 7 = (A 12 - A 22 ) (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` ,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B
` 19 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12, One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B
20 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 nn lg n Dec 1 C The DAG of Strassen: further recursive steps 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C
21 En lg n B En lg n A Dec lg n C n2n2 n2n2 nn lg n The DAG of Strassen 1.Compute weighted sums of A’s elements. 2.Compute weighted sums of B’s elements. 3.Compute multiplications m 1,m 2,…,m . 4.Compute weighted sums of m 1,m 2,…,m to obtain C. AB C
22 Expansion of a Segment Two methods to compute the expansion of the recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]) or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00])
23 Expansion of a Segment Main technical challenges: Two types of vertices: with/without recursion. The graph is not regular. ` ,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B
24 Estimating the edge expansion- Combinatorially SkSk S1S1 S3S3 S2S2 Dec 1 C is a consistency gadget: Mixed pays 1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed
25 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.
26 Open Problems Find algorithms that attain the lower bounds: Sparse matrix algorithms for sequential and parallel models that auto-tune or are cache oblivious Address complex heterogeneous hardware: Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11] Extend the techniques to other algorithm and algorithmic tools: Non-uniform recursive structure Characterize a communication lower bound for a problem rather than for an algorithm. ?
How to Compute and Prove Lower Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms Based on: G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication. Thank you!