Download presentation
Presentation is loading. Please wait.
Published byDaniel Jefferson Modified over 8 years ago
1
Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance Fault tolerance March/15/2015
2
Model & Motivation Two kinds of costs: Arithmetic (FLOPs) Communication: moving data Running time = #FLOPs + #Words ( + #Messages ) CPU M CPU M CPU M CPU M Distributed CPU M RAM Sequential Fast/local memory of size M P processors 2 Communication-minimizing algorithms: Save time Communication-minimizing algorithms: Save energy 23%/Year26%/Year 59%/Year
3
3 Communication Lower Bounds – to be continued… Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.
4
[Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 ) (B 11 + B 22 ) M 2 = (A 21 + A 22 ) B 11 M 3 = A 11 (B 12 - B 22 ) M 4 = A 22 (B 21 - B 11 ) M 5 = (A 11 + A 12 ) B 22 M 6 = (A 21 - A 11 ) (B 11 + B 12 ) M 7 = (A 12 - A 22 ) (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7 T(n/2) + (n 2 ) T(n) = (n log 2 7 ) 4
5
Strassen-like algorithms T(n) = n 0 0 T(n/n 0 ) + (n 2 ) T(n) = (n 0 ) n/n 0 = 5 Subsequently… Compute n 0 x n 0 matrix multiplication using only n 0 0 multiplications (instead of n 0 3 ). Apply recursively (block-wise) 0 2.81 [Strassen 69],[Strassen-Winograd 71] 2.79 [Pan 78] 2.78 [Bini 79] 2.55 [Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach 2.3730 [Stothers 10] 2.3728640 [Vassilevska Williams 12] 2.3728642 [Le Gall 14]
6
[Ballard, Demmel, Holtz, S. 2011b]: Sequential and parallel Novel graph expansion proof Strassen-like:Classic (cubic):For Strassen’s: log 2 7 log 2 8 00 Sequential Distributed Communication costs lower bounds for matrix multiplication 6
7
Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth Time to multiply 2 largest locally storable square matrices exceeds latency Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If 2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic M 1/2 M 3/2 Strassen-like M 0 /2-1 M 0 /2
8
RSRS WSWS S 8 The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency V How can we estimate R s and W s ? By bounding the expansion of the graph!
9
9 Let G = (V,E) be a graph A is the normalized adjacency matrix of a regular undirected graph, with eigenvalues 1 = 1 ≥ 2 ≥ … ≥ n 1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Small sets expansion: Expansion [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]
10
RSRS WSWS S 10 The Computation Directed Acyclic Graph Expansion Communication-Cost is Graph-Expansion Input / Output Intermediate value Dependency V (Small-Sets)
11
11 What is the Computation Graph of Strassen? Can we Compute its Expansion?
12
12 M 1 = (A 11 + A 22 ) (B 11 + B 22 ) M 2 = (A 21 + A 22 ) B 11 M 3 = A 11 (B 12 - B 22 ) M 4 = A 22 (B 21 - B 11 ) M 5 = (A 11 + A 12 ) B 22 M 6 = (A 21 - A 11 ) (B 11 + B 12 ) M 7 = (A 12 - A 22 ) (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` 7541326 1,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B
13
Enc 1 A Dec 1 C ` 13 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12,2 One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B
14
14 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n Dec 1 C The DAG of Strassen: further recursive steps 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C 0 = lg 7
15
15 The Expansion of the Computation Graph Methods for the analysis of the expansion of recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]), or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00]) Main technical challenges: Two types of vertices (with/without recursion). The graph is not regular.
16
16 Estimating the edge expansion- Combinatorially Dec 1 C is a consistency gadget: Mixed pays 1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n 0 = lg 7
17
17 Is Strassen’s Graph a Good Expander? S1S1 S2S2 S3S3 S5S5 S4S4 For n -by- n matrices: For M 1/2 -by- M 1/2 matrices: For M 1/2 -by- M 1/2 sub-matrices (or other small subsets): Summing up (the partition argument)
18
18 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of (M /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has 3M vertices with incoming / outgoing edges perform M read/writes. 4.The total communication BW is BW = BW of one segment #segments = (M) O(n ) / (M /2 ) = (n / M /2 -1 ) M M MM M S RSRS WSWS V The partitioning argument S1S1 S2S2 S3S3 Read Write FLOP Time...
19
19 Subsequently… More Bounds, for example: For rectangular fast-matrix multiplication algorithms [MedAlg’12]. For fast numerical linear algebra [EECS-Techreport’12]. E.g., solving linear systems, least squares, eigenproblems,... with same arithmetic and communication costs, and numerically stably. How much extra memory is useful How far we can have perfect strong-scaling [SPAA’12b] New Parallel Algorithm…
20
Algorithms for Supercomputers Lower bounds by graph expansion Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance Fault tolerance March/15/2015
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.