How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

Lecture 19: Parallel Algorithms
Block LU Factorization Lecture 24 MA471 Fall 2003.
The Study of Cache Oblivious Algorithms Prepared by Jia Guo.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Introduction to Algorithms Jiafen Liu Sept
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
Query Evaluation. An SQL query and its RA equiv. Employees (sin INT, ename VARCHAR(20), rating INT, age REAL) Maintenances (sin INT, planeId INT, day.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.
Richard Fateman CS 282 Lecture 111 Determinants Lecture 11.
1 Anna Östlin Pagh and Rasmus Pagh IT University of Copenhagen Advanced Database Technology March 25, 2004 QUERY COMPILATION II Lecture based on [GUW,
Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.
Design of parallel algorithms
1 CS 267 Tricks with Trees James Demmel
Prof. Bart Selman Module Probability --- Part e)
1 02/09/05CS267 Lecture 7 CS 267 Tricks with Trees James Demmel
1 Lecture 24: Parallel Algorithms I Topics: sort and matrix algorithms.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Linear Algebra on GPUs Vasily Volkov. GPU Architecture Features SIMD architecture – Don’t be confused by scalar ISA which is only a program model We use.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Evaluation of Relational Operations. Relational Operations v We will consider how to implement: – Selection ( ) Selects a subset of rows from relation.
ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.
Improved results for a memory allocation problem Rob van Stee University of Karlsruhe Germany Leah Epstein University of Haifa Israel WADS 2007 WAOA 2007.
1 Lecture 2: Parallel computational models. 2  Turing machine  RAM (Figure )  Logic circuit model RAM (Random Access Machine) Operations supposed to.
Divide-and-Conquer 7 2  9 4   2   4   7
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
1 Minimizing Communication in Linear Algebra James Demmel 15 June
1 High-Performance Grid Computing and Research Networking Presented by Xing Hang Instructor: S. Masoud Sadjadi
Analysis of Algorithms
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS240A: Conjugate Gradients and the Model Problem.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original.
1/6/20161 CS 3343: Analysis of Algorithms Lecture 2: Asymptotic Notations.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
CPSC 404, Laks V.S. Lakshmanan1 Evaluation of Relational Operations – Join Chapter 14 Ramakrishnan and Gehrke (Section 14.4)
1 CSC 421: Algorithm Design & Analysis Spring 2014 Complexity & lower bounds  brute force  decision trees  adversary arguments  problem reduction.
PARALLEL PROCESSING From Applications to Systems Gorana Bosic Veljko Milutinovic
Lecture 9 Architecture Independent (MPI) Algorithm Design
Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)
Onlinedeeneislam.blogspot.com1 Design and Analysis of Algorithms Slide # 1 Download From
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
Unsupervised Learning II Feature Extraction
Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.
CSc 8530 Matrix Multiplication and Transpose By Jaman Bhola.
Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance
Numerical Algorithms Chapter 11.
PERFORMANCE EVALUATIONS
Optimizing Cache Performance in Matrix Multiplication
Chapter 12: Query Processing
Minimizing Communication in Linear Algebra
Optimizing Cache Performance in Matrix Multiplication
BLAS: behind the scenes
Unit-2 Divide and Conquer
P A R A L L E L C O M P U T I N G L A B O R A T O R Y
Divide-and-Conquer 7 2  9 4   2   4   7
Presentation transcript:

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture #3 Fall, 2011 Communication-Avoiding Algorithms Based on: D. Irony, S. Toledo, and A. Tiskin: Communication lower bounds for distributed-memory matrix multiplication. G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Minimizing communication in linear algebra.

2 Last time: the models Two kinds of costs: Arithmetic (FLOPs) Communication: moving data between levels of a memory hierarchy (sequential case) over a network connecting processors (parallel case) CPU RAM CPU RAM CPU RAM CPU RAM Parallel CPU Cache RAM Sequential M1M1 M2M2 M3M3 M k =   Hierarchy

3 Last time: Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

4 Last time: Lower bounds for matrix multiplication Bandwidth: [Hong & Kung 81] Sequential [Irony,Toledo,Tiskin 04] Sequential and parallel Latency: Divide by M.

5 Last time: Reduction (1 st approach) [Ballard, Demmel, Holtz, S. 2009a] Thm: Cholesky and LU decompositions are (communication-wise) as hard as matrix-multiplication Proof: By a reduction (from matrix-multiplication) that preserves communication bandwidth, latency, and arithmetic. Cor: Any classical O(n 3 ) algorithm for Cholesky and LU decomposition requires: Bandwidth:  (n 3 / M 1/2 ) Latency:  (n 3 / M 3/2 ) (similar cor. for the parallel model).

6 Today: Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

7 Lower bounds: for matrix multiplication using geometric embedding [Hong & Kung 81] Sequential [Irony,Toledo,Tiskin 04] Sequential and parallel Now: prove both, using the geometric embedding approach of [Irony,Toledo,Tiskin 04].

8 Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] Matrix multiplication form:  (i,j)  n x n, C(i,j) =  k A(i,k) B(k,j), Thm: If an algorithm agrees with this form (regardless of the order of computation) then BW =  (n 3 / M 1/2 ) BW =  (n 3 / PM 1/2 )in P-parallel model.

9 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S  k 4.The total communication BW is BW = BW of one segment  #segments  M  #mults / k... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

10 Volume of box V = x·y·z = ( xz · zy · yx) 1/2 Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2 x z z y x y A B C “A shadow” “B shadow” “C shadow” A B C V V Matrix multiplication form:  (i,j)  n x n, C(i,j) =  k A(i,k)B(k,j), Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

11 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S  k 4.The total communication BW is BW = BW of one segment  #segments  M  #mults / k = M  n 3 / k 5.By Loomis-Whitney: BW  M  n 3 / (3M) 3/2... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

12 From Sequential Lower bound to Parallel Lower Bound We showed: Any classical O(n 3 ) algorithm for matrix multiplication on sequential model requires: Bandwidth:  (n 3 / M 1/2 ) Latency:  (n 3 / M 3/2 ) Cor: Any classical O(n 3 ) algorithm for matrix multiplication on P-processors machine (with balanced workload) requires: 2D-layout: M=O(n 2 /P) Bandwidth:  (n 3 /PM 1/2 )  (n 2 /P 1/2 ) Latency:  (n 3 / PM 3/2 )  (P 1/2 )

13 From Sequential Lower bound to Parallel Lower Bound Proof: Observe one processor. Is it always true? “A shadow” “B shadow” “C shadow” A B C Let Alg be an algorithm with communication lower bound B = B(n,M). Then any parallel implementation of Alg has a communication lower bound B’(n, M, p) = B(n, M)/p ?

Proof of Loomis-Whitney inequality T = 3D set of 1x1x1 cubes on lattice N = |T| = #cubes T x = projection of T onto x=0 plane N x = |T x | = #squares in T x, same for T y, N y, etc Goal: N ≤ (N x · N y · N z ) 1/2 14 T(x=i) = subset of T with x=i T(x=i | y ) = projection of T(x=i) onto y=0 plane N(x=i) = |T(x=i)| etc N =  i N(x=i) =  i (N(x=i)) 1/2 · (N(x=i)) 1/2 ≤  i (N x ) 1/2 · (N(x=i)) 1/2 ≤ (N x ) 1/2 ·  i (N(x=i | y ) · N(x=i | z ) ) 1/2 = (N x ) 1/2 ·  i (N(x=i | y ) ) 1/2 · (N(x=i | z ) ) 1/2 ≤ (N x ) 1/2 · (  i N(x=i | y ) ) 1/2 · (  i N(x=i | z ) ) 1/2 = (N x ) 1/2 · (N y ) 1/2 · (N z ) 1/2 z y x T(x=i) T(x=i | y) T x=i N(x=i|y) N(x=i)  N(x=i|y) ·N(x=i|z) N(x=i|z) T(x=i)

15 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

How to generalize this lower bound 16 Matrix multiplication form:  (i,j)  n x n, C(i,j) =  k A(i,k)B(k,j), (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments) C(i,j) any unique memory location. Same for A(i,k) and B(k,j). A,B and C may overlap. Lower bound for all reorderings. Incorrect ones too. It does assume each operand generate load/store. Turns out QR, eig, SVD all may do this Need a different analysis. Not today… f ij and g ijk are “nontrivial” functions

17 Geometric Embedding (2 nd approach) (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments) Thm: [Ballard, Demmel, Holtz, S. 2011a] If an algorithm agrees with the generalized form then BW =  (G/ M 1/2 ) where G = |{g (i,j,k) | (i,j)  S, k  S ij } BW =  (G/ PM 1/2 )in P-parallel model.

18 Example: Application to Cholesky decomposition (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments)

19 From Sequential Lower bound to Parallel Lower Bound We showed: Any algorithm that agrees with Form (1) on sequential model requires: Bandwidth:  (G / M 1/2 ) Latency:  (G / M 3/2 ) where G is the g ijk count. Cor: Any algorithm that agrees with Form (1), on a P- processors machine, where at least two processors perform  (1/P) of G each requires: Bandwidth:  (G /PM 1/2 ) Latency:  (G / PM 3/2 )

20 Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] Lower bounds: for algorithms with “flavor” of 3 nested loops BLAS, LU, Cholesky, LDL T, and QR factorizations, eigenvalues and SVD, i.e., essentially all direct methods of linear algebra. Dense or sparse matrices In sparse cases: bandwidth is a function NNZ. Bandwidth and latency. Sequential, hierarchical, and parallel – distributed and shared memory models. Compositions of linear algebra operations. Certain graph optimization problems [Demmel, Pearson, Poloni, Van Loan, 11], [Ballard, Demmel, S. 11] Tensor contraction For dense:

21 Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? Mostly not. Are there other algorithms that do? Mostly yes.

22 Dense Linear Algebra: Sequential Model Lower boundAttaining algorithm AlgorithmBandwidthLatencyBandwidthLatency Matrix- Multiplication [Ballard, Demmel, Holtz, S. 11] [Ballard, Demmel, Holtz, S. 11] [Frigo, Leiserson, Prokop, Ramachandran 99] Cholesky[Ahmad, Pingali 00] [Ballard, Demmel, Holtz, S. 09] LU [Toledo97] [DGX08] QR [EG98] [DGHL08a] Symmetric Eigenvalues [Ballard,Demmel,Dumitriu 10] SVD[Ballard,Demmel,Dumitriu 10] (Generalized) Nonsymetric Eigenvalues [Ballard,Demmel,Dumitriu 10]

Dense 2D parallel algorithms Assume nxn matrices on P processors, memory per processor = O(n 2 / P) ScaLAPACK assumes best block size b chosen Many references (see reports), Blue are new Recall lower bounds: #words_moved =  ( n 2 / P 1/2 ) and #messages =  ( P 1/2 ) AlgorithmReferenceFactor exceeding lower bound for #words_moved Factor exceeding lower bound for #messages Matrix multiply[Cannon, 69]11 CholeskyScaLAPACKlog P LU[GDX08] ScaLAPACK log P ( N / P 1/2 ) · log P QR[DGHL08] ScaLAPACK log P log 3 P ( N / P 1/2 ) · log P Sym Eig, SVD[BDD10] ScaLAPACK log P log 3 P N / P 1/2 Nonsym Eig[BDD10] ScaLAPACK log P P 1/2 · log P log 3 P N · log P Relax: 2.5D Algorithms Solomonik & Demmel ‘11

24 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that S performs  k FLOPs g ijk 4.The total communication BW is BW = BW of one segment  #segments  M  G / k, where G is #g i,j,k... Geometric Embedding (2 nd approach)

25 Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments) Volume of box V = x·y·z = ( xz · zy · yx) 1/2 Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2 x z z y x y A B C “A shadow” “B shadow” “C shadow” A B C V V

26 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that S performs  k FLOPs g ijk 4.The total communication BW is BW = BW of one segment  #segments  M  G / k where G is #g i,j,k 5.By Loomis-Whitney: BW  M  G / (3M) 3/2... Geometric Embedding (2 nd approach)

27 Applications 27 BW =  (G/ M 1/2 ) where G = |{g (i,j,k) | (i,j)  S, k  S ij } BW =  (G/ PM 1/2 )in P-parallel model. (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments)

28 Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments) But many algorithms just don’t fit the generalized form! For example: Strassen’s fast matrix multiplication

29 Beyond 3-nested loops How about the communication costs of algorithms that have a more complex structure?

30 Communication Lower Bounds – to be continued… Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

31 Further reduction techniques: Imposing reads and writes Example: Computing ||A∙B|| where each matrix element is a formulas, computed only once. Problem: Input/output do not agree with Form (1). Solution: Impose writes/reads of (computed) entries of A and B. Impose writes of the entries of C. The new algorithm has lower bound  For the original algorithm i.e., for (which we assume anyway). (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments)

32 Further reduction techniques: Imposing reads and writes The previous example can be generalized to other “black-box” uses of algorithms that fit Form (1). Consider a more general class of algorithms: Some arguments of the generalized form may be computed “on the fly” and discarded immediately after used. …

33 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (Algorithm, Machine, Input) 1.Partition computations into segments of 3M reads / writes 2.Any segment S has M inputs/outputs. 3.Show that S performs  G(3M) FLOPs g ijk 4.The total communication BW is BW = BW of one segment  #segments  M  G / G(3M) But now some operands inside a segment may be computed on-the fly and discarded. So no read/write performed.... Recall…

How to generalize this lower bound: How to deal with on-the-fly generated operands 34 Need to distinguish Sources, Destinations of each operand in fast memory during a segment: Possible Sources: R1: Already in fast memory at start of segment, or read; at most 2M R2: Created during segment; no bound without more information Possible Destinations: D1: Left in fast memory at end of segment, or written; at most 2M D2: Discarded; no bound without more information S1S1 S2S2 S3S3 Read Write FLOP Time...

How to generalize this lower bound: How to deal with on-the-fly generated operands 35 There are at most 4M of types: R1/D1, R1/D2, R2/D1. Need to assume/prove: not too many R2/D2 arguments; Then we can use LW, and obtain the lower bound of Form (1). Bounding R2/D2 is sometimes quite subtle. S1S1 S2S2 S3S3 Read Write FLOP Time... “A shadow” “B shadow” “C shadow” A B C V

36 Composition of algorithms Many algorithms and applications use composition of other (linear algebra) algorithms. How to compute lower and upper bounds for such cases? Example - Dense matrix powering Compute A n by ( log n times) repeated squaring: A  A 2  A 4  …  A n Each squaring step agrees with Form (1). Do we get or is there a way to reorder (interleave) computations to reduce communication?

37 Communication hiding vs. Communication avoiding Q. The Model assumes that computation and communication do not overlap. Is this a realistic assumption? Can we not gain time by such overlapping? A. Right. This is called communication hiding. It is done in practice, and ignored in our model. It may save up to a factor of 2 in the running time. Note that the speedup gained by avoiding (minimizing) communication is typically larger than a constant factor.

38 Two-nested loops: when the input/output size dominates Q. Do two-nested-loops algorithms fall into the paradigm of Form (1)? For example, what lower bound do we obtain for computing Matrix- vector multiplication? A. Yes, but the lower bound we obtain is Where just reading the input costs More generally, the communication cost lower bound for algorithms that agree with Form (1) is where LW is the one we obtain from the geometric embedding, and #inputs+#outputs is the size of the inputs and outputs. For some algorithms LW dominates, for others #inputs+#outputs dominate.

39 Composition of algorithms Claim: any implementation of A n by ( log n times) repeated squaring requires Therefore we cannot reduce communication by more than a constant factor (compared to log n separate calls to matrix multiplications) by reordering computations. (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments)

40 Composition of algorithms Proof: by imposing reads/writes on each entry of every intermediate matrix. The total number of g i,j,k is  (n 3 log n). The total number of imposed reads/writes is  (n 2 log n). The lower bound for the original algorithm is (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments)

41 Composition of algorithms: when interleaving does matter Example 1: Input: A,v 1,v 2,…,v n Output: Av 1,Av 2,…,Av n The phased solution costs But we already know that we can save a M 1/2 factor: Set B = (v 1,v 2,…,v n ), and compute A  B, then the cost is Other examples?

42 Composition of algorithms: when interleaving does matter Example 2: Input: A,B, t Output: C (k) = A  B (k) for k = 1,2,…,t where B i,j (k) = B i,j 1/k Phased solution: Upper bound: (by adding up the BW cost of t matrix multiplication calls). Lower bound: (by imposing writes/reads between phases).

43 Composition of algorithms: when interleaving does matter Example 2: Input: A,B, t Output: C (k) = A  B (k) for k = 1,2,…,t where B i,j (k) = B i,j 1/k Can we do better than ?

44 Composition of algorithms: when interleaving does matter Example 2: Input: A,B, t Output: C (k) = A  B (k) for k = 1,2,…,t where B i,j (k) = B i,j 1/k Can we do better than ? Yes. Claim: There exists an implementation for the above algorithm, with communication cost (tight lower and upper bounds):

45 Composition of algorithms: when interleaving does matter Example 2: Input: A,B, t Output: C (k) = A  B (k) for k = 1,2,…,t where B i,j (k) = B i,j 1/k Proofs idea: Upper bound: Having both A i,k and B k,j in fast memory lets us do up to t evaluations of g ijk. Lower bound: The union of all these tn 3 operations does not match Form (1), since the inputs B k,j cannot be indexed in a one-to-one fashion. We need a more careful argument regarding the numbers of g ijk. Operations in a segment as a function of the number of accessed elements of A, B and C (k).

46 Composition of algorithms: when interleaving does matter Can you think of natural examples where reordering / interleaving of known algorithms may improve the communication costs, compared to the phased implementation?

47 Summary How to compute an upper bound on the communication costs of your algorithm? Typically straightforward. Not always. How to compute and prove a lower bound on the communication costs of your algorithm? Reductions: from another algorithm/problem from another model of computing By using the generalized form (“flavor” of 3 nested loops) and imposing reads/writes – black-box-wise or bounding the number of R2/D2 operands By carefully composing the lower bounds of the building blocks. Next time: by graph analysis

48 Open Problems Find algorithms that attain the lower bounds: Sparse matrix algorithms that auto-tune or are cache oblivious cache oblivious for parallel (distributed memory) Cache oblivious parallel matrix multiplication? (Cilk++ ?) Address complex heterogeneous hardware: Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11]

How to Compute and Prove Lower Bounds on the Communication Costs of your Algorithm Oded Schwartz CS294, Lecture #2 Fall, 2011 Communication-Avoiding Algorithms Based on: D. Irony, S. Toledo, and A. Tiskin: Communication lower bounds for distributed-memory matrix multiplication. G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Minimizing communication in linear algebra. Thank you!