How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.

Slides:



Advertisements
Similar presentations
Divide-and-Conquer CIS 606 Spring 2010.
Advertisements

05/11/2005 Carnegie Mellon School of Computer Science Aladdin Lamps 05 Combinatorial and algebraic tools for multigrid Yiannis Koutis Computer Science.
Communication Lower Bound for the Fast Fourier Transform Michael Anderson Communication-Avoiding Algorithms (CS294) Fall 2011.
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Lecture 17 Path Algebra Matrix multiplication of adjacency matrices of directed graphs give important information about the graphs. Manipulating these.
Divide and Conquer. Recall Complexity Analysis – Comparison of algorithm – Big O Simplification From source code – Recursive.
Numerical Algorithms Matrix multiplication
Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers Chapter 11: Numerical Algorithms Sec 11.2: Implementing.
Nattee Niparnan. Recall  Complexity Analysis  Comparison of Two Algos  Big O  Simplification  From source code  Recursive.
1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
An Elementary Construction of Constant-Degree Expanders Noga Alon *, Oded Schwartz * and Asaf Shapira ** *Tel-Aviv University, Israel **Microsoft Research,
1 Fast Sparse Matrix Multiplication Raphael Yuster Haifa University (Oranim) Uri Zwick Tel Aviv University ESA 2004.
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Undirected ST-Connectivity 2 DL Omer Reingold, STOC 2005: Presented by: Fenghui Zhang CPSC 637 – paper presentation.
Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.
Administrivia, Lecture 5 HW #2 was assigned on Sunday, January 20. It is due on Thursday, January 31. –Please use the correct edition of the textbook!
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
Zig-Zag Expanders Seminar in Theory and Algorithmic Research Sashka Davis UCSD, April 2005 “ Entropy Waves, the Zig-Zag Graph Product, and New Constant-
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Undirected ST-Connectivity In Log Space
Undirected ST-Connectivity In Log Space Omer Reingold Slides by Sharon Bruckner.
1 Entropy Waves, The Zigzag Graph Product, and New Constant-Degree Expanders Omer Reingold Salil Vadhan Avi Wigderson Lecturer: Oded Levy.
Spring 2013 Solving a System of Linear Equations Matrix inverse Simultaneous equations Cramer’s rule Second-order Conditions Lecture 7.
Prof. Swarat Chaudhuri COMP 482: Design and Analysis of Algorithms Spring 2013 Lecture 12.
Divide-and-Conquer 7 2  9 4   2   4   7
1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra
Complexity of direct methods n 1/2 n 1/3 2D3D Space (fill): O(n log n)O(n 4/3 ) Time (flops): O(n 3/2 )O(n 2 ) Time and space to solve any problem on any.
Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.
Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
CS 361 – Chapters 8-9 Sorting algorithms –Selection, insertion, bubble, “swap” –Merge, quick, stooge –Counting, bucket, radix How to select the n-th largest/smallest.
15-853:Algorithms in the Real World
Lecture 10. Paradigm #8: Randomized Algorithms Back to the “majority problem” (finding the majority element in an array A). FIND-MAJORITY(A, n) while (true)
1 Chapter 4-2 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
ADA: 4.5. Matrix Mult.1 Objective o an extra divide and conquer example, based on a question in class Algorithm Design and Analysis (ADA) ,
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
1 How to Multiply Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. integers, matrices, and polynomials.
Lecture 9 Architecture Independent (MPI) Algorithm Design
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.
Algorithmics - Lecture 41 LECTURE 4: Analysis of Algorithms Efficiency (I)
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
James Hipp Senior, Clemson University.  Graph Representation G = (V, E) V = Set of Vertices E = Set of Edges  Adjacency Matrix  No Self-Inclusion (i.
09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,
MA/CSSE 473 Day 14 Strassen's Algorithm: Matrix Multiplication Decrease and Conquer DFS.
Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel.
Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.
1 Entropy Waves, The Zigzag Graph Product, and New Constant-Degree Expanders Omer Reingold Salil Vadhan Avi Wigderson Lecturer: Oded Levy.
Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance
Numerical Algorithms Chapter 11.
Approximating the MST Weight in Sublinear Time
Optimizing Cache Performance in Matrix Multiplication
Optimizing Cache Performance in Matrix Multiplication
BLAS: behind the scenes
Complexity of Expander-Based Reasoning and the Power of Monotone Proofs Sam Buss (UCSD), Valentine Kabanets (SFU), Antonina Kolokolova.
Communication costs of Schönhage-Strassen fast integer multiplication
Parallel Matrix Operations
CS200: Algorithm Analysis
How Efficient Can We Be?: Bounds on Algorithm Energy Consumption
On the effect of randomness on planted 3-coloring models
Spectral Clustering Eric Xing Lecture 8, August 13, 2010
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Presentation transcript:

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms Based on: G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication.

2 Previous talk on lower bounds Communication Lower Bounds: Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

3 Previous talk on lower bounds: algorithms with “flavor” of 3 nested loops [Ballard, Demmel, Holtz, S. 2009], [Ballard, Demmel, Holtz, S. 2011a] Following [Irony,Toledo,Tiskin 04] BLAS, LU, Cholesky, LDL T, and QR factorizations, eigenvalues and singular values, i.e., essentially all direct methods of linear algebra. Dense or sparse matrices In sparse cases: bandwidth is a function NNZ. Bandwidth and latency. Sequential, hierarchical, and parallel – distributed and shared memory models. Compositions of linear algebra operations. Certain graph optimization problems [Demmel, Pearson, Poloni, Van Loan, 11] Tensor contraction

4 Geometric Embedding (2 nd approach) [Ballard, Demmel, Holtz, S. 2011a] Follows [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] (1) Generalized form:  (i,j)  S, C(i,j) = f ij ( g i,j,k1 (A(i,k1), B(k1,j)), g i,j,k2 (A(i,k2), B(k2,j)), …, k1,k2,…  S ij other arguments) But many algorithms just don’t fit the generalized form! For example: Strassen’s fast matrix multiplication

5 Beyond 3-nested loops How about the communication costs of algorithms that have a more complex structure?

6 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

7 [Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7  T(n/2) + O(n 2 ) T(n) =  (n log 2 7 )

8 Strassen-like algorithms Compute n 0 x n 0 matrix multiplication using only n 0  0 multiplications (instead of n 0 3 ). Apply recursively (block-wise)  0  2.81[Strassen 69] works fast in practice. 2.79[Pan 78] 2.78[Bini 79] 2.55[Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38[Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach T(n) = n 0  0  T(n/n 0 ) + O(n 2 ) T(n) =  (n  0 ) n/n 0 =

9 New lower bound for Strassen’s fast matrix multiplication [Ballard, Demmel, Holtz, S. 2011b]: The Communication bandwidth lower bound is Strassen-like:Recall for cubic:For Strassen’s: The parallel lower bounds applies to 2D: M =  (n 2 /P) 2.5D: M =  (c∙n 2 /P) log 2 7 log 2 8 00

10 For sequential? hierarchy? Yes, existing implementation do! For parallel 2D? parallel 2.5D? Yes: new algorithms.

11 Sequential and new 2D and 2.5D parallel Strassen-like algorithms Sequential and Hierarchy cases: Attained by the natural recursive implementation. Also: LU, QR,… (Black-box use of fast matrix multiplication) [Ballard, Demmel, Holtz, S., Rom 2011]: New 2D parallel Strassen-like algorithm. Attains the lower bound. New 2.5D parallel Strassen-like algorithm. c  0 /2-1 parallel communication speedup over 2D implementation ( c ∙ 3n 2 = M∙P ) [Ballard, Demmel, Holtz, S. 2011b]: This is as good as it gets.

Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth Time to multiply 2 largest locally storable square matrices exceeds latency Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If   2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic  M 1/2    M 3/2    Strassen-like  M  0 /2-1    M  0 /2   

13 Let G = (V,E) be a d -regular graph A is the normalized adjacency matrix, with eigenvalues  1 ≥ 2 ≥ … ≥ n   1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Expansion (3rd approach) [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

RSRS WSWS S 14 The Computation Directed Acyclic Graph Expansion (3rd approach) Communication-cost is Graph-expansion Input / Output Intermediate value Dependency V

15 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of  (M  /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has  3M vertices with incoming / outgoing edges  perform  M read/writes. 4.The total communication BW is BW = BW of one segment  #segments =  (M)  O(n  ) /  (M  /2 ) =  (n  / M  /2 -1 ) M M MM M S RSRS WSWS V Expansion (3rd approach) S1S1 S2S2 S3S3 Read Write FLOP Time...

16 Is it a Good Expander? Break G into edge-disjoint graphs, corresponding to the algorithm on M 1/2  M 1/2 matrices. Consider the expansions of S in each part (they sum up). S1S1 S2S2 S3S3 S5S5 S4S4 We need to show that M  /2 expands to  (M). h(G(n)) =  (M/ M  /2 ) for n =  (M 1/2 ). Namely, for every n, h(G(n)) =  (n 2 /n  ) =  ((4/7) lg n ) BW =  (T(n))  h(G(M 1/2 )) BW =  (T(n))   (G(M 1/2 )) En lg n BEn lg n A Dec lg n C n2n2 n2n2 nn lg n

17 What is the CDAG of Strassen’s algorithm?

18 M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` ,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

` 19 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12, One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B

20 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 nn lg n Dec 1 C The DAG of Strassen: further recursive steps 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C

21 En lg n B En lg n A Dec lg n C n2n2 n2n2 nn lg n The DAG of Strassen 1.Compute weighted sums of A’s elements. 2.Compute weighted sums of B’s elements. 3.Compute multiplications m 1,m 2,…,m . 4.Compute weighted sums of m 1,m 2,…,m  to obtain C. AB C

22 Expansion of a Segment Two methods to compute the expansion of the recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]) or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00])

23 Expansion of a Segment Main technical challenges: Two types of vertices: with/without recursion. The graph is not regular. ` ,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

24 Estimating the edge expansion- Combinatorially SkSk S1S1 S3S3 S2S2 Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed

25 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

26 Open Problems Find algorithms that attain the lower bounds: Sparse matrix algorithms for sequential and parallel models that auto-tune or are cache oblivious Address complex heterogeneous hardware: Lower bounds and algorithms [Demmel, Volkov 08],[Ballard, Demmel, Gearhart 11] Extend the techniques to other algorithm and algorithmic tools: Non-uniform recursive structure Characterize a communication lower bound for a problem rather than for an algorithm. ?

How to Compute and Prove Lower Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10 Fall, 2011 Communication-Avoiding Algorithms Based on: G. Ballard, J. Demmel, O. Holtz, and O. Schwartz: Graph expansion and communication costs of fast matrix multiplication. Thank you!