Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

Slides:



Advertisements
Similar presentations
Randomness Conductors Expander Graphs Randomness Extractors Condensers Universal Hash Functions
Advertisements

Numerical Linear Algebra in the Streaming Model Ken Clarkson - IBM David Woodruff - IBM.
Divide-and-Conquer CIS 606 Spring 2010.
Algorithms (and Datastructures) Lecture 3 MAS 714 part 2 Hartmut Klauck.
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Parallel Matrix Operations using MPI CPS 5401 Fall 2014 Shirley Moore, Instructor November 3,
Lecture 17 Path Algebra Matrix multiplication of adjacency matrices of directed graphs give important information about the graphs. Manipulating these.
Artur Czumaj Dept of Computer Science & DIMAP University of Warwick Testing Expansion in Bounded Degree Graphs Joint work with Christian Sohler.
CSL758 Instructors: Naveen Garg Kavitha Telikepalli Scribe: Manish Singh Vaibhav Rastogi February 7 & 11, 2008.
1 Reduction between Transitive Closure & Boolean Matrix Multiplication Presented by Rotem Mairon.
CS Section 600 CS Section 002 Dr. Angela Guercio Spring 2010.
Undirected ST-Connectivity in Log-Space Author: Omer Reingold Presented by: Yang Liu.
An Elementary Construction of Constant-Degree Expanders Noga Alon *, Oded Schwartz * and Asaf Shapira ** *Tel-Aviv University, Israel **Microsoft Research,
Constant Degree, Lossless Expanders Omer Reingold AT&T joint work with Michael Capalbo (IAS), Salil Vadhan (Harvard), and Avi Wigderson (Hebrew U., IAS)
ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.
Undirected ST-Connectivity 2 DL Omer Reingold, STOC 2005: Presented by: Fenghui Zhang CPSC 637 – paper presentation.
Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.
Michael Bender - SUNY Stony Brook Dana Ron - Tel Aviv University Testing Acyclicity of Directed Graphs in Sublinear Time.
EXPANDER GRAPHS Properties & Applications. Things to cover ! Definitions Properties Combinatorial, Spectral properties Constructions “Explicit” constructions.
Administrivia, Lecture 5 HW #2 was assigned on Sunday, January 20. It is due on Thursday, January 31. –Please use the correct edition of the textbook!
Undirected ST-Connectivity in Log-Space By Omer Reingold (Weizmann Institute) Year 2004 Presented by Maor Mishkin.
Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.
CSE 421 Algorithms Richard Anderson Lecture 4. What does it mean for an algorithm to be efficient?
(Page 554 – 564) Ping Perez CS 147 Summer 2001 Alternative Parallel Architectures  Dataflow  Systolic arrays  Neural networks.
Zig-Zag Expanders Seminar in Theory and Algorithmic Research Sashka Davis UCSD, April 2005 “ Entropy Waves, the Zig-Zag Graph Product, and New Constant-
Expanders Eliyahu Kiperwasser. What is it? Expanders are graphs with no small cuts. The later gives several unique traits to such graph, such as: – High.
Complexity 19-1 Parallel Computation Complexity Andrei Bulatov.
Parallel System Performance CS 524 – High-Performance Computing.
Undirected ST-Connectivity In Log Space
Undirected ST-Connectivity In Log Space Omer Reingold Slides by Sharon Bruckner.
1 Entropy Waves, The Zigzag Graph Product, and New Constant-Degree Expanders Omer Reingold Salil Vadhan Avi Wigderson Lecturer: Oded Levy.
CSE 589 Applied Algorithms Spring Colorability Branch and Bound.
Multithreaded Algorithms Andreas Klappenecker. Motivation We have discussed serial algorithms that are suitable for running on a uniprocessor computer.
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.
All-Pairs Bottleneck Paths in Vertex Weighted graphs Asaf Shapira Microsoft Research Raphael Yuster University of Haifa Uri Zwick Tel-Aviv University.
Complexity of direct methods n 1/2 n 1/3 2D3D Space (fill): O(n log n)O(n 4/3 ) Time (flops): O(n 3/2 )O(n 2 ) Time and space to solve any problem on any.
1 Dryad Distributed Data-Parallel Programs from Sequential Building Blocks Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, Dennis Fetterly of Microsoft.
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
Graph Algorithms. Definitions and Representation An undirected graph G is a pair (V,E), where V is a finite set of points called vertices and E is a finite.
MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.
Bulk Synchronous Processing (BSP) Model Course: CSC 8350 Instructor: Dr. Sushil Prasad Presented by: Chris Moultrie.
1 Chapter 4 Divide-and-Conquer. 2 About this lecture Recall the divide-and-conquer paradigm, which we used for merge sort: – Divide the problem into a.
Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar Adapted for 3030 To accompany the text ``Introduction to Parallel Computing'',
15-853:Algorithms in the Real World
Data Structures and Algorithms in Parallel Computing Lecture 1.
1 Chapter 4-2 Divide and Conquer Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved.
ADA: 4.5. Matrix Mult.1 Objective o an extra divide and conquer example, based on a question in class Algorithm Design and Analysis (ADA) ,
Divide and Conquer Andreas Klappenecker [based on slides by Prof. Welch]
Artur Czumaj DIMAP DIMAP (Centre for Discrete Maths and it Applications) Computer Science & Department of Computer Science University of Warwick Testing.
1 How to Multiply Slides by Kevin Wayne. Copyright © 2005 Pearson-Addison Wesley. All rights reserved. integers, matrices, and polynomials.
Young CS 331 D&A of Algo. Topic: Divide and Conquer1 Divide-and-Conquer General idea: Divide a problem into subprograms of the same kind; solve subprograms.
1 Complete this to a Pfaffian orientation (all internal faces have an odd number of clockwise arcs).
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Presented by Alon Levin
MA/CSSE 473 Day 14 Strassen's Algorithm: Matrix Multiplication Decrease and Conquer DFS.
Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel.
1 Entropy Waves, The Zigzag Graph Product, and New Constant-Degree Expanders Omer Reingold Salil Vadhan Avi Wigderson Lecturer: Oded Levy.
Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance
Coding, Complexity and Sparsity workshop
Randomized Algorithms
BLAS: behind the scenes
CSCE 411 Design and Analysis of Algorithms
Complexity of Expander-Based Reasoning and the Power of Monotone Proofs Sam Buss (UCSD), Valentine Kabanets (SFU), Antonina Kolokolova.
Communication costs of Schönhage-Strassen fast integer multiplication
Randomized Algorithms
Parallel Matrix Operations
Topic: Divide and Conquer
Numerical Algorithms Quiz questions
On the effect of randomness on planted 3-coloring models
Presentation transcript:

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance Fault tolerance March/15/2015

Model & Motivation Two kinds of costs: Arithmetic (FLOPs) Communication: moving data Running time =   #FLOPs +   #Words ( +   #Messages ) CPU M CPU M CPU M CPU M Distributed CPU M RAM Sequential Fast/local memory of size M P processors 2 Communication-minimizing algorithms: Save time Communication-minimizing algorithms: Save energy 23%/Year26%/Year 59%/Year

3 Communication Lower Bounds – to be continued… Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

[Strassen 69] Compute 2 x 2 matrix multiplication using only 7 multiplications (instead of 8). Apply recursively (block-wise) M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 Recall: Strassen’s Fast Matrix Multiplication C 21 C 22 C 11 C 12 n/2 A 21 A 22 A 11 A 12 B 21 B 22 B 11 B 12 = T(n) = 7  T(n/2) +  (n 2 ) T(n) =  (n log 2 7 ) 4

Strassen-like algorithms T(n) = n 0  0  T(n/n 0 ) +  (n 2 ) T(n) =  (n  0 ) n/n 0 = 5 Subsequently… Compute n 0 x n 0 matrix multiplication using only n 0  0 multiplications (instead of n 0 3 ). Apply recursively (block-wise)  0  2.81 [Strassen 69],[Strassen-Winograd 71] 2.79 [Pan 78] 2.78 [Bini 79] 2.55 [Schönhage 81] 2.50 [Pan Romani,Coppersmith Winograd 84] 2.48 [Strassen 87] 2.38 [Coppersmith Winograd 90] 2.38 [Cohn Kleinberg Szegedy Umans 05] Group-theoretic approach [Stothers 10] [Vassilevska Williams 12] [Le Gall 14]

[Ballard, Demmel, Holtz, S. 2011b]: Sequential and parallel Novel graph expansion proof Strassen-like:Classic (cubic):For Strassen’s: log 2 7 log 2 8 00 Sequential Distributed Communication costs lower bounds for matrix multiplication 6

Implications for sequential architectural scaling Requirements so that “most” time is spent doing arithmetic on n x n dense matrices, n 2 > M: Time to add two rows of largest locally storable square matrix exceeds reciprocal bandwidth Time to multiply 2 largest locally storable square matrices exceeds latency Strassen-like algs do fewer flops & less communication but are more demanding on the hardware. If   2, it is all about communication. CA Matrix multiplication algorithm Scaling Bandwidth Requirement Scaling Latency Requirement Classic  M 1/2    M 3/2    Strassen-like  M  0 /2-1    M  0 /2   

RSRS WSWS S 8 The Computation Directed Acyclic Graph Input / Output Intermediate value Dependency V How can we estimate R s and W s ? By bounding the expansion of the graph!

9 Let G = (V,E) be a graph A is the normalized adjacency matrix of a regular undirected graph, with eigenvalues  1 = 1 ≥ 2 ≥ … ≥ n   1 - max { 2, | n |} Thm: [Alon-Milman84, Dodziuk84, Alon86] Small sets expansion: Expansion [Ballard, Demmel, Holtz, S. 2011b], in the spirit of [Hong & Kung 81]

RSRS WSWS S 10 The Computation Directed Acyclic Graph Expansion Communication-Cost is Graph-Expansion Input / Output Intermediate value Dependency V (Small-Sets)

11 What is the Computation Graph of Strassen? Can we Compute its Expansion?

12 M 1 = (A 11 + A 22 )  (B 11 + B 22 ) M 2 = (A 21 + A 22 )  B 11 M 3 = A 11  (B 12 - B 22 ) M 4 = A 22  (B 21 - B 11 ) M 5 = (A 11 + A 12 )  B 22 M 6 = (A 21 - A 11 )  (B 11 + B 12 ) M 7 = (A 12 - A 22 )  (B 21 + B 22 ) C 11 = M 1 + M 4 - M 5 + M 7 C 12 = M 3 + M 5 C 21 = M 2 + M 4 C 22 = M 1 - M 2 + M 3 + M 6 The DAG of Strassen, n = 2 ` ,11,22,12,2 1,11,22,12,21,11,22,12,2 Enc 1 A Dec 1 C Enc 1 B

Enc 1 A Dec 1 C ` 13 The DAG of Strassen, n=4 Dec 1 C 1,11,22,12,2 One recursive level: Each vertex splits into four. Multiply blocks Enc 1 BEnc 1 A Dec 1 C Enc 1 AEnc 1 B

14 Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n Dec 1 C The DAG of Strassen: further recursive steps 1,11,22,12,2 Recursive construction Given Dec i C, Construct Dec i+1 C: 1.Duplicate 4 times 2.Connect with a cross-layer of Dec 1 C  0 = lg 7

15 The Expansion of the Computation Graph Methods for the analysis of the expansion of recursively constructed graph: Combinatorial - estimate directly the edge / vertex expansion (in the spirit of [Alon, S., Shapira, 08]), or Spectral - compute the edge expansion via the spectral-gap (in the spirit of the Zig-Zag analysis [Reingold, Vadhan, Wigderson 00]) Main technical challenges: Two types of vertices (with/without recursion). The graph is not regular.

16 Estimating the edge expansion- Combinatorially Dec 1 C is a consistency gadget: Mixed pays  1/12 of its edges. The fraction of S vertices is consistent between the 1 st level and the four 2 nd levels (deviations pay linearly). In S Not in S Mixed Enc lg n BEnc lg n A Dec lg n C n2n2 n2n2 n0n0 lg n  0 = lg 7

17 Is Strassen’s Graph a Good Expander? S1S1 S2S2 S3S3 S5S5 S4S4 For n -by- n matrices: For M 1/2 -by- M 1/2 matrices: For M 1/2 -by- M 1/2 sub-matrices (or other small subsets): Summing up (the partition argument)

18 For a given run (Algorithm, Machine, Input) 1.Consider the computation DAG: G = (V, E) V = set of computations and inputs E = dependencies 2.Partition G into segments S of  (M  /2 ) vertices (correspond to time / location adjacency) 3.Show that every S has  3M vertices with incoming / outgoing edges  perform  M read/writes. 4.The total communication BW is BW = BW of one segment  #segments =  (M)  O(n  ) /  (M  /2 ) =  (n  / M  /2 -1 ) M M MM M S RSRS WSWS V The partitioning argument S1S1 S2S2 S3S3 Read Write FLOP Time...

19 Subsequently… More Bounds, for example: For rectangular fast-matrix multiplication algorithms [MedAlg’12]. For fast numerical linear algebra [EECS-Techreport’12]. E.g., solving linear systems, least squares, eigenproblems,... with same arithmetic and communication costs, and numerically stably. How much extra memory is useful How far we can have perfect strong-scaling [SPAA’12b] New Parallel Algorithm…

Algorithms for Supercomputers Lower bounds by graph expansion Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance Fault tolerance March/15/2015