Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance 3.1.2015.

Slides:

Advertisements

Similar presentations

Dense Linear Algebra (Data Distributions) Sathish Vadhiyar.

Advertisements

Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.

Introduction to Algorithms 6.046J/18.401J L ECTURE 3 Divide and Conquer Binary search Powering a number Fibonacci numbers Matrix multiplication Strassen’s.

Eigenvalue and eigenvectors  A x = λ x  Quantum mechanics (Schrödinger equation)  Quantum chemistry  Principal component analysis (in data mining)

Communication Avoiding Algorithms for Dense Linear Algebra: LU, QR, and Cholesky decompositions and Sparse Cholesky decompositions Jim Demmel and Oded.

Benchmarking Parallel Code. Benchmarking2 What are the performance characteristics of a parallel code? What should be measured?

Numerical Algorithms Matrix multiplication

CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.

CS 240A : Matrix multiplication Matrix multiplication I : parallel issues Matrix multiplication II: cache issues Thanks to Jim Demmel and Kathy Yelick.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

1cs542g-term Notes  Assignment 1 will be out later today (look on the web)

1cs542g-term Notes  Assignment 1 is out (questions?)

CS267 L20 Dense Linear Algebra II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 20: Dense Linear Algebra - II James Demmel

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Towards Data Partitioning for Parallel Computing on Three Interconnected Clusters Brett A. Becker and Alexey Lastovetsky Heterogeneous Computing Laboratory.

ISPDC 2007, Hagenberg, Austria, 5-8 July On Grid-based Matrix Partitioning for Networks of Heterogeneous Processors Alexey Lastovetsky School of.

02/21/2007CS267 Lecture DLA11 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel

Towards Communication Avoiding Fast Algorithm for Sparse Matrix Multiplication Part I: Minimizing arithmetic operations Oded Schwartz CS294, Lecture #21.

02/09/2006CS267 Lecture 81 CS 267 Dense Linear Algebra: Parallel Matrix Multiplication James Demmel

Communication [Lower] Bounds for Heterogeneous Architectures Julian Bui.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:

Data Shackling Locality enhancement of dense numerical linear algebra codes Traversals along co-ordinate axes Data-centric reference for each statement.

1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.

Lection 1: Introduction Computational Geometry Prof.Dr.Th.Ottmann 1 History: Proof-based, algorithmic, axiomatic geometry, computational geometry today.

Lecture 37: Chapter 7: Multiprocessors Today’s topic –Introduction to multiprocessors –Parallelism in software –Memory organization –Cache coherence 1.

Lecture 6: Introduction to Distributed Computing.

By: David McQuilling; Jesus Caban Deng Li Numerical Linear Algebra.

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.

High Performance Computing 1 Numerical Linear Algebra An Introduction.

1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra

CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix multiplication I: parallel issues Matrix multiplication II: cache.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Introduction to Algorithms 6.046J/18.401J/SMA5503 Lecture 3 Prof. Erik Demaine.

By: David McQuilling and Jesus Caban Numerical Linear Algebra.

Packing Rectangles into Bins Nikhil Bansal (CMU) Joint with Maxim Sviridenko (IBM)

MATRIX MULTIPLY WITH DRYAD B649 Course Project Introduction.

Community Grids Lab. Indiana University, Bloomington Seung-Hee Bae.

Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M

Accelerating the Singular Value Decomposition of Rectangular Matrices with the CSX600 and the Integrable SVD September 7, 2007 PaCT-2007, Pereslavl-Zalessky.

Lecture 4 Sparse Factorization: Data-flow Organization

One algorithm to rule them all One join query at a time Atri Rudra University at Buffalo.

Dense Linear Algebra Sathish Vadhiyar. Gaussian Elimination - Review Version 1 for each column i zero it out below the diagonal by adding multiples of.

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.

Data Structures and Algorithms in Parallel Computing Lecture 7.

Computer Organization CS224 Fall 2012 Lesson 52. Introduction  Goal: connecting multiple computers to get higher performance l Multiprocessors l Scalability,

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

Today's Software For Tomorrow's Hardware: An Introduction to Parallel Computing Rahul.S. Sampath May 9 th 2007.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA Shirley Moore CPS5401 Fall 2013 svmoore.pbworks.com November 12, 2012.

Communication-Avoiding Algorithms: 1) Strassen-Like Algorithms 2) Hardware Implications Jim Demmel.

Algorithms for Supercomputers Upper bounds: from sequential to parallel Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: Sunday, 2-5pm High performance.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

Ioannis E. Venetis Department of Computer Engineering and Informatics

Optimizing Cache Performance in Matrix Multiplication

High-Performance Matrix Multiplication

Minimizing Communication in Linear Algebra

Optimizing Cache Performance in Matrix Multiplication

BLAS: behind the scenes

Lecture 7: Introduction to Distributed Computing.

CS 140 : Matrix multiplication

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

How Efficient Can We Be?: Bounds on Algorithm Energy Consumption

Introduction to Algorithms

To accompany the text “Introduction to Parallel Computing”,

CS 140 : Matrix multiplication

Parallel Matrix Multiply

Presentation transcript:

Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance

Model & Motivation Two kinds of costs: Arithmetic (FLOPs) Communication: moving data Running time =   #FLOPs +   #Words ( +   #Messages ) CPU M CPU M CPU M CPU M Distributed CPU M RAM Sequential Fast/local memory of size M P processors 2 Communication-minimizing algorithms: Save time Communication-minimizing algorithms: Save energy 23%/Year26%/Year 59%/Year

3 The Fastest Matrix Multiplication in the West Franklin (Cray XT4), Strong Scaling, n = [BDHLS SPAA’12] 24%-184% faster than previous algorithms Ours [2012] Strassen based [1995] ScaLAPACK Classical based [Solomonik-Demmel 2011]

The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Speedup example Rectangular Matrix Multiplication, 64 x k x 64 Benchmarked on 32 cores shared memory Intel machine “…Highest Performance and Scalability across Past, Present & Future Processors…” Intel’s Math Kernel Library. From: Intel’s Math Kernel Library Our Algorithm 4

The Fastest Matrix Multiplication in the West [DEFKLSS, SuperComputing’12] Our Algorithm Machine Peak ScaLAPACK Speedup example Benchmarked on Hopper ( Cray XE6) Lawrence Berkeley National Laboratory Rectangular Matrix Multiplication Dimensions: 192 x x 192 Strong-Scaling plot. 5

(1) proving bounds on these communication costs (2) developing faster algorithms by minimizing communication. Lower bounds and Algorithms CPU RAM CPU RAM CPU RAM CPU RAM Distributed 6 CPU M RAM Sequential

Today: I. Model, Motivation II. Graph expansion analysis III. Optimal parallel algorithms Maybe: IV. Energy V. Obliviousness VI. Scaling bounds VII. Geometric arguments 7

Naïve implementation For i = 1 to n For j = 1 to n For k = 1 to n C(i, j) = C(i, j) + A(i, k)B(k, j) 8 Example: Classical Matrix Multiplication Bandwidth cost: BW =  (n 3 ) Can we do better? n = ABC ii jj CPU M RAM Sequential

9 Example: Classical Matrix Multiplication Yes. Compute block-wise [BLAS] [Cannon 69] M=  (n 2 /P), [McColl Tiskin 99, Solomonik Demmel 11] any M Can we do better? n = M is fast/local memory size M < n 2

10 Lower bounds: “3-nested loops” [Hong & Kung 81] Sequential Mat-Mul [Irony,Toledo,Tiskin 04] Sequential and parallel Mat-Mul No!

11 Communication Lower Bounds Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

12 LU and Cholesky decompositions LU decomposition of A A = P·L·U U is upper triangular L is lower triangular P is a permutation matrix Cholesky decomposition of A If A is a real symmetric and positive definite matrix, then  L (a real lower triangular) s.t., A = L·L T (L is unique if we restrict diagonal elements  0) Efficient parallel and sequential algorithms?

13 Lower bounds for matrix multiplication Bandwidth: [Hong & Kung 81] Sequential [Irony,Toledo,Tiskin 04] Sequential and parallel Latency: Divide by M.

14 Reduction (1 st approach) [Ballard, Demmel, Holtz, S. 2009a] Thm: Cholesky and LU decompositions are (communication-wise) as hard as matrix-multiplication Proof: By a reduction (from matrix-multiplication) that preserves communication bandwidth, latency, and arithmetic. Cor: Any classical O(n 3 ) algorithm for Cholesky and LU decomposition requires: Bandwidth:  (n 3 / M 1/2 ) Latency:  (n 3 / M 3/2 ) (similar cor. for the parallel model).

15 Some Sequential Classical Cholesky Algorithms BandwidthLatency Cache Oblivious Lower bound  (n 3 /M 1/2 )  (n 3 /M 3/2 ) Naive (n3)(n3)  (n 3 /M)  LAPACK Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 )  Rectangular Recursive [Toledo 97] Column-major Contiguous blocks O(n 3 /M 1/2 +n 2 log n)  (n 3 /M)  (n 2 )  Square Recursive [Ahmad, Pingali 00] Column-major Contiguous blocks O(n 3 /M 1/2 ) O(n 3 /M) O(n 3 /M 3/2 ) 

16 Easier example: LU It is easy to reduce matrix multiplication to LU decomposition  LU factorization can be used to perform matrix multiplication  Communication lower bound for matrix multiplication applies to LU

17 Can we do the same for Cholesky? Is it easy to reduce matrix multiplication to Cholesky decomposition? Problems: A·A T appears in T. Perhaps, all communication guarantee takes place computing A·A T.

18 So, Can we do the same for Cholesky? [Ballard, Demmel, Holtz, S. 2009]: Yes, but not as easy as LU. Proof: Later.

19 Lower bounds for the Hierarchy model To reduce the hierarchy model to the sequential model: Observe the communication between two levels Conclusion: Any communication lower bound for the sequential model translates to a lower bound for the hierarchy model. Any CA algorithm for the sequential model translates to a CA algorithm for the hierarchy model. But not the other way round. CPU Cache RAM Sequential M1M1 M2M2 M3M3 M k =   Hierarchy M1M1 M2M2 M k =   Reduction

20 Next time: Approaches: 1.Reduction [Ballard, Demmel, Holtz, S. 2009] 2.Geometric Embedding [Irony,Toledo,Tiskin 04], [Ballard, Demmel, Holtz, S. 2011a] 3.Graph Analysis [Hong & Kung 81], [Ballard, Demmel, Holtz, S. 2011b] Proving that your algorithm/implementation is as good as it gets.

21 Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49] Matrix multiplication form:  (i,j)  n x n, C(i,j) =  k A(i,k) B(k,j), Thm: If an algorithm agrees with this form (regardless of the order of computation) then BW =  (n 3 / M 1/2 ) BW =  (n 3 / PM 1/2 )in P-parallel model.

22 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S  k 4.The total communication BW is BW = BW of one segment  #segments  M  #mults / k... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

23 Volume of box V = x·y·z = ( xz · zy · yx) 1/2 Thm: (Loomis & Whitney, 1949) Volume of 3D set V ≤ (area(A shadow) · area(B shadow) · area(C shadow) ) 1/2 x z z y x y A B C “A shadow” “B shadow” “C shadow” A B C V V Matrix multiplication form:  (i,j)  n x n, C(i,j) =  k A(i,k)B(k,j), Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

24 S1S1 S2S2 S3S3 Read Write FLOP Time... M Example of a partition, M = 3 For a given run (algorithm, machine, input) 1.Partition computations into segments of M reads / writes 2.Any segment S has 3M inputs/outputs. 3.Show that #multiplications in S  k 4.The total communication BW is BW = BW of one segment  #segments  M  #mults / k = M  n 3 / k 5.By Loomis-Whitney: BW  M  n 3 / (3M) 3/2... Geometric Embedding (2 nd approach) [Irony,Toledo,Tiskin 04], based on [Loomis & Whitney 49]

25 From Sequential Lower bound to Parallel Lower Bound We showed: Any classical O(n 3 ) algorithm for matrix multiplication on sequential model requires: Bandwidth:  (n 3 / M 1/2 ) Latency:  (n 3 / M 3/2 ) Cor: Any classical O(n 3 ) algorithm for matrix multiplication on P-processors machine (with balanced workload) requires: 2D-layout: M=O(n 2 /P) Bandwidth:  (n 3 /PM 1/2 )  (n 2 /P 1/2 ) Latency:  (n 3 / PM 3/2 )  (P 1/2 )

26 From Sequential Lower bound to Parallel Lower Bound Proof: Observe one processor. Is it always true? “A shadow” “B shadow” “C shadow” A B C Let Alg be an algorithm with communication lower bound B = B(n,M). Then any parallel implementation of Alg has a communication lower bound B’(n, M, p) = B(n, M)/p ?

Proof of Loomis-Whitney inequality T = 3D set of 1x1x1 cubes on lattice N = |T| = #cubes T x = projection of T onto x=0 plane N x = |T x | = #squares in T x, same for T y, N y, etc Goal: N ≤ (N x · N y · N z ) 1/2 27 T(x=i) = subset of T with x=i T(x=i | y ) = projection of T(x=i) onto y=0 plane N(x=i) = |T(x=i)| etc N =  i N(x=i) =  i (N(x=i)) 1/2 · (N(x=i)) 1/2 ≤  i (N x ) 1/2 · (N(x=i)) 1/2 ≤ (N x ) 1/2 ·  i (N(x=i | y ) · N(x=i | z ) ) 1/2 = (N x ) 1/2 ·  i (N(x=i | y ) ) 1/2 · (N(x=i | z ) ) 1/2 ≤ (N x ) 1/2 · (  i N(x=i | y ) ) 1/2 · (  i N(x=i | z ) ) 1/2 = (N x ) 1/2 · (N y ) 1/2 · (N z ) 1/2 z y x T(x=i) T(x=i | y) T x=i N(x=i|y) N(x=i)  N(x=i|y) ·N(x=i|z) N(x=i|z) T(x=i)

Motivation & Model Lower bounds techniques Summary 28

Recent progress and open problems in fault tolerance and communication minimizing, from theoretical and practical perspectives. Matrix multiplication (classical and Strassen-like), dense and sparse linear algebra, FFT, sorting, graph algorithms, and dynamic programming. Topics Will Include 29

Theory: Communication lower bounds Communication minimizing algorithm Algorithmic approaches for ascertaining fault tolerance. Practice: Software (e.g., automating the construction of new algorithms with specific properties); Hardware (e.g., what technological trends mean for algorithm design, and vice-versa; Applications (e.g., improving applications’ performance using known algorithms and applications motivated algorithmic research). Topics Will Include 30

Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance