Antisocial Parallelism: Avoiding, Hiding and Managing Communication Kathy Yelick Associate Laboratory Director of Computing Sciences Lawrence Berkeley.

Slides:

Advertisements

Similar presentations

Statistical Modeling of Feedback Data in an Automatic Tuning System Richard Vuduc, James Demmel (U.C. Berkeley, EECS) Jeff.

Advertisements

The view from space Last weekend in Los Angeles, a few miles from my apartment…

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Lecture 6: Multicore Systems

Introductory Courses in High Performance Computing at Illinois David Padua.

Avoiding Communication in Numerical Linear Algebra Jim Demmel EECS & Math Departments UC Berkeley.

Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

POSKI: A Library to Parallelize OSKI Ankit Jain Berkeley Benchmarking and OPtimization (BeBOP) Project bebop.cs.berkeley.edu EECS Department, University.

Revisiting a slide from the syllabus: CS 525 will cover Parallel and distributed computing architectures – Shared memory processors – Distributed memory.

L15: Review for Midterm. Administrative Project proposals due today at 5PM (hard deadline) – handin cs6963 prop March 31, MIDTERM in class L15: Review.

Languages and Compilers for High Performance Computing Kathy Yelick EECS Department U.C. Berkeley.

Automatic Performance Tuning of Sparse Matrix Kernels Observations and Experience Performance tuning is tedious and time- consuming work. Richard Vuduc.

Data Locality CS 524 – High-Performance Computing.

Avoiding Communication in Sparse Iterative Solvers Erin Carson Nick Knight CS294, Fall 2011.

Minisymposia 9 and 34: Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Avoiding Communication in Linear Algebra Jim Demmel UC Berkeley bebop.cs.berkeley.edu.

Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Auto-tuning Sparse Matrix.

CUDA Programming Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen.

High Performance Computing 1 Parallelization Strategies and Load Balancing Some material borrowed from lectures of J. Demmel, UC Berkeley.

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB Tuning Stencils Kaushik Datta.

Bandwidth Avoiding Stencil Computations By Kaushik Datta, Sam Williams, Kathy Yelick, and Jim Demmel, and others Berkeley Benchmarking and Optimization.

C O M P U T A T I O N A L R E S E A R C H D I V I S I O N BIPS Tuning Sparse Matrix Vector Multiplication for multi-core SMPs Samuel Williams 1,2, Richard.

ECE669 L23: Parallel Compilation April 29, 2004 ECE 669 Parallel Computer Architecture Lecture 23 Parallel Compilation.

Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.

Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (paper to appear at SC07) Sam Williams

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 Autotuning Sparse Matrix.

Samuel Williams, John Shalf, Leonid Oliker, Shoaib Kamil, Parry Husbands, Katherine Yelick Lawrence Berkeley National Laboratory ACM International Conference.

Scalabilities Issues in Sparse Factorization and Triangular Solution Sherry Li Lawrence Berkeley National Laboratory Sparse Days, CERFACS, June 23-24,

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

1 Interconnects Shared address space and message passing computers can be constructed by connecting processors and memory unit using a variety of interconnection.

Computer Science and Engineering Parallel and Distributed Processing CSE 8380 February 8, 2005 Session 8.

1 Chapter 04 Authors: John Hennessy & David Patterson.

Chapter 3 Parallel Algorithm Design. Outline Task/channel model Task/channel model Algorithm design methodology Algorithm design methodology Case studies.

CS 395 Last Lecture Summary, Anti-summary, and Final Thoughts.

UPC Applications Parry Husbands. Roadmap Benchmark small applications and kernels —SPMV (for iterative linear/eigen solvers) —Multigrid Develop sense.

Sparse Matrix-Vector Multiplication on Throughput-Oriented Processors

Communication-Avoiding Algorithms for Linear Algebra and Beyond Jim Demmel EECS & Math Departments UC Berkeley.

Sparse Matrix Vector Multiply Algorithms and Optimizations on Modern Architectures Ankit Jain, Vasily Volkov CS252 Final Presentation 5/9/2007

Lecture 4 TTH 03:30AM-04:45PM Dr. Jianjun Hu CSCE569 Parallel Computing University of South Carolina Department of.

C.E. Goutis V.I.Kelefouras University of Patras Department of Electrical and Computer Engineering VLSI lab Date: 31/01/2014 Compilers for Embedded Systems.

Minimizing Communication in Numerical Linear Algebra Optimizing Krylov Subspace Methods Jim Demmel EECS & Math Departments,

Linear Algebra Libraries: BLAS, LAPACK, ScaLAPACK, PLASMA, MAGMA

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 A Vision for Integrating.

How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.

Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:

CS6068 Parallel Computing - Fall 2015 Lecture Week 5 - Sept 28 Topics: Latency, Layout Design, Occupancy Challenging Parallel Design Patterns Sparse Matrix.

Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

BIPS C O M P U T A T I O N A L R E S E A R C H D I V I S I O N Tuning Sparse Matrix Vector Multiplication for multi-core SMPs (details in paper at SC07)

Michael J. Voss and Rudolf Eigenmann PPoPP, ‘01 (Presented by Kanad Sinha)

P A R A L L E L C O M P U T I N G L A B O R A T O R Y EECS Electrical Engineering and Computer Sciences B ERKELEY P AR L AB 1 PERI : Auto-tuning Memory.

Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance

Tools and Libraries for Manycore Computing Kathy Yelick U.C. Berkeley and LBNL.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344.

1 ”MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs” John A. Stratton, Sam S. Stone and Wen-mei W. Hwu Presentation for class TDT24,

University of California, Berkeley

Sparse Matrix-Vector Multiplication (Sparsity, Bebop)

The Parallel Revolution Has Started: Are You Part of the Solution or Part of the Problem? Dave Patterson Parallel Computing Laboratory (Par Lab) & Reliable.

Minimizing Communication in Linear Algebra

Samuel Williams1,2, David Patterson1,

for more information ... Performance Tuning

The Coming Multicore Revolution: What Does it Mean for Programming?

P A R A L L E L C O M P U T I N G L A B O R A T O R Y

Optimizing MMM & ATLAS Library Generator

Auto-tuning Memory Intensive Kernels for Multicore

Avoiding Communication in Linear Algebra

Parallel Programming in C with MPI and OpenMP

6- General Purpose GPU Programming

Presentation transcript:

Antisocial Parallelism: Avoiding, Hiding and Managing Communication Kathy Yelick Associate Laboratory Director of Computing Sciences Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley

Extra Work Can Improve Efficiency! Optimizing Sparse Matrix Vectory Multiply (fill) Example: 3x3 blocking – Logical grid of 3x3 cells – Fill-in explicit zeros – Unroll 3x3 block multiplies – “Fill ratio” = 1.5 – Takes advantage of registers On Pentium III: 1.5x speedup! – Actual mflop rate = 2.25 higher See Eun-Jin Im PhD Thesis (Sparsity Library) and Rich Vuduc PhD thesis (OSKI Library)

3 SpMV Performance (simple parallelization) Out-of-the box SpMV performance on a suite of 14 matrices Scalability isn’t great Is this performance good? Naïve Pthreads Naïve See Sam Williams PhD thesis + papers 4 GF

4 Auto-tuned SpMV Performance (architecture specific optimizations) Fully auto-tuned SpMV performance across the suite of matrices Included SPE/local store optimized version Why do some optimizations work better on some architectures? +Cache/LS/TLB Blocking +Matrix Compression +SW Prefetching +NUMA/Affinity Naïve Pthreads Naïve 16 GF

5 The Roofline Performance Model See Sam Williams PhD Thesis peak DP mul / add imbalance w/out SIMD w/out ILP w/out SW prefetch w/out NUMA peak BW /81/8 actual flop:byte ratio attainable Gflop/s /41/4 1/21/ Generic Machine  Roof structure determined by machine  Locations of posts in the building are determined by algorithmic intensity  Will vary across algorithms and with bandwidth-reducing optimizations, such as better cache re-use (tiling), compression techniques  Can use DRAM, network, disk,…

6 Roofline model for SpMV (matrix compression)  Inherent FMA  Register blocking improves ILP, DLP, flop:byte ratio, and FP% of instructions / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls)

7 Roofline model for SpMV (matrix compression) / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ / 16 flop:DRAM byte ratio attainable Gflop/s /81/8 1/41/4 1/21/ w/out SIMD peak DP w/out ILP w/out FMA w/out NUMA bank conflicts 25% FP peak DP 12% FP w/out SW prefetch w/out NUMA peak DP w/out SIMD w/out ILP mul/add imbalance dataset fits in snoop filter peak DP w/out SIMD w/out ILP mul/add imbalance w/out SW prefetchw/out NUMA IBM QS20 Cell Blade Opteron 2356 (Barcelona) Intel Xeon E5345 (Clovertown) Sun T2+ T5140 (Victoria Falls) Performance is bandwidth limited Performance is bandwidth limited SpMV should run close to memory bandwidth –Time to read matrix is major cost Can we do better? Can we compute A k *x with one read of A? If so, this would –Reduce # messages –Reduce memory bandwidth

Autotuning: Write Code Generators Autotuners are code generators plus search Avoids two unsolved compiler problems: dependence analysis and accurate performance models Popular in libraries: Atlas, FFTW, OSKI,… Work by Williams, Oliker, Shalf, Madduri, Kamil, Im, Ethier,… / /81/8 1/41/4 1/21/2 1 / / /81/8 1/41/4 1/21/2 1 / 32 single-precision peak double-precision peak single-precision peak double-precision peak STREAM bandwidth Device bandwidth Xeon X5550 (Nehalem) NVIDIA C2050 (Fermi) DP add-only Algorithmic intensity: Flops/Word Peak compute 8

Finding Good Performance is like finding the Needle in a Haystack OSKI sparse matrix library: offline search + online evaluation: adding zeros can reduce storage in blocked format Dense: MFlops(r,c) / Tsopf: Fill(r,c) = Effective_MFlops(r,c) Selected RB(5x7) with a sample dense matrix c (num. of cols) r (num. of rows) 9 Work by Im, Vuduc, Williams, Kamil, Ho, Demmel, Yelick…

 Our approach uses Empirical modeling and search 10 Off-line auto-tuning At Install Time Run-time auto-tuning (analysis & modeling) + OSKI and pOSKI: Auto-tuning Sparse Matrix Kernels Benchmark data 1. Build for Target Arch. 2. Benchmark Heuristic models 1. Evaluate Models Generated Code Variants 2. Select Data Struct. & Code To user: Matrix handle for kernel calls Workload from program monitoring History User’s Matrix User’s hints (r,c) Sample Dense Matrix (r,c)

.py OS/HW f() B.h() DSEL Compiler.c Interpreter SEJITS Framework Productivity app.so cc/ld cache SEJITS: Selective Embedded Just-in-Time Specialization (beyond Perl code generators) Armando Fox’s group, including Shoaib Kamil and Michael Driscoll ASP is SEJITS for Python: Python provides Front-end parsing Tools for building strongly-typed IR Visitor pattern for IR transformation to backend AST Code “templates” (like webapps) Code generation External compiler support (C++, CUDA, OpenMP, pthreads, MPI, Scala) Caching

Lessons Learned Optimizations (not all in OSKI) – Register blocking, loop unrolling, cache blocking, thread blocking, reordering, index compression, SIMDization, manual prefetch, NUMA (“PGAS” on node), matrix splitting, switch-to- dense, sparse/bit-masked register blocks – See for papershttp://bebop.berkeley.edu – Straight line code failed to work on Spice ~10 years ago 64-bit instructions: 1 load (x), 1 store (y), 1 op Vs 1 op and fraction of load/store depending on reuse Good news – Autotuning helps save programmer time But the operation is bandwidth limited – With hardware optimizations (NUMA, prefetch, SIMDization, threading) – The rest is about matrix compression A problem for local memory and network 12

Avoiding Communication in Iterative Solvers Consider Sparse Iterative Methods for Ax=b – Krylov Subspace Methods: GMRES, CG,… Solve time dominated by: –Sparse matrix-vector multiple (SPMV) Which even on one processor is dominated by “communication” time to read the matrix –Global collectives (reductions) Global latency-limited Can we lower the communication costs? –Latency: reduce # messages by computing multiple reductions at once –Bandwidth to memory, i.e., compute Ax, A 2 x, … A k x with one read of A Joint work with Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin; See 2 PhD thesis for details 13

… … 32 x A·x A 2 ·x A 3 ·x Communication Avoiding Kernels The Matrix Powers Kernel : [Ax, A 2 x, …, A k x] Replace k iterations of y = A  x with [Ax, A 2 x, …, A k x] Idea: pick up part of A and x that fit in fast memory, compute each of k products Example: A tridiagonal matrix (a 1D “grid”), n=32, k=3 General idea works for any “well-partitioned” A

… … 32 x A·x A 2 ·x A 3 ·x Communication Avoiding Kernels (Sequential case) The Matrix Powers Kernel : [Ax, A 2 x, …, A k x] Replace k iterations of y = A  x with [Ax, A 2 x, …, A k x] Sequential Algorithm Example: A tridiagonal, n=32, k=3 Saves bandwidth (one read of A&x for k steps) Saves latency (number of independent read events) Step 1Step 2Step 3Step 4

… … 32 x A·x A 2 ·x A 3 ·x Communication Avoiding Kernels: (Parallel case) The Matrix Powers Kernel : [Ax, A 2 x, …, A k x] Replace k iterations of y = A  x with [Ax, A 2 x, …, A k x] Parallel Algorithm Example: A tridiagonal, n=32, k=3 Each processor works on (overlapping) trapezoid Saves latency (# of messages); Not bandwidth But adds redundant computation Proc 1Proc 2Proc 3Proc 4

Matrix Powers Kernel on a General Matrix Saves communication for “well partitioned” matrices Serial: O(1) moves of data moves vs. O(k) Parallel: O(log p) messages vs. O(k log p) 17 Joint work with Jim Demmel, Mark Hoemman, Marghoob Mohiyuddin For implicit memory management (caches) uses a TSP algorithm for layout

A k x has higher performance than Ax Speedups on Intel Clovertown (8 core) Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick

Minimizing Communication of GMRES to solve Ax=b GMRES: find x in span{b,Ab,…,A k b} minimizing || Ax-b || 2 Standard GMRES for i=1 to k w = A · v(i-1) … SpMV MGS(w, v(0),…,v(i-1)) update v(i), H endfor solve LSQ problem with H Communication-avoiding GMRES W = [ v, Av, A 2 v, …, A k v ] [Q,R] = TSQR(W) … “Tall Skinny QR” build H from R solve LSQ problem with H Sequential case: #words moved decreases by a factor of k Parallel case: #messages decreases by a factor of k Oops – W from power method, precision lost! Mark Hoemmen, PhD thesis

TSQR: An Architecture-Dependent Algorithm W = W0W1W2W3W0W1W2W3 R 00 R 10 R 20 R 30 R 01 R 11 R 02 Parallel: W = W0W1W2W3W0W1W2W3 R 01 R 02 R 00 R 03 Sequential: W = W0W1W2W3W0W1W2W3 R 00 R 01 R 11 R 02 R 11 R 03 Dual Core: Can choose reduction tree dynamically Multicore / Multisocket / Multirack / Multisite / Out-of-core: ? Work by Laura Grigori, Jim Demmel, Mark Hoemmen, Julien Langou 20

Matrix Powers Kernel (and TSQR) in GMRES 21 Jim Demmel, Mark Hoemmen, Marghoob Mohiyuddin, Kathy Yelick

Communication-Avoiding Krylov Method (GMRES) Performance on 8 core Clovertown 22

Towards Communication-Avoiding Compilers: Deconstructing 2.5D Matrix Multiply x z z y x y k j i Matrix Multiplication code has a 3D iteration space Each point in the space is a constant computation (*/+) for i, for j, for k B[k,j] … A[i,k] … C[i,j] … These are not just “avoiding,” they are “communication-optimal” 23

Generalizing Communication Optimal Transformations to Arbitrary Loop Nests Speedup of 1.5D N-Body over 1D 1.5D N-Body: Replicate and Reduce The same idea (replicate and reduce) can be used on (direct) N-Body code: 1D decomposition  “1.5D” Does this work in general? Yes, for certain loops and array expressions Relies on basic result in group theory Compiler work TBD A Communication-Optimal N-Body Algorithm for Direct Interactions, Driscoll et al, IPDPS’13 24

Generalizing Communication Lower Bounds and Optimal Algorithms For serial matmul, we know #words_moved = Ω (n 3 /M 1/2 ), attained by tile sizes M 1/2 x M 1/2 – Where do all the ½’s come from? Thm (Christ,Demmel,Knight,Scanlon,Yelick): For any program that “smells like” nested loops, accessing arrays with subscripts that are linear functions of the loop indices, #words_moved = Ω (#iterations/M e ), for some e we can determine Thm (C/D/K/S/Y): Under some assumptions, we can determine the optimal tiles sizes Long term goal: All compilers should generate communication optimal code from nested loops 25

Communication Overlap Complements Avoidance Even with communication-optimal algorithms (minimized bandwidth) there are still benefits to overlap and other things that speed up networks Communication Avoiding and Overlapping for Numerical Linear Algebra, Georganas et al, SC12 26

Optimality of Communication Lower bounds, (matching) upper bounds (algorithms) and a question: Can we train compilers to do this? See:

Beyond Domain Decomposition 2.5D Matrix Multiply on BG/P, 16K nodes / 64K cores c = 16 copies EuroPar’11 (Solomonik, Demmel) SC’11 paper (Solomonik, Bhatele, Demmel) Surprises: Even Matrix Multiply had room for improvement Idea: make copies of C matrix (as in prior 3D algorithm, but not as many) Result is provably optimal in communication Lesson: Never waste fast memory Can we generalize for compiler writers? 28

Towards Communication-Avoiding Compilers: Deconstructing 2.5D Matrix Multiply Tiling the iteration space Compute a subcube Will need data on faces (projection of cube, subarrays) For s loops in the nest  s dimensional space For x dimensional arrays, project to x dim space k j i Matrix Multiplication code has a 3D iteration space Each unit cube in the space is a constant computation (*/+) for i for j for k B[k,j] … A[i,k] … C[i,j] … 29

Deconstructing 2.5D Matrix Multiply Solomonik & Demmel Tiling in the k dimension k loop has dependencies because C (on the top) is a Left-Hand-Side variable C +=.. Advantages to tiling in k: -More parallelism  Less synchronization -Less communication x z z y x y What happens to these dependencies? All dependencies are vertical k dim (updating C matrix) Serial case: compute vertical block column in order Parallel case: -2D algorithm (and compilers): never chop k dim -2.5 or 3D: Assume + is associative; chop k, which implies replication of C matrix k j i 30

Beyond Domain Decomposition Much of the work on compilers is based on owner-computes – For MM: Divide C into chunks, schedule movement of A/B – In this case domain decomposition becomes replication Ways to compute C “pencil” 1.Serially 2.Parallel reduction 3.Parallel asynchronous (atomic) updates 4.Or any hybrid of these For what types / operators does this work? – “+” is associative for 1,2 rest of RHS is “simple” – and commutative for 3 31 Using x for C[i,j] here x += … Standard vectorization trick

Lower Bound Idea on C = A*B Iromy, Toledo, Tiskin 32 x z z y x y “Unit cubes” in black box with side lengths x, y and z = Volume of black box = x*y*z = (#A□s * #B□s * #C□s ) 1/2 = ( xz * zy * yx) 1/2 k (i,k) is in “A shadow” if (i,j,k) in 3D set (j,k) is in “B shadow” if (i,j,k) in 3D set (i,j) is in “C shadow” if (i,j,k) in 3D set Thm (Loomis & Whitney, 1949) # cubes in 3D set = Volume of 3D set ≤ (area(A shadow) * area(B shadow) * area(C shadow)) 1/2 “A shadow” “B shadow” “C shadow” j i

Load Store FLOP Time Segment 1 Segment 2 Segment 3 Lower Bound: What is the minimum amount of communication required? Proof from Irony/Toledo/Tiskin (2004) Assume fast memory of size M Outline (big-O reasoning): –Segment instruction stream, each with M loads/stores –Somehow bound the maximum number of flops that can be done in each segment, call it F –So F · # segments  T = total flops = 2·n 3, so # segments  T / F –So # loads & stores = M · #segments  M · T / F How much work (F) can we do with O(M) data?

Recall optimal sequential Matmul Naïve code for i=1:n, for j=1:n, for k=1:n, C(i,j)+=A(i,k)*B(k,j) “Blocked” code for i1 = 1:b:n, for j1 = 1:b:n, for k1 = 1:b:n for i2 = 0:b-1, for j2 = 0:b-1, for k2 = 0:b-1 i=i1+i2, j = j1+j2, k = k1+k2 C(i,j)+=A(i,k)*B(k,j) Thm: Picking b = M 1/2 attains lower bound: #words_moved = Ω(n 3 /M 1/2 ) Where does 1/2 come from? Can we compute these for arbitrary programs? b x b matmul 34

Generalizing Communication Lower Bounds and Optimal Algorithms For serial matmul, we know #words_moved = Ω (n 3 /M 1/2 ), attained by tile sizes M 1/2 x M 1/2 Thm (Christ,Demmel,Knight,Scanlon,Yelick): For any program that “smells like” nested loops, accessing arrays with subscripts that are linear functions of the loop indices #words_moved = Ω (#iterations/M e ) for some e we can determine Thm (C/D/K/S/Y): Under some assumptions, we can determine the optimal tiles sizes – E.g., index expressions are just subsets of indices Long term goal: All compilers should generate communication optimal code from nested loops

New Theorem applied to Matmul for i=1:n, for j=1:n, for k=1:n, C(i,j) += A(i,k)*B(k,j) Record array indices in matrix Δ Solve LP for x = [xi,xj,xk] T : max 1 T x s.t. Δ x ≤ 1 – Result: x = [1/2, 1/2, 1/2] T, 1 T x = 3/2 = s HBL Thm: #words_moved = Ω(n 3 / M S HBL -1 )= Ω(n 3 / M 1/2 ) Attained by block sizes M xi,M xj,M xk = M 1/2,M 1/2,M 1/2 ijk 101A Δ =011B 110C

New Theorem applied to Direct N-Body for i=1:n, for j=1:n, F(i) += force( P(i), P(j) ) Record array indices in matrix Δ Solve LP for x = [xi,xj] T : max 1 T x s.t. Δ x ≤ 1 – Result: x = [1,1], 1 T x = 2 = s HBL Thm: #words_moved = Ω(n 2 /M S HBL -1 )= Ω(n 2 /M 1 ) Attained by block sizes M xi,M xj = M 1,M 1 ij 10F Δ =10P(i) 01P(j)

New Theorem applied to Random Code for i1=1:n, for i2=1:n, …, for i6=1:n A1(i1,i3,i6) += func1(A2(i1,i2,i4),A3(i2,i3,i5),A4(i3,i4,i6)) A5(i2,i6) += func2(A6(i1,i4,i5),A3(i3,i4,i6)) Record array indices in matrix Δ Solve LP for x = [x1,…,x7] T : max 1 T x s.t. Δ x ≤ 1 – Result: x = [2/7,3/7,1/7,2/7,3/7,4/7], 1 T x = 15/7 = s HBL Thm: #words_moved = Ω(n 6 /M S HBL -1 )= Ω(n 6 /M 8/7 ) Attained by block sizes M 2/7,M 3/7,M 1/7,M 2/7,M 3/7,M 4/7 i1i2i3i4i5i A A2 Δ = A A3,A A A6

General Communication Bound Given S subset of Z k, group homomorphisms φ 1, φ 2, …, bound |S| in terms of |φ 1 (S)|, |φ 2 (S)|, …, |φ m (S)| Def: Hölder-Brascamp-Lieb LP (HBL-LP) for s 1,…,s m : for all subgroups H < Z k, rank(H) ≤ Σ j s j *rank(φ j (H)) Thm (Christ/Tao/Carbery/Bennett): Given s 1,…,s m |S| ≤ Π j |φ j (S)| s j Thm: Given a program with array refs given by φ j, choose s j to minimize s HBL = Σ j s j subject to HBL-LP. Then #words_moved = Ω (#iterations/M s HBL -1 )

Comments Attainability depends on loop dependencies Best case: none, or associative operators (matmul, nbody) Thm: When all φ j = {subset of indices}, dual of HBL-LP gives optimal tile sizes: HBL-LP: minimize 1 T *s s.t. s T *Δ ≥ 1 T Dual-HBL-LP: maximize 1 T *x s.t. Δ*x ≤ 1 Then for sequential algorithm, tile i j by M x j Ex: Matmul: s = [ 1/2, 1/2, 1/2 ] T = x Generality: – Extends to unimodular transforms of indices – Does not require arrays (as long as the data structures are injective containers) – Does not require loops as long as they can model computation

Communication is expensive and (relative) cost is growing – Avoid bandwidth (data volume) – Hide latency or reduce number of messages Conceptual model – Think of computation a set of points in a volume in d-space (d = # loops in nest) – What is maximum amount you can do for a fixed surface area Theory – Lower bounds are useful to understand limits – Many programs (index expressions) still open for upper boundsConclusions 41

Bonus Slide #1: Beyond UPC DAG Scheduling in a distributed (partitioned) memory context Assignment of work is static; schedule is dynamic Ordering needs to be imposed on the schedule – Critical path operation: Panel Factorization General issue: dynamic scheduling in partitioned memory – Can deadlock in memory allocation – “memory constrained” lookahead some edges omitted Uses a Berkeley extension to UPC to remotely synchronize 42

43 Bonus slide #2: Emerging Fast Forward Exascale Node Architecture System on Chip (SoC) design coming into focus Memory Stacks on package Low Capacity High Bandwidth Fat Core Latency Optimized Memory DRAM/DIMMS Memory High Capacity Low Bandwidth NIC on Board NVRAM: Burst Buffers / rack-local storage Slide from John Shalf