09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13, 2012 1.

Slides:



Advertisements
Similar presentations
Communication-Avoiding Algorithms Jim Demmel EECS & Math Departments UC Berkeley.
Advertisements

1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Technical University of Lodz Department of Microelectronics and Computer Science Elements of high performance microprocessor architecture Memory system.
Vector Processing. Vector Processors Combine vector operands (inputs) element by element to produce an output vector. Typical array-oriented operations.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
CS420 lecture six Loops. Time Analysis of loops Often easy: eg bubble sort for i in 1..(n-1) for j in 1..(n-i) if (A[j] > A[j+1) swap(A,j,j+1) 1. loop.
CS 140 : Matrix multiplication Linear algebra problems Matrix multiplication I : cache issues Matrix multiplication II: parallel issues Thanks to Jim Demmel.
CS 240A : Matrix multiplication Matrix multiplication I : parallel issues Matrix multiplication II: cache issues Thanks to Jim Demmel and Kathy Yelick.
Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.
1cs542g-term Notes  Assignment 1 will be out later today (look on the web)
1cs542g-term Notes  Assignment 1 is out (questions?)
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Data Locality CS 524 – High-Performance Computing.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Computer ArchitectureFall 2007 © November 7th, 2007 Majd F. Sakr CS-447– Computer Architecture.
CS 240A: Solving Ax = b in parallel °Dense A: Gaussian elimination with partial pivoting Same flavor as matrix * matrix, but more complicated °Sparse A:
CS267 L2 Memory Hierarchies.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 2: Memory Hierarchies and Optimizing Matrix Multiplication.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed., Oct. 30, 2002 Topic: Caches (contd.)
Uniprocessor Optimizations and Matrix Multiplication
Data Locality CS 524 – High-Performance Computing.
CS 524 (Wi 2003/04) - Asim LUMS 1 Cache Basics Adapted from a presentation by Beth Richardson
L7: Memory Hierarchy Optimization, cont. CS6963. Administrative Homework #2, posted on website – Due 5PM, Thursday, February 19 – Use handin program to.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part III: Graph analysis Oded Schwartz CS294, Lecture #10.
High Performance Computing 1 Numerical Linear Algebra An Introduction.
1 Jim Demmel EECS & Math Departments, UC Berkeley Minimizing Communication in Numerical Linear Algebra
09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,
CS 140 : Matrix multiplication Warmup: Matrix times vector: communication volume Matrix multiplication I: parallel issues Matrix multiplication II: cache.
09/21/2010CS4961 CS4961 Parallel Programming Lecture 9: Red/Blue and Introduction to Data Locality Mary Hall September 21,
Slides for Parallel Programming Techniques & Applications Using Networked Workstations & Parallel Computers 2nd ed., by B. Wilkinson & M
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
Matrix Multiply Methods. Some general facts about matmul High computation to communication hides multitude of sins Many “symmetries” meaning that the.
09/07/2012CS4230 CS4230 Parallel Programming Lecture 7: Loop Scheduling cont., and Data Dependences Mary Hall September 7,
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.
08/10/ NRL Hybrid QR Factorization Algorithm for High Performance Computing Architectures Peter Vouras Naval Research Laboratory Radar Division Professor.
How to Compute and Prove Lower and Upper Bounds on the Communication Costs of Your Algorithm Part II: Geometric embedding Oded Schwartz CS294, Lecture.
Empirical Optimization. Context: HPC software Traditional approach  Hand-optimized code: (e.g.) BLAS  Problem: tedious to write by hand Alternatives:
L3: Memory Hierarchy Optimization I, Locality and Data Placement CS6235 L3: Memory Hierarchy, 1.
Potential Projects Jim Demmel CS294 Fall, 2011 Communication-Avoiding Algorithms
Memory-Aware Compilation Philip Sweany 10/20/2011.
10/01/2009CS4961 CS4961 Parallel Programming Lecture 12/13: Introduction to Locality Mary Hall October 1/3,
Concurrency and Performance Based on slides by Henri Casanova.
1 ENERGY 211 / CME 211 Lecture 4 September 29, 2008.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Algorithms for Supercomputers Introduction Oded Schwartz Seminar: Sunday, 12-2pm, B410 Workshop: ? High performance Fault tolerance
CS4961 Parallel Programming Lecture 11: Data Locality, cont
Optimizing Cache Performance in Matrix Multiplication
CS427 Multicore Architecture and Parallel Computing
Minimizing Communication in Linear Algebra
Optimizing Cache Performance in Matrix Multiplication
BLAS: behind the scenes
L18: CUDA, cont. Memory Hierarchy and Examples
Memory Hierarchies.
CS4961 Parallel Programming Lecture 12: Data Locality, cont
CS 140 : Matrix multiplication
L4: Memory Hierarchy Optimization II, Locality and Data Placement
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Memory System Performance Chapter 3
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Introduction to Optimization
CS 140 : Matrix multiplication
Presentation transcript:

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Administrative I will be on travel again Tuesday, September 18 Axel will provide the algorithm description for Singular Value Decomposition, which is our next programming assignment 09/13/2012CS42302

Today’s Lecture Dense Linear Algebra, from video Locality -Data reuse vs. data locality -Reordering transformations for locality Sources for this lecture: -Notes on website 09/13/20123CS4230

4 Back to basics: Why avoiding communication is important (1/2) Algorithms have two costs: 1.Arithmetic (FLOPS) 2.Communication: moving data between -levels of a memory hierarchy (sequential case) -processors over a network (parallel case). CPU Cache DRAM CPU DRAM CPU DRAM CPU DRAM CPU DRAM Slide source: Jim Demmel, CS267CS267 Lecture 11

Why avoiding communication is important (2/2) Running time of an algorithm is sum of 3 terms: -# flops * time_per_flop -# words moved / bandwidth -# messages * latency 5 communication Time_per_flop << 1/ bandwidth << latency Gaps growing exponentially with time Goal : organize linear algebra to avoid communication Between all memory hierarchy levels L1 L2 DRAM network, etc Not just hiding communication (overlap with arith) (speedup  2x ) Arbitrary speedups possible Annual improvements Time_per_flopBandwidthLatency Network26%15% DRAM23%5% 59% Slide source: Jim Demmel, CS267

for i = 1 to n {read row i of A into fast memory, n 2 reads} for j = 1 to n {read C(i,j) into fast memory, n 2 reads} {read column j of B into fast memory, n 3 reads} for k = 1 to n C(i,j) = C(i,j) + A(i,k) * B(k,j) {write C(i,j) back to slow memory, n 2 writes} 6 Review: Naïve Sequential MatMul: C = C + A*B =+* C(i,j) A(i,:) B(:,j) C(i,j) n 3 + O(n 2 ) reads/writes altogether CS267 Lecture 11Slide source: Jim Demmel, CS267

Less Communication with Blocked Matrix Multiply Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size 7 … Break A nxn, B nxn, C nxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul, 4b 2 reads/writes (n/b) 3 · 4b 2 = 4n 3 /b reads/writes altogether Minimized when 3b 2 = cache size = M, yielding O(n 3 /M 1/2 ) reads/writes What if we had more levels of memory? (L1, L2, cache etc)? Would need 3 more nested loops per level CS267 Lecture 11Slide source: Jim Demmel, CS267

Blocked vs Cache-Oblivious Algorithms Blocked Matmul C = A·B explicitly refers to subblocks of A, B and C of dimensions that depend on cache size 8 … Break A nxn, B nxn, C nxn into bxb blocks labeled A(i,j), etc … b chosen so 3 bxb blocks fit in cache for i = 1 to n/b, for j=1 to n/b, for k=1 to n/b C(i,j) = C(i,j) + A(i,k)·B(k,j) … b x b matmul … another level of memory would need 3 more loops Cache-oblivious Matmul C = A·B is independent of cache Function C = RMM(A,B) … R for recursive If A and B are 1x1 C = A · B else … Break A nxn, B nxn, C nxn into (n/2)x(n/2) blocks labeled A(i,j), etc for i = 1 to 2, for j = 1 to 2, for k = 1 to 2 C(i,j) = C(i,j) + RMM( A(i,k), B(k,j) ) … n/2 x n/2 matmul CS267 Lecture 11 Slide source: Jim Demmel, CS267

Communication Lower Bounds: Prior Work on Matmul Assume n 3 algorithm (i.e. not Strassen-like) Sequential case, with fast memory of size M -Lower bound on #words moved to/from slow memory =  (n 3 / M 1/2 ) [Hong, Kung, 81] -Attained using blocked or cache-oblivious algorithms 9 Parallel case on P processors: Let M be memory per processor; assume load balanced Lower bound on #words moved =  (n 3 /(p · M 1/2 )) [Irony, Tiskin, Toledo, 04] If M = 3n 2 /p (one copy of each matrix), then lower bound =  (n 2 /p 1/2 ) Attained by SUMMA, Cannon’s algorithm CS267 Lecture 11Slide source: Jim Demmel, CS267

New lower bound for all “direct” linear algebra Holds for -Matmul, BLAS, LU, QR, eig, SVD, tensor contractions, … -Some whole programs (sequences of these operations, no matter how they are interleaved, eg computing A k ) -Dense and sparse matrices (where #flops << n 3 ) -Sequential and parallel algorithms -Some graph-theoretic algorithms (eg Floyd-Warshall) Proof later 10 Let M = “fast” memory size per processor = cache size (sequential case) or O(n 2 /p) (parallel case) #flops = number of flops done per processor #words_moved per processor =  (#flops / M 1/2 ) #messages_sent per processor =  (#flops / M 3/2 ) CS267 Lecture 11Slide source: Jim Demmel, CS267

New lower bound for all “direct” linear algebra Sequential case, dense n x n matrices, so O(n 3 ) flops -#words_moved =  (n 3 / M 1/2 ) -#messages_sent =  (n 3 / M 3/2 ) Parallel case, dense n x n matrices -Load balanced, so O(n 3 /p) flops processor -One copy of data, load balanced, so M = O(n 2 /p) per processor -#words_moved =  (n 2 / p 1/2 ) -#messages_sent =  ( p 1/2 ) - 11 Let M = “fast” memory size per processor = cache size (sequential case) or O(n 2 /p) (parallel case) #flops = number of flops done per processor #words_moved per processor =  (#flops / M 1/2 ) #messages_sent per processor =  (#flops / M 3/2 ) CS267 Lecture 11Slide source: Jim Demmel, CS267

Can we attain these lower bounds? Do conventional dense algorithms as implemented in LAPACK and ScaLAPACK attain these bounds? -Mostly not If not, are there other algorithms that do? -Yes Goals for algorithms: -Minimize #words_moved -Minimize #messages_sent -Need new data structures -Minimize for multiple memory hierarchy levels -Cache-oblivious algorithms would be simplest -Fewest flops when matrix fits in fastest memory -Cache-oblivious algorithms don’t always attain this Attainable for nearly all dense linear algebra -Just a few prototype implementations so far (class projects!) -Only a few sparse algorithms so far (eg Cholesky) 1202/21/2012 CS267 Lecture 11

Data Dependence and Related Definitions Definition: Two memory accesses are involved in a data dependence if they may refer to the same memory location and one of the references is a write. A data dependence can either be between two distinct program statements or two different dynamic executions of the same program statement. Source: “Optimizing Compilers for Modern Architectures: A Dependence-Based Approach”, Allen and Kennedy, 2002, Ch /13/201213CS4230

Fundamental Theorem of Dependence Theorem 2.2: -Any reordering transformation that preserves every dependence in a program preserves the meaning of that program. 09/13/201214CS4230

In this course, we consider two kinds of reordering transformations Parallelization -Computations that execute in parallel between synchronization points are potentially reordered. Is that reordering safe? According to our definition, it is safe if it preserves the dependences in the code. Locality optimizations -Suppose we want to modify the order in which a computation accesses memory so that it is more likely to be in cache. This is also a reordering transformation, and it is safe if it preserves the dependences in the code. Reduction computations -We have to relax this rule for reductions. It is safe to reorder reductions for commutative and associative operations. 09/13/201215CS4230

Targets of Memory Hierarchy Optimizations Reduce memory latency - The latency of a memory access is the time (usually in cycles) between a memory request and its completion Maximize memory bandwidth -Bandwidth is the amount of useful data that can be retrieved over a time interval Manage overhead -Cost of performing optimization (e.g., copying) should be less than anticipated gain 09/13/2012CS423016

Reuse and Locality Consider how data is accessed -Data reuse: -Same or nearby data used multiple times -Intrinsic in computation -Data locality: -Data is reused and is present in “fast memory” -Same data or same data transfer If a computation has reuse, what can we do to get locality? -Appropriate data placement and layout -Code reordering transformations 09/13/2012 CS

Exploiting Reuse: Locality optimizations We will study a few loop transformations that reorder memory accesses to improve locality. These transformations are also useful for parallelization too (to be discussed later). Two key questions: -Safety: -Does the transformation preserve dependences? -Profitability: -Is the transformation likely to be profitable? -Will the gain be greater than the overheads (if any) associated with the transformation? 09/13/2012CS423018

for (j=0; j<6; j++) for (i= 0; i<3; i++) A[i][j+1]=A[i][j]+B[j] ; for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i][j+1]=A[i][j]+B[j] ; i j new traversal order! i j Permute the order of the loops to modify the traversal order NOTE: C multi-dimensional arrays are stored in row-major order, Fortran in column major Loop Transformations: Loop Permutation 09/13/ CS4230

20 Tiling (Blocking): Another Loop Reordering Transformation Blocking reorders loop iterations to bring iterations that reuse data closer in time Goal is to retain in cache/register/scratchpad (or other constrained memory structure) between reuse J I J I 09/13/2012CS4230

21 Tiling Example for (j=1; j<M; j++) for (i=1; i<N; i++) D[i] = D[i] +B[j,i] for (j=1; j<M; j++) for (ii=1; ii<N; ii+=s) for (i=ii; i<min(ii+s-1,N); i++) D[i] = D[i] +B[j,i] Strip mine for (ii=1; ii<N; ii+=s) for (j=1; j<M; j++) for (i=ii; i<min(ii+s-1,N); i++) D[i] = D[i] +B[j,i] Permute 09/13/2012CS4230

Unroll simply replicates the statements in a loop, with the number of copies called the unroll factor As long as the copies don’t go past the iterations in the original loop, it is always safe -May require “cleanup” code Unroll-and-jam involves unrolling an outer loop and fusing together the copies of the inner loop (not always safe) One of the most effective optimizations there is, but there is a danger in unrolling too much Unroll, Unroll-and-Jam Original: for (i=0; i<4; i++) for (j=0; j<8; j++) A[i][j] = B[j+1][i]; Unroll j for (i=0; i<4; i++) for (j=0; j<8; j+=2) A[i][j] = B[j+1][i]; A[i][j+1] = B[j+2][i]; Unroll-and-jam i for (i= 0; i<4; i+=2) for (j=0; j<8; j++) A[i][j] = B[j+1][i]; A[i+1][j] = B[j+1][i+1]; 09/13/201222CS4230

How does Unroll-and-Jam benefit locality? Temporal reuse of B in registers More if I loop is unrolled further 09/13/2012CS Original: for (i=0; i<4; i++) for (j=0; j<8; j++) A[i][j] = B[j+1][i] + B[j+1][i+1]; Unroll-and-jam i and j loops for (i=0; i<4; i+=2) for (j=0; j<8; j+=2) { A[i][j] = B[j+1][i] + B[j+1][i+1]; A[i+1][j] = B[j+1][i+1] + B[j+1][i+2]; A[i][j+1] = B[j+2][i] + B[j+2][i+1]; A[i+1][j+1] B[j+2][i+1] + B[j+2][i+2]; }

Other advantages of Unroll-and-Jam Less loop control Independent computations for instruction-level parallelism 09/13/2012CS Original: for (i=0; i<4; i++) for (j=0; j<8; j++) A[i][j] = B[j+1][i] + B[j+1][i+1]; Unroll-and-jam i and j loops for (i=0; i<4; i+=2) for (j=0; j<8; j+=2) { A[i][j] = B[j+1][i] + B[j+1][i+1]; A[i+1][j] = B[j+1][i+1] + B[j+1][i+2]; A[i][j+1] = B[j+2][i] + B[j+2][i+1]; A[i+1][j+1] B[j+2][i+1] + B[j+2][i+2]; } + B B A = + B B A = + B B A = + B B A =

How to determine safety of reordering transformations Informally -Must preserve relative order of dependence source and sink -So, cannot reverse order Formally -Tracking dependences 09/13/2012CS423025

Safety of Permutation Cannot reverse any dependences Ok to permute? 09/13/2012CS for (i= 0; i<3; i++) for (j=0; j<6; j++) A[i][j+1]=A[i][j]+B[j]; for (i= 0; i<3; i++) for (j=1; j<6; j++) A[i+1][j-1]=A[i][j]+B[j];

27 Safety of Tiling Tiling = strip-mine and permutation -Strip-mine does not reorder iterations -Permutation must be legal OR - strip size less than dependence distance 09/13/2012CS4230

28 Safety of Unroll-and-Jam Unroll-and-jam = tile + unroll -Permutation must be legal OR - unroll less than dependence distance 09/13/2012CS4230

Unroll-and-jam = tile + unroll? 09/13/2012CS Unroll i tile: for (ii= 0; ii<4; ii+=2) for (j=0; j<8; j++) A[i][j] = B[j+1][i]; A[i+1][j] = B[j+1][i+1]; Original: for (i=0; i<4; i++) for (j=0; j<8; j++) A[i][j] = B[j+1][i]; Tile i loop: for (ii=0; ii<4; ii+=2) for (j=0; j<8; j++) for (i=ii; i<ii+2; i++) A[i][j] = B[j+1][i];