DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.
Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.
Carnegie Mellon 1 Cache Memories : Introduction to Computer Systems 10 th Lecture, Sep. 23, Instructors: Randy Bryant and Dave O’Hallaron.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Memory System Performance October 29, 1998 Topics Impact of cache parameters Impact of memory reference patterns –matrix multiply –transpose –memory mountain.
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Matrix Multiplication (i,j,k) for I = 1 to n do for j = 1 to n do for k = 1 to n do C[i,j] = C[i,j] + A[i,k] x B[k,j] endfor.
Cache Memories May 5, 2008 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance EECS213.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9 Yaniv Carmeli.
CPSC 312 Cache Memories Slides Source: Bryant Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on.
1 Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as a i,j and elements of.
Fast matrix multiplication; Cache usage
Cache Memories Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance CS213.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display. Parallel Programming in C with MPI and OpenMP Michael J. Quinn.
– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
External Sorting Sort n records/elements that reside on a disk. Space needed by the n records is very large.  n is very large, and each record may be.
1 Cache Memories Andrew Case Slides adapted from Jinyang Li, Randy Bryant and Dave O’Hallaron.
Recitation 7: 10/21/02 Outline Program Optimization –Machine Independent –Machine Dependent Loop Unrolling Blocking Annie Luo
– 1 – , F’02 Caching in a Memory Hierarchy Larger, slower, cheaper storage device at level k+1 is partitioned into blocks.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
Introduction to Computer Systems Topics: Theme Five great realities of computer systems (continued) “The class that bytes”
ECE 454 Computer Systems Programming Memory performance (Part II: Optimizing for caches) Ding Yuan ECE Dept., University of Toronto
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
Code and Caches 1 Computer Organization II © CS:APP & McQuain Cache Memory and Performance Many of the following slides are taken with permission.
1 Seoul National University Cache Memories. 2 Seoul National University Cache Memories Cache memory organization and operation Performance impact of caches.
ECE 1754 Loop Transformations by: Eric LaForest
1 Cache Memory. 2 Outline Cache mountain Matrix multiplication Suggested Reading: 6.6, 6.7.
Cache Memories February 28, 2002 Topics Generic cache memory organization Direct mapped caches Set associative caches Impact of caches on performance Reading:
Programming for Performance CS 740 Oct. 4, 2000 Topics How architecture impacts your programs How (and how not) to tune your code.
1 Cache Memories. 2 Today Cache memory organization and operation Performance impact of caches  The memory mountain  Rearranging loops to improve spatial.
Lecture 5: Memory Performance. Types of Memory Registers L1 cache L2 cache L3 cache Main Memory Local Secondary Storage (local disks) Remote Secondary.
Cache Memories Topics Generic cache-memory organization Direct-mapped caches Set-associative caches Impact of caches on performance CS 105 Tour of the.
Memory-Aware Compilation Philip Sweany 10/20/2011.
Special Topics in Computer Engineering OpenMP* Essentials * Open Multi-Processing.
Optimizing for the Memory Hierarchy Topics Impact of caches on performance Memory hierarchy considerations Systems I.
1 Writing Cache Friendly Code Make the common case go fast  Focus on the inner loops of the core functions Minimize the misses in the inner loops  Repeated.
Vassar College 1 Jason Waterman, CMPU 224: Computer Organization, Fall 2015 Cache Memories CMPU 224: Computer Organization Nov 19 th Fall 2015.
Carnegie Mellon 1 Bryant and O’Hallaron, Computer Systems: A Programmer’s Perspective, Third Edition Cache Memories CENG331 - Computer Organization Instructors:
Programming for Cache Performance Topics Impact of caches on performance Blocking Loop reordering.
Cache Memories.
Cache Memories CSE 238/2038/2138: Systems Programming
CS314 – Section 5 Recitation 13
External Sorting Sort n records/elements that reside on a disk.
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
CS 105 Tour of the Black Holes of Computing
Performance Measurement
Cache Memories Topics Cache memory organization Direct mapped caches
“The course that gives CMU its Zip!”
Memory Hierarchy II.
Parallel Programming in C with MPI and OpenMP
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Cache Memories Professor Hugh C. Lauer CS-2011, Machine Organization and Assembly Language (Slides include copyright materials from Computer Systems:
Loop Optimization “Programs spend 90% of time in loops”
Cache Performance October 3, 2007
Matrix Addition, C = A + B Add corresponding elements of each matrix to form elements of result matrix. Given elements of A as ai,j and elements of B as.
Cache Memories.
Cache Memory and Performance
Introduction to Optimization
Writing Cache Friendly Code
Presentation transcript:

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1

DEPENDENCES 2

DEPENDENCES (continued) 3

DEPENDENCE ANDPARALLELIZATION (SPREADING) 4

OpenMP Implementation 5

RENAMING: To Remove Memory Related Dependencies 6

DEPENDENCES IN LOOPS 7

DEPENDENCES IN LOOPS (Cont.) 8

DEPENDENCE ANALYSIS 9

DEPENDENCE ANALYSIS (continued) 10

LOOP PARALLELIZATION AND VECTORIZATION 11 The reason is that if there are no cycles in the dependence graph, then there will be no races in the parallel loop. A loop whose dependence graph is cycle free can be parallelized or vectorized

ALGORITHM REPLACEMENT 12 Some program patterns occur frequently in programs. They can be replaced with a parallel algorithm.

LOOP DISTRIBUTION To insulate these patterns, we can decompose loops into several loops, one for each strongly-connected component (π-block)in the dependence graph. 13

LOOP DISTRIBUTION (continued) 14

LOOP INTERCHANGING The dependence information determines whether or not the loop headers can be interchanged. The headers of the following loop can be interchanged 15

LOOP INTERCHANGING (continued) The headers of the following loop can not be interchanged 16

DEPENDENCE REMOVAL: Scalar Expansion: Some cycles in the dependence graph can be eliminated by using elementary transformations. 17

Induction variable recognition Induction variable: a variable that gets increased or decreased by a fixed amount on every iteration of a loop 18

More about the DO to PARALLEL DO transformation When the dependence graph inside a loop has no cross- iteration dependences, it can be transformed into a PARALLEL loop. 19 do i=1,n S1: a(i) = b(i) + c(i) S2: d(i) = x(i) + 1 end do do i=1,n S1: a(i) = b(i) + c(i) S2: d(i) = a(i) + 1 end do

20

Loop Alignment When there are cross iteration dependences, but no cycles, do loops can be aligned to be transformed into parallel loops (DOALLs) 21

Loop Distribution Another method for eliminating cross-iteration dependences 22

Loop Coalescing for DOALL loops Consider a perfectly nested DOALL loop such as 23 This could be trivially transformed into a singly nested loop with a tuple of variables as index: This coalescing transformation is convenient for scheduling and could reduce the overhead involved in starting DOALL loops.

Extra Slides 24

25 Why loop Interchange: Matrix Multiplication Example A classic example for locality-aware programming is matrix multiplication for (i=0;i<N;i++) for (j=0;j<N;j++) for (k=0;k<N;k++) c[i,j] += a[i][k] * b[k][j]; There are 6 possible orders for the three loops  i-j-k, i-k-j, j-i-k, j-k-i, k-i-j, k-j-i Each order corresponds to a different access patterns of the matrices Let’s focus on the inner loop, as it is the one that’s executed most often

26 Inner Loop Memory Accesses Each matrix element can be accessed in three modes in the inner loop  Constant: doesn’t depend on the inner loop’s index  Sequential: contiguous addresses  Stride: non-contiguous addresses (N elements apart) c[i][j] += a[i][k] * b[k][j]; i-j-k: Constant SequentialStrided i-k-j: Sequential ConstantSequential j-i-k: Constant SequentialStrided j-k-i: Strided StridedConstant k-i-j: Sequential ConstantSequential k-j-i: Strided StridedConstant

27 Loop order and Performance Constant access is better than sequential access  it’s always good to have constants in loops because they can be put in registers Sequential access is better than strided access  sequential access is better than strided because it utilizes the cache better Let’s go back to the previous slides

28 Best Loop Ordering? c[i][j] += a[i][k] * b[k][j]; i-j-k: Constant SequentialStrided i-k-j: Sequential ConstantSequential j-i-k: ConstantSequentialStrided j-k-i: StridedStridedConstant k-i-j: SequentialConstantSequential k-j-i: StridedStridedConstant k-i-j and i-k-j should have the best performance (no Strided) i-j-k and j-i-k should be worse ( 1 strided) j-k-i and k-j-i should be the worst (2 strided)