CMPUT 680 - Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Syllabus Allen and Kennedy, Preface Optimizing Compilers for Modern Architectures.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Bernstein’s Conditions. Techniques to Exploit Parallelism in Sequential Programming Hierarchy of levels of parallelism: Procedure or Methods Statements.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Parallelizing Compilers Presented by Yiwei Zhang.

A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.

Data Dependences CS 524 – High-Performance Computing.

EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.

Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Design Issues. How to parallelize  Task decomposition  Data decomposition  Dataflow decomposition Jaruloj Chongstitvatana 2 Parallel Programming: Parallelization.

1 Theory and Practice of Dependence Testing Data and control dependences Scalar data dependences  True-, anti-, and output-dependences Loop dependences.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

High-Level Transformations for Embedded Computing

Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

ECE 1754 Loop Transformations by: Eric LaForest

Loop Optimizations Scheduling. Loop fusion int acc = 0; for (int i = 0; i < n; ++i) { acc += a[i]; a[i] = acc; } for (int i = 0; i < n; ++i) { b[i] +=

Program Analysis & Transformations Loop Parallelization and Vectorization Toheed Aslam.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

Memory-Aware Compilation Philip Sweany 10/20/2011.

Dependence Analysis and Loops CS 3220 Spring 2016.

1 Removing Impediments to Loop Fusion Through Code Transformations Bob Blainey 1, Christopher Barton 2, and Jos’e Nelson Amaral 2 1 IBM Toronto Software.

Lecture 38: Compiling for Modern Architectures 03 May 02

Automatic Thread Extraction with Decoupled Software Pipelining

Code Optimization Overview and Examples

CS314 – Section 5 Recitation 13

Simone Campanoni Loop transformations Simone Campanoni

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Register Pressure Guided Unroll-and-Jam

Code Optimization Overview and Examples Control Flow Graph

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 19: Code Optimisation

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

CMPUT Compiler Design and Optimization2 Reading Wolfe, Michael, High Performance Compilers for Parallel Computing, Addison-Wesley, 1996 Chapter 9 Allen, Randy and Kennedy, Ken, Optimizing Compilers for Modern Architectures, Morgan-Kaufmann, 2002 Chapter 8

CMPUT Compiler Design and Optimization3 Unswitching Remove loop independent conditionals from a loop. for i=1 to N do for j=2 to N do if T[i] > 0 then A[i,j] = A[i, j-1]*T[i] + B[i] else A[i,j] = 0.0 endif endfor Before Unswitching for i=1 to N do if T[i] > 0 then for j=2 to N do A[i,j] = A[i, j-1]*T[i] + B[i] endfor else for j=2 to N do A[i,j] = 0.0 enfor endif endfor After Unswitching

CMPUT Compiler Design and Optimization4 Unswitching Constraints: The conditional tested must be completely independent of the loop. Legality: It is always legal. Advantages: Reduces the frequence of execution of the conditional statement. Disadvantage: Loop structure is more complex. Code size expansion. Might prevent data reuse.

CMPUT Compiler Design and Optimization5 Loop Peeling Remove the first (last) iteration of the loop into separate code. for i=1 to N do A[i] = (X+Y)*B[i] endfor Before Peeling if N >= 1 then A[i] = (X+Y)*B[i] for j=2 to N do A[i] = (X+Y)*B[i] enfor endif After Peeling

CMPUT Compiler Design and Optimization6 Loop Peeling Constraints: If the compiler does not know that the trip count is always positive, the peeled code must be protected by a zero-trip test. Advantages: Used to enable loop fusion or remove conditionals on the index variable from inside the loop. Allows execution of loop invariant code only in the first iteration. Disadvantage: Code size expansion.

CMPUT Compiler Design and Optimization7 Index Set Splitting Divides the index set into two portions. for i=1 to 100 do A[i] = B[i] + C[i] if i > 10 then D[i] = A[i] + A[i-10] endif endfor Before Set Splitting for i=1 to 10 do A[i] = B[i] + C[i] endfor for i=11 to 100 do A[i] = B[i] + C[i] D[i] = A[i] + A[i-10] endfor After Set Splitting

CMPUT Compiler Design and Optimization8 Index Set Splitting Disadvantage: Code size expansion. Advantages: Used to enable loop fusion or remove conditionals on the index variable from inside the loop. Can remove conditionals that test index variables.

CMPUT Compiler Design and Optimization9 Scalar Expansion for i=1 to N do T = A[i] + B[i] C[i] = T + 1/T endfor In the following loop, the scalar variable T creates: (1) a flow dependence from the first to the second assignment; (2) a loop-carried anti-dependence from the second to the first assignment; This anti-dependence can prevent some loop transformations.

CMPUT Compiler Design and Optimization10 Scalar Expansion Breaks anti-dependence relations by expanding, or promoting a scalar into an array. for i=1 to N do T = A[i] + B[i] C[i] = T + 1/T endfor Before Scalar Expansion if N >= 1 then allocate Tx(1:N) for i=1 to N do Tx[i] = A[i] + B[i] C[i] = Tx[i] + 1/Tx[i] endfor T = Tx[N] endif After Scalar Expansion

CMPUT Compiler Design and Optimization11 Scalar Expansion Constraints: The loop must be countable and the scalar must have no upward exposed uses. Advantages: Eliminates anti-dependences and output dependences. Disadvantage: In nested loops the size of the array might be prohibitive. Flow dependences for the scalar in the loop must be loop independent If the scalar is live on the loop exit, the last value assigned in the array must be copied into the scalar upon exit.

CMPUT Compiler Design and Optimization12 Loop Fusion Takes two adjacent loops and generates a single loop. (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion But, is this fusion legal?

CMPUT Compiler Design and Optimization13 Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N-1 do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion 1357 B 0 Assume N=4: 2468 A 0 After the first loop: 1234 C 1 After the second loop: D After the third loop:

CMPUT Compiler Design and Optimization14 Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion 1357 B 0 Assume N=4: 0000 A 0 After the first loop: 0000 C 0 After the second loop: D After the third loop: 2 1

CMPUT Compiler Design and Optimization15 Loop Fusion To be legal, a loop fusion must preserve all the dependence relations of the original loops. (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion The original loop has the flow dependencies: S 2  f S 5 S 5  f S 8 (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion What are the dependences in the fused loop?

CMPUT Compiler Design and Optimization16 Loop Fusion The original loop has the flow dependencies: S 2  f S 5 S 5  f S 8 (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion In the fused loop, the dependences are S 2  f S 5 S 8  a S 5 Fusion reversed the dependence between S 5 and S 8 !! ?Thus it is illegal.

CMPUT Compiler Design and Optimization17 Loop Fusion Takes two adjacent loops and generates a single loop. (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to N do (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor Before Loop Fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (5) C[i] = A[i] / 2 (6) endfor (7) for i=1 to N do (8) D[i] = 1 / C[i+1] (9) endfor After Loop Fusion This is a legal fusion!

CMPUT Compiler Design and Optimization18 Loop Fusion Initially only data independent loops would be fused. Now we try to fuse data dependent loops to increase data locality and benefit from caches. Loop fusion increases the size of the loop, which reduces instruction temporal locality (a problem only in machines with tiny instruction caches). Larger loop bodies enable more effective scalar optimizations (common subexpression elimination and instruction scheduling).

CMPUT Compiler Design and Optimization19 Loop Fusion (Complications) To be fused, two loops must be compatible, i.e.: (1) they iterate the same number of times (2) they are adjacent or can be reordered to become adjacent (3) the compiler must be able to use the same induction variable in both loops Compilers use other transformations to make loops meet the conditions above.

CMPUT Compiler Design and Optimization20 Loop Fusion (Another Example) (1) for i=1 to 99 do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to 98 do (5) C[i] = A[i+1] * 2 (6) endfor (2) A[1] = B[1] + 1 (1) for i=2 to 99 do (2) A[i] = B[i] + 1 (3) endfor (4) for i=1 to 98 do (5) C[i] = A[i+1] * 2 (6) endfor (1) i = 1 (2) A[i] = B[i] + 1 for ib=0 to 97 do (1) i = ib+2 (2) A[i] = B[i] + 1 (4) i = ib+1 (5) C[i] = A[i+1] * 2 (6) endfor

CMPUT Compiler Design and Optimization21 Loop Fission (or Loop Distribution) Breaks a loop into two or more smaller loops. (1) for i=1 to N do (2) A[i] = A[i] + B[i-1] (3) B[i] = C[i-1]*X + Z (4) C[i] = 1/B[i] (5) D[i] = sqrt(C[i]) (6) endfor S2 S3 S4 S Dependence Graph Original Loop

CMPUT Compiler Design and Optimization22 Loop Fission (or Loop Distribution) Breaks a loop into two or more smaller loops. (1) for i=1 to N do (2) A[i] = A[i] + B[i-1] (3) B[i] = C[i-1]*X + Z (4) C[i] = 1/B[i] (5) D[i] = sqrt(C[i]) (6) endfor Original Loop (1) for ib=0 to N-1 do (3) B[ib+1] = C[ib]*X + Z (4) C[ib+1] = 1/B[ib+1] (6) endfor (1) for ib=0 to N-1 do (2) A[ib+1] = A[ib+1] + B[ib] (6) endfor (1) for ib=0 to N-1 do (5) D[ib+1] = sqrt(C[ib+1]) (6) endfor (1) i = N+1 After Loop Fission

CMPUT Compiler Design and Optimization23 Loop Fission (or Loop Distribution) All statements that form a strongly connected component in the original loop dependence graph must remain in the same loop after fission When finding strongly connected components for loop fission, the compiler can ignore loop carried anti-dependence and output dependence for scalars that are expanded by loop fission.

CMPUT Compiler Design and Optimization24 Loop Fission (or Loop Distribution) To find a legal order of the loops after fission, we compute the acyclic condensation of the dependence graph. S2 S3 S4 S Dependence Graph S3-S4 S2S5 Acyclic Condensation

CMPUT Compiler Design and Optimization25 Loop Fission (or Loop Distribution) Uses of loop fission: - it can improve cache use in machines with very small caches; - it can be required for other transformations, such as loop interchanging.

CMPUT Compiler Design and Optimization26 Loop Reversal Run a loop backward. All dependence directions are reversed. It is only legal for loops that have no loop carried dependences. Can be used to allow fusion (1) for i=1 to N do (2) A[i] = B[i] + 1 (3) C[i] = A[i]/2 (4) endfor (5) for i=1 to N do (6) D[i] = 1/C[i+1] (7) endfor (1) for i=N downto 1 do (2) A[i] = B[i] + 1 (3) C[i] = A[i]/2 (4) endfor (5) for i=N downto 1 do (6) D[i] = 1/C[i+1] (7) endfor (1) for i=N downto 1 do (2) A[i] = B[i] + 1 (3) C[i] = A[i]/2 (6) D[i] = 1/C[i+1] (7) endfor

CMPUT Compiler Design and Optimization27 Loop Interchanging Reverses the nesting order of nested loops. If the outer loop iterates many times, and the inner loop iterates only a few times, interchanging reduces the startup cost of the original inner loop. Interchanging can change the spatial locality of memory references.

CMPUT Compiler Design and Optimization28 Loop Interchanging (1) for j=2 to M do (2) for i=1 to N do (3) A[i,j] = A[i,j-1] + B[i,j] (4) endfor (5) endfor (1) for i=1 to N do (2) for j=2 to M do (3) A[i,j] = A[i,j-1] + B[i,j] (4) endfor (5) endfor

CMPUT Compiler Design and Optimization29 Other Loop Restructuring Loop Skewing: Unnormalize iteration vectors to change the shape of the iteration space to allow loop interchanging. Strip Mining: Decompose a single loop into two nested loop (the inner loop computes a strip of the data computed by the original loop). Used for vector processors. Loop Tiling: The loop space is divided in tiles, with the tile boundaries parallel to the iteration space axes.