ECE 1754 Loop Transformations by: Eric LaForest

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

1 ECE734 VLSI Arrays for Digital Signal Processing Loop Transformation.

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

HW 2 is out! Due 9/25!. CS 6290 Static Exploitation of ILP.

ENGS 116 Lecture 101 ILP: Software Approaches Vincent H. Berk October 12 th Reading for today: , 4.1 Reading for Friday: 4.2 – 4.6 Homework #2:

Computer Architecture Lecture 7 Compiler Considerations and Optimizations.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

1 4/20/06 Exploiting Instruction-Level Parallelism with Software Approaches Original by Prof. David A. Patterson.

Instruction Level Parallelism María Jesús Garzarán University of Illinois at Urbana-Champaign.

ILP: Loop UnrollingCSCE430/830 Instruction-level parallelism: Loop Unrolling CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

Fall 2011SYSC 5704: Elements of Computer Systems 1 SYSC 5704 Elements of Computer Systems Optimization to take advantage of hardware.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Winter 2003 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

Chapter 2 Instruction-Level Parallelism and Its Exploitation

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Parallelizing Compilers Presented by Yiwei Zhang.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

EECC551 - Shaaban #1 Winter 2002 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.

1 Lecture 6: Static ILP Topics: loop analysis, SW pipelining, predication, speculation (Section 2.2, Appendix G) Please hand in Assignment 1 now Assignment.

– 1 – Basic Machine Independent Performance Optimizations Topics Load balancing (review, already discussed) In the context of OpenMP notation Performance.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

What’s in an optimizing compiler?

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

RISC architecture and instruction Level Parallelism (ILP) based on “Computer Architecture: a Quantitative Approach” by Hennessy and Patterson, Morgan Kaufmann.

1 Code optimization “Code optimization refers to the techniques used by the compiler to improve the execution efficiency of the generated object code”

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以利用编译器实现优化

High-Level Transformations for Embedded Computing

Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Memory-Aware Compilation Philip Sweany 10/20/2011.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Dependence Analysis and Loops CS 3220 Spring 2016.

DEPENDENCE-DRIVEN LOOP MANIPULATION Based on notes by David Padua University of Illinois at Urbana-Champaign 1.

Lecture 38: Compiling for Modern Architectures 03 May 02

Automatic Thread Extraction with Decoupled Software Pipelining

Code Optimization Overview and Examples

Simone Campanoni Loop transformations Simone Campanoni

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

CSCE430/830 Computer Architecture

CS4961 Parallel Programming Lecture 12: Data Locality, cont

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

Lecture 19: Code Optimisation

Loop-Level Parallelism

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

ECE 1754 Loop Transformations by: Eric LaForest

Motivation Improving loop behaviour/performance usually parallelism sometimes memory, register usage

Motivation Improving loop behaviour/performance Why focus on loops? usually parallelism sometimes memory, register usage Why focus on loops? they contain most of the busy code

Motivation Improving loop behaviour/performance Why focus on loops? usually parallelism sometimes memory, register usage Why focus on loops? they contain most of the busy code informal proof: else, program size would be proportional to data size

Data-Flow-Based Transformations

Strength Reduction (Loop-Based)‏ Replaces an expression in a loop with an equivalent one that uses a less expensive operator Before do i = 1, n a[i] = a[i] + c*i end do After T = c do i = 1, n a[i] = a[i] + T T = T + c end do Similar operations for exponentiation, sign reversal, and division, where economical.

Induction Variable Elimination Frees the register used by the variable, reduces the number of operations in the loop framework. Before for(i = 0; i < n; i++){ a[i] = a[i] + c; } After A = &a; T = &a + n; while(A < T){ *A = *A + c; A++; }

Loop-Invariant Code Motion A specific case of code hoisting Needs a register to hold the invariant value Ex: multi-dim. indices, pointers, structures Before do i = 1, n a[i] = a[i] + sqrt(x)‏ end do After if (n > 0) C = sqrt(x)‏ do i = 1, n a[i] = a[i] + C end do

Loop Unswitching What: How: Benefits: Moving loop-invariant conditionals outside of a loop. How: replicate loop inside each branch Benefits: no conditional testing each iteration smaller loops expose parallelism

Loop Unswitching Before After do i = 2, n a[i] = a[i] + c if (x < 7) then b[i] = a[i] * c[i] else b[i] = a[i-1] * b[i-1] end if end do After if (n > 2) then if (x < 7) then do all i = 2, n a[i] = a[i] + c b[i] = a[i] * c[i] end do else do i = 2, n b[i] = a[i-1] * b[i-1] end if

Loop Reordering Transformations (changing the relative iteration order of nested loops)‏

Loop Interchange Exchange loops in a perfect nest. Benefits: enable vectorization improve parallelism by increasing granularity reduce stride (and thus improve cache behaviour)‏ move loop-invariant expressions to inner loop Legal when: new dependencies and loop bounds are legal

Loop Interchange Before After do i = 1, n do j = 1, n b[i] = b[i] + a[i,j] end do After do j = 1, n do i = 1, n b[i] = b[i] + a[i,j] end do

do i = 2, n do j = 1, n-1 a[i,j] = a[i-1,j+1] end do Loop Interchange do i = 2, n do j = 1, n-1 a[i,j] = a[i-1,j+1] end do Cannot be interchanged due to (1,-1) dependence. would end up using a prior uncomputed value

Loop Skewing Used in wavefront computations How: By adding the outer loop index multiplied by a skew factor, f, to the bounds of the inner iteration variable, and then subtracting the same quantity from every use of the inner variable. Always legal because of subtraction. Benefit: allows inner loop (once exchanged) to execute in parallel.

Loop Skewing Before After Skewing (f = 1)‏ do i = 2, n-1 do j = 2, m-1 a[i,j] = ( a[a-1,j] + a[i,j-1] + a[i+1,j] + a[i,j+1] )/4 end do After Skewing (f = 1)‏ do i = 2, n-1 do j = i+2, i+m-1 a[i,j-i] = ( a[a-1,j-i] + a[i,j-1-i] + a[i+1,j-i] + a[i,j+1-i] )/4 end do {(1,0),(0,1)} {(1,1),(0,1)}

Loop Skewing After Skewing (f = 1)‏ After Interchange do i = 2, n-1 do j = i+2, i+m-1 a[i,j-i] = ( a[a-1,j-i] + a[i,j-1-i] + a[i+1,j-i] + a[i,j+1-i] )/4 end do After Interchange do j = 4, m+n-2 do i = max(2, j-m+1), min(n-1, j-2)‏ a[i,j-i] = ( a[a-1,j-i] + a[i,j-1-i] + a[i+1,j-i] + a[i,j+1-i] )/4 end do {(1,1),(0,1)} {(1,0),(1,1)}

Loop Reversal Changes the iteration direction By making the iteration variable run down to zero, the loop condition reduces to a BNEZ. May eliminate temporary arrays (see later)‏ Legal when resulting dependence vector remains lexicographically positive This also helps loop interchange.

Loop Reversal Before After do i = 1, n do j = 1, n a[i,j] = end do After do i = 1, n do j = 1, n, -1 a[i,j] = a[i-1,j+1] + 1 end do (1,-1)‏ (1,1)‏

Strip Mining Adjusts the granularity of an operation usually for vectorization also controlling array size, grouping operations Often requires other transforms first

Strip Mining Before After do i = 1, n a[i] = a[i] + c end do TN = (n/64)*64 do TI = 1, TN, 64 a[TI:TI+63] = a[TI:TI+63] + c end do do i= TN+1, n a[i] = a[i] + c

Cycle Shrinking Specialization of strip mining: Legal when: parallelize when dependence distance > 1 Legal when: distance must be constant and positive

Cycle Shrinking Before After do i = 1, n a[i+k] = b[i] b[i+k] = a[i] + c[i] end do After do TI = 1, n, k do all i = TI, TI+k-1 a[i+k] = b[i] b[i+k] = a[i] + c[i] end do all end do

Loop Tiling Multidimensional specialization of strip mining Goal: to improve cache reuse Adjacent loops can be tiled if they can be interchanged.

Loop Tiling Before After do i = 1, n do j = 1, n a[i,j] = b[j,i] end do After do TI = 1, n, 64 do TJ = 1, n, 64 do i = TI, min(TI+63, n)‏ do j = TJ, min(TJ+63, n)‏ a[i,j] = b[j,i] end do

Loop Fission a.k.a. Loop Distribution Divide loop statements into separate similar loops Benefits: create perfect loops nests reduce dependences, memory use improve locality, register reuse Legal when sub-loops are placed in same dependency order as original statements.

Loop Fission Before After do i = 1, n a[i] = a[i] + c x[i+1] = x[i]*7 + x[i+1] + a[i] end do After do all i = 1, n a[i] = a[i] + c end do all do i = 1, n x[i+1] = x[i]*7 + x[i+1] + a[i] end do

Loop Fusion a.k.a. loop jamming Legal when bounds are identical and when not inducing dependencies (S2 < S1). Benefits: reduced loop overhead improved parallelism, locality fix load balance

Restructuring Transformations (Alters the structure, but not the computations or iteration order)‏

Loop Unrolling Replicates the loop body Benefits: Always legal. reduces loop overhead increased ILP (esp. VLIW)‏ improved locality (consecutive elements)‏ Always legal.

Loop Unrolling Before After do i = 2, n-1 a[i] = a[i] + end do After do i = 1, n-2, 2 a[i] = a[i] + a[i-1] * a[i+1] a[i+1] = a[i+1] + a[i] * a[i+2] end do if (mod(n-2,2) = 1) then a[n-1] = a[n-1] + a[n-2] * a[n] end if

Software Pipelining Before After (approx.)‏ do i = 1, n a[i] = a[i] + c end do After (approx.)‏ do i = 1, n, 3 a[i] = a[i] + c a[i+1] = a[i+1] + c a[i+2] = a[i+2] + c end do note: assume a 2-way superscalar CPU

Loop Coalescing Combines a loop nest into a single loop results in a single induction variable Always legal: doesn't change iteration order Improves load balancing on parallel machines

Loop Coalescing Before After do all i = 1, n do all j = 1, m a[i,j] = a[i,j] + c end do all After do all T = 1, n*m i = ((T-1) / m) * m + 1 j = mod(T-1, m) + 1 a[i,j] = a[i,j] + c end do all Note: assume n, m slightly larger than P

Loop Collapsing Reduce the number of loop dimensions Eliminates overhead of nested or multidimensional loops Best when stride is constant

Loop Collapsing Before After do all i = 1, n do all j = 1, m a[i,j] = a[i,j] + c end do all After real TA[n*m] equivalence(TA,a)‏ do all T = 1, n*m TA[T] = TA[T] + c end do all

Loop Peeling Extract a number of iterations at start or end Reduces dependencies, allows adjusting bounds for later loop fusion Always legal

Loop Peeling Before After do i = 2, n b[i] = b[i] + b[2] end do do all i = 3, n a[i] = a[i] + c end do all After if (2 <= n) then b[2] = b[2] + b[2] end if do all i = 3, n b[i] = b[i] + b[2] a[i] = a[i] + c end do all

Loop Normalization Converts induction variables to be of the form: Makes analysis easier

Loop Normalization Before do i = 1, n end do do i = 2, n+1 After a[i] = a[i] + c end do do i = 2, n+1 b[i] = a[i-1] * b[i] After do i = 1, n a[i] = a[i] + c end do b[i+1] = a[i] * b[i+1] note: new loops can be fused

Loop Spreading Move some of the second to the first Enables ILP by stepping over dependences Delay 2nd loop by max. dep. distance between 2nd and 1st loop statements, plus 1.

Loop Spreading Before After do i = 1, n/2 a[i+1] = a[i+1] + a[i] end do do i = 1, n-3 b[i+1] = b[i+1] +b[i] * a[i+3] After do i = 1, n/2 COBEGIN a[i+1] = a[i+1] + a[i] if(i > 3) then b[i-2] = b[i-2] + b[i-3] * a[i] end if COEND end do do i = n/2-3, n-3 b[i+1] = b[i+1] + b[i] * a[i+3]

Replacement Transformations (these change everything)‏

Reduction Recognition Before do i = 1, n s = s + a[i] end do After real TS[64] TS[1:64] = 0.0 do TI = 1, n, 64 TS[1:64] = TS[1:64] + a[TI: TI+63] end do do TI = 1, 64 s = s + TS[TI] note: legal if *fully* associative (watch out for FP ops...)‏

Array Statement Scalarization What do you do when you can't vectorize in hardware? After (wrong)‏ do i = 2, n-1 a[i] = a[i] + a[i-1] end do Before a[2:n-1] = a[2:n-1] + a[1:n-2]

Array Statement Scalarization Before a[2:n-1] = a[2:n-1] + a[1:n-2] After (wrong)‏ do i = 2, n-1 a[i] = a[i] + a[a-1] end do After (right)‏ do i = 2, n-1 T[i] = a[i] + a[i-1] end do a[i] = T[i]

Array Statement Scalarization After (right)‏ do i = 2, n-1 T[i] = a[i] + a[a-1] end do a[i] = T[i] After (even better)‏ do i = n-1, 2, -1 a[i] = a[i] + a[a-1] end do

Array Statement Scalarization However: a[2:n-1] = a[2:n-1] + a[1:n-2] + a[3:n] must use a temporary, because: do i = 2, n-1 a[i] = a[i] + a[i-1] + a[i+1] end do has antidependence either way.

Memory Access Transformations (love your DRAM)‏

Array Padding Before After real a[8,512] do i = 1, 512 a[1,i] = a[1,i] + c end do After real a[9,512] do i = 1, 512 a[1,i] = a[1,i] + c end do note: assumes 8 banks of memory, similar for cache and TLB sets

Scalar Expansion Converts scalars to vectors Removes antidependences from temporaries Must be done when vectorizing

Scalar Expansion Before do i = 1, n end do After real T[n] c = b[i] a[i] = a[i] + c end do After real T[n] do all i = 1, n T[i] = b[i] a[i] = a[i] + T[i] end do all

Array Contraction Before After real T[n,n] do i = 1, n do all j = 1, n T[i,j] = a[i,j]*3 b[i,j[ = T[i,j] + b[i,j]/T[i,j] end do all end do After real T[n] do i = 1, n do all j = 1, n T[j] = a[i,j]*3 b[i,j[ = T[j] + b[i,j]/T[j] end do all end do

Scalar Replacement Before After do i = 1, n do j = 1, n total[i] = total[i] + a[i,j] end do After do i = 1, n T = total[i] do j = 1, n T = T + a[i,j] end do total[i] = T

Transformations for Parallel Machines (sharing the load)‏

Guard Introduction Before After do i = 1, n a[i] = a[i] + c b[i] = b[i] + c end do After LBA = (n/Pnum)*Pid + 1 UBA = (n/Pnum)*(Pid + 1)‏ LBB = (n/Pnum)*Pid + 1 UBB = (n/Pnum)*(Pid + 1)‏ do i = 1, n if (LBA <= 1 .and. i <= UBA)‏ a[i] = a[i] + c if (LBB <= 1 .and. i <= UBB)‏ b[i] = b[i] + c end do

Redundant Guard Elimination Before LBA = (n/Pnum)*Pid + 1 UBA = (n/Pnum)*(Pid + 1)‏ LBB = (n/Pnum)*Pid + 1 UBB = (n/Pnum)*(Pid + 1)‏ do i = 1, n if (LBA <= 1 .and. i <= UBA)‏ a[i] = a[i] + c if (LBB <= 1 .and. i <= UBB)‏ b[i] = b[i] + c end do After LB = (n/Pnum)*Pid + 1 UB = (n/Pnum)*(Pid + 1)‏ do i = 1, n if (LB <= 1 .and. i <= UB)‏ a[i] = a[i] + c b[i] = b[i] + c end if end do

Bounds Reduction After Before LB = (n/Pnum)*Pid + 1 UB = (n/Pnum)*(Pid + 1)‏ do i = LB, UB a[i] = a[i] + c b[i] = b[i] + c end do Before LB = (n/Pnum)*Pid + 1 UB = (n/Pnum)*(Pid + 1)‏ do i = 1, n if (LB <= 1 .and. i <= UB)‏ a[i] = a[i] + c b[i] = b[i] + c end if end do

That's all folks!