Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

CPE 731 Advanced Computer Architecture Instruction Level Parallelism Part I Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

ECE 454 Computer Systems Programming Compiler and Optimization (I) Ding Yuan ECE Dept., University of Toronto

Computer Architecture Instruction Level Parallelism Dr. Esam Al-Qaralleh.

Code Motion of Control Structures From the paper by Cytron, Lowry, and Zadeck, COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda.

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

The Last Lecture Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission.

Parallel and Cluster Computing 1. 2 Optimising Compilers u The main specific optimization is loop vectorization u The compilers –Try to recognize such.

Loop Invariant Code Motion — classical approaches — 1COMP 512, Rice University Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students.

Introduction to Code Optimization Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

Intermediate Representations Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice University.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Data Locality CS 524 – High-Performance Computing.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Data Locality CS 524 – High-Performance Computing.

Improving Code Generation Honors Compilers April 16 th 2002.

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved.

Instruction Scheduling II: Beyond Basic Blocks Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.

Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)

Code Optimization, Part III Global Methods Comp 412 Copyright 2010, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 412.

09/15/2011CS4961 CS4961 Parallel Programming Lecture 8: Dependences and Locality Optimizations Mary Hall September 15,

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Global Redundancy Elimination: Computing Available Expressions Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

High-Level Transformations for Embedded Computing

Introduction to Code Generation Copyright 2003, Keith D. Cooper, Ken Kennedy & Linda Torczon, all rights reserved. Students enrolled in Comp 412 at Rice.

ECE 1754 Loop Transformations by: Eric LaForest

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

Compiler Optimizations ECE 454 Computer Systems Programming Topics: The Role of the Compiler Common Compiler (Automatic) Code Optimizations Cristiana Amza.

Terminology, Principles, and Concerns, III With examples from DOM (Ch 9) and DVNT (Ch 10) Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved.

Cleaning up the CFG Eliminating useless nodes & edges This lecture describes the algorithm Clean, presented in Chapter 10 of EaC2e. The algorithm is due.

Profile-Guided Code Positioning See paper of the same name by Karl Pettis & Robert C. Hansen in PLDI 90, SIGPLAN Notices 25(6), pages 16–27 Copyright 2011,

Recursion Unrolling for Divide and Conquer Programs Radu Rugina and Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology.

3/2/2016© Hal Perkins & UW CSES-1 CSE P 501 – Compilers Optimizing Transformations Hal Perkins Autumn 2009.

Memory-Aware Compilation Philip Sweany 10/20/2011.

09/13/2012CS4230 CS4230 Parallel Programming Lecture 8: Dense Linear Algebra and Locality Optimizations Mary Hall September 13,

Dependence Analysis and Loops CS 3220 Spring 2016.

Introduction to Optimization

Simone Campanoni Loop transformations Simone Campanoni

Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)

Loop Restructuring Loop unswitching Loop peeling Loop fusion

Introduction to Optimization

CS4961 Parallel Programming Lecture 12: Data Locality, cont

Intermediate Representations

Introduction to Code Generation

Optimizing Transformations Hal Perkins Autumn 2011

Register Pressure Guided Unroll-and-Jam

Intermediate Representations

Optimizing Transformations Hal Perkins Winter 2008

Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.

Introduction to Optimization

Lecture 19: Code Optimisation

Loop-Level Parallelism

Introduction to Optimization

Optimizing single thread performance

Presentation transcript:

Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring COMP 512, Rice University

2 The Opportunities Compilers have always focused on loops  Higher execution counts  Repeated, related operations Much of real work takes place in loops ( linear algebra ) Several effects to attack Overhead  Decrease control-structure cost per iteration Locality  Spatial locality  use of co-resident data  Temporal locality  reuse of same data Parallelism  Move loop w/independent iterations to inner (outer) position Remember Fortran H *

COMP 512, Rice University3 Eliminating Overhead Loop unrolling (the oldest trick in the book) To reduce overhead, replicate the loop body Sources of Improvement Less overhead per useful operation Longer basic blocks for local optimization do i = 1 to 100 by 1 a(i) = a(i) + b(i) end do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+ 1 ) = a(i+ 1 ) + b(i+ 1 ) a(i+ 2 ) = a(i+ 2 ) + b(i+ 2 ) a(i+ 3 ) = a(i+ 3 ) + b(i+ 3 ) end becomes (unroll by 4)

COMP 512, Rice University4 Eliminating Overhead Loop unrolling with unknown bounds Generate guard loops While loop needs an explicit update Used in this form in the B LAS & in BitBlt do i = 1 to n by 1 a(i) = a(i) + b(i) end i = 1 do while (i+ 3 < n) a(i) = a(i) + b(i) a(i+ 1 ) = a(i+ 1 ) + b(i+ 1 ) a(i+ 2 ) = a(i+ 2 ) + b(i+ 2 ) a(i+ 3 ) = a(i+ 3 ) + b(i+ 3 ) i = i + 4 end do while (i < n) a(i) = a(i) + b(i) i = i + 1 end becomes (unroll by 4)

COMP 512, Rice University5 Eliminating Overhead One other use for loop unrolling Eliminate copies at the end of a loop More complex cases Multiple cross-iteration copy cycles  use LCM of cycle lengths Result has been rediscovered many times [ Ken’s thesis ] t 1 = b( 0 ) do i = 1 to 100 t 2 = b( i ) a(i) = a(i) + t 1 + t 2 t 1 = t 2 end becomes (unroll + rename) t 1 = b( 0 ) do i = 1 to 100 by 2 t 2 = b( i ) a(i) = a(i) + t 1 + t 2 t 1 = b( i+ 1 ) a(i+ 1 ) = a(i+ 1 ) + t 2 + t 1 end

COMP 512, Rice University6 Eliminating Overhead Loop unswitching Hoist invariant control-flow out of loop nest Replicate the loop & specialize it No tests, branches in loop body Longer segments of straight-line code Does this happen in real code? If so, its worth doing do i = 1 to 100 a(i) = a(i) + b(i) if (expression) then d(i) = 0 end becomes (unswitch) if (expression) then do i = 1 to 100 a(i) = a(i) + b(i) d(i) = 0 end else do i = 1 to 100 a(i) = a(i) + b(i) end See also: Cytron, Lowry, & Zadeck in 13 th POPL (1986) *

COMP 512, Rice University7 Eliminating Overhead & Helping Locality Loop fusion Two loops over same iteration space  one loop Advantages Fewer total operations ( statically & dynamically ) Longer basic blocks for local optimization & scheduling Can convert inter-loop reuse to intra-loop reuse do i = 1 to n c(i) = a(i) + b(i) end do j = 1 to n d(j) = a(j) * e(j) end becomes (fuse) do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) end This is safe if it does not change the values used or defined by any statement in either loop * a(i) will be found in the cache For big enough arrays, a(i) will not be in the cache

COMP 512, Rice University8 Increasing Overhead & Helping Locality Loop distribution (or fission) Single loop with independent statements  multiple loops Advantages Enables other transformations ( e.g., vectorization ) Each resulting loop has a smaller cache footprint  More reuse hits in the cache do i = 1 to n a(i) = b(i) + c(i) end do i = 1 to n d(i) = e(i) * f(i) end do i = 1 to n g(i) = h(i) - k(i) end becomes (fission) do i = 1 to n a(i) = b(i) + c(i) d(i) = e(i) * f(i) g(i) = h(i) - k(i) end } Reads b & c Writes a } Reads e & f Writes d } Reads h & k Writes g { Reads b, c, e, f, h, & k Writes a, d, & g * It is safe if all the statements that form a cycle in the dependence graph end up in the same loop ( see COMP 515 from Ken )

COMP 512, Rice University9 do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) end do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end becomes (interchange) Reordering Loops for Locality Loop interchange Swap inner & outer loops to rearrange iteration space Effect Improves reuse by using more elements per cache line Goal is to get as much reuse into inner loop as possible After interchange, direction of Iteration is changed cache line Runs down cache line In row-major order, the opposite loop ordering causes the same effects In Fortran’s column-major order, a(4,4) would lay out as cache line As little as 1 used element per line *

COMP 512, Rice University10 Reordering Loops for Locality Loop permutation Interchange is degenerate case  Two perfectly nested loops More general problem is called permutation Safety Permutation is safe iff no data dependences are reversed  The flow of data from definitions to uses is preserved Effects Change order of access & order of computation Move accesses closer in time  increase temporal locality Move computations farther apart  cover pipeline latencies

COMP 512, Rice University11 Reordering Loops for Locality Strip mining Splits a loop into two loops Effects ( may slow down the code ) Used to match loop bounds to vector length Used as prelude to loop permutation ( for tiling ) do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end becomes (strip mine) do j = 1 to 100 do ii = 1 to 50 by 8 do i = ii to min(ii+ 7, 50 ) a(i,j) = b(i,j) * c(i,j) end This is always safe

COMP 512, Rice University12 Reordering Loops for Locality Loop tiling (or blocking) Combination of strip mining and interchange Effects Reduces volume of data between reuses  Works on one “tile” at a time ( tile size is tk by tj ) Choice of tile size is crucial do i = 1 to m do k = 1 to n do j = 1 to n c(j,i) = c(j,i) + a(k,i) * b(j,k) end do kk = 1 to n by tk do jj = 1 to n by tj do i = 1 to m do k = kk to min(kk+tk-1,n) do j = jj to min(jj+tj-1,n) c(j,i) = c(j,i) + a(k,i) * b(j,k) end becomes (tiling) Interchange must be safe Strip mine & interchange

COMP 512, Rice University13 Rewriting Loops for Better Register Allocation Scalar Replacement Allocators never keep c(i) in a register We can trick the allocator by rewriting the references The plan Locate patterns of consistent reuse Make loads and stores use temporary scalar variable Replace references with temporary’s name May need copies at end of loop to keep reused values straight  If reuse spans more than one iteration, need to “pipeline” it

COMP 512, Rice University14 Rewriting Loops for Better Register Allocation Scalar Replacement Effects Decreases number of loads and stores Keeps reused values in names that can be allocated to registers In essence, this exposes the reuse of a(i) to subsequent passes do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end becomes (scalar replacement) Almost any register allocator can get t into a register

COMP 512, Rice University15 Rewriting Loops for Better Register Allocation What if we are not in Fortran? What about C? Register Promotion ( PLDI: Lu 1997, Sastry & Ju 1998, Chow et al 1998 ) Promote pointer-based references into scalar temporaries Requires data-flow information on pointers + a transformation Equivalent of scalar promotion for pointer-based values Lu’s formulation Perform interprocedural analysis to disambiguate pointers Find loops & solve intraprocedural problem for each loop Rewrite code based on results of analysis His work relied on ILOC’s memory tags

COMP 512, Rice University16 Rewriting Loops for Better Register Allocation Register Promotion Every memory operation has a set of tags -- textual names that describe which memory locations it can address  | tag set | = 1 means the reference is unambiguous  | tag set | > 1 means the reference is ambiguous  Interprocedural pointer analysis shrinks tag sets … Initial data-flow information B_Explicit(b) contains all tags referenced by an explicit memory operation in b B_Ambiguous(b) contains all tags referenced by a memory operation with multiple tags or by a procedure call in b Algorithm computes B_Explicit and B_Ambiguous for each block

COMP 512, Rice University17 Rewriting Loops for Better Register Allocation Register Promotion Solve the equations For each promotable tag, t, in a loop: Create a virtual register, v Rewrite each reference to t with a copy from v L_Explicit( L ) =  b in loop L B_Explicit(b) L_Ambiguous( L ) =  b in loop L B_Ambiguous(b) L_Promotable( L ) = L_Explicit( L ) - L_Ambiguous( L ) L_Lift( L ) = L_Promotable( L ) if L is an outermost loop L_Promotable( L ) - L_Promotable( S ), S surrounds L Loop-by-loop approach motivated by fact that we really care about loops more than other parts of the program

COMP 512, Rice University18 Rewriting Loops for Better Register Allocation Register Promotion - Example from the paper With this simple scheme 0 to 16% of loads removed in test codes 0 to 50% of stores removed in test codes Other authors tried PRE-based extensions to Lu’s work for (i= 0; i < DIM_X; i++) { B[i] = 0; for (j=0; j < DIM_Y; j++) { B[i] += A[i][j]; } for (i= 0; i < DIM_X; i++) { r b = 0; for (j=0; j < DIM_Y; j++) { r b += A[i][j] ; } B[i] = r b ; } B has 1 tag

COMP 512, Rice University19 Balancing Memory Bound Loops Balance is the ratio of memory accesses to flops Machine balance is  M = Loop balance is  L = Loops run better if they are balanced or compute bound If  L >  M, the loop is memory bound If  L =  M, the loop is balanced If  L <  M, the loop is compute bound Making memory bound loops into balanced or compute-bound loops Need more reuse to decrease number of accesses Combine scalar replacement, unrolling, and fusion accesses/cycle flops/cycle accesses/iteration flops/iteration

COMP 512, Rice University20 Rewriting Loops for Better Register Allocation Example of Loop Balance do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end Original loop nest do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end After scalar replacement  L = 3 accesses/iteration 1 flop/iteration  L = 1 access/iteration 1 flop/iteration As memory accesses cost more, this gets better !

COMP 512, Rice University21 Balancing Memory Loops Unroll and Jam Unroll the outer loop Fuse the resulting inner loops Effect Increases reuse in the inner loop * Decreases overhead, too Combine with scalar replacement for full benefits do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+ 1 ) = a(i +1) + b(j) end becomes (unroll & jam) *

1 access 2 flops Impact Original loop had  L = Scalar replacement got  L down to Unroll & jam + scalar replacement produced  L = 3 accesses/iteration 1 flop/iteration 1 access/iteration 1 flop/iteration 1 access/iteration 2 flop/iteration COMP 512, Rice University22 Balancing Memory Loops Adding Scalar Replacement Rewrites all three cases of reuse do i = 1 to n by 2 t 1 = a(i) t 2 = a(i+ 1 ) do j = 1 to n t 3 = b(j) t 1 = t 1 + t 3 t 2 = t 2 + t 3 end a(i) = t 1 a(i+ 1 ) = t 2 end becomes (scalar replacement) do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+ 1 ) = a(i +1) + b(j) end captures the reuse

COMP 512, Rice University23 Balancing Memory Loops The Big Picture Scalar replacement + unroll & jam helps Factors of 2 to 6 for matrix multiply (matrix300) What does it take to do this in a real compiler? Data dependence analysis (see COMP 515) Method to discover consistent reuse patterns (see Carr) Way to choose the unroll amount  Constrained by available registers  Need heuristics to predict allocator’s behavior