Download presentation
Presentation is loading. Please wait.
Published byCalvin Lucas Modified over 9 years ago
1
Improving Locality through Loop Transformations Copyright 2011, Keith D. Cooper & Linda Torczon, all rights reserved. Students enrolled in Comp 512 at Rice University have explicit permission to make copies of these materials for their personal use. Faculty from other educational institutions may use these materials for nonprofit educational purposes, provided this copyright notice is preserved. Comp 512 Spring 2011 1COMP 512, Rice University
2
2 The Opportunities Compilers have always focused on loops Higher execution counts Repeated, related operations Much of real work takes place in loops ( linear algebra ) Several effects to attack Overhead Decrease control-structure cost per iteration Locality Spatial locality use of co-resident data Temporal locality reuse of same data Parallelism Move loop w/independent iterations to inner (outer) position Remember Fortran H *
3
COMP 512, Rice University3 Eliminating Overhead Loop unrolling (the oldest trick in the book) To reduce overhead, replicate the loop body Sources of Improvement Less overhead per useful operation Longer basic blocks for local optimization do i = 1 to 100 by 1 a(i) = a(i) + b(i) end do i = 1 to 100 by 4 a(i) = a(i) + b(i) a(i+ 1 ) = a(i+ 1 ) + b(i+ 1 ) a(i+ 2 ) = a(i+ 2 ) + b(i+ 2 ) a(i+ 3 ) = a(i+ 3 ) + b(i+ 3 ) end becomes (unroll by 4)
4
COMP 512, Rice University4 Eliminating Overhead Loop unrolling with unknown bounds Generate guard loops While loop needs an explicit update Used in this form in the B LAS & in BitBlt do i = 1 to n by 1 a(i) = a(i) + b(i) end i = 1 do while (i+ 3 < n) a(i) = a(i) + b(i) a(i+ 1 ) = a(i+ 1 ) + b(i+ 1 ) a(i+ 2 ) = a(i+ 2 ) + b(i+ 2 ) a(i+ 3 ) = a(i+ 3 ) + b(i+ 3 ) i = i + 4 end do while (i < n) a(i) = a(i) + b(i) i = i + 1 end becomes (unroll by 4)
5
COMP 512, Rice University5 Eliminating Overhead One other use for loop unrolling Eliminate copies at the end of a loop More complex cases Multiple cross-iteration copy cycles use LCM of cycle lengths Result has been rediscovered many times [ Ken’s thesis ] t 1 = b( 0 ) do i = 1 to 100 t 2 = b( i ) a(i) = a(i) + t 1 + t 2 t 1 = t 2 end becomes (unroll + rename) t 1 = b( 0 ) do i = 1 to 100 by 2 t 2 = b( i ) a(i) = a(i) + t 1 + t 2 t 1 = b( i+ 1 ) a(i+ 1 ) = a(i+ 1 ) + t 2 + t 1 end
6
COMP 512, Rice University6 Eliminating Overhead Loop unswitching Hoist invariant control-flow out of loop nest Replicate the loop & specialize it No tests, branches in loop body Longer segments of straight-line code Does this happen in real code? If so, its worth doing do i = 1 to 100 a(i) = a(i) + b(i) if (expression) then d(i) = 0 end becomes (unswitch) if (expression) then do i = 1 to 100 a(i) = a(i) + b(i) d(i) = 0 end else do i = 1 to 100 a(i) = a(i) + b(i) end See also: Cytron, Lowry, & Zadeck in 13 th POPL (1986) *
7
COMP 512, Rice University7 Eliminating Overhead & Helping Locality Loop fusion Two loops over same iteration space one loop Advantages Fewer total operations ( statically & dynamically ) Longer basic blocks for local optimization & scheduling Can convert inter-loop reuse to intra-loop reuse do i = 1 to n c(i) = a(i) + b(i) end do j = 1 to n d(j) = a(j) * e(j) end becomes (fuse) do i = 1 to n c(i) = a(i) + b(i) d(i) = a(i) * e(i) end This is safe if it does not change the values used or defined by any statement in either loop * a(i) will be found in the cache For big enough arrays, a(i) will not be in the cache
8
COMP 512, Rice University8 Increasing Overhead & Helping Locality Loop distribution (or fission) Single loop with independent statements multiple loops Advantages Enables other transformations ( e.g., vectorization ) Each resulting loop has a smaller cache footprint More reuse hits in the cache do i = 1 to n a(i) = b(i) + c(i) end do i = 1 to n d(i) = e(i) * f(i) end do i = 1 to n g(i) = h(i) - k(i) end becomes (fission) do i = 1 to n a(i) = b(i) + c(i) d(i) = e(i) * f(i) g(i) = h(i) - k(i) end } Reads b & c Writes a } Reads e & f Writes d } Reads h & k Writes g { Reads b, c, e, f, h, & k Writes a, d, & g * It is safe if all the statements that form a cycle in the dependence graph end up in the same loop ( see COMP 515 from Ken )
9
COMP 512, Rice University9 do i = 1 to 50 do j = 1 to 100 a(i,j) = b(i,j) * c(i,j) end do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end becomes (interchange) Reordering Loops for Locality Loop interchange Swap inner & outer loops to rearrange iteration space Effect Improves reuse by using more elements per cache line Goal is to get as much reuse into inner loop as possible After interchange, direction of Iteration is changed cache line Runs down cache line In row-major order, the opposite loop ordering causes the same effects In Fortran’s column-major order, a(4,4) would lay out as cache line As little as 1 used element per line *
10
COMP 512, Rice University10 Reordering Loops for Locality Loop permutation Interchange is degenerate case Two perfectly nested loops More general problem is called permutation Safety Permutation is safe iff no data dependences are reversed The flow of data from definitions to uses is preserved Effects Change order of access & order of computation Move accesses closer in time increase temporal locality Move computations farther apart cover pipeline latencies
11
COMP 512, Rice University11 Reordering Loops for Locality Strip mining Splits a loop into two loops Effects ( may slow down the code ) Used to match loop bounds to vector length Used as prelude to loop permutation ( for tiling ) do j = 1 to 100 do i = 1 to 50 a(i,j) = b(i,j) * c(i,j) end becomes (strip mine) do j = 1 to 100 do ii = 1 to 50 by 8 do i = ii to min(ii+ 7, 50 ) a(i,j) = b(i,j) * c(i,j) end This is always safe
12
COMP 512, Rice University12 Reordering Loops for Locality Loop tiling (or blocking) Combination of strip mining and interchange Effects Reduces volume of data between reuses Works on one “tile” at a time ( tile size is tk by tj ) Choice of tile size is crucial do i = 1 to m do k = 1 to n do j = 1 to n c(j,i) = c(j,i) + a(k,i) * b(j,k) end do kk = 1 to n by tk do jj = 1 to n by tj do i = 1 to m do k = kk to min(kk+tk-1,n) do j = jj to min(jj+tj-1,n) c(j,i) = c(j,i) + a(k,i) * b(j,k) end becomes (tiling) Interchange must be safe Strip mine & interchange
13
COMP 512, Rice University13 Rewriting Loops for Better Register Allocation Scalar Replacement Allocators never keep c(i) in a register We can trick the allocator by rewriting the references The plan Locate patterns of consistent reuse Make loads and stores use temporary scalar variable Replace references with temporary’s name May need copies at end of loop to keep reused values straight If reuse spans more than one iteration, need to “pipeline” it
14
COMP 512, Rice University14 Rewriting Loops for Better Register Allocation Scalar Replacement Effects Decreases number of loads and stores Keeps reused values in names that can be allocated to registers In essence, this exposes the reuse of a(i) to subsequent passes do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end becomes (scalar replacement) Almost any register allocator can get t into a register
15
COMP 512, Rice University15 Rewriting Loops for Better Register Allocation What if we are not in Fortran? What about C? Register Promotion ( PLDI: Lu 1997, Sastry & Ju 1998, Chow et al 1998 ) Promote pointer-based references into scalar temporaries Requires data-flow information on pointers + a transformation Equivalent of scalar promotion for pointer-based values Lu’s formulation Perform interprocedural analysis to disambiguate pointers Find loops & solve intraprocedural problem for each loop Rewrite code based on results of analysis His work relied on ILOC’s memory tags
16
COMP 512, Rice University16 Rewriting Loops for Better Register Allocation Register Promotion Every memory operation has a set of tags -- textual names that describe which memory locations it can address | tag set | = 1 means the reference is unambiguous | tag set | > 1 means the reference is ambiguous Interprocedural pointer analysis shrinks tag sets … Initial data-flow information B_Explicit(b) contains all tags referenced by an explicit memory operation in b B_Ambiguous(b) contains all tags referenced by a memory operation with multiple tags or by a procedure call in b Algorithm computes B_Explicit and B_Ambiguous for each block
17
COMP 512, Rice University17 Rewriting Loops for Better Register Allocation Register Promotion Solve the equations For each promotable tag, t, in a loop: Create a virtual register, v Rewrite each reference to t with a copy from v L_Explicit( L ) = b in loop L B_Explicit(b) L_Ambiguous( L ) = b in loop L B_Ambiguous(b) L_Promotable( L ) = L_Explicit( L ) - L_Ambiguous( L ) L_Lift( L ) = L_Promotable( L ) if L is an outermost loop L_Promotable( L ) - L_Promotable( S ), S surrounds L Loop-by-loop approach motivated by fact that we really care about loops more than other parts of the program
18
COMP 512, Rice University18 Rewriting Loops for Better Register Allocation Register Promotion - Example from the paper With this simple scheme 0 to 16% of loads removed in test codes 0 to 50% of stores removed in test codes Other authors tried PRE-based extensions to Lu’s work for (i= 0; i < DIM_X; i++) { B[i] = 0; for (j=0; j < DIM_Y; j++) { B[i] += A[i][j]; } for (i= 0; i < DIM_X; i++) { r b = 0; for (j=0; j < DIM_Y; j++) { r b += A[i][j] ; } B[i] = r b ; } B has 1 tag
19
COMP 512, Rice University19 Balancing Memory Bound Loops Balance is the ratio of memory accesses to flops Machine balance is M = Loop balance is L = Loops run better if they are balanced or compute bound If L > M, the loop is memory bound If L = M, the loop is balanced If L < M, the loop is compute bound Making memory bound loops into balanced or compute-bound loops Need more reuse to decrease number of accesses Combine scalar replacement, unrolling, and fusion accesses/cycle flops/cycle accesses/iteration flops/iteration
20
COMP 512, Rice University20 Rewriting Loops for Better Register Allocation Example of Loop Balance do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end Original loop nest do i = 1 to n t = a(i) do j = 1 to n t = t + b(j) end a(i) = t end After scalar replacement L = 3 accesses/iteration 1 flop/iteration L = 1 access/iteration 1 flop/iteration As memory accesses cost more, this gets better !
21
COMP 512, Rice University21 Balancing Memory Loops Unroll and Jam Unroll the outer loop Fuse the resulting inner loops Effect Increases reuse in the inner loop * Decreases overhead, too Combine with scalar replacement for full benefits do i = 1 to n do j = 1 to n a(i) = a(i) + b(j) end do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+ 1 ) = a(i +1) + b(j) end becomes (unroll & jam) *
22
1 access 2 flops Impact Original loop had L = Scalar replacement got L down to Unroll & jam + scalar replacement produced L = 3 accesses/iteration 1 flop/iteration 1 access/iteration 1 flop/iteration 1 access/iteration 2 flop/iteration COMP 512, Rice University22 Balancing Memory Loops Adding Scalar Replacement Rewrites all three cases of reuse do i = 1 to n by 2 t 1 = a(i) t 2 = a(i+ 1 ) do j = 1 to n t 3 = b(j) t 1 = t 1 + t 3 t 2 = t 2 + t 3 end a(i) = t 1 a(i+ 1 ) = t 2 end becomes (scalar replacement) do i = 1 to n by 2 do j = 1 to n a(i) = a(i) + b(j) a(i+ 1 ) = a(i +1) + b(j) end captures the reuse
23
COMP 512, Rice University23 Balancing Memory Loops The Big Picture Scalar replacement + unroll & jam helps Factors of 2 to 6 for matrix multiply (matrix300) What does it take to do this in a real compiler? Data dependence analysis (see COMP 515) Method to discover consistent reuse patterns (see Carr) Way to choose the unroll amount Constrained by available registers Need heuristics to predict allocator’s behavior
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.