Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.

Slides:

Advertisements

Similar presentations

Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.

Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

Cpt S 223 – Advanced Data Structures Graph Algorithms: Introduction

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Register Usage Keep as many values in registers as possible Register assignment Register allocation Popular techniques – Local vs. global – Graph coloring.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

1 Optimizing compilers Managing Cache Bercovici Sivan.

Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.

School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.

Lecture 11: Code Optimization CS 540 George Mason University.

Chapter 9 Code optimization Section 0 overview 1.Position of code optimizer 2.Purpose of code optimizer to get better efficiency –Run faster –Take less.

1 CS 201 Compiler Construction Lecture 3 Data Flow Analysis.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

1 COMP 740: Computer Architecture and Implementation Montek Singh Tue, Feb 24, 2009 Topic: Instruction-Level Parallelism IV (Software Approaches/Compiler.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Greedy Algorithms Greed is good. (Some of the time)

Compiler Challenges for High Performance Architectures

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.

Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.

Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.

Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

EECC551 - Shaaban #1 Winter 2002 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Greedy Algorithms Reading Material: Chapter 8 (Except Section 8.5)

Code Generation for Basic Blocks Introduction Mooly Sagiv html:// Chapter

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Register Allocation (via graph coloring)

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Improving Code Generation Honors Compilers April 16 th 2002.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.

Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp

Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

Télécom 2A – Algo Complexity (1) Time Complexity and the divide and conquer strategy Or : how to measure algorithm run-time And : design efficient algorithms.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.

Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.

Cleaning up the CFG Eliminating useless nodes & edges C OMP 512 Rice University Houston, Texas Fall 2003 Copyright 2003, Keith D. Cooper & Linda Torczon,

Advanced Compiler Techniques LIU Xianhua School of EECS, Peking University Loops.

Optimizing Compilers for Modern Architectures Creating Coarse-grained Parallelism for Loop Nests Chapter 6, Sections 6.3 through 6.9.

1 Control Flow Graphs. 2 Optimizations Code transformations to improve program –Mainly: improve execution time –Also: reduce program size Can be done.

2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.

Great Theoretical Ideas in Computer Science for Some.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Global Register Allocation Based on

BackTracking CS255.

CSCI1600: Embedded and Real Time Software

Register Pressure Guided Unroll-and-Jam

Lecture 16: Register Allocation

Optimization 薛智文 (textbook ch# 9) 薛智文 96 Spring.

The Greedy Approach Young CS 530 Adv. Algo. Greedy.

CSE P 501 – Compilers SSA Hal Perkins Autumn /31/2019

Introduction to Optimization

CSCI1600: Embedded and Real Time Software

(via graph coloring and spilling)

Time Complexity and the divide and conquer strategy

CS 201 Compiler Construction

Presentation transcript:

Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Last lecture at a glance (1) Assumption 1: Most compilers can handle register allocation to scalars (using node coloring algorithm). However they don’t know how to handle vectors. Assumption 2: We are dealing with RISC processors. All of the CPU operations need the data in the registers (except of load and store operations). Assumption 3: Memory Hierarchy: Accessing the registers is much faster than a cache hit, which is much faster than a cache miss and accessing the main memory, which is much faster than accessing the virtual memory (swap file)…

Last lecture at a glance (2) Therefore our strategy will be: Do some transformation that will “expose” vector entries as scalars, and then let the good old compiler do the register allocation. We will benefit from avoiding unnecessary Load / Store operations.

Last lecture at a glance (3) Example: (Scalar Replacement) DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO

Last lecture at a glance (4) Dependences to consider: True dependence A(I) = … =A(I) Output dependence A(I) = … A(I) = Antidependence =A(I) … A(I) = Input dependence = A(I) … = A(I)

Last lecture at a glance (5) We should also consider Loop Carried and Loop Independent dependences. In general the more dependences the merry. This is because there are probably more opportunities for registers reuse. We will use the dependences to decide if and how to “expose” the vectors as scalars.

Last lecture at a glance (6) We saw : Scalar Replacement (see first example) – this is the actual “exposure”. Unroll and Jam – Unrolling of loops in order to bring dependences that are carried by an outer loop into the inner loop. This can benefit register reuse if we apply Scalar Replacement afterwards.

Last lecture at a glance (7) Example: (Unroll and Jam) Original Code DO I = 1, N*2 DO J = 1, M A(I) = A(I) + B(J) ENDDO Unroll and Jam DO I = 1, N*2, 2 DO J = 1, M A(I) = A(I) + B(J) A(I+1) = A(I+1) +B(J) ENDDO Scalar Replacement DO I = 1, N*2, 2 s0 = A(I) s1 = A(I+1) DO J = 1, M t = B(J) s0 = s0 + t s1 = s1 + t ENDDO A(I) = s0 A(I+1) = s1 ENDDO

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Loop Interchange (1) Loop nesting is not always optimal in regard to register reuse. For example, on CPUs with no vector engines, the following code (matrix initialization): DO I=2, N A(1:M, I) = A(1:M, I-1) ENDDO Will be converted into: DO I = 2, N DO J = 1, M A(J, I) = A(J, I-1) ENDDO

Loop Interchange (2) Which will be implemented in the following way: DO I = 2, N DO J = 1, M R1 = A(J, I-1) A(J, I) = R1 ENDDO Which is not too clever, since it has (N-1)*M Load and Store operations. If we change the order of the loops we can get a better implementation.

Loop Interchange (3) Original Code DO I = 2, N DO J = 1, M A(J, I) = A(J, I-1) ENDDO Loop Interchange DO J = 1, M DO I = 2, N A(J, I) = A(J, I-1) ENDDO Scalar Replacement DO J = 1, M R1 = A(J, 1) DO I = 2, N A(J, I) = R1 ENDDO This implementation still requires (N-1)*M Store operations (we can’t escape that), but it only requires M Load operations which can make the running time considerably shorter.

Loop Interchange (4) Considerations for Loop Interchange The basic idea is to get the loop that carries the most dependences to the innermost position. Register reuse for the outer loop is usually cannot be achieved due to limited register resources. We use the conventional direction matrix for loop nest.

Loop Interchange (5) Example: DO J = 1, N DO K = 1, N DO I = 1, 256 A(I, J, K) = A(I, J-1, K) + A(I, J-1, K-1) + A(I, J, K-1) ENDDO There are 3 true dependences which result in the following direction matrix:

Loop Interchange (6) Example (cont.): If we select the J loop to be the innermost we get: DO K = 1, N DO I = 1, 256 DO J = 1, N A(I, J, K) = A(I, J-1, K) + & A(I, J-1, K-1) + A(I, J, K-1) ENDDO DO K = 1, N DO I = 1, 256 R1 = A(I, 0, K) DO J = 1, N R1 = R1 + A(I, J-1, K-1) + & A(I, J, K-1) A(I, J, K) = R1 ENDDO We saved a Load operation in each iteration. It is possible to interchange the 2 outer loops and get further optimization.

Loop Interchange (7) Loop Interchange Algorithm: 1.Form the direction matrix for the loop nest and use it to identify the loops other than the scalarization loop that can legally be moved to the innermost position 2.For each such loop L, let count(L) be the number of rows of the direction matrix that have “<“ in the position corresponding to L and “=“ in every other position. 3.Pick the loop l that maximize the product of count(L) and the iteration count of loop L. Some assumptions need to be taken when the bounds of the loop are unknown at compile time. Loop interchange should be weighed against cache efficiency (next chapter)

Loop Interchange (8) ,000  (# of loop iterations) Example 100 * 2 = * 3 = * 1 = 150 1,000 * 0 = 0 The outermost loop (100*2) should be the innermost loop

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Loop Fusion (1) Example: On CPUs with no vector engines the following code: A(1:N) = C(1:N) + D(1:N) B(1:N) = C(1:N) – D(1:N) Will be transformed into: DO I = 1, N A(I) = C(I) + D(I) ENDDO DO I = 1, N B(I) = C(I) - D(I) ENDDO

Loop Fusion (2) Using Loop Fusion (chapter 6) we get: DO I = 1, N A(I) = C(I) + D(I) B(I) = C(I) – D(I) ENDDO Using Scalar Replacement We can save on the fetching time of C(I) and D(I) : DO I = 1, N R1 = C(I) R2 = D(I) A(I) = R1 + R2 B(I) = R1 – R2 ENDDO

Loop Fusion (3) Profitable Loop Fusion for Register Reuse Just because a loop fusion is safe does not mean it is profitable. There are 2 cases where the fusion may be profitable: The fusion results in a loop independent dependence (as we just saw). The fusion results in a forward loop carried dependence.

Loop Fusion (4) Example : (forward loop carried dependence) DO J = 1, N DO I = 1, M A(I,J) = C(I,J)+D(I,J) ENDDO DO I = 1, M B(I,J) = A(I,J-1)-E(I,J) ENDDO Fusion: DO J = 1, N DO I = 1, M A(I,J) = C(I,J)+D(I,J) B(I,J) = A(I,J-1)-E(I,J) ENDDO

Loop Fusion (5) Fusion: DO J = 1, N DO I = 1, M A(I,J) = C(I,J)+D(I,J) B(I,J) = A(I,J-1)-E(I,J) ENDDO Loop Interchange: DO I = 1, M DO J = 1, N A(I,J) = C(I,J)+D(I,J) B(I,J) = A(I,J-1)-E(I,J) ENDDO Statement Order Reversing: DO I = 1, M DO J = 1, N B(I,J) = A(I,J-1)-E(I,J) A(I,J) = C(I,J)+D(I,J) ENDDO Scalar Replacement: DO I = 1, M R1 = A(I, 0) DO J = 1, N B(I,J) = R1 - E(I,J) R1 = C(I,J)+D(I,J) A(I,J) = R1 ENDDO

Loop Fusion (6) Loop Alignment for Fusion Reminder: Blocking dependences cause problems for loop fusion. DO I = 1, M DO J = 1, N A(J,I) = B(J,I) ENDDO DO J = 1, N C(J,I) = A(J+1,I) ENDDO We cannot simply fuse the two loops because we will introduce backward-carried antidependence.

Loop Fusion (7) We can overcome this problem by aligning the loops: DO I = 1, M DO J = 0, N-1 A(J+1,I) = B(J,I+1) ENDDO DO J = 1, N C(J,I) = A(J+1,I) ENDDO We can now fuse the two loops on their common iteration range while peeling a single iteration from the beginning of the first loop and one iteration from the end of the second loop.

Loop Fusion (8) Hence we get: DO I = 1, M A(1,I) = B(1,I) DO J = 1, N-1 A(J+1,I) = B(J+1,I) C(J,I) = A(J+1,I) ENDDO C(N,I) = A(N+1,I) ENDDO Scalar Replacement DO I = 1, M A(1,I) = B(1,I) DO J = 1, N-1 R1 = B(J+1,I) A(J+1,I) = R1 C(J,I) = R ENDDO C(N,I) = A(N+1,I) ENDDO

Loop Fusion (9) Definition : Let  be a dependence between loops. The Alignment Threshold of  is defined as follows: If  is loop independent after merging, threshold(  ) = 0 If  is forward carried after merging, threshold(  ) is the negative of the resulting dependence threshold. If  is fusion preventing, threshold(  ) is the threshold of the merged dependence. Aligning by the largest threshold allow fusion.

Loop Fusion (10) Example: DO I = 1, N A(I) = B(I) ENDDO DO I = 1, N C(I) = A(I+1) + A(I-1) ENDDO We have 2 dependences: 1.Forward carried with a threshold of 1 because of the reference A(I-1)  Alignment threshold of Backward carried with a threshold of 1 because of the reference A(I+1)  Alignment threshold of +1.

Loop Fusion (11) Since (+1) > (-1) we should align by the alignment threshold: (+1) And so we get: DO I = 0, N-1 A(I+1) = B(I+1) ENDDO DO I = 1, N C(I) = A(I+1) + A(I-1) ENDDO From here we can proceed to fuse the loops and then “Scalar Replace” A(I+1).

Loop Fusion (12) Fusion Mechanics Assuming we have a collection of aligned loops how do we fuse them? 1.Sort the lower bounds of the loops into nondecreasing sequence {L 1,L 2,…L n } and sort the upper bounds of the loops into nondecreasing sequence {H 1,H 2,…,H n }. 2.Produce a sequence of fusion loops with lower bounds of L 1,L 2,…,L n-1 with respective upper bounds of L 2 -1,L 3 -1,…,L n Produce the central fuse loop with a lower bound of L n and an upper bound of H 1. 4.Produce a sequence of fusion loops with lower bounds of H 1 +1,H 2 +1,…,L n-1 +1 with respective upper bounds of H 2,H 3,…,H n.

Loop Fusion (13) Loop 1 Loop 2 Loop 3 Example Each color represents a fusion loop. Loops after alignment

Loop Fusion (14) The Weighted Fusion Problem The last thing to do is to form the collections of the loops to be fused. We need to do it in a profitable manner. Example L1 DO I = 1, 1,000 A(I) = B(I) + X(I) ENDDO L2 DO I = 1, 1,000 C(I) = A(I) + Y(I) ENDDO S Z = FOO(A(1:1,000)) L3 DO I = 1, 500 A(I) = C(I) + Z ENDDO L1 SL2 L3 1, ,000

Loop Fusion (15) Definition A mixed-directed graph is a graph G = (V, E = E d U E u ) where (V,E d ) forms a directed graph, (V, E u ) forms an undirected graph, and E d and E u are disjoint. G is acyclic if (V,E d ) is acyclic. w is a successor or predecessor of v if it is such in (V,E d ). w is a neighbor of v if it is such in (V,E u ).

Loop Fusion (16) Problem Definition Let G be an acyclic mixed-directed graph, W a weight function on E, B a set of bad vertices, and E b a set of bad edges. The weighted loop fusion problem is the problem of finding vertex sets {V 1,V 2,…,V n } such that: {V 1,V 2,…,V n } partitions V. Each vertex set V i either contains no bad vertices, or consists of a single bad vertex. Given two v and w in V i, there is no path from v to w (in E d ) that leaves V i. Given v and w in V i, there is no bad edge between v and w. The induced graph on the vertex sets is acyclic. The Target : To maximize the total weight of edges between vertices in the same vertex sets.

Loop Fusion (17) The Algorithm 1.Initialize all the quantities and compute initial successor, predecessor, and neighbor sets. 2.Topologically sort the vertices of the directed acyclic graph. Continued… Unfortunately, The Weighted Fusion Problem is NP-Hard. Therefore we have to resort to heuristic based algorithms. A fast and simple algorithm, is the Fast Greedy algorithm for Weighted Fusion which was developed by Kennedy.

Loop Fusion (18) The Algorithm (continued) 3. Process the vertices in V to compute for each vertex the set pathFrom[v], which contains all vertices that can be reached by a path from vertex v, and the set badPathFrom[v], a subset of pathFrom[v] that includes the set of vertices that can be reached from v by a path that contains a bad vertex or a bad edge. 4.Invert the sets pathFrom and badPathFrom, respectively, to produce the sets pathTo[v] and badPathTo[v] for each vertex v in the graph, The set pathTo[v] contains the vertices from which there is a path to v ; the set badPathTo[v] contains the vertices from which v can be reached via a bad path. Continued…

Loop Fusion (19) 5. Insert each of the edges into a priority queue edgeHeap by weight. 6. While edgeHeap is nonempty, select and remove the heaviest edge (v,w) from it. If w is in badPathFrom[v] then do not fuse – repeat step 6. Otherwise do the following: Collapse v, w, and every edge on the directed path between them. After each collapse, adjust the sets pathFrom, badPathFrom, pathTo, and badPathTo to reflect the new graph. That is, the composite node will now be reached from every vertex that reached a vertex in the composite, and it will reach any vertex that is reached by a vertex in the composite. After each vertex collapse, recompute successor, predecessor, and neighbor sets for the composite vertex, and recompute weights between the composite vertex and other vertices as appropriate. The running time of the algorithm is: O(EV + V 2 )

Loop Fusion (20) L1 SL2 L3 1, ,000 In the previous example the greedy algorithm will fuse L1 and L2 which is the optimal solution.

Loop Fusion (21) ab c e d f Bad vertex 1 a However, the algorithm is not optimal. Consider the following example:

Loop Fusion (22) Since the edge (a,f) is the heaviest, the greedy algorithm will fuse the vertices a,b,c,d,f together: ab c e d f Bad vertex 1 a This solution weight is 16.

Loop Fusion (23) However, fusing c,d,e,f and a,b produce a better result: ab c e d f Bad vertex 1 a This solution weight is 23.

Loop Fusion (24) Multilevel Loop Fusion When dealing with multiple-loop nesting problem, the strategy is simple: First align and fuse the outer most loops, then recursively repeat the process for the bodies of the resulting loops. At best it is inefficient to start with fusing the inner loops (since we won’t be able to fuse all of them, and if we will insist on fusing them we might get the wrong code as the outer loops might need alignment, and therefore the references in the inner loops will change).

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Putting It All Together (1) In which order should the transformations be applied? The recommended order is as follows: 1.Loop Interchange. 2.Loop Alignment and Fusion. 3.Unroll and Jam. 4.Scalar Replacement. But Why?

Putting It All Together (2) 1.Loop Interchange : Fusion might interfere with loop interchange therefore it should be done first. 2.Loop Alignment and Fusion : This can achieve extra reuse across loops 3.Unroll and Jam : This can achieve outer loop reuse when there are dependences carried by other than the inner loop after interchange is finished. 4.Scalar Replacement : As we already noted, this is the actual “exposure” – so this must be the last transformation.

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Complex Loop Nests (1) Loops with If Statements Consider the following example: DO I = 1, N IF(M(I).LT.0) THEN A(I)=B(I)+C ENDIF D(I) = A(I) + E ENDDO Scalar Replacement DO I = 1, N IF(M(I).LT.0) THEN a0 = B(I) + C A(I) = a0 ENDIF D(I) = a0 + E ENDDO Error: a0 may not be initialized

Complex Loop Nests (2) We can overcome this problem in the following way: DO I = 1, N IF(M(I).LT.0) THEN a0 = B(I) + C A(I) = a0 ELSE a0 = A(I) ENDIF D(I) = a0 + E ENDDO Note: We didn’t increase the running time.

Complex Loop Nests (3) Given a control flow graph of the loop, and assuming that each If statement has (possibly empty) Else branch: We insert initialization at the beginning of block b if the variable is used in b but not initialized on any path to b. We insert an initialization at the end of block b if the variable has not been initialized on any path to the block, it is live on exit from the block, and at some successor to the block it is used. (as done in the example).

Complex Loop Nests (4) Triangular Unroll and Jam Consider the following example: DO I = 2, 99 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) ENDDO Naïve Unroll an Jam DO I = 2, 99, 2 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) A(I+1,J)=A(I+1,I+1)+A(J,J) ENDDO Error: We miss an assignment We can solve the problem by applying Unroll an Jam step by step an using the loop fusion mechanics.

Complex Loop Nests (5) Original Code DO I = 2, 99 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) ENDDO Unroll DO I = 2, 99, 2 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) ENDDO DO J = 1, I A(I+1,J) = A(I+1,I+1)+A(J,J) ENDDO Jam (Fusion) DO I = 2, 99, 2 DO J = 1, I-1 A(I,J) = A(I,I) + A(J,J) A(I+1,J) = A(I+1,I+1)+A(J,J) ENDDO A(I+1,I) = A(I+1,I+1)+A(I,I) ENDDO Scalar Replacement DO I = 2, 99, 2 tI = A(I,I) tI1 = A(I+1,I+1) DO J = 1, I-1 tJ = A(J,J) A(I,J) = tI + tJ A(I+1,J) = tI1 + tJ ENDDO A(I+1,I) = tI1 + tI ENDDO

Complex Loop Nests (6) Note: It is also possible to Unroll using a factor bigger than 2, using the same techniques.

Complex Loop Nests (7) Trapezoidal Unroll and Jam The same technique can be used for general trapezoidal loops, for example: (A part of a convolution code) DO I = 0, N DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) ENDDO F3(I) = F3(I)*DT ENDDO Unroll DO I = 0, N, 2 DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) ENDDO F3(I) = F3(I)*DT DO J = I+1, I+N2+1 F3(I+1)=F3(I+1)+F1(J)*W(I-J+1) ENDDO F3(I+1) = F3(I+1)*DT ENDDO

Complex Loop Nests (8) Unroll DO I = 0, N, 2 DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) ENDDO F3(I) = F3(I)*DT DO J = I+1, I+N2+1 F3(I+1)=F3(I+1)+F1(J)*W(I-J+1) ENDDO F3(I+1) = F3(I+1)*DT ENDDO Jam (Fusion) DO I = 0, N, 2 F3(I) = F3(I)+F1(I)*W(0) DO J = I, I+N2 F3(I) = F3(I)+F1(J)*W(I-J) F3(I+1)=F3(I+1)+F1(J)*W(I-J+1) ENDDO F3(I+1)=F3(I+1)+F1(I+N2+1)*W(-N2) F3(I) = F3(I)*DT F3(I+1) = F3(I+1)*DT ENDDO Applying Scalar Replacement gave a speedup of 2.22 on a MIPS M120…

Agenda  Last Lecture at a glance  Loop Interchange for Register Reuse  Loop Fusion for Register Reuse  Putting it All Together  Complex Loop Nests  Summary

Summary (1) This lecture we covered: 1.Loop Interchange – This gives us more dependences in the innermost loop which we can utilize for more register reuse. 2.Loop Fusion and Alignment – Bring uses together so they can share registers. 3.Complex Loops – How to overcome some of the problems in real-world programs.