Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky.

Slides:



Advertisements
Similar presentations
Optimizing Compilers for Modern Architectures Compiler Improvement of Register Usage Chapter 8, through Section 8.4.
Advertisements

Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.
P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.
1 Optimizing compilers Managing Cache Bercovici Sivan.
Optimizing Compilers for Modern Architectures Allen and Kennedy, Chapter 13 Compiling Array Assignments.
CSCI 4717/5717 Computer Architecture
School of EECS, Peking University “Advanced Compiler Techniques” (Fall 2011) Parallelism & Locality Optimization.
Lecture 11: Code Optimization CS 540 George Mason University.
Architecture-dependent optimizations Functional units, delay slots and dependency analysis.
COMPILERS Register Allocation hussein suleman uct csc305w 2004.
Improving Register Usage Chapter 8, Section 8.5  End. Omer Yehezkely.
Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.
Parallell Processing Systems1 Chapter 4 Vector Processors.
Compiler Challenges for High Performance Architectures
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
Dependence Analysis Kathy Yelick Bebop group meeting, 8/3/01.
1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.
Stanford University CS243 Winter 2006 Wei Li 1 Loop Transformations and Locality.
Enhancing Fine- Grained Parallelism Chapter 5 of Allen and Kennedy Mirit & Haim.
Compiler Challenges, Introduction to Data Dependences Allen and Kennedy, Chapter 1, 2.
CMPUT680 - Fall 2006 Topic A: Data Dependence in Loops José Nelson Amaral
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
EECC551 - Shaaban #1 Spring 2004 lec# Static Compiler Optimization Techniques We already examined the following static compiler techniques aimed.
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.
1 Tuesday, September 19, 2006 The practical scientist is trying to solve tomorrow's problem on yesterday's computer. Computer scientists often have it.
Register Allocation (via graph coloring)
Optimizing Compilers for Modern Architectures Coarse-Grain Parallelism Chapter 6 of Allen and Kennedy.
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
CMPUT Compiler Design and Optimization1 CMPUT680 - Winter 2006 Topic B: Loop Restructuring José Nelson Amaral
Dependence Testing Optimizing Compilers for Modern Architectures, Chapter 3 Allen and Kennedy Presented by Rachel Tzoref and Rotem Oshman.
Chapter 13 Reduced Instruction Set Computers (RISC) Pipelining.
U NIVERSITY OF M ASSACHUSETTS, A MHERST D EPARTMENT OF C OMPUTER S CIENCE Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.
Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.
A Data Locality Optimizing Algorithm based on A Data Locality Optimizing Algorithm by Michael E. Wolf and Monica S. Lam.
1 Liveness analysis and Register Allocation Cheng-Chia Chen.
Improving Code Generation Honors Compilers April 16 th 2002.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.
4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)
Creating Coarse-Grained Parallelism Chapter 6 of Allen and Kennedy Dan Guez.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2 pp
Dependence: Theory and Practice Allen and Kennedy, Chapter 2 Liza Fireman.
Optimizing Compilers for Modern Architectures Compiling High Performance Fortran Allen and Kennedy, Chapter 14.
Optimizing Compilers for Modern Architectures Dependence: Theory and Practice Allen and Kennedy, Chapter 2.
Topic #10: Optimization EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003.
Optimizing Loop Performance for Clustered VLIW Architectures by Yi Qian (Texas Instruments) Co-authors: Steve Carr (Michigan Technological University)
Enhancing Fine-Grained Parallelism Chapter 5 of Allen and Kennedy Optimizing Compilers for Modern Architectures.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.
U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2011 Dependence Analysis and Loop Transformations.
Carnegie Mellon Lecture 14 Loop Optimization and Array Analysis I. Motivation II. Data dependence analysis Chapter , 11.6 Dror E. MaydanCS243:
CIS 662 – Computer Architecture – Fall Class 16 – 11/09/04 1 Compiler Techniques for ILP  So far we have explored dynamic hardware techniques for.
Carnegie Mellon Lecture 15 Loop Transformations Chapter Dror E. MaydanCS243: Loop Optimization and Array Analysis1.
*All other brands and names are the property of their respective owners Intel Confidential IA64_Tools_Overview2.ppt 1 修改程序代码以 利用编译器实现优化
High-Level Transformations for Embedded Computing
3/12/2013Computer Engg, IIT(BHU)1 CONCEPTS-1. Pipelining Pipelining is used to increase the speed of processing It uses temporal parallelism In pipelining,
Optimizing Compilers for Modern Architectures Enhancing Fine-Grained Parallelism Part II Chapter 5 of Allen and Kennedy.
Chapter 11 System Performance Enhancement. Basic Operation of a Computer l Program is loaded into memory l Instruction is fetched from memory l Operands.
Dependence Analysis and Loops CS 3220 Spring 2016.
Analysis of Algorithms
CS314 – Section 5 Recitation 13
Data Dependence, Parallelization, and Locality Enhancement (courtesy of Tarek Abdelrahman, University of Toronto)
Loop Restructuring Loop unswitching Loop peeling Loop fusion
Chapter 14 Instruction Level Parallelism and Superscalar Processors
Register Pressure Guided Unroll-and-Jam
Copyright 2003, Keith D. Cooper & Linda Torczon, all rights reserved.
Introduction to Optimization
Presentation transcript:

Compiler Improvement of Register Usage Part 1 - Chapter 8, through Section 8.4 Anastasia Braginsky

Roadmap Introduction Scalar Replacement Unroll-and-Jam

Introduction Processor cycle time is decreasing; while memory time remains almost the same. Better usage of registers set (especially for RISC architectures) No cache discussion here

Register Allocation Algorithms Allocate a single “live range” for a variable. Start of “live range” – load from memory to register In “live range” – usage of register End of “live range” – store from register to some memory location.

Register Allocation Algorithms 1. Defining live range per variable. 2. Building interference graph to model which pairs of live ranges can not be assigned to the same register. 3. Using a fast heuristic coloring algorithm. 4. If coloring fails take at least one live range from registers and repeat from step 3.

Register Allocation Algorithms – so what’s the problem? Only non-array variables are assigned to registers. Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables Array A: Register R For Array A:

Register Allocation Algorithms – so what’s the problem? Only non-array variables are assigned to registers. Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables Array A:

Register Allocation Algorithms – so what’s the problem? Only non-array variables are assigned to registers. Almost no optimization for floating-point registers, typically used to hold temporarily individual elements of array variables Array A: We want to eliminate the unneeded loads and stores.

Introduction – example: DO I=1,N DO J=1,M A(I)=A(I)+B(J) ENDDO 1.Load from some memory location A(I) to register R A. 2.Load from some memory location B(J) to register R B. 3.R A = R A + R B 4.Store R A to memory – A(I) 1.Load from some memory location A(I) to register R A. 2.Load from some memory location B(J+1) to register R B. 3.R A = R A + R B 4.Store R A to memory – A(I)

Introduction – example (cont.) DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO All loads and stores to A in the inner loop have been saved High chance of T being allocated a register by the coloring algorithm Source-to-source transformation

Data Dependence for Register Reuse Two application of dependence:  To determine the correctness of different transformations  To determine the transformations to improve the performance of particular memory accesses

Types of dependences A true or flow dependence  Assignment to variable and then usage of it S1:V = …  Store the register representing V to V’s memory location S2:… = V  Load V to register representing V  Load can be saved (store after the usage). Cache miss can be saved R1R1 R2R2 R1R2

Types of dependences An antidepencence dependence  Usage of variable and then assignment to it S1:… = V  Load V to register representing V S2:V = …  Store the register representing V to V’s memory location  Nothing can be done to improve the registers usage (if used once), but cache miss can be saved. R1R1 R2R2

Types of dependences An output dependence  Assignment to variable repeats S1:V = …  Store the register representing V to V’s memory location S2:V = …  Store the register representing V to V’s memory location  First store is not needed. R1R1 R2R2

Types of dependences - example S1:A(I) = … S2:… = A(I) S3:A(I) = … Store Load Store True dependence between S1 and S2 Output dependence between S1 and S3

Types of dependences An input dependence  Usage of variable repeats S1:… = V  Load V to register representing V S2:… = V  Load V to register representing V  Second load is not needed. R1R1 R2R2

Types of dependences Loop-independent dependence Loop-carried dependence:  Consistent dependence – a loop-carried dependence with a constant dependence distance throughout the loop.  Non consistent dependence – the opposite.

Dependence graph modification S1:A(I)=… S2:…=A(I) S3:…=A(I) load from A(I) Two True dependences between S1 and S2 (Load can be saved) and between S1 and S3 (Load can be saved) Input dependence from S2 to S3 (Load can be saved)  It is not possible to save three memory references  the dependence graph should be pruned. load from A(I) store to A(I)

Sum reduction example (again) DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO Save second load …=V Input Save first store V=… Output Save nothing …=V V=… Anti- depen dence Save load V = … … = V True

Sum reduction example (again) DO I = 1, N DO J = 1, M A(I) = A(I) + B(J) ENDDO Load from A(I) to R1 … Load from B(J) to R2 R1 = R1 + R2 Store R1 to A(I) Load from A(I) to R1 Load from B(J+1) to R2 R1 = R1 + R2 Store R1 to A(I) … Load from A(I) to R1

Scalar Register Allocation DO I = 1, N T = A(I) DO J = 1, M T = T + B(J) ENDDO A(I) = T ENDDO

Roadmap Introduction Scalar Replacement Unroll-and-Jam

Loop Independent Dependence Simple replacement DO I=1,N A(I) = B(I) + C X(I) = A(I) + Q ENDDO DO I=1,N t = B(I) + C X(I) = t + Q A(I) = t ENDDO

Loop Carried Dependences Spanning single iteration DO I=1,N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO I=1 A(1) = B(0) B(1) = A(1)+C(1) I=2 A(2) = B(1) B(2) = A(2)+C(2) I=3 A(3) = B(2) B(3) = A(3)+C(3) … Loop Carried True dependence

Loop Carried Dependences Spanning single iteration DO I=1,N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO I=1 A(1) = B(0) or T B(1) or T = A(1)+C(1) I=2 A(2) = B(1) or T B(2) or T = A(2)+C(2) I=3 A(3) = B(2) or T B(3) or T = A(3)+C(3) … T = B(0) DO I=1,N A(I) = T T = A(I) + C(I) B(I) = T ENDDO

Loop Carried Dependences DO I=1,N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO Taking care for loop independent dependence using scalar tA: DO I=1,N tA = B(I-1) B(I) = tA + C(I) A(I) = tA ENDDO

Loop Carried Dependences tB = B(0) DO I=1,N tA = tB tB = tA + C(I) A(I) = tA B(I) = tB ENDDO DO I=1,N tA = B(I-1) B(I) = tA + C(I) A(I) = tA ENDDO Taking care for loop carried dependence using scalar tB and inserting an initialization of tB before the loop: tB1 = 0 DO I=1,N tA = tB1 tB2 = tA + C(I) A(I) = tA B(I) = tB2 ENDDO

Loop Carried Dependences t = B(0) DO I=1,N A(I) = t t = t + C(I) B(I) = t ENDDO tB = B(0) DO I=1,N tA = tB A(I) = tA tB = tA + C(I) B(I) = tB ENDDO DO I=1,N A(I) = B(I-1) B(I) = A(I) + C(I) ENDDO Compare with initial state:

Dependences Spanning Multiple Iterations Distance for the dependence is 2 iterations DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO Use 3 different scalar temporaries defined as follows:  t1 = B(I-1)t2 = B(I) t3 = B(I+1) Input dependence

Dependences Spanning Multiple Iterations DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO  t1 = B(I-1)  t2 = B(I)  t3 = B(I+1) DO I=1, N  t1 = B(I-1)  t2 = B(I)  t3 = B(I+1)  A(I) = t1 + t3  t1 = t2  t2 = t3 ENDDO B[0] B[1] B[2] B[3] B[4] t1 t2 t3 t1 t2 t3

Dependences Spanning Multiple Iterations DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO  t1 = B(I-1)  t2 = B(I)  t3 = B(I+1) t1 = B(0) t2 = B(1) DO I=1, N  t3 = B(I+1)  A(I) = t1 + t3  t1 = t2  t2 = t3 ENDDO 2 pipeline cycles/slots wasted for each loop!

Dependences Spanning Multiple Iterations DO I=1, N A(I) = B(I-1) + B(I+1) ENDDO  t1 = B(I-1)  t2 = B(I)  t3 = B(I+1) t1 = B(0) t2 = B(1) DO I=1, N  t3 = B(I+1)  A(I) = t1 + t3  t1 = t2  t2 = t3 ENDDO 2 pipeline cycles/slots wasted for each loop! B[0] B[1] B[2] B[3] B[4] t1 t2 t3 t1 t2 t3 B[6] B[5] …

Dependences Spanning Multiple Iterations t1 = B(0) t2 = B(1) DO I=1, N  t3 = B(I+1)  A(I) = t1 + t3  t1 = B(I+2)  A(I+1) = +  t2 = B(I+3)  A(I+2) = t3 + t2 ENDDO I = 1 A(1)=B(0)+B(2) I = 2 A(2)=B(1)+B(3) I = 3 A(3)=B(2)+B(4) I = 4 A(4)=B(3)+B(5) … B[0] B[1] B[2] B[3] B[4] t1 t2 t3 B[6] B[5] … t2t1,3

Eliminating Scalar Copies t1 = B(0) t2 = B(1) mN3 = MOD(N,3) DO I = 1, mN3 t3 = B(I+1) A(I) = t1 + t3 t1 = t2 t2 = t3 ENDDO DO I = mN3 + 1, N, 3 t3 = B(I+1) A(I) = t1 + t3 t1 = B(I+2) A(I+1) = t2 + t1 t2 = B(I+3) A(I+2) = t3 + t2 ENDDO Pre-loop the same as the previous solution Main loop - three sums in one iteration t1 = B(0) t2 = B(1) DO I=1, N  t3 = B(I+1)  A(I) = t1 + t3  t1 = t2  t2 = t3 ENDDO

Scalar Replacement 1. Loop-independent dependence – simple replacement 2. Handling loop-carried dependence – spanning one iteration 3. Dependences spanning multiple iterations 4. Eliminating scalar copies 5. Pruning the dependence graph 6. Moderating register pressure

Break?

Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO I = 1 A(2)=A(0)+B(0) A(1)=A(1)+B(1)+B(2) I = 2 A(3)=A(1)+B(1) A(2)=A(2)+B(2)+B(3) I = 3 A(4)=A(2)+B(2) A(3)=A(3)+B(3)+B(4) … Save last load …=V Input Save first store V=… Output Save nothing …=V V=… Anti- depend ence Save load V=… …=V True

Pruning the dependence graph Two kinds of edges to be pruned:  True and input dependence edges which are killed by an intervening assignment to the same location.  Input dependence edges that are redundant.

Three-phase algorithm for pruning edges Phase 1: Eliminate killed dependences  Can be true or input dependences  Looking for the store to the location involved in the dependence between the endpoints of the dependence.

Three-phase algorithm for pruning edges Phase 1: Eliminate killed dependences  True dependence Looking for output dependence from the source to the assignment before the sink (there should be true dependence from the assignment to the sink).  S1: A(I+1) = …  S2: A(I) = …  S3: … = A(I) True Output

Three-phase algorithm for pruning edges Phase 1: Eliminate killed dependences  Input dependence Looking for anti-dependence from the source to the assignment before the sink (there should be true dependence from the assignment to the sink).  S1: …=A(I+1)  S2: A(I) = …  S3: … = A(I) Input True Anti

Three-phase algorithm for pruning edges Phase 2: Identify generators  Generator is a reference with at least one true or input dependence starting from it AND with no input or true dependence into it  Actually if we are looking on the graph represented only by input and true dependences, generators are the sources.

Three-phase algorithm for pruning edges Phase 3: Find name partitions and eliminate input dependences  A name partition is a set of references that can be replaced by a reference to a single scalar variable. A(I+2) A(I) B(I) A(I+1) A(I) A(I-1)

Three-phase algorithm for pruning edges Phase 3: Find name partitions and eliminate input dependences  A name partition is a set of references that can be replaced by a reference to a single scalar variable.  Starting at each generator mark each reference reachable from the generator by a flow or input dependence as part of the name partition for that variable. (similar to the typed fusion problem)  Eliminate input dependences within same name partition, unless source is generator.

Three-phase algorithm for pruning edges Phase 3 and a half: Anti-dependence  Note that anti-dependence can’t directly give a rise to register reuse and are always pruned as the last step in the graph pruning.

Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Phase 1  Try to eliminate true dependences  Try to eliminate input dependences Phase 2  Identify generators Candidates are sources of true and input dependences

Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO Phase 3  Starting each generator mark each reference reachable from the generator by input or true dependence as part of the name partition for that variable. Phase 3.5

Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO How to start scalar replacement from here? 1. Generators:  A(I+1)  A(I)  B(I+1) 2. The number of scalars for each generator = how many iterations are spanned by the dependence. A(I-1) B(I-1) B(I)

Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO References:  A(I-1)= tA1  A(I)= tA2  A(I+1) = tA3  B(I-1) = tB1  B(I) = tB2  B(I+1) = tB3

Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO tA1 = A(I-1), tA2=A(I), tA3=A(I+1) tB1 = B(I-1), tB2=B(I), tB3=B(I+1) tA1 = A(0) tA2 = A(1) tB1 = B(0) tB2 = B(1) DO I = 1, N tA3 = tA1 + tB1 tB3 = B(I+1) tA1 = tA2 + tB2 + tB3 // Above: tA1=tA2 A(I) = tA1 tA2 = tA3 tB1 = tB2 tB2 = tB3 ENDDO A(N+1) = tA3  One Load  One Store

Pruning the dependence graph DO I = 1, N A(I+1) = A(I-1) + B(I-1) A(I) = A(I) + B(I) + B(I+1) ENDDO 1. Load A(I-1) to R1 2. Load B(I-1) to R2 R1 = R1 + R2 3. Store R1 to A(I+1) 4. Load A(I) to R1 5. Load B(I) to R2 R1 = R1 + R2 6. Load B(I+1) to R2 R1 = R1 + R2 7. Store R1 to A(I)

Pruning the dependence graph Special cases: References in the Loop DO I=1, N = B(I) + C(I,J) C(I,J) = + D(I) ENDDO A(J) tA A(J) = tA Assumption: N>1

Pruning the dependence graph Special cases: Forcing stores and loads DO I = 1, N A(I) = A(I-1) + B(I) A(J) = A(J) + A(I) ENDDO I=1 A(1) = A(0) + B(1) A(J) = A(J) + A(1) I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2) I=3 …

Pruning the dependence graph Special cases: Forcing stores and loads DO I = 1, N tAI = A(I-1) + B(I) A(I) = tAI A(J) = A(J) + tAI ENDDO I=1 A(1) = A(0) + B(1) A(J) = A(J) + A(1) I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2) I=3 …

Pruning the dependence graph Special cases: Forcing stores and loads tAI = A(0) DO I = 1, N tAI = tAI(was A(I-1))+B(I) A(J) = A(J) + tAI A(I) = tAI ENDDO J ?= I I=1 A(1) = A(0) + B(1) A(J) = A(J) + A(1) I=2 A(2) = A(1) + B(2) A(J) = A(J) + A(2) I=3 …

Special cases: Forcing stores and loads tAI = A(0) DO I = 1, N tAI = tAI + B(I) A(J) = A(J) + tAI A(I) = tAI ENDDO A(J) ?= A(I) DO I = 1, N tAI = A(I-1) + B(I) A(J) = A(J) + tAI A(I) = tAI ENDDO tAI = A(0) tAJ = A(J) JU = MAX(J-1,0) DO I = 1, JU tAI = tAI + B(I) tAJ = tAJ + tAI A(I) = tAI ENDDO // Here I = J IF(J.GT.0.AND.J.LE.N) THEN tAI = tAI + B(J) tAJ = tAJ + tAI A(J) = tAI tAI = tAJ ENDIF DO I = JU+2, N //I starting from J+1 tAI = tAI + B(I) tAJ = tAJ + tAI A(I) = tAI ENDDO A(J) = tAJ

Pruning the dependence graph Special cases: Inconsistent dependence DO I = 1, N A(2*I) = A(I) + B(I) ENDDO Bag edge in the typed fusion framework above.

Moderating of Register Pressure Scalar replacement may produce to many scalar quantities. Have to choose name partitions for scalar replacement to maximize register usage

Moderating of Register Pressure M name partitions R: {R 1, R 2, R 3,…,R m } The value of the name partition: v(R) – the number of memory loads and stores saved by replacing The cost of the name partition: c(R) – the number of registers needed to hold all the scalar values.

Moderating of Register Pressure The desired solution, given the limit of register-resident scalars to use - n:  The sub-collection of name partitions:  or {R 1, R 2,…,R m } where R i =0/1  Such that:  And

Dynamic programming solution of the 0/1 knapsack problem Let the costs be c1,..., cm and the corresponding values v1,..., vm. We wish to maximize total value subject to the constraint that total cost is less than n. Then for each i<n, define A(i) to be the maximum value that can be attained with total cost less than or equal to i. A(n) is the solution to the problem. Define A(i) recursively as follows: A(0) = 0 A(i) = max { vj + A(i − cj) | cj ≤ i } Here the maximum of the empty set is taken to be zero. Tabulating the results from A(0) up through A(n) gives the solution. Since the calculation of each A(i) involves examining m items (all of which have been previously computed), and there are n values of A(i) to calculate, the running time of the dynamic programming solution is thus O(nm).

0/1 knapsack problem solutions Dynamic programming solution – O(nm) Heuristic solution:  Order the name partition set in decreasing order of the ratio v(R)/c(R)  Select elements from the beginning of the list until the registers are full.

Scalar replacement algorithm For the loops that do not include conditional flow of control: 1. Prune dependence graph. (Apply typed fusion.) Get set of name partitions as the part of the result. 2. Select a set of name partitions using register pressure moderation

Scalar replacement algorithm For the loops that do not include conditional flow of control: 3. For each selected partition, replace it with a reference to scalars: A) If non-cyclic, replace using set of temporaries True dependence is replaced by set of temporaries. The number of temporaries is defined by how many iterations are spanned by the dependence. Output dependence – move stores after the end of loop. Input dependence – move loads before the beginning of loop.

Scalar replacement algorithm For the loops that do not include conditional flow of control: B) If cyclic replace reference with single temporary C) For each inconsistent dependence  Use index set splitting or insert loads and stores 4. Unroll loop to eliminate scalar copies

Experimental Data Speedup = running time of the original program divided by running time of the version after scalar replacement LL = Livermore loops

Roadmap Introduction Scalar Replacement Unroll-and-Jam

Recall introduction example: DO I=1,N DO J=1,M A(I)=A(I)+B(J) ENDDO

Unroll-and-Jam After transformation called unroll-and-jam Assume Nmod2=0 DO I=1,N,2 DO J=1,M A(I)=A(I)+B(J) A(I+1)=A(I+1)+B(J) ENDDO This is unroll-and-jam to factor 2

Improving the efficiency of pipelined functional unit DO J=1, 2M DO I=1, N A(I,J) = A(I+1,J) + A(I-1,J) ENDDO A(1,J) = A(2,J) + A(0,J) A(2,J) = A(3,J) + A(1,J) A(3,J) = A(4,J) + A(2,J) …

Improving the efficiency of pipelined functional unit Instruction Fetch Instruction Decode Or Register Fetch Register File Execute Memory Access Write Back To Register File

Improving the efficiency of pipelined functional unit Instruction Fetch Instruction Decode Or Register Fetch Register File Execute Memory Access Write Back To Register File

Improving the efficiency of pipelined functional unit Instruction Fetch Instruction Decode Or Register Fetch Register File Execute Memory Access Write Back To Register File Forward Unit

Improving the efficiency of pipelined functional unit Instruction Fetch Instruction Decode Or Register Fetch Register File Memory Access Write Back To Register File Forward Unit Execute

Improving the efficiency of pipelined functional unit DO J=1, 2M DO I=1, N A(I,J) = A(I+1,J) + A(I-1,J) ENDDO A(1,J) = A(2,J) + A(0,J) A(2,J) = A(3,J) + A(1,J) A(3,J) = A(4,J) + A(2,J) … A(1,J+1) = A(2,J+1) + A(0,J+1) A(2,J+1) = A(3,J+1) + A(1,J+1) A(3,J+1) = A(4,J+1) + A(2,J+1) …

Improving the efficiency of pipelined functional unit DO J=1, 2M, 2 DO I=1, N A(I,J) = A(I+1,J) + A(I-1,J) A(I,J+1)=A(I+1,J+1)+A(I-1,J+1) ENDDO

Legality of Unroll-and-Jam DO I=1, 2N DO J=1, M A(I+1, J-1) = A(I,J) + B(I,J) A(I+2, J-1) = A(I+1,J) + B(I+1,J) ENDDO I=1, J=1 A(2, 0) = A(1, 1) + B(1, 1) I=1, J=2 A(2, 1) = A(1, 2) + B(1, 2) I=1, J=3 A(2, 2) = A(1, 3) + B(1, 3) … I=2, J=1 A(3, 0) = A(2, 1) + B(2, 1) I=2, J=2 A(3, 1) = A(2, 2) + B(2, 2) I=2, J=3 A(3, 2) = A(2, 3) + B(2, 3) …, 2

Legality of Unroll-and-Jam I=1, J=1 A(2, 0) = A(1, 1) + B(1, 1) I=1, J=2 A(2, 1) = A(1, 2) + B(1, 2) I=1, J=3 A(2, 2) = A(1, 3) + B(1, 3) … I=2, J=1 A(3, 0) = A(2, 1) + B(2, 1) I=2, J=2 A(3, 1) = A(2, 2) + B(2, 2) I=2, J=3 A(3, 2) = A(2, 3) + B(2, 3) …

Legality of Unroll-and-Jam I=1, J=1 A(2, 0) = A(1, 1) + B(1, 1) I=1, J=2 A(2, 1) = A(1, 2) + B(1, 2) I=1, J=3 A(2, 2) = A(1, 3) + B(1, 3) … I=2, J=1 A(3, 0) = A(2, 1) + B(2, 1) I=2, J=2 A(3, 1) = A(2, 2) + B(2, 2) I=2, J=3 A(3, 2) = A(2, 3) + B(2, 3) …

Legality of Unroll-and-Jam DO I=1, 2N DO J=1, M A(I+1, J-1) = A(I,J) + B(I,J) ENDDO The direction vector of the loop: [ ] Swap  [>, <]  illegal Should we assume that loop unroll-and-jam is illegal whenever loop interchange is illegal?

Legality of Unroll-and-Jam DO I=1, 2N DO J=1, M A(I+2, J-1) = A(I,J) + B(I,J) ENDDO Direction vector is ( ); still unroll-and- jam possible

Legality of Unroll-and-Jam Definition: An unroll-and-jam to factor n consist of: Unrolling the outer loop n-1 times to create n copies of the inner loop And fusing those copies together

Unroll-and-Jam to factor n=2 Recall introduction example: DO I=1,2N DO J=1,M A(I)=A(I)+B(J) ENDDO Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop DO I=1,2N DO J=1,M A(I)=A(I)+B(J) ENDDO DO J=1,M A(I)=A(I)+B(J) ENDDO ENDDO

Unroll-and-Jam to factor n=2 Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop DO I=1,2N, 2 DO J=1,M A(I)=A(I)+B(J) ENDDO DO J=1,M A(I+1)=A(I+1)+B(J) ENDDO Recall introduction example: DO I=1,2N DO J=1,M A(I)=A(I)+B(J) ENDDO

Unroll-and-Jam to factor n=2 Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop DO I=1,2N, 2 DO J=1,M A(I)=A(I)+B(J) ENDDO DO J=1,M A(I+1)=A(I+1)+B(J) ENDDO

Unroll-and-Jam to factor n=2 Unrolling the outer loop n-1=1 times to create n=2 copies of the inner loop DO I=1,2N, 2 DO J=1,M A(I)=A(I)+B(J) ENDDO DO J=1,M A(I+1)=A(I+1)+B(J) ENDDO And fusing those copies together DO I=1,2N,2 DO J=1,M A(I)=A(I)+B(J) A(I+1)=A(I+1)+B(J) ENDDO ENDDO

Legality of Unroll-and-Jam Theorem:  An unroll-and-jam to factor n is legal if and only if there exist no dependence with direction vector ( ) such that the distance for the outer loop is < n. To check this note to use the full dependence graph and not the pruned one.

Legality of Unroll-and-Jam What happens if there exist dependence with direction vector ( ) but that the distance for the outer loop is n. Note that an unroll- and-jam was to factor n

Unroll-and-Jam to factor m Algorithm 1. Create preloop 2. Unroll main loop m times 3. Apply typed fusion to loops within the body of the unrolled loop 4. Apply unroll-and-jam recursively to the inner nested loop

Unroll-and-Jam example DO I = 1, N DO K = 1, N A(I) = A(I) + X(I,K) ENDDO DO J = 1, M DO K = 1, N B(J,K) = B(J,K) + A(I) ENDDO DO J = 1, M C(J,I) = B(J,N)/A(I) ENDDO DO I = Nmod2+1, N, 2 DO K = 1, N A(I) = A(I) + X(I,K) A(I+1) = A(I+1) + X(I+1,K) ENDDO DO J = 1, M DO K = 1, N B(J,K) = B(J,K) + A(I) B(J,K) = B(J,K) + A(I+1) ENDDO C(J,I) = B(J,N)/A(I) C(J,I+1) = B(J,N)/A(I+1) ENDDO ENDDO

DO I = Nmod2+1, N, 2 DO K = 1, N A(I) = A(I) + X(I,K) A(I+1) = A(I+1) + X(I+1,K) ENDDO DO J = 1, M DO K = 1, N B(J,K) = B(J,K) + A(I) B(J,K) = B(J,K) + A(I+1) ENDDO C(J,I) = B(J,N)/A(I) C(J,I+1) = B(J,N)/A(I+1) ENDDO DO I = Nmod2+1, N, 2 DO K = 1, N A(I) = A(I) + X(I,K) A(I+1) = A(I+1) + X(I+1,K) ENDDO DO J = 1, Mmod2 DO K = 1, N B(J,K) = B(J,K) + A(I) B(J,K) = B(J,K) + A(I+1) ENDDO C(J,I) = B(J,N)/A(I) C(J,I+1) = B(J,N)/A(I+1) ENDDO DO J = Mmod2+1, M, 2 DO K = 1, N B(J,K) = B(J,K) + A(I) B(J,K) = B(J,K) + A(I+1) B(J+1,K) = B(J+1,K) + A(I) B(J+1,K) = B(J+1,K) + A(I+1) ENDDO C(J,I) = B(J,N)/A(I) C(J,I+1) = B(J,N)/A(I+1) C(J+1,I) = B(J+1,N)/A(I) C(J+1,I+1) = B(J+1,N)/A(I+1) ENDDO

Effectiveness of Unroll-and-Jam

Improving the efficiency of pipelined functional unit Instruction Fetch Instruction Decode Or Register Fetch Register File Execute Memory Access Write Back To Register File