EECS 583 – Class 15 Register Allocation

Slides:

Advertisements

Similar presentations

Register Allocation Consists of two parts: Goal : minimize spills

Advertisements

Compiler Support for Superscalar Processors. Loop Unrolling Assumption: Standard five stage pipeline Empty cycles between instructions before the result.

Register Usage Keep as many values in registers as possible Register assignment Register allocation Popular techniques – Local vs. global – Graph coloring.

P3 / 2004 Register Allocation. Kostis Sagonas 2 Spring 2004 Outline What is register allocation Webs Interference Graphs Graph coloring Spilling Live-Range.

Register allocation Morgensen, Torben. "Register Allocation." Basics of Compiler Design. pp from (

Register Allocation CS 320 David Walker (with thanks to Andrew Myers for most of the content of these slides)

EECS 583 – Class 13 Software Pipelining University of Michigan October 24, 2011.

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

Coalescing Register Allocation CS153: Compilers Greg Morrisett.

Register Allocation Mooly Sagiv Schrierber Wed 10:00-12:00 html://

COMPILERS Register Allocation hussein suleman uct csc305w 2004.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

Stanford University CS243 Winter 2006 Wei Li 1 Register Allocation.

Register Allocation CS 671 March 27, CS 671 – Spring Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

Program Representations. Representing programs Goals.

Carnegie Mellon Lecture 6 Register Allocation I. Introduction II. Abstraction and the Problem III. Algorithm Reading: Chapter Before next class:

1 CS 201 Compiler Construction Lecture 12 Global Register Allocation.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

Prof. Bodik CS 164 Lecture 171 Register Allocation Lecture 19.

Register Allocation (via graph coloring)

U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710.

Register Allocation (via graph coloring). Lecture Outline Memory Hierarchy Management Register Allocation –Register interference graph –Graph coloring.

1 Liveness analysis and Register Allocation Cheng-Chia Chen.

Improving Code Generation Honors Compilers April 16 th 2002.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

Improving code generation. Better code generation requires greater context Over expressions: optimal ordering of subtrees Over basic blocks: Common subexpression.

4/29/09Prof. Hilfinger CS164 Lecture 381 Register Allocation Lecture 28 (from notes by G. Necula and R. Bodik)

CS745: Register Allocation© Seth Copen Goldstein & Todd C. Mowry Register Allocation.

Supplementary Lecture – Register Allocation EECS 483 University of Michigan.

EECS 583 – Class 14 Modulo Scheduling Reloaded University of Michigan October 31, 2011.

U NIVERSITY OF D ELAWARE C OMPUTER & I NFORMATION S CIENCES D EPARTMENT Optimizing Compilers CISC 673 Spring 2009 Register Allocation John Cavazos University.

CMPE 511 Computer Architecture A Faster Optimal Register Allocator Betül Demiröz.

ANALYSIS AND IMPLEMENTATION OF GRAPH COLORING ALGORITHMS FOR REGISTER ALLOCATION By, Sumeeth K. C Vasanth K.

EECS 583 – Class 15 Register Allocation University of Michigan November 2, 2011.

Register Usage Keep as many values in registers as possible Keep as many values in registers as possible Register assignment Register assignment Register.

Register Allocation CS 471 November 12, CS 471 – Fall 2007 Register Allocation - Motivation Consider adding two numbers together: Advantages: Fewer.

2/22/2016© Hal Perkins & UW CSEP-1 CSE P 501 – Compilers Register Allocation Hal Perkins Winter 2008.

Carnegie Mellon Lecture 8 Software Pipelining I. Introduction II. Problem Formulation III. Algorithm Reading: Chapter 10.5 – 10.6 M. LamCS243: Software.

Global Register Allocation Based on

Topic Register Allocation

The compilation process

Register Allocation Hal Perkins Autumn 2009

Henk Corporaal TUEindhoven 2009

CSC D70: Compiler Optimization Register Allocation

Register Allocation Hal Perkins Autumn 2011

EECS 583 – Class 13 Software Pipelining

Instruction Scheduling Hal Perkins Winter 2008

Register Allocation Hal Perkins Summer 2004

Register Allocation Hal Perkins Autumn 2005

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EECS 583 – Class 14 Modulo Scheduling Reloaded

EECS 583 – Class 13 Modulo Scheduling

EECS 583 – Class 7 Static Single Assignment Form

Lecture 16: Register Allocation

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

EECS 583 – Class 15 Register Allocation

EECS 583 – Class 7 Static Single Assignment Form

EECS 583 – Class 12 Superblock Scheduling Intro. to Loop Scheduling

Compiler Construction

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Lecture 17: Register Allocation via Graph Colouring

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

CMSC 611: Advanced Computer Architecture

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Target Code Generation

Fall Compiler Principles Lecture 13: Summary

(via graph coloring and spilling)

CS 201 Compiler Construction

Presentation transcript:

EECS 583 – Class 15 Register Allocation University of Michigan November 5, 2012

Announcements + Reading Material Did your group submit a project proposal? If not, get it in today Midterm exam: Nov 12, 14, or 19? Class vote Today’s class reading “Register Allocation and Spilling Via Graph Coloring,” G. Chaitin, Proc. 1982 SIGPLAN Symposium on Compiler Construction, 1982. Next class reading “Revisiting the Sequential Programming Model for Multi-Core,” M. J. Bridges, N. Vachharajani, Y. Zhang, T. Jablin, and D. I. August, Proc 40th IEEE/ACM International Symposium on Microarchitecture, December 2007.

Homework Problem – Answers in Red latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 How many resources of each type are required to achieve an II=1 schedule? For II=1, each operation needs a dedicated resource, so: 3 ALU, 2 MEM, 1 BR If the resources are non-pipelined, how many resources of each type are required to achieve II=1 Instead of 1 ALU to do the multiplies, 3 are needed, and instead of 1 MEM to do the loads, 2 are needed. Hence: 5 ALU, 3 MEM, 1 BR Assuming pipelined resources, generate the II=1 modulo schedule. See next few slides for (j=0; j<100; j++) b[j] = a[j] * 26 LC = 99 Loop: 1: r3 = load(r1) 2: r4 = r3 * 26 3: store (r2, r4) 4: r1 = r1 + 4 5: r2 = r2 + 4 7: brlc Loop

HW continued Assume II=1 so resources are: 3 ALU, 2 MEM, 1 BR Dependence graph (same as example in class) DSA converted code below (same as example in class) 1,1 RecMII = 1 RESMII = 1 MII = MAX(1,1) = 1 1 LC = 99 2,0 0,0 2 Loop: 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop 3,0 3 Priorities 1: H = 5 2: H = 3 3: H = 0 4: H = 4 5: H = 0 7: H = 0 1,1 1,1 0,0 4 5 1,1 1,1 7

HW continued LC = 99 Unrolled Schedule Rolled Schedule Loop: resources: 3 alu, 2 mem, 1 br latencies: add=1, mpy=3, ld = 2, st = 1, br = 1 LC = 99 Unrolled Schedule Rolled Schedule Loop: 1: r3[-1] = load(r1[0]) 2: r4[-1] = r3[-1] * 26 3: store (r2[0], r4[-1]) 4: r1[-1] = r1[0] + 4 5: r2[-1] = r2[0] + 4 remap r1, r2, r3, r4 7: brlc Loop stage 1 1 4 stage 2 1 7 1 4 2 3 5 stage 3 2 2 stage 4 3 stage 5 4 stage 6 5 3 5 7 6 Scheduling steps: Schedule brlc at time II-1 Schedule op1 at time 0 Schedule op4 at time 0 Schedule op2 at time 2 Schedule op3 at time 5 Schedule op5 at time 5 Schedule op7 at time 5 alu0 alu1 alu2 m1 m2 br X X X X X X MRT

HW continued The final loop consists of a single MultiOp containing 6 operations, each predicated on the appropriate staging predicate. Note register allocation still needs to be performed. LC = 99 Loop: r3[-1] = load(r1[0]) if p1[0]; r4[-1] = r3[-1] * 26 if p1[2]; store (r2[0], r4[-1]) if p1[5]; r1[-1] = r1[0] + 4 if p1[0]; r2[-1] = r2[0] + 4 if p1[5]; brf Loop

What if We Don’t Have Hardware Support? No predicates Predicates enable kernel-only code by selectively enabling/disabling operations to create prolog/epilog Now must create explicit prolog/epilog code segments No rotating registers Register names not automatically changed each iteration Must unroll the body of the software pipeline, explicitly rename Consider each register lifetime i in the loop Kmin = min unroll factor = MAXi (ceiling((Endi – Starti) / II)) Create Kmin static names to handle maximum register lifetime Apply modulo variable expansion

No Predicates Kernel-only code with rotating registers and B A A B A Kernel-only code with rotating registers and predicates, II = 1 prolog C B A D C B A C B B kernel E D C B A D C B D C C E D C B D E D C epilog E D E Without predicates, must create explicit prolog and epilogs, but no explicit renaming is needed as rotating registers take care of this

No Predicates and No Rotating Registers Assume Kmin = 4 for this example A1 prolog B1 A2 B1 C1 B2 A3 C1 B2 C1 D1 C2 B3 A4 D1 C2 B3 D1 C2 D1 E1 D2 C3 B4 A1 E2 D3 C4 B1 A2 unrolled kernel E3 D4 C1 B2 A3 E4 D1 C2 B3 A4 E1 D2 C3 B4 E4 D1 C2 B3 E3 D4 C1 B2 E2 D3 C4 B1 E2 D3 C4 E1 D2 C3 E4 D1 C2 E3 D4 C1 epilog E3 D4 E2 D3 E1 D2 E4 D1 E4 E3 E2 E1

Register Allocation: Problem Definition Through optimization, assume an infinite number of virtual registers Now, must allocate these infinite virtual registers to a limited supply of hardware registers Want most frequently accessed variables in registers Speed, registers much faster than memory Direct access as an operand Any VR that cannot be mapped into a physical register is said to be spilled Questions to answer What is the minimum number of registers needed to avoid spilling? Given n registers, is spilling necessary Find an assignment of virtual registers to physical registers If there are not enough physical registers, which virtual registers get spilled?

Live Range Value = definition of a register Live range = Set of operations 1 more or values connected by common uses A single VR may have several live ranges Very similar to the web being constructed for HW3 Live ranges are constructed by taking the intersection of reaching defs and liveness Initially, a live range consists of a single definition and all ops in a function in which that definition is live

Example – Constructing Live Ranges {liveness}, {rdefs} 1: x = {}, {1} {x}, {1} 2: x = 3: {x}, {2} {x}, {1} 4: = x Each definition is the seed of a live range. Ops are added to the LR where both the defn reaches and the variable is live {}, {1,2} 5: x = {}, {5} {}, {5,6} {x}, {5} 6: x = LR1 for def 1 = {1,3,4} LR2 for def 2 = {2,4} LR3 for def 5 = {5,7,8} LR4 for def 6 = {6,7,8} {x}, {6} 7: = x {x}, {5,6} 8: = x

Merging Live Ranges If 2 live ranges for the same VR overlap, they must be merged to ensure correctness LRs replaced by a new LR that is the union of the LRs Multiple defs reaching a common use Conservatively, all LRs for the same VR could be merged Makes LRs larger than need be, but done for simplicity We will not assume this r1 = r1 = = r1

Example – Merging Live Ranges {liveness}, {rdefs} 1: x = LR1 for def 1 = {1,3,4} LR2 for def 2 = {2,4} LR3 for def 5 = {5,7,8} LR4 for def 6 = {6,7,8} {}, {1} {x}, {1} 2: x = 3: {x}, {2} {x}, {1} 4: = x {}, {1,2} 5: x = Merge LR1 and LR2, LR3 and LR4 LR5 = {1,2,3,4} LR6 = {5,6,7,8} {}, {5} {}, {5,6} {x}, {5} 6: x = {x}, {6} 7: = x {x}, {5,6} 8: = x

Class Problem Compute the LRs 1: y = for each def 2: x = y merge overlapping 1: y = 2: x = y 3: = x 4: y = 5: = y 6: y = 7: z = 8: x = 9: = y 10: = z

Interference Two live ranges interfere if they share one or more ops in common Thus, they cannot occupy the same physical register Or a live value would be lost Interference graph Undirected graph where Nodes are live ranges There is an edge between 2 nodes if the live ranges interfere What’s not represented by this graph Extent of interference between the LRs Where in the program is the interference

Example – Interference Graph lr(a) = {1,2,3,4,5,6,7,8} lr(b) = {2,3,4,6} lr(c) = {1,2,3,4,5,6,7,8,9} lr(d) = {4,5} lr(e) = {5,7,8} lr(f) = {6,7} lr{g} = {8,9} 1: a = load() 2: b = load() 3: c = load() 4: d = b + c 5: e = d - 3 6: f = a * b 7: e = f + c a b c d 8: g = a + e 9: store(g) e f g

Graph Coloring A graph is n-colorable if every node in the graph can be colored with one of the n colors such that 2 adjacent nodes do not have the same color Model register allocation as graph coloring Use the fewest colors (physical registers) Spilling is necessary if the graph is not n-colorable where n is the number of physical registers Optimal graph coloring is NP-complete for n > 2 Use heuristics proposed by compiler developers “Register Allocation Via Coloring”, G. Chaitin et al, 1981 “Improvement to Graph Coloring Register Allocation”, P. Briggs et al, 1989 Observation – a node with degree < n in the interference can always be successfully colored given its neighbors colors

Coloring Algorithm 1. While any node, x, has < n neighbors Remove x and its edges from the graph Push x onto a stack 2. If the remaining graph is non-empty Compute cost of spilling each node (live range) For each reference to the register in the live range Cost += (execution frequency * spill cost) Let NB(x) = number of neighbors of x Remove node x that has the smallest cost(x) / NB(x) Push x onto a stack (mark as spilled) Go back to step 1 While stack is non-empty Pop x from the stack If x’s neighbors are assigned fewer than R colors, then assign x any unsigned color, else leave x uncolored

Example – Finding Number of Needed Colors How many colors are needed to color this graph? B Try n=1, no, cannot remove any nodes Try n=2, no again, cannot remove any nodes Try n=3, Remove B Then can remove A, C Then can remove D, E Thus it is 3-colorable A E C D

Example – Do a 3-Coloring lr(a) = {1,2,3,4,5,6,7,8} refs(a) = {1,6,8} lr(b) = {2,3,4,6} refs(b) = {2,4,6} lr(c) = {1,2,3,4,5,6,7,8,9} refs(c) = {3,4,7} lr(d) = {4,5} refs(d) = {4,5} lr(e) = {5,7,8} refs(e) = {5,7,8} lr(f) = {6,7} refs(f) = {6,7} lr{g} = {8,9} refs(g) = {8,9} a b Profile freqs 1,2 = 100 3,4,5 = 75 6,7 = 25 8,9 = 100 Assume each spill requires 1 operation c d e f g a b c d e f g cost 225 200 175 150 200 50 200 neighbors 6 4 5 4 3 4 2 cost/n 37.5 50 35 37.5 66.7 12.5 100

Example – Do a 3-Coloring (2) Remove all nodes < 3 neighbors So, g can be removed Stack g a b a b c d c d e e f f g

Example – Do a 3-Coloring (3) Now must spill a node Choose one with the smallest cost/NB  f is chosen Stack f (spilled) g a b a b c c d d e e f

Example – Do a 3-Coloring (4) Remove all nodes < 3 neighbors So, e can be removed Stack e f (spilled) g a b a b c c d d e

Example – Do a 3-Coloring (5) Now must spill another node Choose one with the smallest cost/NB  c is chosen Stack c (spilled) e f (spilled) g a b a b d c d

Example – Do a 3-Coloring (6) Remove all nodes < 3 neighbors So, a, b, d can be removed Stack d b a c (spilled) e f (spilled) g a b Null d

Example – Do a 3-Coloring (7) Stack d b a c (spilled) e f (spilled) g a b c d e f g Have 3 colors: red, green, blue, pop off the stack assigning colors only consider conflicts with non-spilled nodes already popped off stack d  red b  green (cannot choose red) a  blue (cannot choose red or green) c  no color (spilled) e  green (cannot choose red or blue) f  no color (spilled) g  red (cannot choose blue)

Example – Do a 3-Coloring (8) d  red b  green a  blue c  no color e  green f  no color g  red 1: blue = load() 2: green = load() 3: spill1 = load() 4: red = green + spill1 5: green = red - 3 6: spill2 = blue * green 7: green = spill2 + spill1 Notes: no spills in the blocks executed 100 times. Most spills in the block executed 25 times. Longest lifetime (c) also spilled 8: red = blue + green 9: store(red)

Homework Problem do a 2-coloring 1: y = compute cost matrix 2: x = y draw interference graph color graph 1: y = 2: x = y 1 3: = x 10 90 4: y = 5: = y 6: y = 7: z = 8: x = 9: = y 99 1 10: = z

It’s not that easy – Iterative Coloring You can’t spill without creating more live ranges Need regs for the stack ptr, value spilled, [offset] Can’t color before taking this into account 1: blue = load() 2: green = load() 3: spill1 = load() 4: red = green + spill1 5: green = red - 3 6: spill2 = blue * green 7: green = spill2 + spill1 8: red = blue + green 9: store(red)

Iterative Coloring (1) 0: c = 15: store(c, sp) 1: a = load() 2: b = load() 3: c = load() 10: store(c, sp) 11: i = load(sp) 4: d = b + i 5: e = d - 3 6: f = a * b 12: store(f, sp + 4) 13: j = load(sp + 4) 14: k = load(sp) 7: e = k + j 8: g = a + e 9: store(g) After spilling, assign variables to a stack location, insert loads/stores

Iterative Coloring (2) Update live ranges - Don’t need to recompute! 15: store(c, sp) lr(a) = {1,2,3,4,5,6,7,8,10,11,12,13,14} refs(a) = {1,6,8} lr(b) = {2,3,4,6,10,11} lr(c) = {3,10} (This was big) lr(d) = … lr(e) = … lr(f) = … lr(g) = … lr(i) = {4,11} lr(j) = {7,13,14} lr(k) = {7,14} lr(sp) = … 1: a = load() 2: b = load() 3: c = load() 10: store(c, sp) 11: i = load(sp) 4: d = b + i 5: e = d - 3 6: f = a * b 12: store(f, sp + 4) 13: j = load(sp + 4) 14: k = load(sp) 7: e = k + j 8: g = a + e 9: store(g) Update live ranges - Don’t need to recompute!

Iterative Coloring (3) Update interference graph b i c d j e k f g Update interference graph Nuke edges between spilled LRs

Iterative Coloring (4) Add edges for new/spilled LRs j k i a b c d e f g Add edges for new/spilled LRs Stack ptr (almost) always interferes with everything so ISAs usually just reserve a reg for it. 4. Recolor and repeat until no new spill is generated