Static Code Scheduling CS 671 April 1, 2008
Code Scheduling Scheduling or reordering instructions to improve performance and/or guarantee correctness Important for dynamically-scheduled architectures Crucial (assumed!) for statically-scheduled architectures, e.g. VLIW or EPIC Takes into account anticipated latencies Machine-specific, performed later in the optimization pass How does this contrast with our earlier exploration of code motion?
Why Must the Compiler Schedule? Many machines are pipelined and expose some aspects of pipelining to the user (compiler) Examples: Branch delay slots! Memory-access delays Multi-cycle operations Some machines don’t have scheduling hardware
Example Assume loads take 2 cycles and branches have a delay slot. instruction start time r2 [r1] r3 [r1+4] r4 r2 + r3 r5 r2 + 1 goto L1 nop
Example Assume loads take 2 cycles and branches have a delay slot. instruction start time r2 [r1] r3 [r1+4] r5 r2 + 1 goto L1 r4 r2 + r3
Code Scheduling Strategy Get resources operating in parallel Integer data path Integer multiply / divide hardware FP adder, multiplier, divider Method Fill with computations that do not require result or same hardware resources Drawbacks Highly hardware dependent Start Op Use Op Try to fill
Scheduling Approaches Local Branch scheduling Basic-block scheduling Global Cross-block scheduling Software pipelining Trace scheduling Percolation scheduling Basic block and branch scheduling – simplest approach – 10% speedup Cross-block scheduling considers a tree of blocks at once and may move instructions from one block to another Software pipelining operates specifically on loops – 2X speedup Trace and percolation scheduling are global approaches that work well for high-degree superscalar and VLIW All are among the last optimizations (except last two, which enable other transformations)
Branch Scheduling Two problems: Branches often take some number of cycles to complete Can be a delay between a compare b and its associated branch A compiler will try to fill these slots with valid instructions (rather than nop) Delay slots – present in PA-RISC, SPARC, MIPS Condition delay – PowerPC, Pentium
Recall from Architecture… IF – Instruction Fetch ID – Instruction Decode EX – Execute MA – Memory access WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB
Control Hazards Taken Branch IF ID EX MA WB IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1
Data Dependences If two operations access the same register, they are dependent Types of data dependences Flow Output Anti r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6
Data Hazards Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB stall EX MA WB add R3,R1,R4
Data Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles
Multi-cycle Instructions Scheduling is particularly important for multi-cycle operations Alpha instructions > 1 cycle latency (partial list) mull (32-bit integer multiply) 8 mulq (64-bit integer multiply) 16 addt (fp add) 4 mult (fp multiply) 4 divs (fp single-precision divide) 10 divt (fp double-precision divide) 23
Avoiding data hazards Move loads earlier and stores later (assuming this does not violate correctness) Other stalls may require more sophisticated re-ordering, i.e. ((a+b)+c)+d becomes (a+b)+(c+d) How can we do this in a systematic way??
Example: Without Scheduling Start Time Code lw r1, w add r1,r1,r1 lw r2,x mult r1,r1,r2 lw r2,y lw r2,z sw r1, a Assume: memory instrs take 3 cycles mult takes 2 cycles (to have result in register) rest take 1 cycle ____cycles
Basic Block Dependence DAGS Nodes - instructions Edges - dependence between I1 and I2 When we cannot determine whether there is a dependence, we must assume there is one a) lw R2, (R1) b) lw R3, (R1) 4 c) R4 R2 + R3 d) R5 R2 - 1 a b 2 2 2 d c
Example – Build the DAG Code a lw r1, w b add r1,r1,r1 c load r2,x d mult r1,r1,r2 e load r2,y f g load r2,z h i sw r1, a Assume: memory instrs = 3 mult = 2 (to have result in register) rest = 1 cycle
Creating a schedule Create a DAG of dependences Determine priority Schedule instructions with Ready operands Highest priority Heuristics: If multiple possibilities, fall back on other priority functions
Operation Priority Priority – Need a mechanism to decide which ops to schedule first (when you have choices) Common priority functions Height – Distance from exit node Give priority to amount of work left to do Slackness – inversely proportional to slack Give priority to ops on the critical path Register use – priority to nodes with more source operands and fewer destination operands Reduces number of live registers Uncover – high priority to nodes with many children Frees up more nodes Original order – when all else fails
Computing Priorities Height(n) = exec(n) if n is a leaf max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency
Example – Determine Height and CP Code a lw r1, w b add r1,r1,r1 c lw r2,x d mult r1,r1,r2 e lw r2,y f g lw r2,z h i sw r1, a Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle Critical path: _______
Example – List Scheduling Code a lw r1, w b add r1,r1,r1 c lw r2,x d mult r1,r1,r2 e lw r2,y f g lw r2,z h i sw r1, a start Schedule _____cycles
Scheduling vs. Register Allocation Code a lw r1 (r12) b lw r2 (r12+4) c r1 r1+r2 d stw (r12) r1 e lw r1 (r12+8) f lw r2 (r12+12) g r2 r1+r2
Register Renaming Code a lw r1 (r12) b lw r2 (r12+4) c r3 r1+r2 stw (r12) r3 e lw r4 (r12+8) f lw r5 (r12+12) g r6 r4+r5
VLIW Very Long Instruction Word Compiler determines exactly what is issued every cycle (before the program is run) Schedules also account for latencies All hardware changes result in a compiler change Usually embedded systems (hence simple HW) Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)
Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g
Multi-Issue Scheduling Example Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, non-pipelined ALU = 1 cycle 2m 3m 5m 4 6 9 8 10 7m 1 RU_map Schedule time ALU MEM 1 2 3 4 5 6 7 8 9 time Ready Placed 1 2 3 4 5 6 7 8 9
Earliest Latest Sets Machine: 2 issue, 1 memory port, 1 ALU Memory port = 2 cycles, pipelined ALU = 1 cycle 1m 2m 3 4m 5 6 7 8 9m 10
List Scheduling Algorithm Build dependence graph, calculate priority Add all ops to UNSCHEDULED set time = 0 while (UNSCHEDULED is not empty) time++ READY = UNSCHEDULED ops whose incoming deps have been satisfied Sort READY using priority function For each op in READY (highest to lowest priority) op can be scheduled at current time? (resources free?) Yes: schedule it, op.issue_time = time Mark resources busy in RU_map relative to issue time Remove op from UNSCHEDULED/READY sets No: continue
Improving Basic Block Scheduling Loop unrolling – creates longer basic blocks Register renaming – can change register usage in blocks to remove immediate reuse of registers Summary Static scheduling complements (or replaces) dynamic scheduling by the hardware