U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science Emery Berger University of Massachusetts, Amherst Advanced Compilers CMPSCI 710 Spring 2003 Instruction Scheduling
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 2 Modern Architectures Lots of features to increase performance and hide memory latency Superscalar Multiple logic units Multiple issue 2 or more instructions issued per cycle Speculative execution Branch predictors Speculative loads Deep pipelines
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 3 Instruction Scheduling Challenges to achieving instruction-level parallelism: Structural hazards: Insufficient resources to exploit parallelism Data hazards Instruction depends on result of previous instruction still in pipeline Control hazards Branches & jumps modify PC affect which instructions should be in pipeline
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 4 Scheduling for Pipelined Architectures Compiler reorders (“schedules”) instructions to maximize ILP = minimize stalls (“bubbles”) in pipeline Perform after code generation & register allocation First approach: [Hennessy & Gross 1983] O(n 4 ), n = instructions in basic block Today: [Gibbons & Muchnick 1986] O(n 2 )
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 5 Gibbons & Muchnick, I Assumptions: Hardware hazard detection Algorithm not required to introduce nops Each memory location referenced via offset of single base register Pointer may reference all of memory Load followed by add creates interlock (stall) Hazards only take single cycle
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 6 Gibbons & Muchnick, II For each basic block: Construct directed acyclic graph (DAG) using dependences between statements Node = statement / instruction Edge (a,b) = statement a must execute before b Schedule instructions using the DAG
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 7 Dependence DAG Cannot reorder two dependent instructions Data dependencies: True dependence (RAW = read-after-write) Instruction can’t be executed until all required operands available Anti-dependence (WAR) Write must not occur before read Output dependence (WAW) Earlier write cannot overwrite later one
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 8 Scheduling Example 1.r8 = [r12+8](4) 2.r1 = r r2 = 2 4.call r14,r31 5.nop 6.r9 = r1 + 1
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 9 Scheduling Example 1.r8 = [r12+8](4) 2.r1 = r r2 = 2 4.call r14,r31 5.nop 6.r9 = r1 + 1 We can reschedule to remove nop in delay slot: (1,3,4,2,6)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 10 Scheduling Algorithm Construct dependence dag on basic block Put roots in candidate set Use scheduling heuristics (in order) to select instruction Take into account terminating instruction of predecessor basic blocks While candidate set not empty Evaluate all candidates and select best one Delete scheduled instruction from candidate set Add newly-exposed candidates
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 11 Instruction Scheduling Heuristics NP-complete ) we need heuristics Bias scheduler to prefer instructions: Interlock with dag successors Allow other operations can proceed Have many successors More flexibility in scheduling Progress along critical path Free registers Reduce register pressure etc. (see ACDI p. 542)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 12 Scheduling Algorithm, Example ExecTime(n): cycles to execute statement n Let ExecTime(6) = 2, ExecTime(others) = 1; assume instruction latency = 1 Compute Delay(n): =ExecTime(n), if n is leaf =max m 2 Succ(n) Delay(m)+1
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 13 Scheduling Algorithm, Example Start at CurTime = 0 ETime(n): earliest time node should be scheduled to avoid stall Initally 0 Cands = {1,3} MCands: set of candidates with max delay time to end of block ECands:set whose earliest start time is at most current time MCands = ECands = {3}
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 14 Scheduling Algorithm, Example Scheduled node 3 Cands = {1} CurTime = 1 ETime(4) = 1
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 15 Scheduling Algorithm, Example Scheduled node 1 Cands = {2} CurTime = 2 ETime(2) = 1, ETime(4) = 4
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 16 Scheduling Algorithm, Example Scheduled node 2 Cands = {4} CurTime = 3 ETime(4) = 4
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 17 Scheduling Algorithm, Example Scheduled node 4 Cands = {5,6} CurTime = 4 ETime(5) = 6, ETime(6) = 4 MaxDelay = 2: MCands = {6} Want to progress along critical path
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 18 Scheduling Algorithm, Example Scheduled node 6 Cands = {5} CurTime = 5 ETime(5) = 6 Only one left…
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 19 Scheduling Algorithm, Example Scheduled node 5 Resulting schedule: [3,1,2,4,6,5] Requires 6 cycles – optimal! Version of this algorithm is (p+1) competitive p = number of pipelines Average-case much better
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 20 Scheduling Algorithm Complexity Time complexity: O(n 2 ) n = max number of instructions in basic block Building dependence dag: worst-case O(n 2 ) Each instruction must be compared to every other instruction Scheduling then requires each instruction be inspected at each step = O(n 2 ) Average-case: small constant (e.g., 3)
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 21 Empirical Results Scheduling: always a win (1-13% on PA- RISC) Results same as Hennessy & Gross for most benchmarks However: removes only 5/16 stalls in sieve, at most 10/16 with better alias information
U NIVERSITY OF M ASSACHUSETTS, A MHERST Department of Computer Science 22 Next Time We’ve assumed no cache misses! Next time: balanced scheduling Read Kerns & Eggers