University of Michigan Electrical Engineering and Computer Science FLASH: Foresighted Latency-Aware Scheduling Heuristic for Processors with Customized Datapaths Manjunath Kudlur, Kevin Fan, Michael Chu, Rajiv Ravindran, Nathan Clark, Scott Mahlke Advanced Computer Architecture Laboratory University of Michigan
Electrical Engineering and Computer Science Introduction Bypass network : Important component of datapath Allows for data forwarding to reduce pipeline stalls Full bypass: any FU can bypass from any other FU and from any pipeline stage Cost of full bypass increases quadratically with number of FUs # paths = (# FU) 2 bypassable stages input ports per FU output ports per FU
University of Michigan Electrical Engineering and Computer Science Case for Bypass Customization Only few bypasses are heavily utilized The heavily utilized bypasses vary widely from application to application Customize bypass network in an application specific processor by removing under-utilized paths
University of Michigan Electrical Engineering and Computer Science Implications of Bypass Customization Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File
University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B DFG
University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 1 Cycle
University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 2 Cycles
University of Michigan Electrical Engineering and Computer Science A Implications of Bypass Customization Latency depends on –Which FU the operation is scheduled on –Which FU the operation’s consumer is scheduled on Latency of an operation no longer constant –Varies per consumer Execute Stage Pipeline Latch Memory Stage Pipeline Latch Register File B 3 Cycles Bypass Customization introduces non-uniform operation latencies
University of Michigan Electrical Engineering and Computer Science Effects on List Scheduler (LS) Used widely in many compilation systems Assign each operation to a free FU at the earliest time (Greedy!) When more than one free FU available, pick one arbitrarily WHILE (Readylist is non-empty) DO op Next unscheduled operation in priority order ; stime Earliest time when op can be scheduled ; WHILE (no free resource available to execute op at stime) DO stime stime + 1 ; END res Free resource capable of executing op; schedule (op, res, stime) ; END
University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine ABC Operations have 1-cycle latency. Machine with full bypass network DFG
University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC ABC Operations have 1-cycle latency. Machine with full bypass network.
University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.
University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.
University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Machine with full bypass network.
University of Michigan Electrical Engineering and Computer Science LS on Full Bypass Machine CycleABC Schedule length = 3 cycles Choice of FU does not affect schedule length in a machine with full bypass. ABC Operations have 1-cycle latency. Machine with full bypass network.
University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC Schedule length = 5 cycles
University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine CycleABC Schedule length = 3 cycles ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.
University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine CycleABC Schedule length = 5 cycles ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.
University of Michigan Electrical Engineering and Computer Science LS on Partial Bypass Machine CycleABC Schedule length = 5 cycles Choice of FU affects schedule length drastically in a machine with partial bypass. Arbitrary choice no good! ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist.
University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist Partial DFG CycleABC i i+1 i+2 i+3 i+4 Consider Scheduling Op1 Earliest Time
University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC i i+11 i+22 i+3 i+434 Greedily scheduling op 1at cycle i+1 delays ops 3 and 4
University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC i i+11 i+223 i+3 i+44 Greedily scheduling op 1 at cycle i+1 delays op 4
University of Michigan Electrical Engineering and Computer Science Greediness of LS ABC Operations have 1-cycle latency. Assume 3 cycles required to transmit value via register file, if bypass path does not exist CycleABC i i+1 i+21 i+3234 Delayed 1 cycle Delaying ops could improve schedule. Being Greedy no good!
University of Michigan Electrical Engineering and Computer Science FLASH : Goals Keep the List Scheduling framework, it is fast and widely used Effectively deal with non-uniform latencies –Intelligently select from among multiple co- equal choices –Avoid greedy choices by delaying schedule slots
University of Michigan Electrical Engineering and Computer Science Observation I A B Consider FU choices for operation A :
University of Michigan Electrical Engineering and Computer Science Observation I A B No Good! Consider FU choices for operation A : 3 cycle delay
University of Michigan Electrical Engineering and Computer Science Observation I A B Good! Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed No delay
University of Michigan Electrical Engineering and Computer Science Observation I A B C Good ??? Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed No Delay 3 cycle delay
University of Michigan Electrical Engineering and Computer Science Observation I A B C Better! Consider FU choices for operation A : An FU with a low latency path to a consumer FU is good Thus, the consumer operation won’t be delayed Same observation extends to consumer’s consumer, and so on No Delay An FU which does not delay the consumers is a good choice
University of Michigan Electrical Engineering and Computer Science Observation II Consider FU choices for operation A : A BC D Slack 1Slack 0
University of Michigan Electrical Engineering and Computer Science Observation II Consider FU choices for operation A : A BC D Good ??? All consumers are not equal No Delay 3 cycle delay
University of Michigan Electrical Engineering and Computer Science Observation II All consumers are not equal Its better to delay a non- critical consumer Criticality Consider FU choices for operation A : A BC D Better! An FU which does not delay a critical chain of consumers is a good choice No Delay 3 cycle delay 1 SLACK
University of Michigan Electrical Engineering and Computer Science The FLASH Technique Compute the merit (FLASH_RANK) of each FU choice for an operation FLASH_RANK - weighted estimate of schedule lengths of the dependence chains of an operation Schedule the operation on the FU with the best FLASH_RANK Avoid greediness by delaying schedule slot, if necessary FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(A, Green FU) = ? FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D Cycle 1 FLASH_RANK(A, Green FU) = MAX X1 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D FLASH_RANK(A, Green FU) = MAX 0.5, X Cycle 1 Cycle 4 4 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D FLASH_RANK(A, Green FU) = MAX 0.5, 4 = 4 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D Slack 1Slack 0 FLASH_RANK(A, Yellow FU) = ? FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D Cycle 1 FLASH_RANK(A, Yellow FU) = MAX X1 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A B C D FLASH_RANK(A, Yellow FU) = MAX 0.5, X Cycle 1 Cycle 2 2 FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science FLASH_RANK Example A BC D FLASH_RANK(A, Yellow FU) = MAX 0.5, 2 = 2 Choose Yellow FU for op A FLASH_RANK(op, FU) = MAX c Estimated schedule length of c where c is a dependence chain of op 1 Slack(c) + 1 X
University of Michigan Electrical Engineering and Computer Science Some Practical Considerations Impractical to estimate schedule length of entire dependence chain (few 10s of operations) –Truncate dependence chains to manageable depths, say 2 or 3 (Look Ahead depth) Impractical to calculate schedule lengths of all dependence chains together –Many dependence chains originate from an operation –Consider dependence chains independently –Ignore resource constraint between dependence chains
University of Michigan Electrical Engineering and Computer Science Experiments Implemented in TRIMARAN compiler framework Evaluated MediaBench and SPECint2000 Machine is a 9 wide VLIW (4I, 2F, 2M, 1B) Application specific bypass network [Fan ’03] –30% cost of a full bypass network
University of Michigan Electrical Engineering and Computer Science Comparisons Baseline is the performance achieved by the traditional list scheduler Global Resource Preference (GRP) algorithm [Fan ’03] –Global pre-scheduling phase assigns FU preferences to operations based on Bottom-Up Greedy (BUG) schedule estimates –List scheduler uses these preferences as hints while scheduling
University of Michigan Electrical Engineering and Computer Science FLASH vs. GRP
University of Michigan Electrical Engineering and Computer Science Bypass Utilization
University of Michigan Electrical Engineering and Computer Science Conclusion Developed a effective scheduling heuristic for machines with customized bypass interconnect –Intelligent FU choice –Avoid greediness Average performance improvement of 25% over baseline –Bypass paths utilized better Could be applied to other cases of non- uniform latencies
University of Michigan Electrical Engineering and Computer Science Questions
University of Michigan Electrical Engineering and Computer Science Backup
University of Michigan Electrical Engineering and Computer Science Backup
University of Michigan Electrical Engineering and Computer Science Backup