Advanced Microarchitecture Lecture 8: Data-Capture Instruction Schedulers
Out-of-Order Execution The goal is to execute instructions in dataflow order as opposed to the sequential order specified by the ISA The renamer plays a critical role by removing all of the false register dependencies The scheduler is responsible for: for each instruction, detecting when all dependencies have been satisifed (and therefore ready to execute) propagating readiness information between instructions Lecture 8: Data-Capture Instruction Schedulers
Out-of-Order Execution (2) Fetch Dynamic Instruction Stream Rename Renamed Instruction Stream Schedule Dynamically Scheduled Instructions Static Program Just a pictoral view of the different steps Out-of-order = out of the original sequential order Lecture 8: Data-Capture Instruction Schedulers
Superscalar != Out-of-Order cache miss 1-wide In-Order A cache miss 2-wide In-Order A 1-wide Out-of-Order A cache miss 2-wide Out-of-Order A: R1 = Load 16[R2] B: R3 = R1 + R4 C: R6 = Load 8[R9] D: R5 = R2 – 4 E: R7 = Load 20[R5] F: R4 = R4 – 1 G: BEQ R4, #0 C D E F G B 5 cycles cache miss C D E Superscalar/In-order is not uncommon; Out-of-order/Single-Issue is not common (but possible). B C D E F G 10 cycles B C D E F G 8 cycles B F G 7 cycles A C D F B E G Lecture 8: Data-Capture Instruction Schedulers
Data-Capture Scheduler At dispatch, instruction read all available operands from the register files and store a copy in the scheduler Unavailable operands will be “captured” from the functional unit outputs When ready, instructions can issue directly from the scheduler without reading additional operands from any other register files Fetch & Dispatch ARF PRF/ROB Physical register update P-Pro family processors use data-capture-style schedulers. Dispatch is usually the same as the Allocate (or just “alloc”) stage(s). Data-Capture Scheduler Bypass Functional Units Lecture 8: Data-Capture Instruction Schedulers
Non-Data-Capture Scheduler Fetch & Dispatch More on this next lecture! Scheduler E.g., MIPS R10000, Alpha 21264 (and others) ARF PRF Physical register update Functional Units Lecture 8: Data-Capture Instruction Schedulers
Components of a Scheduler Method for tracking state of dependencies (resolved or not) Buffer for unexecuted instructions A B C D F E G Method for choosing between multiple ready instructions competing for the same resource Arbiter B The scheduler is sometimes also called the “Instruction Window”, but that can be a little confusing as sometime the instruction window refers to all instructions in flight in the processor, which can include those that have issues/executed and left the IQ/RS. Method for notification of dependency resolution “Scheduler Entries” or “Issue Queue” (IQ) or “Reservation Stations” (RS) Lecture 8: Data-Capture Instruction Schedulers
Lather, Rinse, Repeat… Scheduling Loop or Wakeup-Select Loop Wake-Up Part: Instructions selected for execution notify dependents (children) that the dependency has been resolved For each instruction, check whether all input dependencies have been resolved if so, the instruction is “woken up” Select Part: Choose which instructions get to execute If 3 add instructions are ready at the same time, but there are only two adders, someone must wait to try again in the next cycle (and again, and again until selected) Lecture 8: Data-Capture Instruction Schedulers
Scalar Scheduler (Issue Width = 1) Tag Broadcast Bus T14 = T39 Select Logic To Execute Logic T16 = T39 = T8 The boxes that turn green indicate the readiness of each of the operands. T6 = T17 = T42 T39 = T15 = T17 T39 = Lecture 8: Data-Capture Instruction Schedulers
Superscalar Scheduler (detail of one entry) Tags, Ready Logic Select Logic Tag Broadcast Buses grants Same logic as previous slide (but for one entry), but for four-way issue. = = = = = = = = SrcL ValL RdyL SrcL ValL RdyL Dst Issued bid Lecture 8: Data-Capture Instruction Schedulers
Interaction with Execution Payload RAM Select Logic D SL SR A opcode ValL ValR ValL ValR D = Destination Tag, SL = Left Source Tag, SR = Right Source Tag, ValL = Left Operand Value, ValR = Right Operand Value The scheduler is typically broken up into the CAM-based scheduling part, and a RAM-based “payload” part that holds the actual values (and instruction opcode and any other information required for execution) that get sent to the actual functional units/ALUs. ValL ValR ValL ValR Lecture 8: Data-Capture Instruction Schedulers
Again, But Superscalar The scheduler captures the data, hence “Data-Capture” Select Logic ValL ValR D SL SR A opcode ValL ValR ValL ValR ValL ValR D SL SR B opcode ValL ValR D = Destination Tag, SL = Left Source Tag, SR = Right Source Tag, ValL = Left Operand Value, ValR = Right Operand Value The animation shows the wakeup process on the left, and then illustrates how the same tag matches used in the wakeup can also be used to enable the capturing of output values on the data-path side. ValL ValR ValL ValR Lecture 8: Data-Capture Instruction Schedulers
Issue Width Maximum number of instructions selected for execution each cycle is the issue width Previous slide showed an issue width of two The slide before that showed the details of a scheduler entry for an issue width of four Hardware requirements: Typically, an Issue Width of N requires N tag broadcast buses Not always true: can specialize such that, for example, one “issue slot” can only handle branches Lecture 8: Data-Capture Instruction Schedulers
Pipeline Timing A B A: C B: C: Cycle i Cycle i+1 Select Payload Execute C result broadcast tag broadcast enable capture on tag match B: Wakeup Capture Select Payload Execute tag broadcast Simple case with minimal pipelining; dependent instructions can execute in back-to-back cycles, but the achievable clock speed will be slow because each cycle contains too much work (i.e., select, payload read, execute, bypass and capture). enable capture C: Wakeup Cycle i Cycle i+1 Lecture 8: Data-Capture Instruction Schedulers
Pipelined Timing A B Can’t read and write payload RAM at the same time; may need to bypass the results A: Select Payload Execute C result broadcast tag broadcast enable capture B: Wakeup Capture Select Payload Execute Faster clock speed tag broadcast enable capture C: Wakeup Capture Select Payload Execute Cycle i Cycle i+1 Cycle i+2 Cycle i+3 Lecture 8: Data-Capture Instruction Schedulers
Pipelined Timing (2) A B A: Wakeup Select Payload Execute result broadcast tag broadcast enable capture B: Capture Wakeup Select Payload Execute There are a variety of factors that impact the decision of where the cycle boundary should be placed. Some of this has to do with inserting the instructions into the RS and whether or not the newly inserted instructions can be immediately considered for scheduling or have to wait until the next cycle to do so. Cycle i Cycle i+1 Cycle i+2 Previous slide placed the pipeline boundary at the writing of the ready bits This slide shows a pipeline where latches are placed right before the tag broadcast Lecture 8: Data-Capture Instruction Schedulers
No simultaneous read/write! More Pipelined Timing A B result broadcast and bypass C A: Select Payload Execute Wakeup tag match on first operand tag broadcast Need a second level of bypassing B: enable capture Wakeup Capture Select Payload Execute C: No simultaneous read/write! Capture Wakeup Capture tag match on second operand (now C is ready) Select Payload Exec Cycle i Cycle i+1 Cycle i+2 Cycle i+3 Lecture 8: Data-Capture Instruction Schedulers
More-er Pipelined Timing A B C Dependent instructions cannot execute in back-to-back cycles! A: Select Payload Execute D B: Select Payload Execute A&B both ready, only A selected, B bids again AC and CD must be bypassed, but no bypass for BD Wakeup Capture C: Wakeup Capture Very aggressive pipelining, but now with a greater IPC penalty due to not being able to issue dependent instructions in back-to-back cycles. Good segue to the many research papers on aggressive and/or speculative pipelining of the scheduler (quite a few of these in ISCA/MICRO/HPCA the early 2000’s). Select Payload Execute D: Wakeup Capture Select Payload Ex Cycle i i+1 i+2 i+3 i+4 i+5 Lecture 8: Data-Capture Instruction Schedulers
Critical Loops Wakeup-Select Loop cannot be trivially pipelined while maintaining back-to-back execution of dependent instructions Regular Scheduling No Back- to-Back Worst-case IPC reduction by ½ Shouldn’t be that bad (previous slide had IPC of 4/3) Studies indicate 10-15% IPC penalty “Loose Loops Sink Chips”, Borch et al. A A B C B C Lecture 8: Data-Capture Instruction Schedulers
IPC vs. Frequency 10-15% IPC drop doesn’t seem bad if we can double the clock frequency 1000ps 500ps 500ps 2.0 IPC, 1GHz 1.7 IPC, 2GHz 2 BIPS 3.4 BIPS Just pointing out that the ideal performance (double clock speed combined with 10-15% IPC hit) is not likely achievable due to many other issues. Frequency doesn’t double latch/pipeline overhead unbalanced stages Other sources of IPC penalties branches: ↑ pipe depth, ↓ predictor size, ↑ predict-to-update latency caches/memory: same time in seconds, ↑ frequency, more cycles Power limitations: more logic, higher frequency P=½CV2f 900ps 450ps 350 550 1.5GHz Lecture 8: Data-Capture Instruction Schedulers
Select Logic Goal: minimize DFG height (execution time) NP-Hard Precedence Constrained Scheduling Problem Even harder because the entire DFG is not known at scheduling time Scheduling decisions made now may affect the scheduling of instructions not even fetched yet Heuristics For performance For ease of implementation The Select Logic is also sometimes called a “picker”. The NP-Hard part is even if you were given the entire DFG, you still couldn’t efficiently find the optimal schedule. The assumption that you can see the entire DFG is obviously false in a processor. This is also harder because instruction latencies are not constant (i.e., changing the schedule can change whether certain loads hit or miss). Lecture 8: Data-Capture Instruction Schedulers
Simple Select Logic 1 O(log S) gates Scheduler Entries S entries Grant0 = 1 Grant1 = !Bid0 Grant2 = !Bid0 & !Bid1 Grant3 = !Bid0 & !Bid1 & !Bid2 Grantn-1 = !Bid0 & … & !Bidn-2 S entries yields O(S) gate delay O(log S) gates granti 1 x0 x1 x2 x3 x4 x5 x6 x7 x8 grant0 xi = Bidi This just selects the first ready instruction, where “first” is simply determined by physical location in the scheduler (top entry has highest priority). In this example, a grant may be seen by an entry that is not ready in which case the grant is simply ignored (the circuit will ensure that only one ready entry will ever receive a grant). grant1 grant2 grant3 grant4 grant5 grant6 grant7 grant8 grant9 Scheduler Entries Lecture 8: Data-Capture Instruction Schedulers
Simple Select Logic Instructions may be located in scheduler entries in no particular order The first ready entry may be the oldest, youngest, anywhere in between Simple select results in a “random” schedule the schedule is still “correct” in that no dependencies are violated it just may be far from optimal Lecture 8: Data-Capture Instruction Schedulers
Oldest First Select Intuition: An instruction that has just entered the scheduler will likely have few, if any, dependents (only intra-group) Similarly, the instructions that have been in the scheduler the longest (the oldest) are likely to have the most dependents Selecting the oldest instructions has a higher chance of satisfying more dependencies, thus making more instructions ready to execute more parallelism The older instructions also have a higher chance of blocking up resource allocation (e.g., full ROB, LDQ, etc.) Lecture 8: Data-Capture Instruction Schedulers
Implementing Oldest First Select H G Compress Up A C B D K L E F H G I J B D E Alpha 21264 used a compressing scheduler. The fading instructions indicate those which have been selected and sent off to execution. F G Newly dispatched I J H Write instructions into scheduler in program order Lecture 8: Data-Capture Instruction Schedulers
Implementing Oldest First Select (2) Compressing buffers are very complex gates, wiring, area, power Ex. 4-wide Need up to shift by 4 Lots of multiplexing, huge amounts of wiring. Remember that any arbitrary four slots may be vacated in any given cycle. An entire instruction’s worth of data: tags, opcodes, immediates, readiness, etc. Lecture 8: Data-Capture Instruction Schedulers
Implementing Oldest First Select (3) 6 Grant 3 ∞ 2 A 2 F 5 D 3 Each box in the select logic is a MIN operation, and passes the lower timestamp (older instruction) onward to the right. Non-ready instruction effectively present a timestamp of ∞ to the select logic. At the root of the tree, the last timestamp is that of the oldest AND ready instruction. B 1 H 7 C 2 E 4 Age-Aware Select Logic Lecture 8: Data-Capture Instruction Schedulers
Handling Multi-Cycle Instructions Sched PayLd Exec Add R1 = R2 + R3 Sched PayLd Exec Xor R4 = R1 ^ R5 Sched PayLd Exec Mul R1 = R2 × R3 Sched PayLd Exec Add R4 = R1 + R5 Add attemps to execute too early! Result not ready for another two cycles. Lecture 8: Data-Capture Instruction Schedulers
Delayed Tag Broadcast It works, but… Mul R1 = R2 × R3 Sched PayLd Exec Exec Exec Add R4 = R1 + R5 Sched PayLd Exec Assume pipelined such that tag broadcast occurs at cycle boundary It works, but… Must make sure tag broadcast bus will be available N cycles in the future when needed Bypass, data-capture potentially get more complex Lecture 8: Data-Capture Instruction Schedulers
Delayed Tag Broadcast (2) Sched PayLd Exec Exec Exec Mul R1 = R2 × R3 Add R4 = R1 + R5 Sched PayLd Exec Assume issue width equals 2 Just illustrating that if you delay the tag broadcast, you will need some mechanism to ensure that sufficient broadcast buses are available when it finally comes time to do the broadcast. Sub R7 = R8 – #1 Sched PayLd Exec Xor R9 = R9 ^ R6 Sched PayLd Exec In this cycle, three instructions need to broadcast their tags! Lecture 8: Data-Capture Instruction Schedulers
Delayed Tag Broadcast (3) Possible solutions Have one select for issuing, another select for tag broadcast messes up timing of data-capture Pre-reserve the bus select logic more complicated, must track usage in future cycles in addition to the current cycle Hold the issue slot from initial launch until tag broadcast I’m not sure how this is actually implemented in current processors, but my best guess is option #2. The scheduler has to keep a couple of tables/scoreboards/data-structures for tracking resource availability, but it probably already has to do this anyway for the functional units, so it would just be some more scoreboards (for tag broadcast bus, for the bypass buses, etc.). sch payl exec exec exec Issue width effectively reduced by one for three cycles Lecture 8: Data-Capture Instruction Schedulers
Delayed Wakeup Push the delay to the consumer Tag Broadcast for R1 = R2 × R3 Tag arrives, but we wait three cycles before acknowledging it This is just another possibility, but I don’t think anyone would actually do this… variable latency instructions would probably cause an immense head-ache. R1 R5 = R1 + R4 = R4 ready! = Also need to know parent’s latency Lecture 8: Data-Capture Instruction Schedulers
Non-Deterministic Latencies Problem with previous approaches is that they assume that all instruction latencies are known at the time of scheduling Makes things uglier for the delayed broadcast This pretty much kills the delayed wakeup approach Examples Load instructions Latency {L1_lat, L2_lat, L3_lat, DRAM_lat} DRAM_lat is not a constant either, queuing delays Some architecture specific cases PowerPC 603 has a “early out” for multiplication with a low-bit-width multiplicand Intel Core 2’s divider also has an early out Lecture 8: Data-Capture Instruction Schedulers
The Wait-and-See Approach Just wait and see whether a load hits or misses in the cache R1 = 16[$sp] Sched PayLd Exec Exec Exec Cache hit known, can broadcast tag R2 = R1 + #4 Sched PayLd Exec Scheduler DL1 Tags DL1 Data May be able to design cache s.t. hit/miss known before data Load-to-Use latency increases by 2 cycles (3 cycle load appears as 5) Exec Sched PayLd Penalty reduced to 1 cycle Lecture 8: Data-Capture Instruction Schedulers
Load-Hit Speculation Caches work pretty well hit rates are high, otherwise caches wouldn’t be too useful Just assume all loads will hit in the cache Sched PayLd Exec Exec Exec Cache hit, data forwarded R1 = 16[$sp] Broadcast delayed by DL1 latency R2 = R1 + #4 Sched PayLd Exec Um, ok, what happens when there’s a load miss? Lecture 8: Data-Capture Instruction Schedulers
Invalidate the instruction Load Miss Scenario Cache Miss Detected! Value at cache output is bogus L2 hit Sched PayLd Exec Exec Exec Exec … Broadcast delayed by DL1 latency Sched PayLd Exec Invalidate the instruction (ALU output ignored) Broadcast delayed by L2 latency Sched PayLd Exec Each mis-scheduling wastes an issue slot: the tag broadcast bus, payload RAM read port, writeback/bypass bus, etc. could have been used for another instruction Rescheduled assuming a hit at the DL2 cache There could be a miss at the L2, and again at the L3 cache. A single load can waste multiple issuing opportunities. Lecture 8: Data-Capture Instruction Schedulers
Scheduler Deallocation Normally, as soon as an instruction issues, it can vacate its scheduler entry The sooner an entry is deallocated, the sooner another instruction can reuse that entry leads to that instruction executing earlier In the case of a load, the load must hang around in the scheduler until it can be guaranteed that it will not have to rebroadcast its destination tag Decreases the effective size of the scheduler Even though the loads usually hit, they need to stick around just in case they need to be rescheduled. Another option is discussed in the next set of slides. Lecture 8: Data-Capture Instruction Schedulers
“But wait, there’s more!” DL1 Miss Not only children get squashed, there may be grand-children to squash as well Sched PayLd Exec Exec Exec Squash Sched PayLd Exec Sched PayLd Exec Sched PayLd Ex Exec The longer the schedule-to-execute latency, the more “generations” of dependent instructions could be speculatively mis-scheduled. All waste issue slots All must be rescheduled All waste power None may leave scheduler until load hit known Lecture 8: Data-Capture Instruction Schedulers
Squashing The number of cycles worth of dependents that must be squashed is equal to Cache-Miss-Known latency minus one previous example, miss-known latency = 3 cycles, there are two cycles worth of mis-scheduled dependents Early miss-detection helps reduce mis-scheduled instructions A load may have many children, but the issue width limits how many can possibly be mis-scheduled Max = Issue-width × (Miss-known-lat – 1) Lecture 8: Data-Capture Instruction Schedulers
Squashing (2) Simple: squash anything “in-flight” between schedule and execute This may include non-dependent instructions All instructions must stay in the scheduler for a few extra cycles to make sure they will not be rescheduled due to a squash Sched PayLd Exec Exec Exec Sched PayLd Exec Sched PayLd Exec Sched PayLd Exec Lecture 8: Data-Capture Instruction Schedulers
Squashing (3) Selective squashing: use “load colors” each load is assigned a unique color every dependent “inherits” its parents’ colors on a load miss, the load broadcasts its color and anyone in the same color group gets squashed An instruction may end up with many colors … Explicitly tracking each color would require a huge number of comparisons Lecture 8: Data-Capture Instruction Schedulers
Squashing (4) Can list “colors” in unary (bit-vector) form Each instruction’s color vector is the bitwise OR of its parents’ vectors A load miss now only squashes the dependent instructions Hardware cost increases quadratically with number of load instructions Load R1 = 16[R2] 1 X Add R3 = R1 + R4 1 Hardware can still be pretty expensive; consider a modern Core 2 with a 32-entry LDQ… you would need 32 load colors to track everything (and there are some weird corner cases where a color cannot be reallocated before all “consumers” of that color have also executed). Load R5 = 12[R7] 1 Load R8 = 0[R1] 1 1 Load R7 = 8[R4] 1 Add R6 = R8 + R7 1 1 1 Lecture 8: Data-Capture Instruction Schedulers
Allocation Allocate in-order, Deallocate in-order Very simple! Smaller effective scheduler size instructions may have already executed out-of-order, but their RS entries cannot be reused Can be very bad if a load goes to main memory Head Tail This is specific to the allocation of the RS entries. Tail Circular Buffer Lecture 8: Data-Capture Instruction Schedulers
Allocation (2) With arbitrary placement, entries much better utilized Allocator more complex must scan availability and find N free entries Write logic more complex must route N instructions to arbitrary entries of the scheduler RS Allocator Entry availability bit-vector 1 Lecture 8: Data-Capture Instruction Schedulers
Allocation (3) Segment the entries Still possible inefficiencies only one entry per segment may be allocated per cycle instead of 4-of-16 alloc (previous slide), each allocator only does 1-of-4 write logic simplified as well Still possible inefficiencies full segment leads to allocating less than N instructions per cycle A Alloc 1 B 1 Alloc 1 C Alloc X 1 D Alloc 1 1 Free RS entries exist, just not in the correct segment Lecture 8: Data-Capture Instruction Schedulers