Download presentation
Presentation is loading. Please wait.
Published byBrendan Cobb Modified over 9 years ago
1
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8 th, 2003 Workshop on Duplicating, Deconstructing and Debunking Pre-
2
2 Dynamic Scheduling Complexity Broadcast-based dynamic scheduler circuits are: –High complexity –Power-hungry –Scale poorly This logic is also difficult to pipeline –Palacharla, et al. stated that instruction wakeup and select appears to be an atomic operation –Work on pipelining this logic has been done, but there’s always a price: More complex structures IPC penalties Tag Elimination approach: [ISCA ’02] Make the critical path faster by reducing the circuit complexity.
3
3 Tag Elimination Specialize the window entries Reservation stations have a varying number of tags to compare against –Instructions are placed into stations that have the required number of in-flight tags Advantages: –Fewer comparators on the broadcast bus –Bus wire length can be shorter Disadvantage: –New condition on being placed into the scheduler window: Is there an entry of the correct size? Makes for a few more front-end stalls than usual Slightly less parallelism can be extracted
4
4 Tag Elimination Example = = = = = = = = = = = = = = = = = = = = = = = = 8/0/0 2/4/2 Destination Tag Bus Destination Tag Bus
5
5 Last-Tag Speculation If it is known which of an instruction’s operands will be the last to arrive, the first-arriving operand can be ignored. –This can be predicted with high accuracy Advantage: –Reduced the maximum number of tags per instruction to 1 Disadvantage: –Possibly further IPC loss due to last-tag misprediction penalties
6
6 Scheduler Replay Latency prediction: –Handles non-deterministic latency of memory ops Selective replay only flushes instructions depending on the mis-speculated load Kim & Lipasti (ISCA ’03) state that: “…implementing tag elimination on a machine with selective recovery is impractical.”
7
7 Parent-Child Broadcast Replay Concept: If a latency is mispredicted, flush the instructions and reschedule from the window The scheduler window dependence broadcasts are overloaded to also propagate information about an instruction’s “ancestors” Tag Elimination is not compatible with this mechanism –All dependence information can not be tracked or reset when not all tags are on the broadcast busses
8
8 Broadcast-Based Replay Description dep matrix Ldep matrix R tagLtagR == == kill bus wakeup bus { tag dependence info a)
9
9 Dependence Propagation dep matrix L dep matrix R 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 shifted down every cycle kill bus inst invalidated if kill bus bits match bits in the bottom row 0 0 0 1 1 0 0 0 0 0 0 1 1 0 merge matrices & mark myself before sent to children b)
10
10 U.S. Patent 6,212,626 - Filed Dec. 1998 (Held by Intel Corp.)
11
11 Timed Queue Replay Concept: If a latency is mispredicted, keep the same schedule already computed and insert an extra latency to make it correct –The scheduler window is never involved in resolving ancestry A ‘checker’ later in the pipeline verifies the schedule – Table of register ready bits Tag elimination is compatible with this mechanism –Extra dependence tracking is not necessary
12
12 Timed Queue Replay Description Instruction Scheduler Replay Check Registers/ Execution Writeback stop scheduler replay safe
13
13 Methodology: Architecture Architectural model –Derived from SimpleScalar 3.0 –More sophisticated scheduling support Separated ROB and RS Variable-length pipelines Scheduler replay on memory latency misprediction –Flush replay –Broadcast-based selective –Timed queue selective –Support for specialized windows and last-tag prediction
14
14 Last-Tag with Selective Replay 4-wide
15
15 Last-Tag with Selective Replay 8-wide
16
16 Reducing Scheduler Pressure After instructions are issued from the scheduler… –Parent-Child Broadcast: Instructions must remain in the window for D cycles in case a replay is initiated –Timed Queue: Instructions do not need to be kept
17
17 Benefits of Station Release
18
18 Comparing Replay Mechanisms ILP –Timed queue replay has much less instruction window pressure (WIB based on similar idea – ISCA ’02) Complexity Power –Parent-Child broadcasts have high circuit complexity W*(W x D) + W extra broadcast bits (roughly) –328 vs. 64 in our example
19
19 Conclusions Tag elimination and selective replay can work together –If the replay mechanism is largely external to the scheduler Timed queue replay mechanisms are more efficient –Allows the scheduler to remove some dependence information –Less pressure in the scheduler window –No extra broadcasts Lower circuit complexity Lower power –
Similar presentations
© 2024 SlidePlayer.com. Inc.
All rights reserved.