Download presentation
Presentation is loading. Please wait.
1
Advanced Microarchicture ECE/CS 752 Fall 2017
Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes by Ilhyun Kim Updated by Mikko Lipasti
2
Readings Read on your own: To be discussed in class:
I. Kim and M. Lipasti, “Understanding Scheduling Replay Schemes,” in Proceedings of the 10th International Symposium on High-performance Computer Architecture (HPCA-10), February 2004. Srikanth Srinivasan, Ravi Rajwar, Haitham Akkary, Amit Gandhi, and Mike Upton, “Continual Flow Pipelines”, in Proceedings of ASPLOS 2004, October 2004. Ahmed S. Al-Zawawi, Vimal K. Reddy, Eric Rotenberg, Haitham H. Akkary, “Transparent Control Independence,” in Proceedings of ISCA-34, 2007. To be discussed in class: Review by 11/6/2017: T. Shaw, M. Martin, A. Roth, “NoSQ: Store-Load Communication without a Store Queue, ” in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, 2006. Review by 11/8/2017: Andreas Sembrant et al., “Long term parking (LTP): criticality-aware resource allocation in OOO processors,” Proc of MICRO-48, December 2015. Review by 11/10/2017: Arthur Perais and André Seznec EOLE: paving the way for an effective implementation of value prediction. In Proceeding of the 41st Annual International Symposium on Computer Architecture (ISCA '14). IEEE Press, Piscataway, NJ, USA, Online PDF
3
Outline Memory Data Flow Register Data Flow Instruction Flow
Scalable Load/Store Queues Memory-level parallelism (MLP) Register Data Flow Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery EOLE: Effective Implementation of Value Prediction Instruction Flow Revolver: efficient loop execution Transparent Control Independence
4
Memory Dataflow
5
Scalable Load/Store Queues
Load queue/store queue Large instruction window: many loads and stores have to be buffered (25%/15% of mix) Expensive searches positional-associative searches in SQ, associative lookups in LQ coherence, speculative load scheduling Power/area/delay are prohibitive
6
Store Queue/Load Queue Scaling
Multilevel store queue [Akkary et al., MICRO 03] Bloom filters (quick check for independence) [Sethumadhavan et al., MICRO 03] Eliminate associative load queue via replay [Cain, ISCA 2004] Issue loads again at commit, in order Check to see if same value is returned Filter load checks for efficiency: Most loads don’t issue out of order (no speculation) Most loads don’t coincide with coherence traffic
7
Store Vulnerability Window (SVW)
[Roth, ISCA 05] Assign sequence numbers to stores Track writes to cache with sequence numbers Efficiently filter out safe loads/stores by only checking against writes in vulnerability window At dispatch, load captures SN of oldest uncommitted store At commit, if cache SN is older, then load is safe Stores write SN to bloom filter to make check cheaper Otherwise, load gets replayed at commit
8
NoSQ [Sha et al., MICRO 06] Rely on load/store alias prediction to directly connect dependent pairs Memory cloaking [Moshovos/Sohi, ISCA 1997] More accurate, path-dependent predictor Use SVW technique to check Replay load only if necessary Train load/store alias predictor
9
Store/Load Optimizations
Several other similar concurrent proposals DMDC [Castro et al., MICRO 06] Fire-and-forget [Subramanian/Loh, MICRO 06] Weakness: predictor still fails Glass jaw: should fail gracefully, not fall off a cliff Risk of new/unknown workload that is unpredictable
10
Outline Memory Data Flow Register Data Flow Instruction Flow
Scalable Load/Store Queues Memory-level parallelism (MLP) Register Data Flow Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery EOLE: Effective Implementation of Value Prediction Instruction Flow Revolver: efficient loop execution Transparent Control Independence
11
Memory-Level Parallelism
[Glew, ASPLOS 98 “Wild and Crazy Ideas Session”] Tolerate/overlap memory latency Once first miss is encountered, find another one Naïve solution Implement a very large ROB, IQ, LSQ Power/area/delay make this infeasible Instead, build virtual instruction window
12
Runahead Execution Use poison bits to eliminate miss-dependent load program slice Forward load slice processing is a very old idea Massive Memory Machine [Garcia-Molina et al. 84] Datascalar [Burger, Kaxiras, Goodman 97] Runahead proposed by [Dundas, Mudge 97] Checkpoint state, keep running beyond miss When miss completes, return to checkpoint May need runahead cache for store/load communication [Mutlu et al, HPCA 03] All runahead activity is wasted (re-execute everything)
13
Waiting Instruction Buffer
[Lebeck et al. ISCA 2002] Capture forward load slice in separate buffer Propagate poison bits to identify slice Relieve pressure on issue queue Reinsert instructions when load completes Very similar to Intel Pentium 4 replay mechanism But not publicly known at the time Makes recovery from load latency mispredicts easier/cheaper Scope still limited by ROB size
14
Continual Flow Pipelines
[Srinivasan et al. ASPLOS 2004] Slice buffer extension of WIB Store operands in slice buffer as well to free up ROB/buffer entries in OOO window Also relieve pressure on rename/physical registers Applicable to data-capture machines (Intel P6) or physical register file machines (Pentium 4) iCFP extends idea to in-order CPUs [Hilton et al., HPCA 09] Challenge: how to buffer loads/stores See [Gandhi et al, ISCA 05]
15
Long Term Parking [Sembrant et al., MICRO 2015] Proactively defers allocation of microarchitectural resources to non-critical instructions WIB, CFP are reactive (after miss occurs) Relies on predictor, LTP structure
16
Outline Memory Data Flow Register Data Flow Instruction Flow
Scalable Load/Store Queues Memory-level parallelism (MLP) Register Data Flow Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery EOLE: Effective Implementation of Value Prediction Instruction Flow Revolver: efficient loop execution Transparent Control Independence
17
Register Dataflow
18
Instruction scheduling
A process of mapping a series of instructions into execution resources Decides when and where an instruction is executed Data dependence graph Mapped to two FUs FU0 FU1 n n+1 n+2 n+3 1 2 3 5 4 6 1 2 3 4 5 6
19
Instruction scheduling
A set of wakeup and select operations Wakeup Broadcasts the tags of parent instructions selected Dependent instruction gets matching tags, determines if source operands are ready Resolves true data dependences Select Picks instructions to issue among a pool of ready instructions Resolves resource conflicts Issue bandwidth Limited number of functional units / memory ports
20
Scheduling loop Basic wakeup and select operations … scheduling loop
broadcast the tag of the selected inst = OR readyL tagL readyR tagR tag W tag 1 … grant 0 request 0 scheduling loop grant 1 request 1 …… grant n selected issue to FU request n ready - request Select logic Wakeup logic
21
Wakeup and Select FU0 FU1 n n+1 n+2 n+3 1 2 3 5 4 6 Select 1
Wakeup / select Select 2, 3 Wakeup 5, 6 Select 4, 5 Wakeup 6 Select 6 Ready inst to issue 2, 3, 4 4, 5 1 2 3 4 5 6
22
Scheduling Atomicity Operations in the scheduling loop must occur within a single clock cycle For back-to-back execution of dependent instructions n n+1 n+2 n+3 n+4 select 1 wakeup 2, 3 select 2, 3 wakeup 4 select 4 Select 2, 3 Select 4 Atomic scheduling Non-Atomic 2-cycle scheduling cycle 1 4 2 3
23
Implication of scheduling atomicity
Pipelining is a standard way to improve clock frequency Hard to pipeline instruction scheduling logic without losing ILP ~10% IPC loss in 2-cycle scheduling ~19% IPC loss in 3-cycle scheduling A major obstacle to building high-frequency microprocessors
24
Scheduling atomicity & non-data-capture scheduler
Multi-cycle scheduling loop Scheduling atomicity is not maintained Separated by extra pipeline stages (Disp, RF) Unable to issue dependent instructions consecutively solution: speculative scheduling Fetch Decode Schedule Dispatch RF Exe Writeback Commit Fetch Decode Schedule Dispatch RF Exe Writeback Commit Wakeup /Select Fetch Decode Schedule Dispatch RF Exe Writeback Commit Fetch Decode Schedule Dispatch RF Exe Writeback Commit Wakeup /Select Fetch Decode Schedule Dispatch RF Exe Writeback Commit Fetch Decode Schedule Dispatch RF Exe Writeback Commit Fetch Decode Schedule Dispatch RF Exe Writeback Commit wakeup/ select Fetch Decode Sched /Exe Writeback Commit Atomic Sched/Exe
25
Speculative Scheduling
Speculatively wakeup dependent instructions even before the parent instruction starts execution Keep the scheduling loop within a single clock cycle But, nobody knows what will happen in the future Source of uncertainty in instruction scheduling: loads Cache hit / miss, bank conflict Store-to-load aliasing eventually affects timing decisions Scheduler assumes that all types of instructions have pre-determined fixed latencies Load instructions are assumed to have a common case (over 90% in general) $DL1 hit latency If incorrect, subsequent (dependent) instructions are replayed
26
Speculative Scheduling
Overview Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Latency Changed!! Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Re-schedule when latency mispredicted Invalid input value Speculatively issued instructions Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Spec wakeup /select Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Spec wakeup /select Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Speculatively issued instructions Re-schedule when latency mispredicted Unlike the original Tomasulo’s algorithm Instructions are scheduled BEFORE actual execution occurs Assumes instructions have pre-determined fixed latencies ALU operations: fixed latency Load operations: assumes $DL1 latency (common case) This is a very generic out-of-order pipeline that is based on original Tomasulo’s algorithm. In the algorithm, it assumes that ‘as soon as’ the result value becomes available, it wakes up dependent instructions and issues them atomically so that back-to-back execution is possible. However, today’s microprocessors have deeper pipeline to get the higher clock speed and throughput and the distance between writeback stage and the issue stage gets longer. So if we execute two dependent instructions in this pipeline based on the original scheme, we cannot do back-to-back execution and we’ll lose lots of ILP. So, instead of waking up instructions from writeback stage, there’s a feed back loop where issued instructions speculatively wake up dependent instructions based on the instruction latency. So the previous instruction is issued. it speculatively wakes up dependent instruction, and finally they can execute in back-to-back cycles. If, for example, the first instruction is a load and we found that a cache miss occurred, which means this instruction do not have the correct memory value and it needs more cycles to get the value from memory, so the instruction and the dependent instruction are put again into the scheduler and waiting for the next issue slot. Unlike the atomic wakeup and select scheduling, speculative scheduling assumes a fixed latency of instruction and schedules dependent instructions based on that. And also, since resource uses are determined at scheduled time, for instance, even though we found that this load does not have to access the cache in the execution stage, it is impossible to deallocate that scheduled port and use it for other load instructions.
27
Scheduling replay Speculation needs verification / recovery
There’s no free lunch If the actual load latency is longer (i.e. cache miss) than what was speculated Best solution (disregarding complexity): replay data-dependent instructions issued under load shadow Fetch Decode Rename Queue Sched Disp RF Exe Retire / WB Commit instruction flow Cache miss detected verification flow
28
Wavefront propagation
Fetch Decode Rename Queue Sched Disp RF Exe Retire / WB Commit speculative execution wavefront real execution instruction flow dependence linking Data verification flow Speculative execution wavefront speculative image of execution (from scheduler’s perspective) Both wavefronts propagate along dependence edges at the same rate (1 level / cycle) the real wavefront runs behind the speculative wavefront The load resolution loop delay complicates the recovery process scheduling miss is notified a couple of clock cycles later after issue
29
Load resolution feedback delay in instruction scheduling
Broadcast / wakeup Select N N Time delay between sched and feedback Dispatch / Payload N-4 Execution N-1 RF Misc. N-3 recover instructions in this path synchronizing scheduling with time delay: correct scheduling: releasing issue queue entry, making source operand nonspeculative ready incorrect scheduling: stopping issuing instructions in the wrong path, making source operand unready state, doing something for instructions in the pipeline issued behind N-2 Scheduling runs multiple clock cycles ahead of execution But, instructions can keep track of only one level of dependence at a time (using source operand identifiers)
30
Issues in scheduling replay
Exe Sched / Issue checker cache miss signal cycle n cycle n+1 cycle n+2 cycle n+3 Cannot stop speculative wavefront propagation Both wavefronts propagate at the same rate Dependent instructions are unnecessarily issued under load misses
31
Requirements of scheduling replay
Propagation of recovery status should be faster than speculative wavefront propagation Recovery should be performed on the transitive closure of dependent instructions Conditions for ideal scheduling replay All mis-scheduled dependent instructions are invalidated instantly Independent instructions are unaffected Multiple levels of dependence tracking are needed e.g. Am I dependent on the current cache miss? Longer load resolution loop delay tracking more levels load miss
32
Scheduling replay schemes
Alpha 21264: Non-selective replay Replays all dependent and independent instructions issued under load shadow Analogous to squashing recovery in branch misprediction Simple but high performance penalty Independent instructions are unnecessarily replayed Sched Disp RF Exe Retire Invalidate & replay ALL instructions in the load shadow LD ADD OR Cache miss AND BR LD ADD OR AND BR LD ADD OR AND BR LD ADD OR AND BR LD ADD OR AND BR miss resolved
33
Position-based selective replay
merge matices ADD 0 0 0 1 OR SLL AND 1 0 XOR LD Integer pipeline Mem pipeline (width 2) Sched Disp RF Exe Retire tag / dep info broadcast kill bus broadcast killed Cycle n Cycle n+1 bit merge & shift invalidate if bits match in the last row tagR ReadyR ReadyL tagL = Kill bus tag bus dependence info bus Cache miss Detected Ideal selective recovery replay dependent instructions only Dependence tracking is managed in a matrix form Column: load issue slot, row: pipeline stages
34
Outline Memory Data Flow Register Data Flow Instruction Flow
Scalable Load/Store Queues Memory-level parallelism (MLP) Register Data Flow Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery EOLE: Effective Implementation of Value Prediction Instruction Flow Revolver: efficient loop execution Transparent Control Independence
35
Definition What is value prediction? Broadly, three salient attributes: Generate a speculative value (predict) Consume speculative value (execute) Verify speculative value (compare/recover) This subsumes branch prediction Focus here on operand values A B C D 3rd point is key in my mind: must verify value, not some attribute of the value. Branch outcome is not an operand, it is an attribute of the branch instruction! MICRO 2017 Test of Time Retrospective
36
MICRO 2017 Test of Time Retrospective
Some History “Classical” value prediction Independently invented by 4 groups in AMD (Nexgen): L. Widigen and E. Sowadsky, patent filed March 1996, inv. March 1995 Technion: F. Gabbay and A. Mendelson, inv. sometime 1995, TR 11/96, US patent Sep 1997 CMU: M. Lipasti, C. Wilkerson, J. Shen, inv. Oct. 1995, ASPLOS paper submitted March 1996, MICRO June 1996 Wisconsin: Y. Sazeides, J. Smith, Summer 1996 NOTE: ASPLOS paper submitted as 9 paper copies using FedEx! MICRO 2017 Test of Time Retrospective
37
MICRO 2017 Test of Time Retrospective
Why? Possible explanations: Natural evolution from branch prediction Natural evolution from memoization Natural evolution from rampant speculation Cache hit speculation Memory independence speculation Speculative address generation [Zero Cycle Loads – Austin/Sohi] Improvements in tracing/simulation technology “There’s a lot of zeroes out there.” (C. Wilkerson) Values, not just instructions & addresses TRIP6000 [A. Martin-de-Nicolas, IBM] David Wood said this to me at ASPLOS! You have to have lived through the early to mid 90’s to realize the damning nature of this remark! I was crushed! Actually, didn’t even learn about memoization (dating back to 70s) when I was doing my due diligence legwork for Ph.D. proposal. Further, instruction reuse work came out the following year (ISCA 1997). Could argue that IR derived from memoization. Certainly, my thinking was influenced by work on address prediction; apparently so was AMD This is the big one—machines were slow, tracing was really slow, no-one thought they could afford values in addition to addresses and instructions. When it finally happened, lots of new research and new ideas were enabled. MICRO 2017 Test of Time Retrospective
38
MICRO 2017 Test of Time Retrospective
What Happened? Considerable academic interest Dozens of research groups, papers, proposals No industry uptake for a long time Intel (x86), IBM (PowerPC), HAL (SPARC) all failed Why? Modest performance benefit (< 10%) Power consumption Dynamic power for extra activity Static power (area) for prediction tables Complexity and correctness (risk) Subtle memory ordering issues [MICRO ’01] Misprediction recovery [HPCA ’04] Intel – Chris Wilkerson IBM Power4 – Me HAL – acquired by Fujitsu, killed off MICRO 2017 Test of Time Retrospective
39
MICRO 2017 Test of Time Retrospective
Performance? Relationship between timely fetch and value prediction benefit [Gabbay & Mendelson, ISCA’98] Value prediction doesn’t help when the result can be computed before the consumer instruction is fetched Accurate, high-bandwidth fetch helped Wide trace caches studied in late 1990s Much better branch prediction today (neural, TAGE) Industry was pursuing frequency, not ILP (GHz race) I’m going to brutally summarize Gabbay’s excellent paper in one short sentence. A high peak fetch to execute rate fixes this problem How do we fetch the right instructions? I don’t know. MICRO 2017 Test of Time Retrospective
40
MICRO 2017 Test of Time Retrospective
Future Adoption? Classical value prediction will only make it in the context of a very different microarchitecture One that explicitly and aggressively exposes ILP Promising trends Deep pipelining craze is over High frequency mania is over Architects are pursuing ILP once again Value prediction may have another opportunity Rumors of 4 design teams considering it No, I don’t mean Itanium! So, buck up. But, the story gets better… MICRO 2017 Test of Time Retrospective
41
MICRO 2017 Test of Time Retrospective
Some Recent Interest VTAGE [Perais/Seznec, HPCA 14] Solves many practical problems in the predictor EOLE [Perais/Seznec, ISCA 14] Value predicted operands reduce need for OoO Execute some ops early, some late, outside OoO Smaller, faster OoO window Load Value Approximation [San Miguel/Badr/Enright Jerger, MICRO-47][Thwaites et al., PACT 2014] DLVP [Sheikh/Cain/Damodaran, MICRO-50] MICRO 2017 Test of Time Retrospective
42
Introducing Early Execution
Out-of-order engine Fetch Decode Rename Dispatch Early Execution VPred Execute ready single-cycle instructions in parallel with Rename, in-order. Do not dispatch to the IQ. First, we propose early execution to execute single-cycle ready instructions in-order in the front-end. Since execution is in-order, it can be done in parallel with Rename. Early Executed instructions are not be dispatched to the IQ, their results simply have to be written in the PRF, like regular predictions. 9/17/2018 Arthur Perais & André Seznec - ISCA 2014
43
Early Execution Hardware
From Decode and Value Predictor Values come from: Decode (Immediate) Value Predictor Bypass Network To Dispatch The hardware required is a rank of simple ALUs and the associated bypass network. ALU stages can be chained but we found that a single stage is sufficient. Operands come from Decode, the value predictor or the bypass network. Once early executed, results will be written in the PRF using the ports provisioned for Value Prediction. Execute what you can, write in the PRF with the ports provisioned for VP. 9/17/2018 Arthur Perais & André Seznec - ISCA 2014
44
Introducing Late Execution
Out-of-order engine Validation/ Late Execution Validation CMP Retire VPredict Prediction FIFO Queue Execute single-cycle predicted instructions just before retirement, in-order. Do not dispatch to the IQ either. Second, and as discussed before, the execution of predicted instructions becomes non critical since dependent can execute using the prediction instead of the actual result. Thus, we propose late execution to execute single-cycle predicted instructions at retirement time, just before the prediction is validated. Similarly to early-executed instructions, late-executed instructions are not dispatched to the IQ. 9/17/2018 Arthur Perais & André Seznec - ISCA 2014
45
Late Execution Hardware
Validation Late Exec Control Prediction FIFO Queue PRF CMP I1Correct CMP I2Correct To late-execute instructions, we add a rank of simple ALUs after the out-of-order engine and before the validation. To read operands, we can use the read ports we assumed were there for validation. However, to ensure smooth late-execution, we might require more read ports as validation required one per instruction while late execution requires two as there can be two operands to read. To VPred Execute just before validation and retirement by leveraging the ports provisioned for validation. 9/17/2018 Arthur Perais & André Seznec - ISCA 2014
46
{Early | OoO | Late} Execution: EOLE
Much less instructions enter the IQ: We may be able to reduce the issue-width: Simpler IQ. Less ports on the PRF. Less bypass. Simpler OoO. Non critical predictions become useful as the instructions can be late-executed. What about hardware cost? With EOLE, less instructions enter the IQ. Therefore, we could for instance reduce the issue-width, hence the number of ports required by OoO execution. We get a simpler scheduler. Less ports on the register file. We also get less bypass network. So we could really get an overall simpler out-of-order engine. Moreover, even non critical predictions become useful as the instructions can be late executed, that is, they don’t use any resource in the out-of-order engine anymore. But what about the hardware cost of this first proposition? 9/17/2018 Arthur Perais & André Seznec - ISCA 2014
47
Outline Memory Data Flow Register Data Flow Instruction Flow
Scalable Load/Store Queues Memory-level parallelism (MLP) Register Data Flow Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery EOLE: Effective Implementation of Value Prediction Instruction Flow Revolver: efficient loop execution Transparent Control Independence
48
Instruction Flow
49
Motivation – Loop Evolution
AMD Jaguar / Qualcomm Krait AMD 29K Intel Core 2 / ARM Cortex-A9 Intel Sandybridge Intel Nehalem / ARM Cortex-A15 In-Place Execution Mikko Lipasti-University of Wisconsin
50
Motivation – Loop Opportunity
Mikko Lipasti-University of Wisconsin
51
In-Place Loop Execution
Execute loops in-place Eliminate fetch/branch/dispatch overheads Reuse back-end structures Necessary Modifications Loop Detection / Dispatch Logic Dependence Linking Reusable backend structures IQ Entries, LSQ Entries, Physical Registers Front-End Iteration N Iteration N+4 Iteration N+1 Iteration N+5 Iteration N+6 Iteration N+2 Iteration N+3 Iteration N+7 Mikko Lipasti-University of Wisconsin
52
Mikko Lipasti-University of Wisconsin
Frontend Loop Logic Primary Responsibilities Identify loops and resource requirements Dispatch loops Incorporate feedback Loop Identification Triggered by backwards branch Unlimited control flow Utilizes simple state machine and registers Details in [Hayenga, HPCA 2014] Mikko Lipasti-University of Wisconsin
53
Mikko Lipasti-University of Wisconsin
Loop Types Simple Complex Early Exit Mikko Lipasti-University of Wisconsin
54
Mikko Lipasti-University of Wisconsin
Queue Management ROB Commit Loop Start Commit Loop Start Loop End Commit Loop Start Loop End Loop End Mikko Lipasti-University of Wisconsin
55
LSQ – Conventional Ordering
Mikko Lipasti-University of Wisconsin
56
Mikko Lipasti-University of Wisconsin
LSQ – Loop Ordering Mikko Lipasti-University of Wisconsin
57
In-place Loop Cache Benefit
ARM Cortex A15 [Source: NVIDIA] On average 20% fewer instructions fetched Still significant opportunity remaining Mikko Lipasti-University of Wisconsin
58
Outline Memory Data Flow Register Data Flow Instruction Flow
Scalable Load/Store Queues Memory-level parallelism (MLP) Register Data Flow Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery EOLE: Effective Implementation of Value Prediction Instruction Flow Revolver: efficient loop execution Transparent Control Independence
59
Transparent Control Independence
[Al-Zawawi et al., ISCA 07] Control flow graph convergence Execution reconverges after branches If-then-else constructs, etc. Can we fetch/execute instructions beyond convergence point? Significant potential for ILP shown by limit study [Lam/Wilson, ISCA 92] How do we resolve ambiguous register and memory dependences? Slides from Al-Zawawi ISCA presentation follow
60
Control independence basics
© 2007 Ahmed S. Al-Zawawi ISCA 34
61
Four steps for exploiting CI
© 2007 Ahmed S. Al-Zawawi ISCA 34
62
Four steps for exploiting CI
Identify reconv. point © 2007 Ahmed S. Al-Zawawi ISCA 34
63
Four steps for exploiting CI
Identify reconv. point Remove/Insert CD inst. © 2007 Ahmed S. Al-Zawawi ISCA 34
64
Four steps for exploiting CI
Identify reconv. point Remove/Insert CD inst. Identify CIDD inst. © 2007 Ahmed S. Al-Zawawi ISCA 34
65
Four steps for exploiting CI
Identify reconv. point Remove/Insert CD inst. Identify CIDD inst. Repair CIDD inst. Fix data dependencies Re-execute CIDD inst. © 2007 Ahmed S. Al-Zawawi ISCA 34
66
TCI misprediction recovery
CIDD instructions Recovery program Duplicate CIDD inst. Checkpoint 1 Checkpoint 2 Correct CD inst. R branch checkpoint Checkpoint CIDI-supplied source values CIDI-supplied source value Checkpoint-based retirement enables aggressive register reclamation (e.g., CPR): Completed instructions free their resources Leverage branch checkpoint for correct CD instructions In-order retirement is not possible when instructions are out of program order Leverage checkpointed source values to mimic the effect of program order Exploit coarse-grain checkpoint-based retirement to relax ordering constraints © 2007 Ahmed S. Al-Zawawi ISCA 34
67
Transparent Control Independence
TCI repairs program state, not program order TCI pipeline is recovery-free Transparent recovery by fetching additional instructions with checkpointed source values TCI pipeline is free-flowing Leverage conventional speculation to execute correct and incorrect instructions quickly and efficiently Completed instructions free their resources © 2007 Ahmed S. Al-Zawawi ISCA 34
68
TCI microarchitecture
Add repair rename map Add selective re-execution buffer (RXB) Flash bulb new components © 2007 Ahmed S. Al-Zawawi ISCA 34
69
Instructions execute and leave
Predict the branch Instructions execute and leave the pipeline when done © 2007 Ahmed S. Al-Zawawi ISCA 34
70
Construct recovery program
Change red Copy duplicate of CIDD inst. with their source values into RXB © 2007 Ahmed S. Al-Zawawi ISCA 34
71
Insert correct CD instructions
Load branch checkpoint into repair rename map, then fetch correct CD inst. © 2007 Ahmed S. Al-Zawawi ISCA 34
72
Repair & re-execute CIDD instructions
Change red to something bright blue Fix moving Inject duplicate CIDD inst. with their checkpointed source values © 2007 Ahmed S. Al-Zawawi ISCA 34
73
Merge repair & spec. rename maps
Copy corrected register mappings from repair map to spec. map © 2007 Ahmed S. Al-Zawawi ISCA 34
74
Transparent Control Independence
TCI employs CFP-like slice buffer to reconstruct state Instructions with ambiguous dependences buffered Reinsert them the same way forward load miss slice is reinserted “Best” CI proposal to date, but still very complex and expensive, with moderate payback Main reason to pursue CI: mispredicted branches This is a moving target Branch misprediction rates have dropped significantly even since 2007
75
Summary Memory Data Flow Register Data Flow Instruction Flow
Scalable Load/Store Queues Memory-level parallelism (MLP) Register Data Flow Instruction scheduling overview Scheduling atomicity Speculative scheduling Scheduling recovery EOLE: Effective Implementation of Value Prediction Instruction Flow Revolver: efficient loop execution Transparent Control Independence
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.