February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team.

Slides:



Advertisements
Similar presentations
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Advertisements

CS 7810 Lecture 4 Overview of Steering Algorithms, based on Dynamic Code Partitioning for Clustered Architectures R. Canal, J-M. Parcerisa, A. Gonzalez.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Chapter 8. Pipelining. Instruction Hazards Overview Whenever the stream of instructions supplied by the instruction fetch unit is interrupted, the pipeline.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
WCED: June 7, 2003 Matt Ramsay, Chris Feucht, & Mikko Lipasti University of Wisconsin-MadisonSlide 1 of 26 Exploring Efficient SMT Branch Predictor Design.
June 20 th 2004University of Utah1 Microarchitectural Techniques to Reduce Interconnect Power in Clustered Processors Karthik Ramani Naveen Muralimanohar.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
Trace Processors Presented by Nitin Kumar Eric Rotenberg Quinn Jacobson, Yanos Sazeides, Jim Smith Computer Science Department University of Wisconsin-Madison.
EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.
EECS 470 Memory Scheduling Lecture 11 Coverage: Chapter 3.
1 Lecture 7: Static ILP and branch prediction Topics: static speculation and branch prediction (Appendix G, Section 2.3)
Revisiting Load Value Speculation:
1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
Advanced Microarchitecture Prof. Mikko H. Lipasti University of Wisconsin-Madison Lecture notes based on notes by Ilhyun Kim Updated by Mikko Lipasti.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
Microprocessor Microarchitecture Limits of Instruction-Level Parallelism Lynn Choi Dept. Of Computer and Electronics Engineering.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
December 4, 2003 Ilhyun Kim -- MICRO-36 Slide 1 of 23 Macro-op Scheduling: Relaxing Scheduling Loop Constraints Ilhyun Kim Mikko H. Lipasti PHARM Team.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Implementing Optimizations at Decode Time
Lynn Choi Dept. Of Computer and Electronics Engineering
Physical Register Inlining (PRI)
Lecture: Out-of-order Processors
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Half-Price Architecture
Instruction scheduling
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Module 3: Branch Prediction
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Address-Value Delta (AVD) Prediction
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
How to improve (decrease) CPI
Control unit extension for data hazards
Advanced Microarchitecture
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Adapted from the slides of Prof
Dynamic Hardware Prediction
Patrick Akl and Andreas Moshovos AENAO Research Group
Lecture 9: Dynamic ILP Topics: out-of-order processors
Presentation transcript:

February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team University of Wisconsin-Madison

February 18, 2004 Slide 2 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculation vs. Recovery All speculative techniques share a few common requirements Some mechanisms for generating predictions Microarchitectural support for realizing the benefits of predictions Recovery for mispredictions Relatively little focus on recovery Prediction and speculative techniques have been discussed extensively Vague descriptions like refetch, squash, reissue and replay Recovery for speculative scheduling: scheduling replay What are the issues in scheduling replay? What functionalities should it provide? What are the potential limitations?

February 18, 2004 Slide 3 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Related Work Selective re-issue Initially proposed for value prediction Assumed by many data-speculation techniques Detailed mechanics were not fully described and/or developed Generic dependence vector scheme [Sazeides, Ph.D. thesis] Scheduling replay Alpha 21264: squashing replay Pentium 4: selective replay based on replay queue Evaluation of replay schemes [Morancho et al.] Scheduling miss prediction [Yoaz et al.] Our work Provides a framework for developing & analyzing replay schemes Proposes token-based selective replay

February 18, 2004 Slide 4 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Outline Speculative Scheduling & Wavefront Propagation Parallel Verification Scheduling Replay Schemes Token-based Selective Replay Performance Evaluation Conclusions

February 18, 2004 Slide 5 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculative Scheduling Overview Original Tomasulo’s algorithm Sched FetchDecode Atomic sched / exe WBCommit / Exe FetchDecodeSchedDispRFExeWBCommit cannot achieve max ILP FetchDecodeSchedDispRFExeWBCommit speculative issue verify scheduling decisions Speculative Scheduling Source of scheduling misses Load instructions: D-cache / DTLB misses, store-to-load aliasing Performance / complexity optimization techniques

February 18, 2004 Slide 6 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data dependences Speculative “Image” of execution Execution Wavefront Delay between the two wavefronts FetchDecRenQueSchedDisp RF ExeWB Com- mit Ren dependence linking data linking Real Execution Wavefront Speculative Execution Wavefront Speculative Execution Wavefront Real Execution Wavefront The scheduled image is projected to the EXE stage, initiating the real execution wavefront Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies Speculative Execution Wavefront Real Execution Wavefront Verification runs behind speculative execution wavefront  The current execution verifies scheduling decisions made in the past cache miss detected

February 18, 2004 Slide 7 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Speculative Execution Wavefront Initiated by a set of wakeup and select operation that links data dependences Speculative “Image” of execution Scoreboard is OK, but not enough Speculative Execution Wavefront Real Execution Wavefront The scheduled image is projected to the EXE stage, initiating the real execution wavefront Serves to verify the scheduled execution Comparing the scheduled and actual execution latencies Serial Verification Triggers re-scheduling of directly dependent instructions e.g. a scoreboard propagates poison bits along with data dependences Scoreboard FetchDecRenQueSchedDisp RF ExeWB Com- mit Ren dependence linking data linking Real Execution Wavefront Speculative Execution Wavefront  Hard to stop invalid speculative execution wavefront Verification and schedule propagates at the same rate The scheduler doesn’t know which instructions depend on the miss The scheduler keeps issuing instructions unnecessarily

February 18, 2004 Slide 8 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Invalid Wavefront Propagation Parser Gap max 836max 157 Serial verification Parallel verification A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser (compared to parallel verification) average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power Need a mechanism to stop invalid wavefront propagation

February 18, 2004 Slide 9 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Parallel Verification Issued instructions are verified in parallel Verification catches up with invalid speculative execution wavefront The scheduler does not trigger any further incorrect issue Other independent instructions may be issued instead  Focus of this talk: parallel verification for scheduling replay FetchDecRenQueSchedDisp RF ExeWB Com- mit Ren dependence linking data linking Real Execution Wavefront Speculative Execution Wavefront parallel verification terminated speculative execution wavefront

February 18, 2004 Slide 10 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Outline Speculative Scheduling & Wavefront Propagation Parallel Verification Scheduling Replay Schemes Token-based Selective Replay Performance Evaluation Conclusions

February 18, 2004 Slide 11 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Requirements of Parallel Verification Propagation of scheduling verification should be FASTER than that of speculative execution wavefront propagation Verification catches up with invalid speculative wavefront Verification should be performed on the transitive closure of dependent instructions No invalid wavefront slips through invalidation / recovery Ideal scheduling replay All mis-scheduled dependent instructions are invalidated instantly Independent instructions are unaffected (selective replay)

February 18, 2004 Slide 12 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Reducing Name Space for dependence tracking A naïve way: dependence vector scheme works, but… Dependence vector size == the max number of loads in the window Propagate full vectors to dependent instructions at e.g. rename time Scalability issues (e.g. replay at any instruction boundary)  Approximation or conversion of the name space for precise dependence tracking into a smaller set Reduce the number of bits in dependence vectors Scheduling miss detected Am I dependent on the miss? Faster verification  multi-level dependence tracking

February 18, 2004 Slide 13 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Non-Selective Replay (aka “squashing” replay) Kill operands with non-zero-value timers Assuming all operands awakened after the misscheduled instruction are incorrect Dependence tracking: wakeup order (imprecise) Sched DispRFExeVerify Invalidate & replay ALL instructions in the load shadow LD ADD OR AND BR LD ADD OR AND BR LD ADD OR AND BR miss resolved LD ADD OR AND BR LD ADD OR AND BR LD ADD OR AND BR LD ADD OR Cache miss AND BR tag L = = Kill wire tag bus timer start 4 timer L 0 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 3 timer L 1 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 4 timer L 1 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 2 timer L 1 ready L tag R 0 timer R 1 ready R tag L = = Kill wire tag bus timer start 4 timer L 0 ready L tag R 0 timer R 1 ready R wakeup OR instruction Kill wire (single wire)

February 18, 2004 Slide 14 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Delayed Selective Replay Invalidates all conservatively (same as non-selective replay) Samples the completion signal in the given issue slot at timer 0 Selectively re-validates direct child instructions if no poison bit from scoreboard Dependence tracking: wakeup order and position (imprecise) Sched DispRFExeVerify tagR ReadyL tagL = Kill wire wakeup bus timer ReadyR timer = = Slot # Completion bus (wire / issue slot) = timer start timer start Scoreboard Completion bus ADD OR XOR ANDBR LD OR ANDBR LD ADD ANDBR LD ADD OR BR LD ADD OR LD ADD OR SUB Cache miss XOR ANDBR re-validate invalidated source operand (prevent further propagation)

February 18, 2004 Slide 15 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Position-based Selective Replay Ideal selective recovery Dependence tracking is managed in a matrix form Column: load issue slot, row: pipeline stages Dependence tracking: 2-dimensional position (precise) bit merge & shift tagR ReadyR ReadyL tagL = = Kill bus (wire/mem port) tag bus dependence info bus (mem ports X depth) ADD LD ADD LD ADD ADD Shift down every cycle in sync with pipeline flow Propagate matrices along with tag broadcast LD ADD SLL LD OR ANDSLL XOR AND OR XOR ADD Cache miss detected LD ADD SLL LD OR ANDSLL XOR AND OR XOR ADD ALU pipeMEM pipe Sched Disp RF Exe Verify

February 18, 2004 Slide 16 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Outline Speculative Scheduling & Wavefront Propagation Parallel Verification Scheduling Replay Schemes Token-based Selective Replay Performance Evaluation Conclusions

February 18, 2004 Slide 17 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Limitations of Replay Schemes Performance scalability Non-selective scheme replays independent instructions Delayed selective replay creates bubbles in scheduling Complexity issue in position-based replay Extra wires increase exponentially as the machine grows A function of memory ports, issue width and pipeline depth e.g. 50 to 196 extra wires when transitioning from 4 to 8-wide machines Incompatible with data-speculation techniques (e.g. value prediction) Data-speculation techniques collapse true data dependences Wakeup order or position no longer correlates to dependences

February 18, 2004 Slide 18 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Overcoming the Limitations Source of the limitations Dependence information propagates as a part of scheduling or execution process Move dependence propagation out of scheduling logic Track dependences in program order (i.e. in rename stage) Similar to dependence vector scheme  requires a big name space How to reduce the bits while providing precise dependence tracking?  Token-based selective replay Tracks dependences only for the instructions likely to be misscheduled Plant tokens in loads based on scheduling hit/miss prediction Propagate the tokens to dependent instructions Selectively recover instructions with the token Expensive backup recovery if token planting is incorrect Squash & re-insert in program order (analogous to bpred recovery)

February 18, 2004 Slide 19 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Token-based Selective Replay Pipeline structure FetchDecodeRename sched miss predictor token allocator PC sched miss confidence token allocation/ deallocation token propagation high-confidence? name space? ScheduleExeVerifyCommit selective replay for token heads deallocate tokens when retired Queue non-selective replay for others Squash & reinsert instructions in program order Source register mapping from Rename table dep vector Physical reg ID Src0 dep vector Physical reg ID Src conventional instruction / reg info vector merge dep_vector head 1 token_ID back to rename table token allocated ? new token ID new dep_vector to issue queue dep_vector tagR ReadyR ReadyL tagL = = tag bus head 1 token_ID 111 Kill bus # wires in kill bus = 2 X (# tokensr) Token allocation

February 18, 2004 Slide 20 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Machine parameters Simplescalar-Alpha-based 4- and 8-wide OoO 4-wide: 128 ROB, 64 LSQ, 64 IQ, 2 memory ports 8-wide: 256 ROB, 128 LSQ, 128 IQ, 4 memory ports Speculative scheduling, 6-cycle schedule-to-verify delay 32K IL1 (2), 32K DL1 (2), 512K L2 (8), memory (100) Combined branch prediction, fetch until the first taken branch Position-based selective replay Token-based selective replay 4-wide: 8 tokens, 8-wide: 16 tokens Scheduling miss predictor: 4k-entry, PC direct-mapped 2-bit counters 4-cycle penalty for squashing instructions from issue queue Re-insert instructions at the rate of machine width SPEC2K INT, reduced input sets Reference input sets for crafty, eon and gap up to 3B instructions

February 18, 2004 Slide 21 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Scheduling Misses Covered by Tokens 75~92% of scheduling misses are recovered by tokens selectively The misses not covered by tokens are recovered non-selectively (re-insert) mcf runs out of tokens due to many concurrent misses Name space reduction 8-wide: Naïve vector scheme tracks 128 loads  16 loads (16 tokens) % load sched misses / load issues wide, 8 tokens 8-wide, 16 tokens

February 18, 2004 Slide 22 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Normalized Issue Count Selective replay is essential for lower issue count Significant increase in non-selective replay Independent instructions are unnecessarily replayed Worse on wider machines Token scheme performs as well as ideal scheme (position-based) except for mcf: low scheduling miss coverage 4-wide 8-wide

February 18, 2004 Slide 23 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 non-selectivedelayedtoken 8-wide Normalized IPC Non-selective and delayed schemes do not scale to wider machines Scheduling miss penalty grows as the width grows Token selective recovery Works better than non-selective or delayed selective schemes in many cases Better performance scalability non-selectivedelayedtoken 4-wide

February 18, 2004 Slide 24 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Discussion Delayed selective recovery A good design alternative to ideal scheme on a 4-wide machine Good tradeoffs among complexity, performance, and issue count Complexity is a function of the number of tokens (not the machine width nor depth) in token scheme # extra wires in the scheduler = 2 X (# tokens) Position-based scheme: {(width) X (depth) + 1} X (# mem ports) 32 (token-based) vs. 196 (position-based) on our 8-wide machine Better for wider and deeper machines Support for data-speculation techniques Token scheme correctly tracks true data dependences in program order Other schemes cannot recover unless correct dependences are carried through the scheduler

February 18, 2004 Slide 25 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Conclusions Scheduling replay is essential for speculative scheduling Invalidate and re-schedule incorrectly issued instructions Increasingly important as the pipeline become wider and deeper Speculative wavefront propagation in scheduling replay Incurred by the schedule-to-verify delay Negatively affects issue count (power) and performance Scheduling replay needs multi-level dependence tracking to avoid unnecessary issue under misses Issues in efficient dependence tracking Non-selective, delayed selective and position-based selective schemes Token-based selective replay Scalable to wider machines, support for data speculation

February 18, 2004 Slide 26 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Questions??

February 18, 2004 Slide 27 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Scheduling miss predictor performance at different threshold PC-indexed, direct-mapped, 4K entries Coverage of scheduling misses (higher is better) Loads predicted to be a miss (lower is better)

February 18, 2004 Slide 28 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Limitations with data-speculation techniques Assumptions enabling the name space conversion (into a smaller set) Data-dependence enforcement, deterministic schedule-to-verify delay Tracking issue / execution status filters out independent instructions Data-speculation breaks those assumptions  Cannot be directly applied to data-speculation recovery......…… Sched......…… ExeVerify …… issuemiss detected Issued dependent / independent Executed independent unissued Sched Replay variable......…… Sched......……ExeVerify…… issue miss detected issued dependent / independent Executed dependent / independent unissued collapsed data-dependence Data Speculation Recovery

February 18, 2004 Slide 29 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 8-wide 4-wide Normalized IPC Re-insert All scheduling misses are recovered by squashing & re-inserting Worst-case performance of token-based replay Conservative Loads with high misscheduling confidence are scheduled based on L2 latency Squashing & re-inserting if mis-scheduled May unnecessary delay too many loads non-selectivedelayedtokenre-insertconservative

February 18, 2004 Slide 30 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Scheduling Replay Models Replay-queue-based Replay (like the Pentium 4) Issued instructions move from issue queue to replay queue Circulates instructions until they hit in the scoreboard Parallel verification for this model is left to future work Exe pipeline verify verification status (kill bus) retire from issue queue if correctly executed Issue-queue-based Replay (our assumption) issue queue = = = = cache miss detected

February 18, 2004 Slide 31 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Parallel Verification A load miss can propagate through 836 instruction levels!! Not bounded by the size of the instruction window (8-wide, 128RUU) Total issue count goes up by 15% in parser average 10% in SPEC2K INT, worst 42% in mcf Negative impacts on performance and power scoreboard / checker SchedExe cache miss signal cycle n cycle n+1 cycle n+2 cache miss signal Sched Exe dependence tracking / parallel verification terminated speculative execution wavefront

February 18, 2004 Slide 32 of 23Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Position-based Selective Replay Ideal selective recovery Dependence tracking is managed in a matrix form Column: load issue slot, row: pipeline stages Dependence tracking: precise position information merge matices ADD OR SLL AND XOR LD ADD OR XOR ANDSLL Integer pipeline Mem pipeline (width 2) Sched Disp RF Exe verify ADD OR XOR LD OR ANDSLL ADD XOR SLL AND tag / dep info broadcast kill bus broadcast killed Cycle n Cycle n+1 Sched Disp RF Exe verify bit merge & shift invalidate if bits match in the last row tagR ReadyR ReadyL tagL = = Kill bus tag bus dependence info bus Cache miss Detected