1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
Exploring Wakeup-Free Instruction Scheduling Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin Microsystems Design Lab The Pennsylvania State University.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
Advanced Computer Architecture Lab University of Michigan MASE Eric Larson MASE: Micro Architectural Simulation Environment Eric Larson, Saugata Chatterjee,
CS 211: Computer Architecture Lecture 5 Instruction Level Parallelism and Its Dynamic Exploitation Instructor: M. Lancaster Corresponding to Hennessey.
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
DATAFLOW ARHITEKTURE. Dataflow Processors - Motivation In basic processor pipelining hazards limit performance –Structural hazards –Data hazards due to.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.
Multiscalar processors
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.
EECS 470 Dynamic Scheduling – Part II Lecture 10 Coverage: Chapter 3.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in partitioned architectures Rajeev Balasubramonian Naveen.
Feb 14 th 2005University of Utah1 Microarchitectural Wire Management for Performance and Power in Partitioned Architectures Rajeev Balasubramonian Naveen.
February 18, 2004 Ilhyun Kim & Mikko H. Lipasti -- HPCA10 Slide 1 of 23 Understanding Scheduling Replay Schemes Ilhyun Kim Mikko H. Lipasti PHARM Team.
A Position-Insensitive Finished Store Buffer Erika Gunadi and Mikko H. Lipasti Department of Electrical and Computer Engineering University of Wisconsin—Madison.
Complexity-Effective Superscalar Processors S. Palacharla, N. P. Jouppi, and J. E. Smith Presented by: Jason Zebchuk.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
Reducing Issue Logic Complexity in Superscalar Microprocessors Survey Project CprE 585 – Advanced Computer Architecture David Lastine Ganesh Subramanian.
Advanced Computer Architecture Lab University of Michigan Compiler Controlled Value Prediction with Branch Predictor Based Confidence Eric Larson Compiler.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Dan Ernst – ISCA-30 – 6/10/03 Advanced Computer Architecture Lab The University of Michigan Cyclone: A Low-Complexity Broadcast-Free Dynamic Instruction.
The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.
CDA 5155 Out-of-order execution: Pentium Pro/II/III Week 7.
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.
OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.
Constructive Computer Architecture Tutorial 6: Five Details of SMIPS Implementations Andy Wright 6.S195 TA October 7, 2013http://csg.csail.mit.edu/6.s195T05-1.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.
CS203 – Advanced Computer Architecture ILP and Speculation.
/ Computer Architecture and Design
Smruti R. Sarangi IIT Delhi
Lynn Choi Dept. Of Computer and Electronics Engineering
CS203 – Advanced Computer Architecture
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Half-Price Architecture
Power-Aware Operand Delivery
Exploring Value Prediction with the EVES predictor
Out of Order Processors
Lecture 10: Out-of-order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Smruti R. Sarangi IIT Delhi
Yingmin Li Ting Yan Qi Zhao
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Smruti R. Sarangi Computer Science and Engineering, IIT Delhi
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
* From AMD 1996 Publication #18522 Revision E
Out-of-Order Execution Structures Optimizations
Adapted from the slides of Prof
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Prof. Onur Mutlu Carnegie Mellon University
Presentation transcript:

1 Practical Selective Replay for Reduced-Tag Schedulers Dan Ernst and Todd Austin Advanced Computer Architecture Lab The University of Michigan June 8 th, 2003 Workshop on Duplicating, Deconstructing and Debunking Pre-

2 Dynamic Scheduling Complexity Broadcast-based dynamic scheduler circuits are: –High complexity –Power-hungry –Scale poorly This logic is also difficult to pipeline –Palacharla, et al. stated that instruction wakeup and select appears to be an atomic operation –Work on pipelining this logic has been done, but there’s always a price: More complex structures IPC penalties Tag Elimination approach: [ISCA ’02] Make the critical path faster by reducing the circuit complexity.

3 Tag Elimination Specialize the window entries Reservation stations have a varying number of tags to compare against –Instructions are placed into stations that have the required number of in-flight tags Advantages: –Fewer comparators on the broadcast bus –Bus wire length can be shorter Disadvantage: –New condition on being placed into the scheduler window: Is there an entry of the correct size? Makes for a few more front-end stalls than usual Slightly less parallelism can be extracted

4 Tag Elimination Example = = = = = = = = = = = = = = = = = = = = = = = = 8/0/0 2/4/2 Destination Tag Bus Destination Tag Bus

5 Last-Tag Speculation If it is known which of an instruction’s operands will be the last to arrive, the first-arriving operand can be ignored. –This can be predicted with high accuracy Advantage: –Reduced the maximum number of tags per instruction to 1 Disadvantage: –Possibly further IPC loss due to last-tag misprediction penalties

6 Scheduler Replay Latency prediction: –Handles non-deterministic latency of memory ops Selective replay only flushes instructions depending on the mis-speculated load Kim & Lipasti (ISCA ’03) state that: “…implementing tag elimination on a machine with selective recovery is impractical.”

7 Parent-Child Broadcast Replay Concept: If a latency is mispredicted, flush the instructions and reschedule from the window The scheduler window dependence broadcasts are overloaded to also propagate information about an instruction’s “ancestors” Tag Elimination is not compatible with this mechanism –All dependence information can not be tracked or reset when not all tags are on the broadcast busses

8 Broadcast-Based Replay Description dep matrix Ldep matrix R tagLtagR == == kill bus wakeup bus { tag dependence info a)

9 Dependence Propagation dep matrix L dep matrix R shifted down every cycle kill bus inst invalidated if kill bus bits match bits in the bottom row merge matrices & mark myself before sent to children b)

10 U.S. Patent 6,212,626 - Filed Dec (Held by Intel Corp.)

11 Timed Queue Replay Concept: If a latency is mispredicted, keep the same schedule already computed and insert an extra latency to make it correct –The scheduler window is never involved in resolving ancestry A ‘checker’ later in the pipeline verifies the schedule – Table of register ready bits Tag elimination is compatible with this mechanism –Extra dependence tracking is not necessary

12 Timed Queue Replay Description Instruction Scheduler Replay Check Registers/ Execution Writeback stop scheduler replay safe

13 Methodology: Architecture Architectural model –Derived from SimpleScalar 3.0 –More sophisticated scheduling support Separated ROB and RS Variable-length pipelines Scheduler replay on memory latency misprediction –Flush replay –Broadcast-based selective –Timed queue selective –Support for specialized windows and last-tag prediction

14 Last-Tag with Selective Replay 4-wide

15 Last-Tag with Selective Replay 8-wide

16 Reducing Scheduler Pressure After instructions are issued from the scheduler… –Parent-Child Broadcast: Instructions must remain in the window for D cycles in case a replay is initiated –Timed Queue: Instructions do not need to be kept

17 Benefits of Station Release

18 Comparing Replay Mechanisms ILP –Timed queue replay has much less instruction window pressure (WIB based on similar idea – ISCA ’02) Complexity  Power –Parent-Child broadcasts have high circuit complexity W*(W x D) + W extra broadcast bits (roughly) –328 vs. 64 in our example

19 Conclusions Tag elimination and selective replay can work together –If the replay mechanism is largely external to the scheduler Timed queue replay mechanisms are more efficient –Allows the scheduler to remove some dependence information –Less pressure in the scheduler window –No extra broadcasts  Lower circuit complexity  Lower power –