Out-of-Order Commit Processors

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Alpha Microarchitecture Onur/Aditya 11/6/2001.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Revisiting Load Value Speculation:

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

Data Prefetching Smruti R. Sarangi.

/ Computer Architecture and Design

Physical Register Inlining (PRI)

Out of Order Processors

Dr. George Michelogiannakis EECS, University of California at Berkeley

Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1.

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Out-of-Order Commit Processors

Commit out of order Phd student: Adrián Cristal.

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Accelerating Dependent Cache Misses with an Enhanced Memory Controller

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Tolerating Long Latency Instructions

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

Lecture 11: Memory Data Flow Techniques

Out-of-Order Commit Processor

Lecture 18: Pipelining Today’s topics:

Alpha Microarchitecture

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

15-740/ Computer Architecture Lecture 10: Runahead and MLP

Lecture 20: OOO, Memory Hierarchy

Data Prefetching Smruti R. Sarangi.

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Instruction-Level Parallelism (ILP)

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17th 2004

3.5X Motivation I 0.5 1 1.5 2 2.5 3 3.5 4 128 256 512 1024 2048 4096 In-flight Instructions IPC L2 Perfect 100 500 1000 0.30X Spec FP 2000 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

Motivation II – Resources - ROB Instructions in-flight (ROB=2048, Mem 500 cycles) 1168 1382 1607 1868 1955 2034 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of In-flight Instructions Number of In-flight Instructions (SpecFP) 10% 25% 50% 75% 90% Often nearly full A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Motivation III – Resources – FP Queue State of FP Queues (ROB=2048, Mem 500 cycles) Number of Instructions 1168 1382 1607 1868 1955 600 Blocked-Long Blocked-Short 500 Ready 400 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain FP Queue 300 200 100 1 10 25 50 75 90 100 Distribution of in-flight Instructions A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Slow Line Instruction Queue Performance Evaluation Conclusion Outline Motivation Out-of-Order Commit Multicheckpointing ROB Slow Line Instruction Queue Performance Evaluation Conclusion

Out-of-Order Commit Ld I1 I2 Br 1 Ld I3 I4 St Br 2 I5 Br 3 I6 Oldest Checkpoint Ld I1 I2 Br 1 Ld New Checkpoint Checkpoint I3 I4 St Br 2 New Checkpoint I5 Br 3 I6

Out-of-Order Commit Commit Gang To Memory Store Buffer Ld I1 I2 Br 1 Oldest Checkpoint Ld Store Buffer I1 Commit Gang I2 Br 1 To Memory Ld Oldest Checkpoint Checkpoint I3 I4 St Br 2 New Checkpoint Oldest Checkpoint Checkpoint I5 Br 3 I6

Miss Branch Prediction Recover from Checkpoint Out-of-Order Commit Store Buffer St Oldest Checkpoint I3 I4 Miss Branch Prediction Recover from Checkpoint St Br 2 Checkpoint I7 I5 I8 Br 3

Out-of-Order Commit II Checkpoint Table. Each entry has: PC of the next Instruction Instruction Counter: Count the number of instructions still alive Map Table: Allows to recover the register file Pointer to the Store Buffer Mechanism to recover free Registers Future Free One bit for each Physical Register Large Virtual ROB: Tech. Rep. UPC-DAC-2002-39 Ephemeral Registers: Tech. Rep UPC-DAC-2003.51

Checkpoint Creation Save Pc Save Map Table Clean Future Free Bits Clean Instruction Counter Get a pointer to the first free entry of the store buffer, and mark this entry in the store buffer.

Instruction Decodification Add 1 to the Instruction Counter of the newest checkpoint R1R2 op R3 If R1 is mapped to PhyReg_N Set PhyReg_N bit of the future free vector bits Map R1 to the new Physical Register Associate the instruction to the last created checkpoint

Instruction Writeback Decrement the Instruction Counter of the checkpoint associated to the instruction If the instruction is a mispredicted branch: Recover From the associated checkpoint: Fetch instructions from saved PC Release all entries in the store buffer from the pointed entry Free all registers in the future free vector of the entry and for all the newer checkpoints entries

Checkpoint Elimination If this counter is 0 and if it is the oldest checkpoint, then: The checkpoint is removed Clean the corresponding mark in the store buffer The registers marked in the Future Free vector are freed

Slow Line Instruction Queue Performance Evaluation Conclusions Outline Motivation Out-of-Order Commit Slow Line Instruction Queue Performance Evaluation Conclusions

Slow Line Instruction Queue P s e u d o R b LD Ld Load/Store Queue x D a t a x D e p e n Instruction a x d e n Queue c e b a x x Slow Line x Instruction Queue b x

Slow Line Instruction Queue LD Ld Load/Store Queue x D a t a x D e p e n Instruction x d e n Queue c e b P s e u d o R b a x a x Slow Line x Instruction Queue b x

Slow Line Instruction Queue Load End LD Ld Load/Store Queue x D a t a x D e p e Begin reinsert n Instruction x d e n Queue c e a x a b x Slow Line x Instruction Queue P s e u d o R b b x

Slow Lane Instruction Queue II Very simple Buffer – Slow Lane Instruction Queue (SLIQ) Each Load that miss in L2 has a pointer to an entry in the SLIQ Pseudo ROB

Slow Line Instruction Queue III When a Instruction is retired from the Pseudo ROB, its state is looked on: If the instruction is a load miss, the pointer is written If the instruction depends on a long latency instruction, it is moved to de SLIQ When a load that miss in L2 finish its execution: The SLIQ is traversed from the instruction pointed by the load if this point is older than the current traversal position. The load’s dependent instructions are reinserted to the IQ

Performance Evaluation Processor Configuration (Baseline 4096): Fetch/Commit width 4 Branch Predictor 16K entries Gshare Instruction L1 32Kb, 4-way, 32 bytes line, 2 cycle Data L1 32Kb, 4-way, 32 bytes line, 2 cycle L2 size 512Kb, 4-way, 64 bytes line, 10 cycle Memory Latency 1000 cycles Physical Registers 4096 entries Load/Store Queue 4096 entries Reorder Buffer 4096 entries Integer General Units 4 (lat/rep 1/1) Integer Mult/Div Units 2 (lat/rep 3/1 and 20/20) FP Functional Units 4 (lat/rep 2/1) FP Mult/Div/Sqrt Units 2 (lat/rep 4/1, 12/12, 24/24)

Performance Evaluation - Some Considerations We mix both models. The processor takes the checkpoints when the instructions are retired from the pseudo ROB. Many branches are resolved at this time, so the probability to come back to the checkpoint is reduced. If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.

IPC – Different Configurations

Number of Checkpoints and Performance Fig. 4. Sensibility of our Hierarchical Commit mechanism to the amount of available checkpoints. With IQ size=2048 entries and 2048 Physical Registers. Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ. 2048 Physical Registers

In-Flight Instructions

Delay in re-insertion from SLIQ SLIQ: 1024 entries

Towards affordable Kilo-Instruction Processor Adding Ephemeral Registers to the Out-of-Order Commit Processors Change in the SLIQ to list of Buckets of Instructions J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR-2003-1035 , 2003.

Putting It All Together PhysicalRegisters Fig. 6. IPC results of the combination of mechanisms (SLIQ, Out-of-Order Commit, Ephemeral registers) with respect to the amount of Virtual Registers, the memory latency and the amount of physical registers Virtual Registers Memory Latency IQs of 128 entries

Conclusion To tolerate increasing memory latencies in Floating Point applications, a large number of in-flight instruction must be maintained. The resources must be up-sized. The resources are underutilized We present two techniques to reduce the need for resources and we show its effectiveness Out of Order Commit Slow Lane Instruction Queue

Thank you very much 

State of ST Queues (specInt, ROB=2048) Number of Instructions 20 108 435 1004 1361 250 Ready Address Ready 200 Blocked-Long Blocked-Short Locality 150 ST Queue 100 50 1 10 25 50 75 90 100 Distribution of in-flight Instructions

State of Int Queues (specInt, ROB=2048) Number of Instructions 20 108 435 1004 1361 450 Blocked-Long 400 Blocked-Short Ready 350 300 250 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain Int. Queue 200 150 100 50 1 10 25 50 75 90 100 Distribution of in-flight Instructions

State of Registers (Int, ROB=2048) 10% 25% 50% 75% 90% 1000 Dead 900 Blocked-Long Blocked-Short 800 Live Early Release 700 600 Int. Registers 500 Virtual Registers 400 300 200 100 20 108 435 1004 1361 1756 Number of In-flight Instructions (SpecInt)