Out-of-Order Commit Processors

Out-of-Order Commit Processors
Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17th 2004

3.5X Motivation I 0.5 1 1.5 2 2.5 3 3.5 4 128 256 512 1024 2048 4096 In-flight Instructions IPC L2 Perfect 100 500 1000 0.30X Spec FP 2000 Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

Motivation II – Resources - ROB
Instructions in-flight (ROB=2048, Mem 500 cycles) 1168 1382 1607 1868 1955 2034 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of In-flight Instructions Number of In-flight Instructions (SpecFP) 10% 25% 50% 75% 90% Often nearly full A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Motivation III – Resources – FP Queue
State of FP Queues (ROB=2048, Mem 500 cycles) Number of Instructions 1168 1382 1607 1868 1955 600 Blocked-Long Blocked-Short 500 Ready 400 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain FP Queue 300 200 100 1 10 25 50 75 90 100 Distribution of in-flight Instructions A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

Slow Line Instruction Queue Performance Evaluation Conclusion
Outline Motivation Out-of-Order Commit Multicheckpointing ROB Slow Line Instruction Queue Performance Evaluation Conclusion

Out-of-Order Commit Ld I1 I2 Br 1 Ld I3 I4 St Br 2 I5 Br 3 I6
Oldest Checkpoint Ld I1 I2 Br 1 Ld New Checkpoint Checkpoint I3 I4 St Br 2 New Checkpoint I5 Br 3 I6

Out-of-Order Commit Commit Gang To Memory Store Buffer Ld I1 I2 Br 1
Oldest Checkpoint Ld Store Buffer I1 Commit Gang I2 Br 1 To Memory Ld Oldest Checkpoint Checkpoint I3 I4 St Br 2 New Checkpoint Oldest Checkpoint Checkpoint I5 Br 3 I6

Miss Branch Prediction Recover from Checkpoint
Out-of-Order Commit Store Buffer St Oldest Checkpoint I3 I4 Miss Branch Prediction Recover from Checkpoint St Br 2 Checkpoint I7 I5 I8 Br 3

Out-of-Order Commit II
Checkpoint Table. Each entry has: PC of the next Instruction Instruction Counter: Count the number of instructions still alive Map Table: Allows to recover the register file Pointer to the Store Buffer Mechanism to recover free Registers Future Free One bit for each Physical Register Large Virtual ROB: Tech. Rep. UPC-DAC Ephemeral Registers: Tech. Rep UPC-DAC

Checkpoint Creation Save Pc Save Map Table Clean Future Free Bits
Clean Instruction Counter Get a pointer to the first free entry of the store buffer, and mark this entry in the store buffer.

Instruction Decodification
Add 1 to the Instruction Counter of the newest checkpoint R1R2 op R3 If R1 is mapped to PhyReg_N Set PhyReg_N bit of the future free vector bits Map R1 to the new Physical Register Associate the instruction to the last created checkpoint

Instruction Writeback
Decrement the Instruction Counter of the checkpoint associated to the instruction If the instruction is a mispredicted branch: Recover From the associated checkpoint: Fetch instructions from saved PC Release all entries in the store buffer from the pointed entry Free all registers in the future free vector of the entry and for all the newer checkpoints entries

Checkpoint Elimination
If this counter is 0 and if it is the oldest checkpoint, then: The checkpoint is removed Clean the corresponding mark in the store buffer The registers marked in the Future Free vector are freed

Slow Line Instruction Queue Performance Evaluation Conclusions
Outline Motivation Out-of-Order Commit Slow Line Instruction Queue Performance Evaluation Conclusions

Slow Line Instruction Queue
P s e u d o R b LD Ld Load/Store Queue x D a t a x D e p e n Instruction a x d e n Queue c e b a x x Slow Line x Instruction Queue b x

LD Ld Load/Store Queue x D a t a x D e p e n Instruction x d e n Queue c e b P s e u d o R b a x a x Slow Line x Instruction Queue b x

Load End LD Ld Load/Store Queue x D a t a x D e p e Begin reinsert n Instruction x d e n Queue c e a x a b x Slow Line x Instruction Queue P s e u d o R b b x

Slow Lane Instruction Queue II
Very simple Buffer – Slow Lane Instruction Queue (SLIQ) Each Load that miss in L2 has a pointer to an entry in the SLIQ Pseudo ROB

Slow Line Instruction Queue III
When a Instruction is retired from the Pseudo ROB, its state is looked on: If the instruction is a load miss, the pointer is written If the instruction depends on a long latency instruction, it is moved to de SLIQ When a load that miss in L2 finish its execution: The SLIQ is traversed from the instruction pointed by the load if this point is older than the current traversal position. The load’s dependent instructions are reinserted to the IQ

Performance Evaluation
Processor Configuration (Baseline 4096): Fetch/Commit width 4 Branch Predictor 16K entries Gshare Instruction L1 32Kb, 4-way, 32 bytes line, 2 cycle Data L Kb, 4-way, 32 bytes line, 2 cycle L2 size 512Kb, 4-way, 64 bytes line, 10 cycle Memory Latency cycles Physical Registers entries Load/Store Queue 4096 entries Reorder Buffer entries Integer General Units 4 (lat/rep 1/1) Integer Mult/Div Units 2 (lat/rep 3/1 and 20/20) FP Functional Units 4 (lat/rep 2/1) FP Mult/Div/Sqrt Units 2 (lat/rep 4/1, 12/12, 24/24)

Performance Evaluation - Some Considerations
We mix both models. The processor takes the checkpoints when the instructions are retired from the pseudo ROB. Many branches are resolved at this time, so the probability to come back to the checkpoint is reduced. If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.

IPC – Different Configurations

Number of Checkpoints and Performance
Fig. 4. Sensibility of our Hierarchical Commit mechanism to the amount of available checkpoints. With IQ size=2048 entries and 2048 Physical Registers. Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ Physical Registers

In-Flight Instructions

Delay in re-insertion from SLIQ
SLIQ: 1024 entries

Towards affordable Kilo-Instruction Processor
Adding Ephemeral Registers to the Out-of-Order Commit Processors Change in the SLIQ to list of Buckets of Instructions J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR , 2003.

Putting It All Together
PhysicalRegisters Fig. 6. IPC results of the combination of mechanisms (SLIQ, Out-of-Order Commit, Ephemeral registers) with respect to the amount of Virtual Registers, the memory latency and the amount of physical registers Virtual Registers Memory Latency IQs of 128 entries

Conclusion To tolerate increasing memory latencies in Floating Point applications, a large number of in-flight instruction must be maintained. The resources must be up-sized. The resources are underutilized We present two techniques to reduce the need for resources and we show its effectiveness Out of Order Commit Slow Lane Instruction Queue

Thank you very much 

State of ST Queues (specInt, ROB=2048)
Number of Instructions 20 108 435 1004 1361 250 Ready Address Ready 200 Blocked-Long Blocked-Short Locality 150 ST Queue 100 50 1 10 25 50 75 90 100 Distribution of in-flight Instructions

State of Int Queues (specInt, ROB=2048)
Number of Instructions 20 108 435 1004 1361 450 Blocked-Long 400 Blocked-Short Ready 350 300 250 Long/Short Lat. Inst. Remove – Reinsert Dependence Chain Int. Queue 200 150 100 50 1 10 25 50 75 90 100 Distribution of in-flight Instructions

State of Registers (Int, ROB=2048)
10% 25% 50% 75% 90% 1000 Dead 900 Blocked-Long Blocked-Short 800 Live Early Release 700 600 Int. Registers 500 Virtual Registers 400 300 200 100 20 108 435 1004 1361 1756 Number of In-flight Instructions (SpecInt)

Out-of-Order Commit Processors

Similar presentations

Presentation on theme: "Out-of-Order Commit Processors"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Out-of-Order Commit Processors

Similar presentations

Presentation on theme: "Out-of-Order Commit Processors"— Presentation transcript:

Similar presentations

About project

Feedback