Presentation is loading. Please wait.

Presentation is loading. Please wait.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th.

Similar presentations


Presentation on theme: "Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th."— Presentation transcript:

1 Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th 2004

2 2 Motivation I 0 0.5 1 1.5 2 2.5 3 3.5 4 128256512102420484096 In-flight Instructions IPC L2 Perfect1005001000 Spec FP 2000 0.30X 3.5X Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

3 3 116813821607186819552034 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of In-flight Instructions Number of In-flight Instructions (SpecFP) 10%25%50%75%90% Motivation II – Resources - ROB Motivation II – Resources - ROB Often nearly full Instructions in-flight (ROB=2048, Mem 500 cycles) A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

4 4 11025507590100 0 200 300 400 500 600 FP Queue Distribution of in-flight Instructions Blocked-Long Blocked-Short Ready 11681382160718681955 Number of Instructions Long/Short Lat. Inst. Remove – Reinsert Dependence Chain Motivation III – Resources – FP Queue Motivation III – Resources – FP Queue State of FP Queues (ROB=2048, Mem 500 cycles) A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

5 5 Outline r Motivation r Out-of-Order Commit r Multicheckpointing ROB r Slow Line Instruction Queue r Performance Evaluation r Conclusion

6 6 Out-of-Order Commit Oldest Checkpoint New Checkpoint I5 Br 3 I6 Br 2 St I4 I3 Ld Br 1 I2 I1 Ld Checkpoint New Checkpoint

7 7 Out-of-Order Commit Oldest Checkpoint I5 Br 3 I6 Br 2 St I4 I3 Ld Br 1 I2 I1 Ld Checkpoint New Checkpoint Checkpoint Oldest Checkpoint Store Buffer Oldest Checkpoint To Memory Gang Commit

8 8 Miss Branch Prediction Recover from Checkpoint Oldest Checkpoint Out-of-Order Commit St I4 I3 I5 Br 3 Br 2 Checkpoint St Store Buffer I7 I8

9 9 Out-of-Order Commit II r Checkpoint Table. Each entry has: r PC of the next Instruction r Instruction Counter: Count the number of instructions still alive r Map Table: Allows to recover the register file r Pointer to the Store Buffer r Mechanism to recover free Registers Future Free –One bit for each Physical Register Large Virtual ROB: Tech. Rep. UPC-DAC-2002-39 Ephemeral Registers: Tech. Rep UPC-DAC-2003.51

10 10 Checkpoint Creation r Save Pc r Save Map Table r Clean Future Free Bits r Clean Instruction Counter r Get a pointer to the first free entry of the store buffer, and mark this entry in the store buffer.

11 11 Instruction Decodification r Add 1 to the Instruction Counter of the newest checkpoint r R1  R2 op R3 r If R1 is mapped to PhyReg_N Set PhyReg_N bit of the future free vector bits Map R1 to the new Physical Register r Associate the instruction to the last created checkpoint

12 12 Instruction Writeback r Decrement the Instruction Counter of the checkpoint associated to the instruction r If the instruction is a mispredicted branch: r Recover From the associated checkpoint: Fetch instructions from saved PC Release all entries in the store buffer from the pointed entry Free all registers in the future free vector of the entry and for all the newer checkpoints entries

13 13 Checkpoint Elimination r If this counter is 0 and if it is the oldest checkpoint, then: r The checkpoint is removed Clean the corresponding mark in the store buffer The registers marked in the Future Free vector are freed

14 14 Outline r Motivation r Out-of-Order Commit r Slow Line Instruction Queue r Performance Evaluation r Conclusions

15 15 P s e u d o R o b Ld x x x a x x x b x D a t a D e p e n d e n c e Load/Store Queue Instruction Queue Slow Line Instruction Queue LD a b Slow Line Instruction Queue

16 16 P s e u d o R o b Ld x x x a x x x b x D a t a D e p e n d e n c e Load/Store Queue Instruction Queue Slow Line Instruction Queue LD a b Slow Line Instruction Queue

17 17 P s e u d o R o b Ld x x x a x x x b x D a t a D e p e n d e n c e Load/Store Queue Instruction Queue Slow Line Instruction Queue LD a b Load End Begin reinsert Slow Line Instruction Queue

18 18 Slow Lane Instruction Queue II r Very simple Buffer – Slow Lane Instruction Queue (SLIQ) r Each Load that miss in L2 has a pointer to an entry in the SLIQ r Pseudo ROB

19 19 Slow Line Instruction Queue III r When a Instruction is retired from the Pseudo ROB, its state is looked on: If the instruction is a load miss, the pointer is written If the instruction depends on a long latency instruction, it is moved to de SLIQ r When a load that miss in L2 finish its execution: r The SLIQ is traversed from the instruction pointed by the load if this point is older than the current traversal position. r The load’s dependent instructions are reinserted to the IQ

20 20 Performance Evaluation r Processor Configuration (Baseline 4096): r Fetch/Commit width4 r Branch Predictor16K entries Gshare r Instruction L1 32Kb, 4-way, 32 bytes line, 2 cycle r Data L1 32Kb, 4-way, 32 bytes line, 2 cycle r L2 size512Kb, 4-way, 64 bytes line, 10 cycle r Memory Latency1000 cycles r Physical Registers4096 entries r Load/Store Queue4096 entries r Reorder Buffer4096 entries r Integer General Units4 (lat/rep 1/1) r Integer Mult/Div Units2 (lat/rep 3/1 and 20/20) r FP Functional Units4 (lat/rep 2/1) r FP Mult/Div/Sqrt Units2 (lat/rep 4/1, 12/12, 24/24)

21 21 Performance Evaluation - Some Considerations r We mix both models. r The processor takes the checkpoints when the instructions are retired from the pseudo ROB. r Many branches are resolved at this time, so the probability to come back to the checkpoint is reduced. r If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.

22 22 IPC – Different Configurations

23 23 Number of Checkpoints and Performance Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ. 2048 Physical Registers

24 24 In-Flight Instructions

25 25 Delay in re-insertion from SLIQ SLIQ: 1024 entries

26 26 Towards affordable Kilo-Instruction Processor r Adding Ephemeral Registers to the Out-of-Order Commit Processors r Change in the SLIQ to list of Buckets of Instructions J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR-2003-1035, 2003.

27 27 Putting It All Together Physical Registers Virtual Registers IQs of 128 entries Memory Latency

28 28 Conclusion r To tolerate increasing memory latencies in Floating Point applications, a large number of in- flight instruction must be maintained. The resources must be up-sized. r The resources are underutilized r We present two techniques to reduce the need for resources and we show its effectiveness r Out of Order Commit r Slow Lane Instruction Queue

29 29 Thank you very much

30 30 Status of the removed instructions

31 31 Example: 2034 instructions DO NOT require 1200 registers Motivation III – Resources - Registers Motivation III – Resources - Registers State of Registers (ROB=2048, Mem 500 cycles)

32 32 11025507590100 0 200 300 400 500 600 LD Queue Distribution of in-flight Instructions Dead Blocked-Long Blocked-Short Replayable Live 11681382160718681955 Number of Instructions Checkpointing Early Release Motivation III – Resources – Load Queue State of LD Queues (ROB=2048, Mem 500 cycles)

33 33 11025507590100 0 50 100 150 200 250 300 ST Queue Distribution of in-flight Instructions Ready Address Ready Blocked-Long Blocked-Short 11681382160718681955 Number of Instructions Motivation III – Resources – Store Queue Motivation III – Resources – Store Queue Locality State of ST Queues (ROB=2048, Mem 500 cycles)

34 34 11025507590100 0 50 100 150 200 250 300 350 400 450 LD Queue Distribution of in-flight Instructions Dead Blocked-Long Blocked-Short Replayable Live 2010843510041361 Number of Instructions INT State of LD Queues (specInt, ROB=2048) State of LD Queues (specInt, ROB=2048) Checkpointing Early Release

35 35 11025507590100 0 50 100 150 200 250 300 ST Queue Distribution of in-flight Instructions Ready Address Ready Blocked-Long Blocked-Short 11681382160718681955 Number of Instructions FP State of ST Queues (specFP, ROB=2048) State of ST Queues (specFP, ROB=2048) Locality

36 36 11025507590100 0 50 100 150 200 250 ST Queue Distribution of in-flight Instructions Ready Address Ready Blocked-Long Blocked-Short 2010843510041361 Number of Instructions INT State of ST Queues (specInt, ROB=2048) State of ST Queues (specInt, ROB=2048) Locality

37 37 11025507590100 0 200 300 400 500 600 FP Queue Distribution of in-flight Instructions Blocked-Long Blocked-Short Ready 11681382160718681955 Number of Instructions Long/Short Lat. Inst. Remove – Reinsert Dependence Chain State of FP Queues (specFP, ROB=2048) State of FP Queues (specFP, ROB=2048)

38 38 11025507590100 0 50 100 150 200 250 300 350 400 450 Int. Queue Distribution of in-flight Instructions Blocked-Long Blocked-Short Ready 2010843510041361 Number of Instructions INT State of Int Queues (specInt, ROB=2048) State of Int Queues (specInt, ROB=2048) Long/Short Lat. Inst. Remove – Reinsert Dependence Chain

39 39 20108435100413611756 0 100 200 300 400 500 600 700 800 900 1000 Int. Registers Number of In-flight Instructions (SpecInt) Dead Blocked-Long Blocked-Short Live 10%25%50%75%90% State of Registers (Int, ROB=2048) State of Registers (Int, ROB=2048) Early Release Virtual Registers

40 40 Instructions in-flight (Int, ROB=2048) Instructions in-flight (Int, ROB=2048) Branches 20108435100413611756 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Number of In-flight Instructions Number of In-flight Instructions (SpecInt) 10%25%50%75%90%


Download ppt "Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th."

Similar presentations


Ads by Google