Out-of-Order Commit Processor Adrain Cristal, Daniel Ortega, Josep Llosa and Mateo Valero
Performance Limiting factors Widening gap between memory and processor performance Increasing wire delays
High Memory Latency Current Solution ROB Cache Miss LD R1, 0(R3) Multiple cache hierarchy Large number of in-flight instructions ROB Register File Load Store Queue Instruction queue DADDI R2, R4 #2 … … … … … …
Motivation
Motivation
Goal of the paper To support large number of in-flight instructions without up-sizing ROB and Instruction queue Out-of-Order commit Slow Lane Instruction Queuing
Re-Order Buffer In-order commit, to handle precise interrupts Controls exactly when stores can write to the memory Frees physical register Enable processor to recover from branch mis-prediction Keeps track of all in-flight instructions Large in-flight instruction Huge ROB structure Cycle time limitation
Checkpointing instead of ROB
Implementation CAM (Content-addressable memory) register mapping Inclusion of Future Free bit For freeing physical register Free List Used for choosing free register
Checkpointing
Operation
Operation
Checkpoint Valid bits Future free bits Number of (active) instructions in that checkpoint
Heuristic for taking checkpoints First branch after 64 instructions Every 512 instructions After 64 stores After flushing the pipeline
Slow Lane Instruction Queuing
Slow Lane Instruction Queuing Identifying instructions that will take long time Put them in a secondary buffer till it gets ready Alternate paper that considers these as critical instructions and put them in the fast queue
SLIQ Pseudo-ROB for finding long latency instructions Slow queue to store the long latency instructions 32-bit register for 32 logical register to keep track of the dependency
Wakening of instructions in SLIQ Every long latency load is stored in SLIQ along with its destination register Wakening done at a pace of four instructions per cycle LD R1, 0(R3) DADDI R2, R4 #2 … … New Load … … …
Baseline Processor Configuration
RESULTS
Effect of delay in re-insertion Clearly shows that the program is highly parallel What about integer programs?
Number of In-flight instructions
Results
Ephemeral Registers Conventional Scheme Virtual Physical Registers Early release Ephemeral Registers
Early Release Early Release of Registers Needs a pending counter for each register When an instruction is decoded, each pending counter associated with the source registers is incremented and when the instruction ins are issued, the pending counter is decremented. The instructions in a wrong path, are nullified and issued in order to maintain the pending counter Coupled with the renaming logic CAM maps table scheme A register can be freed if it is not referenced in any map table, and if its pending counter is zero.
Virtual Registers Decouple renaming from physical register allocation Requires two map tables – GMT (General Map Table) and PMT (Physical Map Table) PMT - New table which maps virtual register to physical register
Putting it together
Analysis How efficient these methods are for integer programs which have Very little parallelism Very poor branch prediction accuracy Lengthy critical path How Scalable is the CAM scheme they have used for future processors having hundreds of physical register and running at very high clock speed Impact of these techniques on power