Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February 14-17 th.

Slides:

Advertisements

Similar presentations

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Advertisements

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

UPC Microarchitectural Techniques to Exploit Repetitive Computations and Values Carlos Molina Clemente LECTURA DE TESIS, (Barcelona,14 de Diciembre de.

THE MIPS R10000 SUPERSCALAR MICROPROCESSOR Kenneth C. Yeager IEEE Micro in April 1996 Presented by Nitin Gupta.

Alpha Microarchitecture Onur/Aditya 11/6/2001.

Sim-alpha: A Validated, Execution-Driven Alpha Simulator Rajagopalan Desikan, Doug Burger, Stephen Keckler, Todd Austin.

Wrong Path Events and Their Application to Early Misprediction Detection and Recovery David N. Armstrong Hyesoon Kim Onur Mutlu Yale N. Patt University.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

CS 7810 Lecture 10 Runahead Execution: An Alternative to Very Large Instruction Windows for Out-of-order Processors O. Mutlu, J. Stark, C. Wilkerson, Y.N.

UPC Reducing Power Consumption of the Issue Logic Daniele Folegnani and Antonio González Universitat Politècnica de Catalunya.

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

Cluster Prefetch: Tolerating On-Chip Wire Delays in Clustered Microarchitectures Rajeev Balasubramonian School of Computing, University of Utah July 1.

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

The PowerPC Architecture  IBM, Motorola, and Apple Alliance  Based on the IBM POWER Architecture Facilitate parallel execution Scale well with advancing.

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Revisiting Load Value Speculation:

The life of an instruction in EV6 pipeline Constantinos Kourouyiannis.

The Alpha – Data Stream Matt Ziegler.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.

CS203 – Advanced Computer Architecture ILP and Speculation.

Lecture: Out-of-order Processors

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

Dr. George Michelogiannakis EECS, University of California at Berkeley

Case Studies MAINAK CS422 1 CS422 MAINAK CS422 MAINAK 1.

Lecture: Out-of-order Processors

Out-of-Order Commit Processors

Commit out of order Phd student: Adrián Cristal.

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Milad Hashemi, Onur Mutlu, Yale N. Patt

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Tolerating Long Latency Instructions

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Address-Value Delta (AVD) Prediction

Lecture 11: Memory Data Flow Techniques

Out-of-Order Commit Processor

Alpha Microarchitecture

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Out-of-Order Commit Processors

Lecture 20: OOO, Memory Hierarchy

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th 2004

2 Motivation I In-flight Instructions IPC L2 Perfect Spec FP X 3.5X Cristal et al.: “Large Virtual ROBs by Processor Checkpointing”, TR UPC-DAC, July 2002

Number of In-flight Instructions Number of In-flight Instructions (SpecFP) 10%25%50%75%90% Motivation II – Resources - ROB Motivation II – Resources - ROB Often nearly full Instructions in-flight (ROB=2048, Mem 500 cycles) A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

FP Queue Distribution of in-flight Instructions Blocked-Long Blocked-Short Ready Number of Instructions Long/Short Lat. Inst. Remove – Reinsert Dependence Chain Motivation III – Resources – FP Queue Motivation III – Resources – FP Queue State of FP Queues (ROB=2048, Mem 500 cycles) A. Cristal, et al, “ A case for resource-conscious out-of-order processors”, IEEE TCCA CA Letters, Vol. 2, Oct. 2003

5 Outline r Motivation r Out-of-Order Commit r Multicheckpointing ROB r Slow Line Instruction Queue r Performance Evaluation r Conclusion

6 Out-of-Order Commit Oldest Checkpoint New Checkpoint I5 Br 3 I6 Br 2 St I4 I3 Ld Br 1 I2 I1 Ld Checkpoint New Checkpoint

7 Out-of-Order Commit Oldest Checkpoint I5 Br 3 I6 Br 2 St I4 I3 Ld Br 1 I2 I1 Ld Checkpoint New Checkpoint Checkpoint Oldest Checkpoint Store Buffer Oldest Checkpoint To Memory Gang Commit

8 Miss Branch Prediction Recover from Checkpoint Oldest Checkpoint Out-of-Order Commit St I4 I3 I5 Br 3 Br 2 Checkpoint St Store Buffer I7 I8

9 Out-of-Order Commit II r Checkpoint Table. Each entry has: r PC of the next Instruction r Instruction Counter: Count the number of instructions still alive r Map Table: Allows to recover the register file r Pointer to the Store Buffer r Mechanism to recover free Registers Future Free –One bit for each Physical Register Large Virtual ROB: Tech. Rep. UPC-DAC Ephemeral Registers: Tech. Rep UPC-DAC

10 Checkpoint Creation r Save Pc r Save Map Table r Clean Future Free Bits r Clean Instruction Counter r Get a pointer to the first free entry of the store buffer, and mark this entry in the store buffer.

11 Instruction Decodification r Add 1 to the Instruction Counter of the newest checkpoint r R1  R2 op R3 r If R1 is mapped to PhyReg_N Set PhyReg_N bit of the future free vector bits Map R1 to the new Physical Register r Associate the instruction to the last created checkpoint

12 Instruction Writeback r Decrement the Instruction Counter of the checkpoint associated to the instruction r If the instruction is a mispredicted branch: r Recover From the associated checkpoint: Fetch instructions from saved PC Release all entries in the store buffer from the pointed entry Free all registers in the future free vector of the entry and for all the newer checkpoints entries

13 Checkpoint Elimination r If this counter is 0 and if it is the oldest checkpoint, then: r The checkpoint is removed Clean the corresponding mark in the store buffer The registers marked in the Future Free vector are freed

14 Outline r Motivation r Out-of-Order Commit r Slow Line Instruction Queue r Performance Evaluation r Conclusions

15 P s e u d o R o b Ld x x x a x x x b x D a t a D e p e n d e n c e Load/Store Queue Instruction Queue Slow Line Instruction Queue LD a b Slow Line Instruction Queue

16 P s e u d o R o b Ld x x x a x x x b x D a t a D e p e n d e n c e Load/Store Queue Instruction Queue Slow Line Instruction Queue LD a b Slow Line Instruction Queue

17 P s e u d o R o b Ld x x x a x x x b x D a t a D e p e n d e n c e Load/Store Queue Instruction Queue Slow Line Instruction Queue LD a b Load End Begin reinsert Slow Line Instruction Queue

18 Slow Lane Instruction Queue II r Very simple Buffer – Slow Lane Instruction Queue (SLIQ) r Each Load that miss in L2 has a pointer to an entry in the SLIQ r Pseudo ROB

19 Slow Line Instruction Queue III r When a Instruction is retired from the Pseudo ROB, its state is looked on: If the instruction is a load miss, the pointer is written If the instruction depends on a long latency instruction, it is moved to de SLIQ r When a load that miss in L2 finish its execution: r The SLIQ is traversed from the instruction pointed by the load if this point is older than the current traversal position. r The load’s dependent instructions are reinserted to the IQ

20 Performance Evaluation r Processor Configuration (Baseline 4096): r Fetch/Commit width4 r Branch Predictor16K entries Gshare r Instruction L1 32Kb, 4-way, 32 bytes line, 2 cycle r Data L1 32Kb, 4-way, 32 bytes line, 2 cycle r L2 size512Kb, 4-way, 64 bytes line, 10 cycle r Memory Latency1000 cycles r Physical Registers4096 entries r Load/Store Queue4096 entries r Reorder Buffer4096 entries r Integer General Units4 (lat/rep 1/1) r Integer Mult/Div Units2 (lat/rep 3/1 and 20/20) r FP Functional Units4 (lat/rep 2/1) r FP Mult/Div/Sqrt Units2 (lat/rep 4/1, 12/12, 24/24)

21 Performance Evaluation - Some Considerations r We mix both models. r The processor takes the checkpoints when the instructions are retired from the pseudo ROB. r Many branches are resolved at this time, so the probability to come back to the checkpoint is reduced. r If a miss predicted branch is detected in the pseudo ROB, a normal rollback mechanism is used.

22 IPC – Different Configurations

23 Number of Checkpoints and Performance Baseline: 2048 IQ. SLIQ 2048 entries and 128 IQ Physical Registers

24 In-Flight Instructions

25 Delay in re-insertion from SLIQ SLIQ: 1024 entries

26 Towards affordable Kilo-Instruction Processor r Adding Ephemeral Registers to the Out-of-Order Commit Processors r Change in the SLIQ to list of Buckets of Instructions J. Martínez et al. “Ephemeral Registers”, Technical Report CSL-TR , 2003.

27 Putting It All Together Physical Registers Virtual Registers IQs of 128 entries Memory Latency

28 Conclusion r To tolerate increasing memory latencies in Floating Point applications, a large number of in- flight instruction must be maintained. The resources must be up-sized. r The resources are underutilized r We present two techniques to reduce the need for resources and we show its effectiveness r Out of Order Commit r Slow Lane Instruction Queue

29 Thank you very much

30 Status of the removed instructions

31 Example: 2034 instructions DO NOT require 1200 registers Motivation III – Resources - Registers Motivation III – Resources - Registers State of Registers (ROB=2048, Mem 500 cycles)

LD Queue Distribution of in-flight Instructions Dead Blocked-Long Blocked-Short Replayable Live Number of Instructions Checkpointing Early Release Motivation III – Resources – Load Queue State of LD Queues (ROB=2048, Mem 500 cycles)

ST Queue Distribution of in-flight Instructions Ready Address Ready Blocked-Long Blocked-Short Number of Instructions Motivation III – Resources – Store Queue Motivation III – Resources – Store Queue Locality State of ST Queues (ROB=2048, Mem 500 cycles)

LD Queue Distribution of in-flight Instructions Dead Blocked-Long Blocked-Short Replayable Live Number of Instructions INT State of LD Queues (specInt, ROB=2048) State of LD Queues (specInt, ROB=2048) Checkpointing Early Release

ST Queue Distribution of in-flight Instructions Ready Address Ready Blocked-Long Blocked-Short Number of Instructions FP State of ST Queues (specFP, ROB=2048) State of ST Queues (specFP, ROB=2048) Locality

ST Queue Distribution of in-flight Instructions Ready Address Ready Blocked-Long Blocked-Short Number of Instructions INT State of ST Queues (specInt, ROB=2048) State of ST Queues (specInt, ROB=2048) Locality

FP Queue Distribution of in-flight Instructions Blocked-Long Blocked-Short Ready Number of Instructions Long/Short Lat. Inst. Remove – Reinsert Dependence Chain State of FP Queues (specFP, ROB=2048) State of FP Queues (specFP, ROB=2048)

Int. Queue Distribution of in-flight Instructions Blocked-Long Blocked-Short Ready Number of Instructions INT State of Int Queues (specInt, ROB=2048) State of Int Queues (specInt, ROB=2048) Long/Short Lat. Inst. Remove – Reinsert Dependence Chain

Int. Registers Number of In-flight Instructions (SpecInt) Dead Blocked-Long Blocked-Short Live 10%25%50%75%90% State of Registers (Int, ROB=2048) State of Registers (Int, ROB=2048) Early Release Virtual Registers

40 Instructions in-flight (Int, ROB=2048) Instructions in-flight (Int, ROB=2048) Branches Number of In-flight Instructions Number of In-flight Instructions (SpecInt) 10%25%50%75%90%