Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison

Slides:

Advertisements

Similar presentations

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

Federation: Repurposing Scalar Cores for Out- of-Order Instruction Issue David Tarjan*, Michael Boyer, and Kevin Skadron* University of Virginia Department.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

National & Kapodistrian University of Athens Dep.of Informatics & Telecommunications MSc. In Computer Systems Technology Advanced Computer Architecture.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture & Engineering Andrew Waterman University of California, Berkeley Section 8 Spring 2010.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

Address-Value Delta (AVD) Prediction Onur Mutlu Hyesoon Kim Yale N. Patt.

Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt.

1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Nov 9, 2005 Topic: Caches (contd.)

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

TDC 311 The Microarchitecture. Introduction As mentioned earlier in the class, one Java statement generates multiple machine code statements Then one.

"DDMT", Amir Roth, HPCA-71 Speculative Data-Driven Multithreading (an implementation of pre-execution) Amir Roth and Gurindar S. Sohi HPCA-7 Jan. 22, 2001.

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

Implementing Precise Interrupts in Pipelined Processors James E. Smith Andrew R.Pleszkun Presented By: Shrikant G.

Exploiting Value Locality in Physical Register Files Saisanthosh Balakrishnan Guri Sohi University of Wisconsin-Madison 36 th Annual International Symposium.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

A Quantitative Framework for Pre-Execution Thread Selection Gurindar S. Sohi University of Wisconsin-Madison MICRO-35 Nov. 22, 2002 Amir Roth University.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

Lecture: Out-of-order Processors

Lecture 18: Pipelining I.

Multiscalar Processors

5.2 Eleven Advanced Optimizations of Cache Performance

Lecture: Out-of-order Processors

Chapter 14 Instruction Level Parallelism and Superscalar Processors

Morgan Kaufmann Publishers The Processor

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 14 Virtual Memory and the Alpha Memory Hierarchy

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

ECE 2162 Reorder Buffer.

Address-Value Delta (AVD) Prediction

Comparison of Two Processors

Out-of-Order Commit Processor

Phase Capture and Prediction with Applications

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Krste Asanovic Electrical Engineering and Computer Sciences

Chapter 6 Memory System Design

Control unit extension for data hazards

The Processor Lecture 3.1: Introduction & Logic Design Conventions

Morgan Kaufmann Publishers Memory Hierarchy: Cache Basics

Lecture 20: OOO, Memory Hierarchy

* From AMD 1996 Publication #18522 Revision E

Instruction Execution Cycle

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Control unit extension for data hazards

Patrick Akl and Andreas Moshovos AENAO Research Group

Control unit extension for data hazards

rePLay: A Hardware Framework for Dynamic Optimization

Lecture 9: Dynamic ILP Topics: out-of-order processors

Lois Orosa, Rodolfo Azevedo and Onur Mutlu

Spring 2019 Prof. Eric Rotenberg

Handling Stores and Loads

Sizing Structures Fixed relations Empirical (simulation-based)

Spring 2019 Prof. Eric Rotenberg

Presentation transcript:

Register Integration: A Simple and Efficient Implementation of Squash Reuse Amir Roth and Gurindar S. Sohi University of Wisconsin-Madison MICRO-33 Dec. 12, 2000

"Register Integration". Amir Roth, MICRO-33 Parsing the Title Squash: mis-speculation? abort sequentially later work, fixup, resume Problem: re-execute (squashed) mis-speculation independent work Reuse: salvage useful squashed results, don’t re-execute instructions Implementation: reuse by writing saved value into register determine instruction reusability: value-comparison/invalidation Register Integration: “Recognize” and “un-squash” results from physical register file Efficient: more natural “fit” for squash reuse Simple: no need to read/write register values "Register Integration". Amir Roth, MICRO-33

"Register Integration". Amir Roth, MICRO-33 Talk Outline Motivation and logical basis Working example Some implementation details Short performance evaluation "Register Integration". Amir Roth, MICRO-33

"Register Integration". Amir Roth, MICRO-33 Motivation Assume Unified Physical Register File (PRF) Logical Register Map (LRM) sequentially “manages” PRF Conventional mis-speculation recovery PR values intact LRM restored to prior state, PR’s become “garbage” “Conventional” reuse Allocate new PR, write value into it Register Integration: why write? value is already in PR To reuse: allocate PR holding squashed result to new instruction Modify register-renaming to do this "Register Integration". Amir Roth, MICRO-33

Logical Basis for Integration Key: must locate PR holding squashed value Use a second mapping of PRF A second LRM? No Implicitly sequential, can’t be “searched” using right criteria Integration Table (IT): describe each PR using creating instruction Operation (PC) and input PR’s Valid after squash (valid always) Encodes “reusability criteria” Renaming + Integration Rename an instruction, use LRM to find input PR’s Search IT for PR created by same instr. (PC) with same input PR’s Find one? Inputs haven’t changed since squash! Integrate! "Register Integration". Amir Roth, MICRO-33

"Register Integration". Amir Roth, MICRO-33 1 Picture == 4KB INST PC Dyn. Instrs X Y LRM PC I1 I2 O IT E Comment 48 47 X = 1; A1: Alloc/IT enter N 48 49 Y = 2; A2: Alloc/IT enter N 48 49 if (!X) A3: Predict taken/IT enter N 48 49 48 50 Y = 3; A4: Alloc/IT enter N 51 50 52 53 Y 51 50 X++; A5: Alloc/IT enter 48 N A5: 48 51 Y 51 52 Y++; A6: Alloc/IT enter 50 N A6: 50 52 Y 53 52 X++; A7: Alloc/IT enter 51 N A7: 51 53 Y 48 Squash/IT enable X++; A5: 51 49 N 49 50 Integrate/IT disable Y++; A6: 51 54 51 A6: 49 N No/Alloc/IT enter X++; A7: 53 54 N Integrate/IT disable E = Eligible (can be integrated) PR cannot simultaneously be mapped by two active instructions "Register Integration". Amir Roth, MICRO-33

"Register Integration". Amir Roth, MICRO-33 The Tao of Integration Definition of “reusable” instruction: inputs unchanged since squash Exactly the information IT encodes PR tags naturally track data-dependences (input changes) Instructions integrated iff data-dependences intact No need to read/compare values to perform reusability test No separate invalidation/dependence-tracking mechanism "Register Integration". Amir Roth, MICRO-33

What Integration (Reuse) Accomplishes Improved performance (first-order effects) Integrated instructions are complete* Collapses data dependences Chains of dependent instr’s can be integrated in a single cycle Integrated mis-predicted branch recovery begins immediately Reduced resource consumption/contention No reservation-stations/scheduling/execution/writeback Faster branch resolution reduces fetch demand *Choose to integrate only completed instructions Simplifies things, doesn’t reduce benefit "Register Integration". Amir Roth, MICRO-33

Implementation Details Requirements of base microarchitecture Unified PRF Support for load speculation (see why soon) Changes/Additions IT Integration circuit (added to renaming, next slide) More PR’s (keep squashed results alive longer) Data-paths to LoadQ, StoreQ (see why soon) Non-changes No datapaths to read/write PRF "Register Integration". Amir Roth, MICRO-33

"Register Integration". Amir Roth, MICRO-33 Integration Circuit IT Dyn. Insn A5: X PC I1 I2 O LRM FreeList PC I1 I2 O E X Y 49 48 A5: 48 51 Y 55 54 N 48 = 51 54 PC I1 I2 O E PC I1 I2 O A5: ? X Y 49 ? A5: 48 51 Y 55 ? "Register Integration". Amir Roth, MICRO-33

Other Implementation Issues Superscalar integration? Sure Same parallel prefix formulation as “plain” renaming Check N2 dependences for N instructions (PR, not LR) N2M2 if IT is M-way set-associative Integrating loads PC + PR’s not enough, previous stores are implicit inputs Mis-integration: load integrated despite conflicting store Add address/value fields to IT, save-from/restore-to LoadQ Load speculation mechanism handles conflict after integration “Snoop” IT for conflicts before integration More details in paper "Register Integration". Amir Roth, MICRO-33

Performance Evaluation SPEC2000 benchmarks, Alpha EV6, -O2 –fast Simplescalar simulator 8-wide superscalar, OoO, speculative, load speculation 256-entry, direct-mapped IT, #PR’s = 64+ROB+256 32KB 2-way I-Cache, 64KB 2-way D-Cache, 1MB 4-way L2 2 base pipeline configurations Current-generation: 128 ROB (448 PR’s), 64 LoadQ, 32 StoreQ Pipe: 3 fetch, 2 decode/rename, 2 schedule/reg-read, 3 load Next-generation: (faster clock, 2MB L2) 256 ROB (576 PR’s), 128 LoadQ, 64 StoreQ Pipe: 5 fetch, 3 decode/rename, 4 schedule/reg-read, 4 load "Register Integration". Amir Roth, MICRO-33

Performance vs. Base Microarchitecture Integration more effective as microarchitecture more aggressive More speculative buffering+longer pipe: more instructions completed along mis-speculated paths more integrated instructions Deeper pipeline, each integrated instruction saves more work "Register Integration". Amir Roth, MICRO-33

"Register Integration". Amir Roth, MICRO-33 A Closer Look Current generation microarchitecture, every second benchmark Vpr Mcf Parser Perl Vortex Twolf Integrated/committed (%) 15.9 6.1 6.5 4.7 1.6 8.6 Integrated/squashed (%) 46.7 24.0 28.3 22.4 7.3 41.4 Fetched instr. saved (%) 6.6 3.7 1.9 1.1 4.8 Executed instr. saved (%) 15.3 7.0 5.6 4.4 15.1 9.2 Execution Time Saved (%) 8.1 0.9 3.1 4-15% reduction in instructions executed, 1-7% in fetched Performance correlated with fetch reduction Integrated instructions still fetched (leave “bubbles”) Some other results IT size matters a little, IT associativity less (thankfully) "Register Integration". Amir Roth, MICRO-33

"Register Integration". Amir Roth, MICRO-33 Summary Integration: new implementation of squash reuse Based on data-dependences, not values/invalidations Reuse: improves performance, reduces resource contention Simple: requires only LRM manipulations, no PR reads/writes Efficient: implementation matches definition of reuse "Register Integration". Amir Roth, MICRO-33