Lecture 9: R10K scheme, Tclk

Slides:

Advertisements

Similar presentations

Final touches on Out-of-Order execution Review Superscalar Looking back Looking forward.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

CS 6290 Instruction Level Parallelism. Instruction Level Parallelism (ILP) Basic idea: Execute several instructions in parallel We already do pipelining…

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.

1 Reducing Datapath Energy Through the Isolation of Short-Lived Operands Dmitry Ponomarev, Gurhan Kucuk, Oguz Ergin, Kanad Ghose Department of Computer.

Lecture 7: Register Renaming. 2 A: R1 = R2 + R3 B: R4 = R1 * R R1 R2 R3 R4 Read-After-Write A A B B

EECS 470 ILP and Exceptions Lecture 7 Coverage: Chapter 3.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

EECS 470 Lecture 8 RS/ROB examples True Physical Registers? Project.

EECS 470 Lecture 6 Branches: Address prediction and recovery (And interrupt recovery too.)

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

Out-of-Order Machine State Instruction Sequence: Inorder State: Look-ahead State: Architectural State: R3  A R7  B R8  C R7  D R4  E R3  F R8  G.

1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

Final touches on Out-of-Order execution Review Tclk Superscalar Looking back Looking forward.

Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

EECS 470 Register Renaming Lecture 8 Coverage: Chapter 3.

Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

Trace Caches J. Nelson Amaral. Difficulties to Instruction Fetching Where to fetch the next instruction from? – Use branch prediction Sometimes there.

EECS 470 Superscalar Architectures and the Pentium 4 Lecture 12.

Reducing the Complexity of the Register File in Dynamic Superscalar Processors Rajeev Balasubramonian, Sandhya Dwarkadas, and David H. Albonesi In Proceedings.

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

Lecture 16: Basic CPU Design

Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

CS 7810 Lecture 21 Threaded Multiple Path Execution S. Wallace, B. Calder, D. Tullsen Proceedings of ISCA-25 June 1998.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Final touches on Out-of-Order execution Review Tclk Superscalar Looking back Looking forward.

Finishing out EECS 470 A few snapshots of the real world.

Out-of-Order Commit Processors Adrián Cristal (UPC), Daniel Ortega (HP Labs), Josep Llosa (UPC) and Mateo Valero (UPC) HPCA-10, Madrid February th.

1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.

Precomputation- based Prefetching By James Schatz and Bashar Gharaibeh.

EECS 470 Lecture 7 Branches: Address prediction and recovery (And interrupt recovery too.)

1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.

OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011.

Samira Khan University of Virginia Feb 9, 2016 COMPUTER ARCHITECTURE CS 6354 Precise Exception The content and concept of this course are adapted from.

1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl 1 and Andreas Moshovos AENAO Research Group Department of Electrical.

CS203 – Advanced Computer Architecture ILP and Speculation.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

CSE 502: Computer Architecture

CS203 – Advanced Computer Architecture

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Out-of-Order Commit Processors

EECS 470 Final touches on Out-of-Order execution

Power-Aware Operand Delivery

Sequential Execution Semantics

Lecture 6: Advanced Pipelines

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Smruti R. Sarangi IIT Delhi

Out-of-Order Commit Processor

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Out-of-Order Commit Processors

Adapted from the slides of Prof

Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011

Instruction-Level Parallelism (ILP)

Patrick Akl and Andreas Moshovos AENAO Research Group

Spring 2019 Prof. Eric Rotenberg

ECE 721 Modern Superscalar Microarchitecture

Presentation transcript:

Lecture 9: R10K scheme, Tclk EECS 470 Fall 2014

Crazy (still) HW3 is due on Wednesday 2/12 It’s a fair bit of work (3 hours?) You’ll probably have questions! Should be able to do all but #3 now. This lecture will cover #3 Midterm is on Monday 2/17, 6-8pm Room assignments TBA Studying Exam Q&A on Satruday 2/15 (3-5pm --- NOTE THE CHANGE) (Q&A) in class on 2/17 Best way to study is look at old exams (posted on-line!)

Let’s lose the ARF! Why? Other motivations? Currently have two structures that may hold values (ROB and ARF) Need to write back to the ARF after every instruction! Other motivations? ROB currently holds result (which needs to be accessible to all) as well as other data (PC, etc.) which does not. So probably two separate structures anyways Many ROB entry result fields are unused (stores, branches)

Physical Register file Version 1.0 Keep a “Physical register file” If you want to get the ARF back you need to use the RAT. But the RAT has speculative information in it! We need to be able to undo the speculative work! How?

How? Remove Add Actions: The value field of the ROB The whole ARF A “Retirement RAT1” (RRAT) that is updated by retiring instructions the same way the RAT is by issuing instructions Actions: When you finish execution, update the PRF as well as the ROB (ROB just gets “done” message now) When you retire, update the RRAT (Other stuff we need to think about goes here.) 1http://www.ecs.umass.edu/ece/koren/ece568/papers/Pentium4.pdf

This seems sorta okay but… There seem to be some problems When can I free a physical register? If I’m writing to the physical register file at execute doesn’t that mean I committing at that point? How do I squash instructions? How do I recover architected state in the event of an exception?

Example RAT RRAT AR PR 1 2 3 4 10 AR PR 1 2 3 4 10 Dispatch: Assembly 1 2 3 4 10 AR PR 1 2 3 4 10 Dispatch: Assembly R1=R2*R3 R3=R1+R3

Example RAT RRAT AR PR 1 2 3 5 4 10 AR PR 1 2 3 4 10 In-flight 1 2 3 5 4 10 AR PR 1 2 3 4 10 In-flight Assembly R1=R2*R3 R3=R1+R3 Renamed P0=P3*P4 P5=P0+P4

Freedom Freeing the PRF We could do better How long must we keep each PRF entry? Until we are sure no one else will read it before the corresponding AR is again written. Once the instruction overwriting the Arch. Register commits we are certain safe. So free the PR when the instruction which overwrites it commits. In other words: when an instruction commits, it frees the PR pointed to the its back pointer! We could do better Freeing earlier would reduce the number of PRs needed. But unclear how to do given speculation and everything else.

Sidebar One thing that must happen with the PRF as well as the RS is that a “free list” must exist letting the processor know which resources are available. Maintaining these free lists can be a pain!

Resolving Branches RRAT Early resolution? (briefly) On mispredict at head of queue copy retirement RAT into RAT. Early resolution? (briefly) BRAT Keep a RAT copy for each branch in a RS! If mispredict can recover RAT quickly. ROB easy to fix, RS’s a bit harder.

Freedom Freeing the PRF We could do better How long must we keep each PRF entry? Until we are sure no one else will read it before the corresponding AR is again written. Once the instruction overwriting the Arch. Register commits we are certain safe. So free the PR when the instruction which overwrites it commits. In other words: when an instruction commits, Free the thing overwritten in the RRAT. We could do better Freeing earlier would reduce the number of PRs needed. But unclear how to do given speculation and everything else.

Sidebar One thing that must happen with the PRF as well as the RS is that a “free list” must exist letting the processor know which resources are available. Maintaining these free lists can be a pain! Let’s talk a bit about how one would do this.

R10K scheme What are we doing? Removing the ARF Removing the value field of the RoB. Adding a Physical Register File (~sum ARF and RoB) Adding a Retirement RAT (RRAT)

RAT RRAT RoB PRF 1 2 3 4 5 6 7 8 9 AR Target 4 1 2 7 3 AR Target 1 2 3 A: R1=MEM[R2+0] B: R2=R1/R3 C: R3=R2+R0 D: Branch (R1==0) E: R3=R1+R3 F: R3=R3+R0 G: R3=R3+19 H: R1=R7+R6 RRAT AR Target 4 1 2 7 3 AR Target 1 2 3 RoB 1 2 3 4 5 6 7 8 9 44 55 66 11 20 PRF

Alternative option (v0.9?) Use “back pointers” instead of RRAT. Record, in the ROB, which value in the RAT you overwrote. On commit, free that value (it will be the same as the one you would have overwritten in the RAT!) On mispredict, “undo” each step in reverse order (from tail to head). This gives same functionality as RRAT. Slower to handle mispredict that is at the head of the RoB. But could in theory handle mispredict as

Optimizing CPU Performance Golden Rule: tCPU = Ninst*CPI*tCLK Given this, what are our options Reduce the number of instructions executed Reduce the cycles to execute an instruction Reduce the clock period Our first focus: Reducing CPI Approach: Instruction Level Parallelism (ILP)

Tclock

tCLK Recall: tCPU = Ninst*CPI*tCLK What defines tCLK? Critical path latency (= logic + wire latency) Latch latency Clock skew Clock period design margins In current and future generation designs Wire latency becoming dominant latency of critical path Due to growing side-wall capacitance Brings a spatial dimension to architecture optimization E.g., How long are the wires that will connect these two devices?

Determining the Latency of a Wire grows shrinks scale

But reality is worse…. (Fringe) For Intel 0.25u process W~=0.64 T~=0.48 H is around 0.9. www.ee.bgu.ac.il/~Orly_lab/courses/Intro_Course/Slides/Lecture02-2-Wire.ppt

Moral of the “tCLK" story As we shrink wire delay starts to dominate Agarwal et. al. Clock Rate versus IPC: the End of the Road for Conventional Microarchitectures, ISCA 2000 Above paper says this will be the end of Moore’s law. So long buses and large fan-outs get to be difficult. tCLK still limits logic but limits clocks even more. Is this the end of Moore’s law? I seriously doubt it. Cried wolf before. Ballistic conductors (e.g. armchair nanotubes), superconductors, or some other magic look likely to save us soonish. Architectural attacks might also reduce impact.

And reducing the number of instructions executed… Sorry, wrong class. Compilers can help with this (a lot in some cases) So can ISA design, but usually not too much. Making instructions too complex hurts ILP and tCLK So on the whole reducing # of instructions doesn’t look to be viable. So ILP would seem to be “where it’s at”

Optimizing CPU Performance Golden Rule: tCPU = Ninst*CPI*tCLK Given this, what are our options Reduce the number of instructions executed Reduce the cycles to execute an instruction Reduce the clock period Our first focus: Reducing CPI Approach: Instruction Level Parallelism (ILP)

Early Branch Resulution

BRAT Simple idea: When we mispredict we need to recover things to the state when the branch finished issuing. RAT: Just make a copy Free list is non-trivial RS Can we just make a copy? RoB Easy money. Note: the literature usually calls this “map table checkpoints” or some such. I find that unwieldy so BRAT or BMAP will be used here. See “Checkpoint Processing and Recovery: Towards Scalable Large Instruction Window Processors” for a nice overview of options on branch misprediction recovery.

RAT BRAT RoB RS1 RS2 RS3 RS4 AR Target 1 2 3 AR Target 1 2 3 1 2 3 RS1 RS2 RS3 RS4 AR Target 1 2 3 RoB PRF freelist: