Computer Architecture

Slides:



Advertisements
Similar presentations
1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Advertisements

ILP: IntroductionCSCE430/830 Instruction-level parallelism: Introduction CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.
CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
A. Moshovos ©ECE Spring ‘02 ECE Toronto Instruction Level Parallel Processing Sequential Execution Semantics Superscalar Execution –Interruptions.
A. Moshovos ©ECE Fall ‘07 ECE Toronto Out-of-Order Execution Scheduling.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
1 Advanced Computer Architecture Limits to ILP Lecture 3.
A. Moshovos ©ECE Fall ‘07 ECE Toronto Superscalar Processors Superscalar Execution –How it can help –Issues: Maintaining Sequential Semantics Scheduling.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
© A. Moshovos (ECE, Toronto) ECE1773 – Spring 2002 ILP, cont. Maintaining Sequential Appearance –Precise Interrupts –RUU approach to OoO Scheduling.
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Computer Architecture 2011 – out-of-order execution (lec 7) 1 Computer Architecture Out-of-order execution By Dan Tsafrir, 11/4/2011 Presentation based.
Computer Architecture 2010 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
1/25 HIPEAC 2008 TurboROB TurboROB A Low Cost Checkpoint/Restore Accelerator Patrick Akl and Andreas Moshovos AENAO Research Group Department of Electrical.
Architecture Basics ECE 454 Computer Systems Programming
Computer Architecture Pipelines & Superscalars Sunset over the Pacific Ocean Taken from Iolanthe II about 100nm north of Cape Reanga.
1 Out-Of-Order Execution (part I) Alexander Titov 14 March 2015.
1/25 June 28 th, 2006 BranchTap: Improving Performance With Very Few Checkpoints Through Adaptive Speculation Control BranchTap Improving Performance With.
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Out-of-order execution Lihu Rappoport 11/ MAMAS – Computer Architecture Out-Of-Order Execution Dr. Lihu Rappoport.
Superscalar Processors Superscalar Execution –How it can help –Issues: Maintaining Sequential Semantics Scheduling –Scoreboard –Superscalar vs. Pipelining.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Dynamic Scheduling Why go out of style?
Computer Organization
/ Computer Architecture and Design
PowerPC 604 Superscalar Microprocessor
CIS-550 Advanced Computer Architecture Lecture 10: Precise Exceptions
CS203 – Advanced Computer Architecture
Lecture: Out-of-order Processors
Sequential Execution Semantics
High-level view Out-of-order pipeline
Morgan Kaufmann Publishers The Processor
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Superscalar Processors & VLIW Processors
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture: Out-of-order Processors
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Out-of-Order Execution Scheduling
CS 704 Advanced Computer Architecture
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
15-740/ Computer Architecture Lecture 5: Precise Exceptions
How to improve (decrease) CPI
Static vs. dynamic scheduling
Advanced Computer Architecture
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 9/30/2011
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CS203 – Advanced Computer Architecture
15-740/ Computer Architecture Lecture 10: Out-of-Order Execution
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
CSC3050 – Computer Architecture
Dynamic Hardware Prediction
Patrick Akl and Andreas Moshovos AENAO Research Group
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Spring 2019 Prof. Eric Rotenberg
Presentation transcript:

Computer Architecture Goal: Build the best possible “processor” Here’s a piece of silicon Here some of its properties Tell me what to build 1. Understand your building blocks 2. Understand what is “best” means 3. Take into account design/production time A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Track Record A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Evolution? A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Modern Designs A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Understanding your Building Blocks A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Moore’s Law A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Moore’s Law in Practice A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

The other Moore’s Law A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Technology Scaling A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Ideal Shrink vs. New Design A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Understanding what is Best A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Why Study Computer Architecture A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Why Study Computer Architecture A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Challenges in Computer Architecture A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Review of Modern Processor Architectures A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Execution Semantics Contract: How the machine appears to behave A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Dissecting Instructions Data Movement Data Manipulation Control Flow A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

An Instruction in a Processor’s Lifetime A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Pipelining A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics are Preserved A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Superscalar - In-order Two or more consecutive instructions in the original program order can execute in parallel This is the dynamic execution order N-way Superscalar Can issue up to N instructions per cycle 2-way, 3-way, … fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Data Dependences A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Superscalar vs. Pipelining loop: ld r2, 10(r1) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Pipelining: sum += a[i--] time fetch decode ld fetch decode add fetch decode sub fetch decode bne Superscalar: fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Superscalar Performance Performance Spectrum? What if all instructions were dependent? Speedup = 0, Superscalar buys us nothing What if all instructions were independent? Speedup = N where N = superscalarity Again key is typical program behavior Some parallelism exists A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

“Real-Life” Performance OLTP = Online Transaction Processing SOURCE: Partha Ranganathan Kourosh Gharachorloo** Sarita Adve* Luiz André Barroso** Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors ASPLOS98 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

“Real Life” Performance SPEC CPU 2000: Simplescalar sim: 32K I$ and D$, 8K bpred A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Issue Mechanism – A Group of Instructions at Decode tgt src1 src1 simplifications may be possible resource checking not shown tgt src1 src1 Program order tgt src1 src1 Assume 2 source & 1 target max per instr. comparators for 2-way: 3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW) comparators for 4-way: 2nd instr: 3 tgt and 2 src 3rd instr: 6 tgt and 4 src 4th instr: 9 tgt and 6 src A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Preserving Sequential Semantics In principle not much different than pipelining Program order is preserved in the pipeline Some instructions proceed in parallel But order is clearly defined Defer interrupts to commit stage (i.e., writeback) Flush all subsequent instructions may include instructions committing simultaneously Allow all preceding instructions to commit Recall comparisons are done in program order Must have sufficient time in clock cycle to handle these A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Interrupts Example Exception raised Exception taken fetch decode ld add fetch decode div fetch decode bne fetch decode bne Exception raised Exception raised Exception taken fetch decode ld fetch decode add fetch decode div fetch decode bne fetch decode bne A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Case Study: Alpha 21164 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

21164: Int. Pipe A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

21164: Memory Pipeline A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

21164: Floating-Point Pipe A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Performance Comparison Source: A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

CPI Comparison: Ideal 0.25 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Compiler Impact Optimized Base Performance A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Stall Cycles - 21164 Data Dependences/Data Stalls No instructions A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Issue Cycle Distribution - 21064 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Issue Cycle Distribution - 21164 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Stall Cycles Distrubution Model: When no instruction is committing Does not capture overlapping factors: Stall due to dependence while committing Stall due to cache miss while committing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Replay Traps Tried to do something and couldn’t Store and write-buffer is full Can’t complete instruction Load and miss-address-file full Assumed Cache hit and was miss Dependent instructions executed Must re-execute dependent instructions Re-execute the instruction and everything that follows A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Replay Traps Explained ld r1 add _, r1 F D E M W Cache hit F D D E M W F D E M M W Cache miss F D D D E M W A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Optimistic Scheduling ld r1 add _, r1 F D E M W Cache hit F D D E M W Hit/miss known here M E D add should start execution here Must decide that add should execute Start making scheduling decisions A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Optimistic Scheduling #2 ld r1 add _, r1 F D E M W Cache hit F D D E M W Hit/miss known here Guess Hit/Miss M E D add should start execution here Must decide that add should execute Start making scheduling decisions A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Stall Distribution A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

21164 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Instruction Decode/Issue Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path Where instructions come from? I-Cache: CPU needs: A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Fetching Four Instructions Where instructions come from? I-Cache: CPU needs: Software must guarantee alignment at 16 byte boundaries Lots of NOPs A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Instruction Decode/Issue Up to four insts/cycle Naturally aligned groups Must start at 16 byte boundary (INT16) Simplifies Fetch path (in a second) All of group must issue before next group gets in Simplifies Scheduling No need for reshuffling A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Pipeline Processing Front-End A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Integer Add A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Floating-Point Add A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Load Hit A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Load Miss A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Store Hit A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics - Review Instructions appear as if they executed: In the order they appear in the program One after the other Program Order Pipelining Superscalar Out-of-Order A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution loop: add r4, r4, 1 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop do { sum += a[++m]; i--; } while (i != 0); Superscalar fetch decode sub bne add ld out-of-order fetch decode add fetch decode ld fetch decode add fetch decode sub fetch decode bne A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Sequential Semantics? Execution does NOT adhere to sequential semantics To be precise: Eventually it may Simplest solution: Define problem away Not acceptable today: e.g., Virtual Memory Three-phase Instruction execution In-Progress, Completed and Committed inconsistent fetch decode sub bne add ld consistent A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Back to Sequential Semantics Instr. exec. in 3 phases: In-progress, Completed, Committed OOO for in-progress and Completed In-order Commits Completed - out-of-order: ”Visible only inside” Results visible to subsequent instructions Results not visible to outsiders On interrupts completed results are discarded Committed - in-order: ”Visible to all” Results visible to outsiders On interrupt committed results are preserved A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

How Completes Help w/ Performance in-order completes out-of-order completes in-order commits DIV R3, _, _ ADD R1, _, _ ADD _, R1, _ Time In-order commits fetch decode sub bne add ld commit commit commit commit commit complete A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Implementing Completes/Commits Key idea: Maintain sufficient state around to be able to roll-back when necessary Roll-back: Discard (aka Squash) all not committed One solution (conceptual): Upon Complete instruction records previous value of target register Upon Discard, instruction restores target value Upon Commit, nothing to do We will return to this shortly Focus on scheduling mechanisms A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution Overview Program Form Processing Phase Static program dynamic inst. Stream (trace) execution window completed instructions In-Progress Dispatch/ dependences inst. Issue inst execution inst. Reorder & commit Completed Committed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution: Stages Fetch: get instruction from memory Decode/Dispatch: what is it? What are the dependences Issue: Go – all dependences satisfied Execute: perform operation Complete: result available to other insts. Commit: result available to outsiders A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-of-Order Execution: Stages Fetch: get instruction from memory Decode/Dispatch: what is it? What are the dependences Issue: Go – all dependences satisfied Execute: perform operation Complete: result available to other insts. Commit: result available to outsiders We’ll start w/ Decode/Dispatch Then we’ll consider Issue A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

OOO Scheduling Instruction @ Decode: Do I have dependences yet to be satisfied? Yes, stall until they are No, clear to issue When Do I Wakeup? Producer Completes and notifies me If All dependences satisfied I can proceed Dependence: (later instruction, earlier instruction) & type We’ll first consider RAW and then move on to WAW and WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Stalling @ Decode for RAW Are there unsatisfied dependences? RAW: have to wait for register value We don’t really care who is producing the value Only whether it is available Can use the Register Availability Vector as in pipelining/superscalar Also known as scoreboard At Decode Reset bit corresponding to your target At writeback set Check all bits for source regs: if any is 0 stall A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Issuing Instructions: Scheduling Determine when an instruction can issue Ignore resources for the time being Stalled because of RAW w/ preceding instruction Concept: Producer (write) notifies consumers (read) Requirements: Consumers need to be able to identify producer The register name is one possible link Mechanism Consumer placed in a reservation station Producers on complete broadcasts identity Waiting instructions observe Update Operand Availability Issue if all operands now available A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Reservation Station State pertaining to an instruction What registers it reads Whether they are available What is the destination register What state is the instruction in Waiting Executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Example loop: add r4, r4, 4 ld r2, 10(r4) 4 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 1 Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Out-Of-Order Exec. Example: Cycle 0 loop: add r4, r4, 4 ld r2, 10(r4) 5 cycles lat add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Ready to be executed RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4/0 Rdy 1 Cycle 0 A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 1 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify those waiting for R4 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Exec 1 ld r4/1 NA/1 r2 Rdy R4 gets produced now A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 2 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 3 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait No dependences sub r1/1 NA/1 r1 Rdy A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 4 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait r1 produced now Notify consumers sub r1/1 NA/1 r1 Exec bne r1/1 r0/1 NA Rdy r1 will be available next cycle A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 5 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/0 r3 Wait Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 6 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Result available @ cycle 6 Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Exec Wait for r2 add r3/1 r2/1 r3 Rdy Completed sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Exec executing A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 7 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Notify consumers RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd Executing add r3/1 r2/1 r3 Exec sub r1/1 NA/1 r1 Compl bne r1/1 r0/1 NA Compl Completed A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Cycle 8 loop: add r4, r4, 4 ld r2, 10(r4) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop RAV op src1 src2 tgt status r1 r2 r3 r4 add r4/1 NA/1 r4 Cmtd 1 ld r4/1 NA/1 r2 Cmtd add r3/1 r2/1 r3 Cmtd sub r1/1 NA/1 r1 Cmtd bne r1/1 r0/1 NA Cmtd A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Window vs. Scheduler Window Distance between oldest and youngest instruction that can co-exist inside the CPU Larger window  Potential for more ILP Scheduler Number of instructions that are waiting to be issued Instructions enter at Fetch Exit at Commit Instructions enter at Decode Leave at writeback/complete Window >= Scheduler Can be the same structure In window but not in scheduler  completed instructions A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Beyond Simple OoO E will wait for B, C and D. WAR w/ C and D WAW w/ B A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E will wait for B, C and D. WAR w/ C and D WAW w/ B Can we do better? A B C D E A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

What if we had infinite registers A: LF F6, 34(R2) B: LF F2, 45(R3) C: MULF F0, F2, F4 D: SUBF F8, F2, F6 E: ADDF F2, F7, F4 E: ADDF F9, F7, F4 No false dependences anymore Since we do not reuse a name we can’t have WAW and WAR A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

State Recovery Example: Register Alias Table Original Code Lg(# arch. regs) RAT A add r1, r2, 100 B breq r1, E C sub r1, r2, r2 p4 p1 p5 p5 p4 Architectural Register p2 p3 # arch. regs Renamed Code A add p4, p2, 100 B breq p4, E C sub r5, p2, p2 Physical Register A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming Register Version Every Write creates a new version Uses read the last version Need to keep a version until all uses have read it. Register Renaming: Architectural vs. Physical Registers more phys. than arch. Maintain a map of arch. to phys. regs. Use in-order decoding to properly identify dependences. Instructions wait only for input op. availability. Only last version is written to reg. file. A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

ROB: Slow, Fine-Grain Recovery Each entry contains Architectural destination register Its previous RAT map Program Order 3. Undo RAT updates in reverse order B B B B B Reorder Buffer Misprediction discovered 2. Locate newest instruction INVALID RAT Too slow: recovery latency proportional to number of instructions to squash A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Global Checkpoints: Fast, Coarse-Grain Recovery Program Order checkpoint checkpoint checkpoint checkpoint B B B B B Reorder Buffer Misprediction discovered INVALID RAT Branch w/ GC: Recovery is “Instantaneous” A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

MIPS R10000 Pipeline Overview A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

MIPS R10000 Pipelines A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

MIPS R10000 Architecture A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Register Renaming A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto

Scheduler A. Moshovos © ECE1773 - Fall ‘07 ECE Toronto