Super-scalar.

Super-scalar

Loose Ends Up to now: Still need to address:
Techniques for handling register-related dependencies Register renaming for WAR, WAW Tomasulo’s algorithm for scheduling RAW Still need to address: Control dependencies Memory dependencies Faults

Big Picture Modern CPUs are/use: Superscalar Out-of-Order
Speculative execution OOO Scheduling readiness, arbitration Actual execution (may be speculative) Update ROB/PRF, LB or SB Write results To ARF/Memory Fetch Decode Rename Issue Superscalar fetch Branch prediction Target prediction Superscalar decode (hard for CISC) Still in-order, but also superscalar Send to RS, ROB, LB, SB (AKA “Alloc”) Schedule Exec Writeback Commit Not in program order!

Non-Spec Exec Fetch instructions, even fetch past a branch, but don’t exec right away (assume fetching/retiring 3 inst. per cycle) fetch decode issue exec write commit F D I E W C 1. DIV R1 = R2 / R3 2. ADD R4 = R5 + R6 3. BEQZ R1, foo 4. SUB R5 = R4 – R3 5. MUL R7 = R5 * R2 6. MUL R8 = R7 * R3 Assume execution of : Add/Sub/Beq: 1 cycles Mult: 4 cycles Divide: 10 cycles

Branch Prediction/Speculative Execution
When we hit a branch, guess if it’s T or NT We’ll discuss how next lecture ADD A Guess T Branch LOAD DIV ADD Branch SUB STORE SUB LOAD XOR STORE ADD MUL  B Keep scheduling and executing Instructions as if the branch Didn’t even exist T NT C Q  Sometime later, if we messed up… D R  Just throw it all out … …  And fetch the correct instructions

Again, with Speculative Execution
Assume fetched direction is correct fetch decode issue exec write commit F D I E W C 1. DIV R1 = R2 / R3 2. ADD R4 = R5 + R6 3. BEQZ R1, foo 4. SUB R5 = R4 – R3 5. MUL R7 = R5 * R2 6. MUL R8 = R7 * R3

Branch Misprediction Recovery
ARF state corresponds to state prior to oldest non-committed instruction ARF br As instructions are processed, the RAT corresponds to the register mapping after the most recently renamed instruction RAT ?!? On a branch misprediction, wrong-path instructions are flushed from the machine The RAT is left with an invalid set of mappings corresponding to the wrong- path instruction state

Solution 1: Stall and Drain
Allow all instructions to execute and commit; ARF corresponds to last committed instruction ARF br ARF now corresponds to the state right before the next instruction to be renamed (foo) RAT X Reset RAT so that all mappings refer to the ARF ?!? Pros: Very simple to implement Cons: Performance loss due to stalls Resume renaming the new correct- path instructions from fetch foo Correct path instructions from fetch; can’t rename because RAT is wrong

Solution 2: Checkpointing
At each branch, make a copy of the RAT (register mapping at the time of the branch) ARF br Checkpoint Free Pool br RAT RAT RAT RAT br RAT br On a misprediction: 1. flush wrong-path instructions 2. deallocate RAT checkpoints foo 3. recover RAT from checkpoint 4. resume renaming

Speculating Past Multiple Branches
Branch every 4-6 instructions Pipeline depth typically stages With peak 3-4 instructions per cycle 20 stages * 3 inst / stage * 1 branch / 5 inst Approximately 12 branches in the pipeline when pipe is full Need 12 checkpoints (on average, more for burst) What’s the probability of still being on right path?

Speculative Execution is OK
ROB maintains program order ROB temporarily stores results If we screw something up, only the ROB knows, but no architected state is affected Register rename recovery makes sure we resume with correct register mapping If we screw up, we: Can undo the effects Know how to resume

Non-Spec Memory Instruction
Fetch instructions, even fetch past a branch, but don’t exec right away (assume fetching/retiring 3 inst. per cycle) fetch decode issue exec write commit F D I E W C 1. LOAD R3 = 0(R6) 2. ADD R7 = R3 + R9 3. STORE R4, 0(R7) 4. SUB R1 = R1 – R2 5. LOAD R8 = 0(R1) 6. LOAD R9 = 0(R2) Assume execution of : Add/Sub: 1 cycles Load/Store hit: 2 cycles Load/Store miss: 100 cycles

Executing Memory Instructions
Load R3 = 0[R6] Issue Miss serviced… Cache Miss! Add R7 = R3 + R9 Issue Store R4  0[R7] Issue Sub R1 = R1 – R2 Load R8 = 0[R1] Issue Cache Hit! But there was a later load… If R1 != R7, then Load R8 gets correct value from cache If R1 == R7, then Load R8 should have gotten value from the Store, but it didn’t!

Out-of-Order Load Execution
= store So don’t let loads execute out-of-order! A = load B IPC = 8/7 = 1.14 A B E H C D E C D F F G Ok, maybe not a good idea. No support for OOO load execution can crush your IPC IPC = 8/3 = 2.67 G H

What else could happen, if we execute load before store completes?
Value from cache is stale A B foo forwarding mechanism Some sort of data A foo A B foo D$ D$ B bar No problem. No problem. Uh oh. A C foo B A B foo bar Should have used B’s store value Luckily, this usually can’t even happen Uh oh. Uh oh.

Memory Disambiguation Problem
Why can’t this happen with non-memory insts? Operand specifiers in non-memory insts are absolute “R1” refers to one specific location Operand specifiers in memory insts are ambiguous “R1” refers to a memory location specified by the value of R1. As pointers change, so does this location.

Two Problems Memory disambiguation Store-to-load forwarding problem
Are there any earlier unexecuted stores to the same address as myself? (I’m a load) Binary question: answer is yes or no Store-to-load forwarding problem Which earlier store do I get my value from? (I’m a load) Which later load(s) do I forward my value to? (I’m a store) Non-binary question: answer is one or more instruction identifiers

Tomasulo’s Algorithm: The Picture

Load Store Queue (LSQ) Disambiguation: loads cannot
PC Seq Addr Value Data Cache Oldest L 0xF048 41773 0x3290 42 0x3290 -17 42 S 0xF04C 41774 0x3410 25 0x3300 1 S 0xF054 41775 0x3290 -17 0x3410 25 38 L 0xF060 41776 0x3418 1234 0x3418 1234 L 0xF840 41777 0x3290 -17 L 0xF858 41778 0x3300 1 S 0xF85C 41779 0x3290 L 0xF870 41780 0x3410 25 L 0xF628 41781 0x3290 Youngest L 0xF63C 41782 0x3300 1 Disambiguation: loads cannot execute until all earlier store addresses computed Forwarding: broadcast/search entire LSQ for match

Memory Disambiguation
Can we “undo” stores? Stores cannot be committed to memory until they are marked ready to retire Completed stores are queued and waiting in a store queue or store buffer Disambiguate (and resolve) memory dependency dynamically

Memory Ordering Source: Alpha HRM Quick example for load-load violation X= 5 P0 P1 R1 = X X = 0 R2 = X Under SC, it is not possible to have R1=0 and R2=X, only if load can bypass load. Load X bypassing Load X violates certain memory consistency model (e.g., sequential consistency) Load-load order trap replays

Load Store Queue (LSQ) RS ALLOC ROB Store Queue Load Queue
Age-ordered ALLOC ROB Store Queue Load Queue Split LSQ Memory instructions are allocated into LSQ in program order LSQ manages memory reference ordering Unified LSQ vs. Split LSQ Sandy Bridge: 64 Load buffers, 36 Store buffers

Issuing a Load for Execution
Issued? Issued? age address data age address 1 1 A 1 A Issued to Memory for execution 1 1 B 2 D 1 C FFFF1111 2 C 2 ??? FFFFFF00 Store Queue Load Queue Each load checks against older stores Associative search A performance issue of scalability

Issued? Issued? age address data age address 1 1 A 1 A 1 1 B 2 D 1 Store-to-load forwarding 1 C FFFF1111 2 C 2 ??? FFFFFF00 Store Queue Load Queue Implementation dependent: comprehensive size matching can be prohibitively expensive Simple method: forward when a larger store (word) precedes a smaller load (half)

Issued? Issued? age address data age address 1 1 A 1 A 1 1 B 2 D 1 Speculatively issue for execution 1 C FFFF1111 2 C 1 2 ??? FFFFFF00 3 K Store Queue Load Queue Can speculatively issue loads for shortening latency (Alpha 21264, Pentium 4 (Prescott)) Naively Use Memory Dependency Predictor Store, when address ready, checks newer loads in the Load Queue “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay)

Store Checks Pre-Mature Loads
Issued? Issued? age address data age address 1 1 A 1 A 1 1 B 2 D 1 1 1 C FFFF1111 2 C 1 2 K FFFFFF00 1 3 K 3 M 1 Conflict detected! Replay the load 4 P 1 Store Queue Load Queue Store, when address ready, checks newer loads in the Load Queue Associative Search “Replay” needed if speculation turns out to be incorrect (e.g. Alpha’s store-load replay)

Issuing a Store for Execution
Issued to memory Issued? Issued? age address data age address 1 4 A 4 A 1 6 A 0F0F0F0F 5 D 6 C 5 C 6 K Store Queue Load Queue Shown above the basic concept Implementation dependent Not allow store bypassing load, since it has little impact on performance Perform associative search

Issuing a Store for Execution
Issued? Issued? age address data age address 1 4 A 4 A 1 6 A 0F0F0F0F 5 D cannot issue for execution 6 C 5 C 6 K Store Queue Load Queue

Load-Load Ordering Needed for Load-load trap invoked
Issued? age address Needed for Multiprocessor support Maintaining memory consistency model Load-load trap invoked Trap on the later, conflicted instructions Replay 4 A 5 D 1 1 5 C 1 6 A Load-load trap 6 M 1 6 N 1 7 K Load Queue

Commit, Exceptions What happens if a speculatively executed instruction faults? A A B B Branch mispred A, B Commit C C W D D X E Divide by Zero! E Divide by Zero! Outside world sees: A, B, fault! Fault should never be seen! Should have been: A, B, C, D, fault!

Treat Fault Like a Result
Regular register results written to ROB until Instruction is oldest for in-order state update Instruction is known to be non-speculative Do the same with faults! At exec, make note of any faults, but don’t expose to outside world If the instruction gets to commit, then expose the fault

Example A LOAD R1 = 0[R2] (commit) Resolved Miss E W F B
ADD R3 = R1 + R4 LOAD P1 imm R2 X X ADD P2 P1 R4 X X C SUB R1 = R5 – R6 SUB P3 R5 R6 X X DIV P4 R7 P3 X X X D DIV R4 = R7 / R1 LOAD P5 imm R7 X X X Executed, completed, faulted E LOAD R6 = 0[R7] Fault! (3 commits) Divide by zero Fault deferred until architecturally correct point. Other fault “never happened”… Flush rest of ROB, Start fetching Fault handler Now raise fault

Superscalar Commit is like Sampling
Scalar Commit Processor States Superscalar Commit Processor States A A B C D E F A B C Each “state” in the superscalar machine always corresponds to one state of the scalar machine (but not necessarily the other way around), and the ordering of states is preserved A B A B C D E A B C A B C D E F G A B C D A B C D E F G H A B C D E A B C D E F G H

Commit “Algorithm” For i={0..commit_width-1}
Has ROB[oldest+i] finished execution? If not, break Does ROB[oldest+i] have a fault? If so, raise it now (flush pipe, NPC = fault handler) If not, write to architected state (ARF/Memory)

SimpleScalar Model Fetch Dispatch (Issue) Issue (Exec) Writeback
IFQ, I$, bpred Fakes front-end pipeline with delay Fetch Dispatch (Issue) Issue (Exec) Writeback Commit RS, ROB (idep/odep) Decode, map dependencies, allocate Simulator uses an oracle/checker approach. It keeps an “official” version of the state, and then makes sure that the simulated version correctly generates the same results (at commit). readyq, LSQ Arbitrate and schedule WB events odep list Notify dependents of result ROB, LSQ commit, store writeback

Superscalar Commit To sustain > 1 IPC execution, must commit at > 1 IPC as well So long as no one wants to “look”, we can make multiple updates to state each cycle ARF needs multiple write ports Potentially multiple writes to memory

Questions?

Super-scalar.

Similar presentations

Presentation on theme: "Super-scalar."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Super-scalar.

Similar presentations

Presentation on theme: "Super-scalar."— Presentation transcript:

Similar presentations

About project

Feedback