Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Slides:

Advertisements

Similar presentations

Krste Asanovic Electrical Engineering and Computer Sciences

Advertisements

Lecture 19: Cache Basics Today’s topics: Out-of-order execution

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

1 Lecture 20: Speculation Papers: Is SC+ILP=RC?, Purdue, ISCA’99 Coherence Decoupling: Making Use of Incoherence, Wisconsin, ASPLOS’04 Selective, Accurate,

1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.

CS 7810 Lecture 2 Complexity-Effective Superscalar Processors S. Palacharla, N.P. Jouppi, J.E. Smith U. Wisconsin, WRL ISCA ’97.

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)

CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )

1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.

1 Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections )

March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )

1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.

1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)

1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )

1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )

OOO execution © Avi Mendelson, 4/ MAMAS – Computer Architecture Lecture 7 – Out Of Order (OOO) Avi Mendelson Some of the slides were taken.

Trace cache and Back-end Oper. CSE 4711 Instruction Fetch Unit Using I-cache I-cache I-TLB Decoder Branch Pred Register renaming Execution units.

CS 152 Computer Architecture and Engineering Lecture 15 - Out-of-Order Memory, Complex Superscalars Review Krste Asanovic Electrical Engineering and Computer.

OOO Pipelines - III Smruti R. Sarangi Computer Science and Engineering, IIT Delhi.

1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.

March 1, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.

CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.

CS203 – Advanced Computer Architecture ILP and Speculation.

1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.

Lecture: Out-of-order Processors

/ Computer Architecture and Design

Smruti R. Sarangi IIT Delhi

Out of Order Processors

Dr. George Michelogiannakis EECS, University of California at Berkeley

Lecture: Out-of-order Processors

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture 6: Advanced Pipelines

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 18: Pipelining Today’s topics:

Smruti R. Sarangi IIT Delhi

Lecture: SMT, Cache Hierarchies

Lecture 11: Memory Data Flow Techniques

Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Lecture 18: Pipelining Today’s topics:

Smruti R. Sarangi Computer Science and Engineering, IIT Delhi

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

Adapted from the slides of Prof

Krste Asanovic Electrical Engineering and Computer Sciences

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Lecture 19: Branches, OOO Today’s topics: Instruction scheduling

Lecture 20: OOO, Memory Hierarchy

Lecture: SMT, Cache Hierarchies

Lecture 20: OOO, Memory Hierarchy

Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.

Adapted from the slides of Prof

Lecture: SMT, Cache Hierarchies

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Lecture 7: Branch Prediction, Dynamic ILP

Presentation transcript:

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 ALU ALU ALU Speculative Reg Map R1P36 R2P34 Instr Fetch Queue Results written to regfile and tags broadcast to IQ Issue Queue (IQ)

Rename A lr1  lr2 + lr3 pr7  pr2 + pr3 B lr2  lr4 + lr5 C lr6  lr1 + lr3 D lr6  lr1 + lr2 RAR lr3 RAW lr1 WAR lr2 WAW lr6 A ; BC ; D pr7  pr2 + pr3 pr8  pr4 + pr5 pr9  pr7 + pr3 pr10  pr7 + pr8 RAR pr3 RAW pr7 WAR x WAW x AB ; CD

Commit Example Map Old / New lr1 pr1 pr7 lr2 pr2 pr8 lr6 pr6 pr9 Assume a processor with 6 logical regs and 10 physical regs Map Old / New lr1 pr1 pr7 lr2 pr2 pr8 lr6 pr6 pr9 lr6 pr9 pr10 lr3 pr3 pr1 lr4 pr4 pr2 A lr1  lr2 + lr3 B lr2  lr4 + lr5 C lr6  lr1 + lr3 D lr6  lr1 + lr2 E lr3  lr6 + lr2 F lr4  lr3 + lr4 pr7  pr2 + pr3 pr8  pr4 + pr5 pr9  pr7 + pr3 pr10  pr7 + pr8 pr1  pr10 + pr8 pr2  pr1 + pr4

Out-of-Order Loads/Stores Ld R1  [R2] Ld R3  [R4] St R5  [R6] Ld R7  [R8] Ld R9[R10]

Memory Dependence Checking Ld 0x abcdef The issue queue checks for register dependences and executes instructions as soon as registers are ready Loads/stores access memory as well – must check for RAW, WAW, and WAR hazards for memory as well Hence, first check for register dependences to compute effective addresses; then check for memory dependences Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00

Memory Dependence Checking Load and store addresses are maintained in program order in the Load/Store Queue (LSQ) Loads can issue if they are guaranteed to not have true dependences with earlier stores Stores can issue only if we are ready to modify memory (can not recover if an earlier instr raises an exception) Ld 0x abcdef Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Instr 7 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 LD R4  8[R3] ST R4  8[R1] Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 P37  8[P35] P37  8[P36] ALU ALU ALU Speculative Reg Map R1P36 R2P34 Results written to regfile and tags broadcast to IQ Instr Fetch Queue Issue Queue (IQ) ALU P37  [P35 + 8] P37  [P36 + 8] D-Cache LSQ

Speculative Issue Instr I1 leaves the issue queue at start of cycle 6; the instr then reads operands from the regfile, wires are traversed, instruction executes, result is available at end of cycle 8 If operand availability is broadcast to issue queue in cycle 9, dependent instruction leaves in cycle 10 This causes a 4-cycle gap between successive instrs Hence, if we know that the instruction takes a cycle to execute, the operand is broadcast to the issue queue in cycle 6 and the dependent instr leaves issue queue in cycle 7; the input operand is correctly bypassed at the FU

Load Hit Speculation The previous optimization assumes that we know the exact latency for every operation This is true for all ops except loads (cache hit or miss?) Assume hit and schedule accordingly; on a cache miss, must squash all speculatively issued instructions; an instruction therefore sits in the queue until load hits are determined

Register Rename Logic Map Table Physical Source Regs Physical Dest Logical Source Regs Mux Free Pool Logical Dest Regs Dependence Check Logic Logical Source Reg

Shadow copies (shift register) Map Table – RAM 7-bits 7-bits 7-bits 7-bits 7-bits Phys reg id Num entries = Num logical regs Shadow copies (shift register)

Map Table – CAM 5-bits 1-bit 1-bit Logical reg id v a l i d Num entries = Num phys regs Shadow copies

Wakeup Logic . . … = = tag1 tagIW or or rdyL tagL tagR rdyR rdyL tagL

Selection Logic For multiple FUs, will need sequential selectors Issue window req grant enable anyreq Arbiter cell enable For multiple FUs, will need sequential selectors

Structure Complexities Critical structures: register map tables, issue queue, LSQ, register file, register bypass Cycle time is heavily influenced by: window size (physical register size), issue width (#FUs) Conflict between the desire to increase IPC and clock speed

Title Bullet