Lecture: Out-of-order Processors

Slides:



Advertisements
Similar presentations
Lecture 19: Cache Basics Today’s topics: Out-of-order execution
Advertisements

CS6290 Speculation Recovery. Loose Ends Up to now: –Techniques for handling register dependencies Register renaming for WAR, WAW Tomasulo’s algorithm.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 11: ILP Innovations and SMT Today: out-of-order example, ILP innovations, SMT (Sections 3.5 and supplementary notes)
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
1 Lecture 19: Core Design Today: issue queue, ILP, clock speed, ILP innovations.
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: branch prediction, out-of-order processors (Sections )
March 9, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
1 Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections , )
1 Lecture 18: Pipelining Today’s topics:  Hazards and instruction scheduling  Branch prediction  Out-of-order execution Reminder:  Assignment 7 will.
1 Lecture 10: ILP Innovations Today: handling memory dependences with the LSQ and innovations for each pipeline stage (Section 3.5)
1 Lecture 8: Branch Prediction, Dynamic ILP Topics: static speculation and branch prediction (Sections )
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
1 Lecture 9: Dynamic ILP Topics: out-of-order processors (Sections )
1 Lecture 7: Branch prediction Topics: bimodal, global, local branch prediction (Sections )
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
OOO Pipelines - II Smruti R. Sarangi IIT Delhi 1.
1 Lecture: Out-of-order Processors Topics: branch predictor wrap-up, a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
1 Lecture: Out-of-order Processors Topics: a basic out-of-order processor with issue queue, register renaming, and reorder buffer.
March 1, 2012CS152, Spring 2012 CS 152 Computer Architecture and Engineering Lecture 12 - Advanced Out-of-Order Superscalars Krste Asanovic Electrical.
CSE431 L13 SS Execute & Commit.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 13: SS Backend (Execute, Writeback & Commit) Mary Jane.
CS203 – Advanced Computer Architecture ILP and Speculation.
1 Lecture 20: OOO, Memory Hierarchy Today’s topics:  Out-of-order execution  Cache basics.
Lecture: Out-of-order Processors
/ Computer Architecture and Design
Smruti R. Sarangi IIT Delhi
Out of Order Processors
Dr. George Michelogiannakis EECS, University of California at Berkeley
Lecture: Out-of-order Processors
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 6: Advanced Pipelines
Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 10: Out-of-order Processors
Lecture 11: Out-of-order Processors
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 18: Pipelining Today’s topics:
Smruti R. Sarangi IIT Delhi
Lecture: SMT, Cache Hierarchies
Lecture 11: Memory Data Flow Techniques
Lecture 17: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Lecture 18: Pipelining Today’s topics:
Lecture: Out-of-order Processors
Lecture 8: Dynamic ILP Topics: out-of-order processors
Adapted from the slides of Prof
15-740/ Computer Architecture Lecture 5: Precise Exceptions
Krste Asanovic Electrical Engineering and Computer Sciences
Lecture 19: Branches, OOO Today’s topics: Instruction scheduling
Lecture 20: OOO, Memory Hierarchy
Lecture: SMT, Cache Hierarchies
Lecture 20: OOO, Memory Hierarchy
Lecture 19: Core Design Today: implementing core structures – rename, issue queue, bypass networks; innovations for high ILP and clock speed.
Instruction Execution Cycle
Adapted from the slides of Prof
Lecture: SMT, Cache Hierarchies
Lecture 10: ILP Innovations
Lecture 9: ILP Innovations
Lecture 9: Dynamic ILP Topics: out-of-order processors
Conceptual execution on a processor which exploits ILP
Sizing Structures Fixed relations Empirical (simulation-based)
Lecture 7: Branch Prediction, Dynamic ILP
Presentation transcript:

Lecture: Out-of-order Processors Topics: more ooo design details, timing, load-store queue

Problem 3 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1  R2+R3 R1  R1+R5 BEQZ R1 R1  R4 + R5 R4  R1 + R7 R1  R6 + R8 R4  R3 + R1 R1  R5 + R9

Problem 3 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1  R2+R3 P33  P2+P3 R1  R1+R5 P34  P33+P5 BEQZ R1 BEQZ P34 R1  R4 + R5 P35  P4+P5 R4  R1 + R7 P36  P35+P7 R1  R6 + R8 P1  P6+P8 R4  R3 + R1 P33  P3+P1 R1  R5 + R9 P34  P5+P9

Problem 3 Show the renamed version of the following code: Assume that you have 36 physical registers and 32 architected registers. When does each instr leave the IQ? R1  R2+R3 P33  P2+P3 cycle i R1  R1+R5 P34  P33+P5 i+1 BEQZ R1 BEQZ P34 i+2 R1  R4 + R5 P35  P4+P5 i R4  R1 + R7 P36  P35+P7 i+1 R1  R6 + R8 P1  P6+P8 j R4  R3 + R1 P33  P3+P1 j+1 R1  R5 + R9 P34  P5+P9 j+2 Width is assumed to be 4. j depends on the #stages between issue and commit.

IQ OOO Example Assume there are 36 physical registers and 32 logical registers, and width is 4 Estimate the issue time, completion time, and commit time for the sample code

IQ Assumptions Perfect branch prediction, instruction fetch, caches ADD  dep has no stall; LD  dep has one stall An instr is placed in the IQ at the end of its 5th stage, an instr takes 5 more stages after leaving the IQ (ld/st instrs take 6 more stages after leaving the IQ)

IQ OOO Example Original code Renamed code ADD R1, R2, R3 LD R2, 8(R1) ST R1, (R3) SUB R1, R1, R5 LD R1, 8(R2) ADD R1, R1, R2

IQ OOO Example Original code Renamed code ADD R1, R2, R3 ADD P33, P2, P3 LD R2, 8(R1) LD P34, 8(P33) ADD R2, R2, 8 ADD P35, P34, 8 ST R1, (R3) ST P33, (P3) SUB R1, R1, R5 SUB P36, P33, P5 LD R1, 8(R2) Must wait ADD R1, R1, R2

IQ OOO Example Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 LD R2, 8(R1) LD P34, 8(P33) ADD R2, R2, 8 ADD P35, P34, 8 ST R1, (R3) ST P33, (P3) SUB R1, R1, R5 SUB P36, P33, P5 LD R1, 8(R2) ADD R1, R1, R2

IQ OOO Example Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6 LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8 ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9 ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9 SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9 LD R1, 8(R2) ADD R1, R1, R2

IQ OOO Example Original code Renamed code InQ Iss Comp Comm ADD R1, R2, R3 ADD P33, P2, P3 i i+1 i+6 i+6 LD R2, 8(R1) LD P34, 8(P33) i i+2 i+8 i+8 ADD R2, R2, 8 ADD P35, P34, 8 i i+4 i+9 i+9 ST R1, (R3) ST P33, (P3) i i+2 i+8 i+9 SUB R1, R1, R5 SUB P36, P33, P5 i+1 i+2 i+7 i+9 LD R1, 8(R2) LD P1, 8(P35) i+7 i+8 i+14 i+14 ADD R1, R1, R2 ADD P2, P1, P35 i+9 i+10 i+15 i+15

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 ALU ALU ALU Speculative Reg Map R1P36 R2P34 Instr Fetch Queue Results written to regfile and tags broadcast to IQ Issue Queue (IQ)

Additional Details When does the decode stage stall? When we either run out of registers, or ROB entries, or issue queue entries Issue width: the number of instructions handled by each stage in a cycle. High issue width  high peak ILP Window size: the number of in-flight instructions in the pipeline. Large window size  high ILP No more WAR and WAW hazards because of rename registers – must only worry about RAW hazards

Branch Mispredict Recovery On a branch mispredict, must roll back the processor state: throw away IFQ contents, ROB/IQ contents after branch Committed map table is correct and need not be fixed The speculative map table needs to go back to an earlier state To facilitate this spec-map-table rollback, it is checkpointed at every branch

Waking Up a Dependent In an in-order pipeline, an instruction leaves the decode stage when it is known that the inputs can be correctly received, not when the inputs are computed Similarly, an instruction leaves the issue queue before its inputs are known, i.e., wakeup is speculative based on the expected latency of the producer instruction

Out-of-Order Loads/Stores Ld R1  [R2] Ld R3  [R4] St R5  [R6] Ld R7  [R8] Ld R9[R10] What if the issue queue also had load/store instructions? Can we continue executing instructions out-of-order?

Memory Dependence Checking Ld 0x abcdef The issue queue checks for register dependences and executes instructions as soon as registers are ready Loads/stores access memory as well – must check for RAW, WAW, and WAR hazards for memory as well Hence, first check for register dependences to compute effective addresses; then check for memory dependences Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00

Memory Dependence Checking Load and store addresses are maintained in program order in the Load/Store Queue (LSQ) Loads can issue if they are guaranteed to not have true dependences with earlier stores Stores can issue only if we are ready to modify memory (can not recover if an earlier instr raises an exception) – happens at commit Ld 0x abcdef Ld St Ld Ld 0x abcdef St 0x abcd00 Ld 0x abc000 Ld 0x abcd00

The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr fetch Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 Instr 6 Instr 7 Committed Reg Map R1P1 R2P2 Register File P1-P64 R1  R1+R2 R2  R1+R3 BEQZ R2 R3  R1+R2 R1  R3+R2 LD R4  8[R3] ST R4  8[R1] Decode & Rename P33  P1+P2 P34  P33+P3 BEQZ P34 P35  P33+P34 P36  P35+P34 P37  8[P35] P37  8[P36] ALU ALU ALU Speculative Reg Map R1P36 R2P34 Results written to regfile and tags broadcast to IQ Instr Fetch Queue Issue Queue (IQ) ALU P37  [P35 + 8] P37  [P36 + 8] D-Cache LSQ

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd LD R3  [R4] 6 adde ST R5  [R6] 4 7 abba LD R7  [R8] 2 abce ST R9  [R10] 8 3 abba LD R11  [R12] 1 abba

Problem 2 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd 4 5 LD R3  [R4] 6 adde 7 8 ST R5  [R6] 4 7 abba 5 commit LD R7  [R8] 2 abce 3 6 ST R9  [R10] 8 3 abba 9 commit LD R11  [R12] 1 abba 2 10

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd LD R3  [R4] 6 adde ST R5  [R6] 5 7 abba LD R7  [R8] 2 abce ST R9  [R10] 1 4 abba LD R11  [R12] 2 abba

Problem 3 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume no memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd 4 5 LD R3  [R4] 6 adde 7 8 ST R5  [R6] 5 7 abba 6 commit LD R7  [R8] 2 abce 3 7 ST R9  [R10] 1 4 abba 2 commit LD R11  [R12] 2 abba 3 5

Problem 4 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd LD R3  [R4] 6 adde ST R5  [R6] 4 7 abba LD R7  [R8] 2 abce ST R9  [R10] 8 3 abba LD R11  [R12] 1 abba

Problem 4 Consider the following LSQ and when operands are available. Estimate when the address calculation and memory accesses happen for each ld/st. Assume memory dependence prediction. Ad. Op St. Op Ad.Val Ad.Cal Mem.Acc LD R1  [R2] 3 abcd 4 5 LD R3  [R4] 6 adde 7 8 ST R5  [R6] 4 7 abba 5 commit LD R7  [R8] 2 abce 3 4 ST R9  [R10] 8 3 abba 9 commit LD R11  [R12] 1 abba 2 3/10

Title Bullet