ECE 2162 Reorder Buffer.

Slides:



Advertisements
Similar presentations
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Advertisements

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.
Lec18.1 Step by step for Dynamic Scheduling by reorder buffer Copyright by John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.
COMP25212 Advanced Pipelining Out of Order Processors.
ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Oct. 14, 2002 Topic: Instruction-Level Parallelism (Multiple-Issue, Speculation)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
1 Zvika Guz Slides modified from Prof. Dave Patterson, Prof. John Kubiatowicz, and Prof. Nancy Warter-Perez Out Of Order Execution.
Computer Architecture Lecture 18 Superscalar Processor and High Performance Computing.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
Anshul Kumar, CSE IITD CSL718 : Superscalar Processors Speculative Execution 2nd Feb, 2006.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
COMP25212 Advanced Pipelining Out of Order Processors.
1 Lecture 10: Memory Dependence Detection and Speculation Memory correctness, dynamic memory disambiguation, speculative disambiguation, Alpha Example.
CS203 – Advanced Computer Architecture ILP and Speculation.
Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.
IBM System 360. Common architecture for a set of machines
COSC3330 Computer Architecture Lecture 14. Branch Prediction
/ Computer Architecture and Design
Out of Order Processors
Dynamic Scheduling and Speculation
Step by step for Tomasulo Scheme
Tomasulo Loop Example Loop: LD F0 0 R1 MULTD F4 F0 F2 SD F4 0 R1
CS203 – Advanced Computer Architecture
CS5100 Advanced Computer Architecture Hardware-Based Speculation
Microprocessor Microarchitecture Dynamic Pipeline
Lecture 12 Reorder Buffers
Advantages of Dynamic Scheduling
Sequential Execution Semantics
High-level view Out-of-order pipeline
11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
CS203 – Advanced Computer Architecture
CS203 – Advanced Computer Architecture
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
Lecture 11: Memory Data Flow Techniques
Out-of-Order Execution Scheduling
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Advanced Computer Architecture
Static vs. dynamic scheduling
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
CS5100 Advanced Computer Architecture Dynamic Scheduling
Adapted from the slides of Prof
Midterm 2 review Chapter
September 20, 2000 Prof. John Kubiatowicz
CSL718 : Superscalar Processors
Lecture 7 Dynamic Scheduling
Tomasulo Speculative Example
Conceptual execution on a processor which exploits ILP
Presentation transcript:

ECE 2162 Reorder Buffer

Out-of-Order Execution We’re now executing instructions in data-flow order Great! More performance But outside world can’t know about this Must maintain illusion of sequentiality

Remember the Toll Booth? 5s 30s Hands toll-booth agent a $100 bill; takes a while to count the change One-at-a-time = 45s OOO = 30s With a “4-Issue” Toll Booth L1 L2 L3 L4 OOO = Out of Order We’ll add the equivalent of the “shoulder” to the CPU: the Re-Order Buffer (ROB)

Re-Order Buffer (ROB) Separates architected vs. physical registers Tracks program order of all in-flight insts Enables in-order completion or “commit”

Hardware Organization RAT Architected Register File Instruction Buffers ROB “head” Reservation Stations and ALUs op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk Add op Qj Qk Vj Vk op Qj Qk Vj Vk type dest value fin Mult

Stall issue if any needed resource not available RAT Architected Register File Read inst from inst buffer Check if resources available: Appropriate RS entry ROB entry Read RAT, read (available) sources, update RAT Write to RS and ROB Instruction Buffers ROB “head” Reservation Stations and ALUs op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk op Qj Qk Vj Vk Add op Qj Qk Vj Vk op Qj Qk Vj Vk type dest value fin Mult Stall issue if any needed resource not available

Exec Same as before Wait for all operands to arrive Compete to use functional unit Execute!

Write Result Broadcast result on CDB (any dependents will grab the value) Write result back to your ROB entry The ARF holds the “official” register state, which we will only update in program order Mark ready/finished bit in ROB (note that this inst has completed execution)

New: Commit When an inst is the oldest in the ROB i.e. ROB-head points to it Write result (if ready/finished bit is set) If register producing instruction: write to architected register file If store: write to memory Advance ROB-head to next instruction This is what the outside world sees And it’s all in-order

Commit Illustrated Make instruction execution “visible” to the outside world “Commit” the changes to the architected state ROB WB result Outside World “sees”: A  ARF B  A executed C  B executed D  C executed E  D executed F  E executed G  H  Instructions execute out of program order, but outside world still “believes” it’s in-order J  K 

Revisiting Register Renaming RAT R1 = R2 + R3 R3 = R5 – R6 R1 = R1 * R7 R1 = R4 + R8 R2 = R9 + R3 R1 R1 ROB3 ROB1 ROB4 ROB R2 R2 R3 ROB2 R3 ROB1 R1 R2+R3 … ROB2 R3 R5-R6 ROB3 R1 ROB1*R7 If we issue R2=R9+R3 to the ROB now, R3 comes from ROB2 ROB4 R1 R4+R8 ROB5 R2 R9+ROB2 R2 R9+R3 ROB6 However, if R3=R5-R6 commits first… ROB7 ROB8 Then update RAT so when we issue R2=R9+R3, it will read source from the ARF.

Example Inst Operands Add: 2 cycles Mult: 10 cycles Divide: 40 cycles R2, R3, R4 MUL R1, R5, R6 ADD R3, R7, R8 Sequentially, this would take: 40+10+2+10+2+2 = 66 cycles (+ other pipeline stages) MUL R1, R1, R3 SUB R4, R1, R5 ADD R1, R4, R2 R1 -23 R2 16 R3 45 R4 5 R5 3 R6 4 R7 1 R8 2

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) DIV ROB1 45 5 I E W C ARF RAT ROB 1 1 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ARF1 ROB1 R2 2 R2 ARF2 ROB1 ROB2 3 R3 ARF3 ROB3 4 R4 ARF4 ROB4 5 R5 ARF5 ROB5 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 1 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) DIV ROB1 45 5 MUL ROB2 3 4 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ARF1 ROB2 ROB1 R2 2 2 R2 ROB1 ROB2 R1 3 R3 ARF3 ROB3 4 R4 ARF4 ROB4 5 R5 ARF5 ROB5 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 2 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) ADD ROB3 1 2 DIV ROB1 45 5 MUL ROB2 3 4 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB2 ROB1 R2 2 2 3 R2 ROB1 ROB2 R1 3 3 R3 ARF3 ROB3 ROB3 R3 4 R4 ARF4 ROB4 5 R5 ARF5 ROB5 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 3 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) ADD ROB3 1 2 DIV ROB1 45 5 MUL ROB2 3 4 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB2 ROB1 R2 2 2 3 R2 ROB1 ROB2 R1 3 3 4 R3 ROB3 ROB3 R3 4 R4 ARF4 ROB4 5 R5 ARF5 ROB5 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 4 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) ADD ROB3 1 2 DIV ROB1 45 5 MUL ROB2 3 4 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB2 ROB1 R2 2 2 3 R2 ROB1 ROB2 R1 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 R4 ARF4 ROB4 5 R5 ARF5 ROB5 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 6 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) DIV ROB1 45 5 MUL ROB2 3 4 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB2 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 R4 ARF4 ROB4 5 R5 ARF5 ROB5 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 13 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) DIV ROB1 45 5 MUL ROB4 12 3 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB2 ROB4 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 R4 ARF4 ROB4 R1 5 R5 ARF5 ROB5 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 14 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) SUB ROB5 ROB4 3 DIV ROB1 45 5 MUL ROB4 12 3 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB4 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 R4 ARF4 ROB5 ROB4 R1 5 15 R5 ARF5 ROB5 R4 6 R6 ARF6 ROB6 R7 ARF7 Cycle: 15 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) SUB ROB5 ROB4 3 DIV ROB1 45 5 ADD ROB6 ROB5 ROB1 MUL ROB4 12 3 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB4 ROB6 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 R4 ROB5 ROB4 R1 5 15 R5 ARF5 ROB5 R4 6 16 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: 16 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) SUB ROB5 ROB4 3 DIV ROB1 45 5 ADD ROB6 ROB5 ROB1 MUL ROB4 12 3 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 R4 ROB5 ROB4 R1 5 15 R5 ARF5 ROB5 R4 6 16 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) SUB ROB5 36 3 DIV ROB1 45 5 ADD ROB6 ROB5 ROB1 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 25 R4 ROB5 ROB4 R1 36 Y 5 15 R5 ARF5 ROB5 R4 6 16 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: 26 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) SUB ROB5 36 3 DIV ROB1 45 5 ADD ROB6 ROB5 ROB1 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 25 R4 ROB5 ROB4 R1 36 Y 5 15 26 R5 ARF5 ROB5 R4 6 16 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: 28 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) DIV ROB1 45 5 ADD ROB6 ROB1 33 I E W C ARF RAT ROB 1 1 2 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 R2 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 25 R4 ROB5 ROB4 R1 36 Y 5 15 26 28 R5 ARF5 ROB5 R4 33 Y 6 16 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) ADD ROB6 33 9 I E W C ARF RAT ROB 1 1 2 42 -23 16 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 R2 9 Y 2 2 3 13 R2 ROB1 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 25 R4 ROB5 ROB4 R1 36 Y 5 15 26 28 R5 ARF5 ROB5 R4 33 Y 6 16 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: 43 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) ADD ROB6 33 9 I E W C ARF RAT ROB 1 1 2 42 43 -23 9 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 2 2 3 13 R2 ARF2 ROB2 R1 12 Y 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 25 R4 ROB5 ROB4 R1 36 Y 5 15 26 28 R5 ARF5 ROB5 R4 33 Y 6 16 43 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: 44 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) ADD ROB6 33 9 I E W C ARF RAT ROB 1 1 2 42 43 12 9 45 5 3 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 2 2 3 13 44 R2 ARF2 ROB2 3 3 4 6 R3 ROB3 ROB3 R3 3 Y 4 14 15 25 R4 ROB5 ROB4 R1 36 Y 5 15 26 28 R5 ARF5 ROB5 R4 33 Y 6 16 43 R6 ARF6 ROB6 R1 R7 ARF7 Cycle: 45 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) I E W C ARF RAT ROB 1 1 2 42 43 12 9 3 5 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 2 2 3 13 44 R2 ARF2 ROB2 3 3 4 6 45 R3 ARF3 ROB3 4 14 15 25 R4 ROB5 ROB4 R1 36 Y 5 15 26 28 R5 ARF5 ROB5 R4 33 Y 6 16 43 45 R6 ARF6 ROB6 R1 42 Y R7 ARF7 Cycle: 46 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) I E W C ARF RAT ROB 1 1 2 42 43 36 9 3 5 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 2 2 3 13 44 R2 ARF2 ROB2 3 3 4 6 45 R3 ARF3 ROB3 4 14 15 25 46 R4 ROB5 ROB4 5 15 26 28 R5 ARF5 ROB5 R4 33 Y 6 16 43 45 R6 ARF6 ROB6 R1 42 Y R7 ARF7 Cycle: 47 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) I E W C ARF RAT ROB 1 1 2 42 43 36 9 3 33 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ROB6 ROB1 2 2 3 13 44 R2 ARF2 ROB2 3 3 4 6 45 R3 ARF3 ROB3 4 14 15 25 46 R4 ARF4 ROB4 5 15 26 28 47 R5 ARF5 ROB5 6 16 43 45 R6 ARF6 ROB6 R1 42 Y R7 ARF7 Cycle: 48 R8 ARF8

In Detail RS fields: ROB fields: RS (Adder) RS (Mul/Div) I E W C ARF R2, R3, R4 DIV Operands Inst R1, R5, R6 MUL R3, R7, R8 ADD R1, R1, R3 R4, R1, R5 SUB R1, R4, R2 1 2 3 4 5 6 RS fields: Op Dst-Tag Tag1 Tag2 Val1 Val2 ROB fields: Type Dest Value Finished RS (Adder) RS (Mul/Div) I E W C ARF RAT ROB 1 1 2 42 43 42 9 3 33 4 1 2 R1 R2 R3 R4 R5 R6 R7 R8 R1 ARF1 ROB1 2 2 3 13 44 R2 ARF2 ROB2 3 3 4 6 45 R3 ARF3 ROB3 4 14 15 25 46 R4 ARF4 ROB4 5 15 26 28 47 R5 ARF5 ROB5 6 16 43 45 48 R6 ARF6 ROB6 R7 ARF7 Cycle: 48 R8 ARF8

Timing Example Assume you can bypass and execute in the same cycle Add: 1 cycles Mult: 10 cycles Divide: 40 cycles Assume you can bypass and execute in the same cycle Commit Inst Operands Is Exec Wr Comments DIV R2, R3, R4 MUL R1, R5, R6 ADD R3, R7, R8 MUL R1, R1, R3 SUB R4, R1, R5 ADD R1, R4, R2

Unified Reservation Stations (1) If MULT RS’s are full, and we need to issue another MULT, then we have to stall But there may be other RS’s (e.g., add) that are available Proper number of RS’s per ALU needs to be matched to the program’s inst distribution But different programs have different distributions

Unified Reservation Stations (2) RS (Adder) RS (Mul/Div) Add Mult Add Mult RS (Unified) Can hold 5 adds Or 5 multiplies Or any combination

Unified Reservation Stations (3) Arbitration and execution paths a little more complex (not too bad though) For N functional units, RS needs To support N read ports RS Entries ADD rdy Adder Arbiter Multiplier Arbiter Add MUL x ADD x Mult ADD rdy MUL rdy MUL x ADD x Ready add’s compete for execution unit ADD rdy Ready mul’s do the same Arbitration logic picks insts to execute Insts go and do their thing

Out-of-Order, but not Superscalar As described, this Tomasulo (+ROB) CPU can only maintain sustained throughput of 1 IPC Limitations: Need superscalar fetch, decode, etc. There’s only one CDB, so only one inst per cycle can write-back its result to the ROB Also must commit > 1 IPC

Getting > 1 IPC Must be able to issue > 1 IPC to RS/ROB Must be able to send > 1 IPC to functional units original Tomasulo can do this already (if inst ready and FU available, go and execute!) Must be able to write-back > 1 IPC to ROB (and reservation stations)

Dual-Issue Need to check resource availability for two instructions (RS/ROB entries) Depending on resources, may issue 0,1 or 2 Read RAT/ARF/ROB for operands Renaming is a little trickier (next slide) Update RS/ROB entries (not too hard)

Dual-Rename X RAT All registers renamed! Read current mappings for all operands R1 ARF1 ROB21 ROB17, ARF3 R1 = R2 + R3 R2 ROB17 R4 = R2 – R4 R3 ARF3 ROB22 ROB17, ROB6 R4 ROB6 New destinations are just in the next two ROB entries To be allocated: ROB-tail, ROB-tail + 1 R1 = R2 + R3 R4 = R1 – R4 ARF1 ROB17 ARF3 ROB6 R1 R2 R3 R4 RAT ROB21 ROB17, ARF3 X ROB22 ARF1, ROB6 dst1 From RAT New ROB entry From RAT New ROB entry Need to check for RAW dependencies within your issue group src2L src2R =? =?

Multiple CDB’s It works (we do this today in CPUs) But there’s a cost RS Entries It works (we do this today in CPUs) But there’s a cost each RS entry must compare each source with each CDB more area, logic, power (all  $$$) FU D1, ROB3 ROB4 E1, ROB4 A2, ROB10 FU Arbiters FU A1, ROB7 ROB7 B1, ROB19 FU C1, ROB2 CDB1 CDB2

Committing > 1 IPC Must be able to write-back multiple results from ROB  ARF (or memory for stores) ROB needs extra read ports ARF needs extra write ports

Terminology is Inconsistent, Overloaded Issue, Dispatch, Commit, etc. Text uses terms w.r.t. Tomasulo’s algorithm Other usage is different (many academics) Issue/Alloc/Dispatch Exec/Issue/Dispatch Commit/Complete/Retire/Graduate Blue: Intel’s convention Orange: some academics