CSE 502: Computer Architecture

Slides:

Advertisements

Similar presentations

Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.

Advertisements

CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

1 Lecture 11: Modern Superscalar Processor Models Generic Superscalar Models, Issue Queue-based Pipeline, Multiple-Issue Design.

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.

A scheme to overcome data hazards

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of.

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.

CS6290 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.

Computer Organization and Architecture (AT70.01) Comp. Sc. and Inf. Mgmt. Asian Institute of Technology Instructor: Dr. Sumanta Guha Slide Sources: Based.

COMP25212 Advanced Pipelining Out of Order Processors.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.

Lecture 4 Slide 1 EECS 470 Portions © Austin, Brehob, Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Tyson, Vijaykumar, Wenisch EECS 470.

Duke Compsci 220 / ECE 252 Advanced Computer Architecture I

ECE 2162 Tomasulo’s Algorithm. Implementing Dynamic Scheduling Tomasulo’s Algorithm –Used in IBM 360/91 (in the 60s) –Tracks when operands are available.

Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.

Instruction-Level Parallelism (ILP)

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Review of CS 203A Laxmi Narayan Bhuyan Lecture2.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.

Instruction-Level Parallelism Dynamic Scheduling

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.

© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW register name refers to a temporary value produced.

CS203 – Advanced Computer Architecture ILP and Speculation.

IBM System 360. Common architecture for a set of machines

Dynamic Scheduling Why go out of style?

CSL718 : Superscalar Processors

/ Computer Architecture and Design

Tomasulo’s Algorithm Born of necessity

Out of Order Processors

CS203 – Advanced Computer Architecture

Lecture: Out-of-order Processors

Microprocessor Microarchitecture Dynamic Pipeline

Advantages of Dynamic Scheduling

Sequential Execution Semantics

High-level view Out-of-order pipeline

CMSC 611: Advanced Computer Architecture

Lecture 6: Advanced Pipelines

A Dynamic Algorithm: Tomasulo’s

Lecture 16: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Out of Order Processors

Lecture 10: Out-of-order Processors

Lecture 11: Out-of-order Processors

Lecture: Out-of-order Processors

Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

ECE 2162 Reorder Buffer.

Lecture 11: Memory Data Flow Techniques

Out-of-Order Execution Scheduling

Lecture: Out-of-order Processors

Lecture 8: Dynamic ILP Topics: out-of-order processors

15-740/ Computer Architecture Lecture 5: Precise Exceptions

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Static vs. dynamic scheduling

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution

September 20, 2000 Prof. John Kubiatowicz

High-level view Out-of-order pipeline

Lecture 10: ILP Innovations

Lecture 9: ILP Innovations

Lecture 9: Dynamic ILP Topics: out-of-order processors

Conceptual execution on a processor which exploits ILP

Presentation transcript:

CSE 502: Computer Architecture Out-of-Order Execution and Register Rename

In Search of Parallelism “Trivial” Parallelism is limited What is trivial parallelism? In-order: sequential instructions do not have dependencies In all previous cases, all insns. executed with or after earlier insns. Superscalar execution quickly hits a ceiling due to deps. So what is “non-trivial” parallelism? …

Instruction-Level Parallelism (ILP) ILP is a measure of inter-dependencies between insns. Average ILP = num. instruction / num. cyc required code1: ILP = 1 i.e. must execute serially code2: ILP = 3 i.e. can execute at the same time code1: r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 code2: r1  r2 + 1 r3  r9 / 17 r4  r0 - r10

The Problem with In-Order Pipelines 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 addf f0,f1,f2 F D E+ W mulf f2,f3,f2 d* E* subf f0,f1,f4 p* What’s happening in cycle 4? mulf stalls due to RAW hazard OK, this is a fundamental problem subf stalls due to pipeline hazard Why? subf can’t proceed into D because mulf is there That is the only reason, and it isn’t a fundamental one Why can’t subf go to D in cycle 4 and E+ in cycle 5?

ILP != IPC ILP usually assumes Infinite resources Perfect fetch Unit-latency for all instructions ILP is a property of the program dataflow IPC is the “real” observed metric How many insns. are executed per cycle ILP is an upper-bound on the attainable IPC Specific to a particular program

OoO Execution (1/3) Dynamic scheduling Totally in the hardware Also called Out-of-Order execution (OoO) Fetch many instructions into instruction window Use branch prediction to speculate past branches Rename regs. to avoid false deps. (WAW and WAR) Execute insns. as soon as possible As soon as deps. (regs and memory) are known Today’s machines: 100+ insns. scheduling window

Out-of-Order Execution (2/3) Execute insns. in dataflow order Often similar but not the same as program order Use register renaming removes false deps. Scheduler identifies when to run insns. Wait for all deps. to be satisfied

Out-of-Order Execution (3/3) Fetch Dynamic Instruction Stream Rename Renamed Instruction Stream Schedule Dynamically Scheduled Instructions Static Program Just a pictoral view of the different steps Out-of-order = out of the original sequential order

OoO Example (1/2) A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Cycle 1: A B C 2: D 3: 4: 5: A B C IPC = 10/8 = 1.25 D G E F J E F 6: G H K This example is about IPC, not ILP. H J K 7: 8:

OoO Example (2/2) A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R9] J: R1 = R9 – 1 K: R3  ST 0[R1] Cycle 1: A B C 2: D 3: 4: 5: E F G A B E C D F G H J H J 6: K K 7: IPC = 10/7 = 1.43

Superscalar != Out-of-Order cache miss 1-wide In-Order A cache miss 2-wide In-Order A 1-wide Out-of-Order A cache miss 2-wide Out-of-Order A: R1 = Load 16[R2] B: R3 = R1 + R4 C: R6 = Load 8[R9] D: R5 = R2 – 4 E: R7 = Load 20[R5] F: R4 = R4 – 1 G: BEQ R4, #0 C D E F G B 5 cycles cache miss C D E B C D E F G 10 cycles B C D E F G 8 cycles B F G 7 cycles Superscalar/In-order is not uncommon; Out-of-order/Single-Issue is not common (but possible). A C D F B E G

Example Pipeline Terminology In-order pipeline F: Fetch D: Decode X: Execute W: Writeback regfile D$ I$ BP

Example Pipeline Diagram Alternative pipeline diagram Down: insns Across: pipeline stages In boxes: cycles Basically: stages  cycles Convenient for out-of-order Insn D X W ldf X(r1),f1 c1 c2 c3 mulf f0,f1,f2 c4+ c7 stf f2,Z(r1) c8 c9 addi r1,4,r1 c10 c11 c12 c13+ c16 c17 c18

Instruction Buffer insn buffer regfile I$ D$ BP Trick: instruction buffer (a.k.a. instruction window) A bunch of registers for holding insns. Split D into two parts Accumulate decoded insns. in buffer in-order Buffer sends insns. down rest of pipeline out-of-order

Dispatch and Issue Dispatch (D): first part of decode insn buffer regfile I$ D$ BP Dispatch (D): first part of decode Allocate slot in insn. buffer (if buffer is not full) In order: blocks younger insns. Issue (S): second part of decode Send insns. from insn. buffer to execution units Out-of-order: doesn’t block younger insns.

Dispatch and Issue with Floating-Point insn buffer regfile I$ D$ BP E* E* E* E + E + E/ F-regfile Number of pipeline stages per FU can vary

Our-of-Order Topics “Scoreboarding” “Tomasulo’s algorithm” First OoO, no register renaming “Tomasulo’s algorithm” OoO with register renaming Handling precise state and speculation P6-style execution (Intel Pentium Pro) R10k-style execution (MIPS R10k) Handling memory dependencies

In-Order Issue, OoO Completion Inst. Stream Execution Begins In-order INT Fadd1 Fmul1 Ld/St Fadd2 Fmul2 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Fmul3 Out-of-order Completion Issue = send an instruction to execution

Track with Simple Scoreboarding Scoreboard: a bit-array, 1-bit for each GPR If the bit is not set: the register has valid data If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it Issue in Order: RD  Fn (RS, RT) If SB[RS] or SB[RT] is set  RAW, stall If SB[RD] is set  WAW, stall Else, dispatch to FU (Fn) and set SB[RD] Complete out-of-order Update GPR[RD], clear SB[RD] H&P-style notation Finite number of regs. will force WAR and WAW

Review of Register Dependencies Read-After-Write A: R1 = R3 / R4 B: R3 = R2 * R4 Write-After-Read 5 -2 9 3 R1 R2 R3 R4 -6 A B Write-After-Write A: R1 = R2 + R3 B: R1 = R3 * R4 5 -2 9 3 R1 R2 R3 R4 7 27 A B A: R1 = R2 + R3 B: R4 = R1 * R4 A R1 5 7 7 R2 -2 -2 -2 R3 9 9 9 B R4 3 3 21 5 -2 9 3 R1 R2 R3 R4 15 7 B A 5 -2 9 3 R1 R2 R3 R4 -6 A B 5 -2 9 3 R1 R2 R3 R4 27 7 A B This should be review. Each is an example of how you will get the wrong results if you reorder two instructions that have a register dependency.

Eliminating WAR Dependencies WAR dependencies are from reusing registers A: R1 = R3 / R4 B: R3 = R2 * R4 A: R1 = R3 / R4 B: R5 = R2 * R4 X A A 5 -2 9 3 R1 R2 R3 R4 B A 4 R5 -6 R1 5 3 3 R1 5 5 -2 B B R2 -2 -2 -2 R2 -2 -2 -2 R3 9 9 -6 R3 9 -6 -6 R4 3 3 3 R4 3 3 3 example shows how using a different architected register removes WAR Can get correct result just by using different reg.

Eliminating WAW Dependencies WAW dependencies are also from reusing registers A: R1 = R2 + R3 B: R1 = R3 * R4 A: R5 = R2 + R3 B: R1 = R3 * R4 X A B 5 -2 9 3 R1 R2 R3 R4 27 7 A B 5 -2 9 3 R1 R2 R3 R4 27 B A 4 R5 7 R1 5 7 27 R2 -2 -2 -2 R3 9 9 9 R4 3 3 3 example shows similar for WAW Can get correct result just by using different reg.

Register Renaming Register renaming (in hardware) How does it work? “Change” register names to eliminate WAR/WAW hazards Arch. registers (r1,f0…) are names, not storage locations Can have more locations than names Can have multiple active versions of same name How does it work? Map-table: maps names to most recent locations On a write: allocate new location, note in map-table On a read: find location of most recent write via map-table

Register Renaming Anti (WAR) and output (WAW) deps. are false Example Dep. is on name/location, not on data Given infinite registers, WAR/WAW don’t arise Renaming removes WAR/WAW, but leaves RAW intact Example Names: r1,r2,r3 Locations: p1,p2,p3,p4,p5,p6,p7 Original: r1p1, r2p2, r3p3, p4–p7 are “free” MapTable FreeList Original insns. Renamed insns. r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6 p6 p7 div r1,4,r1 div p4,4,p7

Register Renaming Anti (WAR) and output (WAW) deps. are false Example Dep. is on name/location, not on data Given infinite registers, WAR/WAW don’t arise Renaming removes WAR/WAW, but leaves RAW intact Example Names: r1,r2,r3 Locations: p1,p2,p3,p4,p5,p6,p7 Original: r1p1, r2p2, r3p3, p4–p7 are “free” MapTable FreeList Original insns. Renamed insns. r1 r2 r3 p1 p2 p3 p4,p5,p6,p7 add r2,r3,r1 add p2,p3,p4 p4 p5,p6,p7 sub r2,r1,r3 sub p2,p4,p5 p5 p6,p7 mul r2,r3,r3 mul p2,p5,p6 p6 p7 div r1,4,r1 div p4,4,p7

Tomasulo’s Algorithm Reservation Stations (RS): instruction buffer Common data bus (CDB): broadcasts results to RS Register renaming: removes WAR/WAW hazards Bypassing (not shown here to make example simpler)

Tomasulo Data Structures (1/2) Reservation Stations (RS) FU, busy, op, R: destination register name T: destination register tag (RS# of this RS) T1,T2: source register tag (RS# of RS that will output value) V1,V2: source register values Map Table (a.k.a., RAT) T: tag (RS#) that will write this register Common Data Bus (CDB) Broadcasts <RS#, value> of completed insns. Valid tags indicate the RS# that will produce result

Tomasulo Data Structures (2/2) Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU

Tomasulo Pipeline New pipeline structure: F, D, S, X, W D (dispatch) Structural hazard ? stall : allocate RS entry S (issue) RAW hazard ? wait (monitor CDB) : go to execute W (writeback) Write register, free RS entry W and RAW-dependent S in same cycle W and structural-dependent D in same cycle

Tomasulo Dispatch (D) Allocate RS entry (structural stall if busy) Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU Allocate RS entry (structural stall if busy) Input register ready ? read value into RS : read tag into RS Set register status (i.e., rename) for output register

Tomasulo Issue (S) Wait for RAW hazards Read register values from RS Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU Wait for RAW hazards Read register values from RS

Tomasulo Execute (X) Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU

Tomasulo Writeback (W) Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU Wait for structural (CDB) hazards Output Reg tag still matches? clear, write result to register CDB broadcast to RS: tag match ? clear tag, copy value

Where is the “register rename”? Regfile Map Table T value CDB.T CDB.V Fetched insns R op T T1 T2 V1 V2 == == == == == == == == Reservation Stations T FU Value copies in RS (V1, V2) Insn. stores correct input values in its own RS entry “Free list” is implicit (allocate/deallocate as part of RS)

Tomasulo Data Structures Insn Status Insn D S X W ldf X(r1),f1 mulf f0,f1,f2 stf f2,Z(r1) addi r1,4,r1 Map Table Reg T f0 f1 f2 r1 CDB T V Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD 3 ST 4 FP1 5 FP2

Tomasulo: Cycle 1 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 mulf f0,f1,f2 stf f2,Z(r1) addi r1,4,r1 Map Table Reg T f0 f1 RS#2 f2 r1 CDB T V Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD yes ldf f1 - [r1] 3 ST 4 FP1 5 FP2 allocate

Tomasulo: Cycle 2 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 mulf f0,f1,f2 stf f2,Z(r1) addi r1,4,r1 Map Table Reg T f0 f1 RS#2 f2 RS#4 r1 CDB T V Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD yes ldf f1 - [r1] 3 ST 4 FP1 mulf f2 RS#2 [f0] 5 FP2 allocate

Tomasulo: Cycle 3 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 mulf f0,f1,f2 stf f2,Z(r1) addi r1,4,r1 Map Table Reg T f0 f1 RS#2 f2 RS#4 r1 CDB T V Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD yes ldf f1 - [r1] 3 ST stf RS#4 4 FP1 mulf f2 RS#2 [f0] 5 FP2 allocate

Tomasulo: Cycle 4 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 stf f2,Z(r1) addi r1,4,r1 Map Table Reg T f0 f1 RS#2 f2 RS#4 r1 RS#1 CDB T V RS#2 [f1] ldf finished (W) clear f1 RegStatus CDB broadcast Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU yes addi r1 - [r1] 2 LD no 3 ST stf RS#4 4 FP1 mulf f2 RS#2 [f0] CDB.V 5 FP2 allocate free RS#2 ready  grab CDB value

Tomasulo: Cycle 5 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c5 stf f2,Z(r1) addi r1,4,r1 Map Table Reg T f0 f1 RS#2 f2 RS#4 r1 RS#1 CDB T V Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU yes addi r1 - [r1] 2 LD ldf f1 RS#1 3 ST stf RS#4 4 FP1 mulf f2 [f0] [f1] 5 FP2 no allocate

Tomasulo: Cycle 6 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c5+ stf f2,Z(r1) addi r1,4,r1 c5 c6 Map Table Reg T f0 f1 RS#2 f2 RS#4RS#5 r1 RS#1 CDB T V no stall on WAW: scoreboard overwrites f2 RegStatus anyone who needs old f2 tag has it Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU yes addi r1 - [r1] 2 LD ldf f1 RS#1 3 ST stf RS#4 4 FP1 mulf f2 [f0] [f1] 5 FP2 RS#2 allocate

Tomasulo: Cycle 7 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c5+ stf f2,Z(r1) addi r1,4,r1 c5 c6 c7 Map Table Reg T f0 f1 RS#2 f2 RS#5 r1 RS#1 CDB T V RS#1 [r1] no W wait on WAR: scoreboard ensures anyone who needs old r1 has RS copy D stall on store RS: structural (no space) addi finished (W) clear r1 RegStatus CDB broadcast Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD yes ldf f1 - RS#1 CDB.V 3 ST stf RS#4 [r1] 4 FP1 mulf f2 [f0] [f1] 5 FP2 RS#2 RS#1 ready  grab CDB value

Tomasulo: Cycle 8 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c5+ c8 stf f2,Z(r1) addi r1,4,r1 c5 c6 c7 Map Table Reg T f0 f1 RS#2 f2 RS#5 r1 CDB T V RS#4 [f2] mulf finished (W), f2 already overwritten by 2nd mulf (RS#5) CDB broadcast Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD yes ldf f1 - [r1] 3 ST stf RS#4 CDB.V 4 FP1 5 FP2 mulf f2 RS#2 [f0] RS#4 ready  grab CDB value

Tomasulo: Cycle 9 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c5+ c8 stf f2,Z(r1) c9 addi r1,4,r1 c5 c6 c7 Map Table Reg T f0 f1 RS#2 f2 RS#5 r1 CDB T V RS#2 [f1] 2nd ldf finished (W) clear f1 RegStatus CDB broadcast Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD 3 ST yes stf - [f2] [r1] 4 FP1 5 FP2 mulf f2 RS#2 [f0] CDB.V RS#2 ready  grab CDB value

Tomasulo: Cycle 10 Insn Status Insn D S X W Map Table Reg T CDB T V ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c5+ c8 stf f2,Z(r1) c9 c10 addi r1,4,r1 c5 c6 c7 Map Table Reg T f0 f1 f2 RS#5 r1 CDB T V stf finished (W) no output register  no CDB broadcast Reservation Stations T FU busy op R T1 T2 V1 V2 1 ALU no 2 LD 3 ST yes stf - RS#5 [r1] 4 FP1 5 FP2 mulf f2 [f0] [f1] free  allocate

Scoreboard vs. Tomasulo Insn D S X W ldf X(r1),f1 c1 c2 c3 c4 mulf f0,f1,f2 c5+ c8 stf f2,Z(r1) c9 c10 addi r1,4,r1 c5 c6 c7 c11 c12+ c15 c10+ c13 c16 c17 c14 Hazard Scoreboard Tomasulo Insn buffer stall in D FU wait in S RAW WAR wait in W none WAW

Can We Add Superscalar? Dynamic scheduling and multi-issue are orthogonal N: superscalar width (number of parallel operations) W: window size (number of reservation stations) What is needed for an N-by-W Tomasulo? RS: N tag/value write (D), N value read (S), 2N tag cmp (W) Select logic: WN priority encoder (S) MT: 2N read (D), N write (D) RF: 2N read (D), N write (W) CDB: N (W)