Instruction-Level Parallelism Dynamic Scheduling

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Tomasulo Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

Scoreboarding & Tomasulos Approach Bazat pe slide-urile lui Vincent H. Berk.
Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.
Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.
A scheme to overcome data hazards
Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.
Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.
Lecture 6: ILP HW Case Study— CDC 6600 Scoreboard & Tomasulo’s Algorithm Professor Alvin R. Lebeck Computer Science 220 Fall 2001.
COMP25212 Advanced Pipelining Out of Order Processors.
CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Microprocessor Microarchitecture Dependency and OOO Execution Lynn Choi Dept. Of Computer and Electronics Engineering.
Instruction-Level Parallelism (ILP)
Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)
CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.
Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.
COMP381 by M. Hamdi 1 Pipelining (Dynamic Scheduling Through Hardware Schemes)
1 Recap (Scoreboarding). 2 Dynamic Scheduling Dynamic Scheduling by Hardware – – Allow Out-of-order execution, Out-of-order completion – – Even though.
ENGS 116 Lecture 71 Scoreboarding Vincent H. Berk October 8, 2008 Reading for today: A.5 – A.6, article: Smith&Pleszkun FRIDAY: NO CLASS Reading for Monday:
1 COMP 206: Computer Architecture and Implementation Montek Singh Wed, Oct 5, 2005 Topic: Instruction-Level Parallelism (Dynamic Scheduling: Scoreboarding)
EENG449b/Savvides Lec 5.1 1/27/04 January 27, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.
Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
Out-of-order execution: Scoreboarding and Tomasulo Week 2
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.
CET 520/ Gannod1 Section A.8 Dynamic Scheduling using a Scoreboard.
Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
2/24; 3/1,3/11 (quiz was 2/22, QuizAns 3/8) CSE502-S11, Lec ILP 1 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From.
1 Images from Patterson-Hennessy Book Machines that introduced pipelining and instruction-level parallelism. Clockwise from top: IBM Stretch, IBM 360/91,
CSC 4250 Computer Architectures September 29, 2006 Appendix A. Pipelining.
Chapter 3 Instruction Level Parallelism Dr. Eng. Amr T. Abdel-Hamid Elect 707 Spring 2011 Computer Applications Text book slides: Computer Architec ture:
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;
CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –
COMP25212 Advanced Pipelining Out of Order Processors.
Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興 高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.
Instruction-Level Parallelism and Its Dynamic Exploitation
IBM System 360. Common architecture for a set of machines
Dynamic Scheduling Why go out of style?
Images from Patterson-Hennessy Book
Out of Order Processors
Step by step for Tomasulo Scheme
CS203 – Advanced Computer Architecture
Microprocessor Microarchitecture Dynamic Pipeline
Chapter 3: ILP and Its Exploitation
Advantages of Dynamic Scheduling
High-level view Out-of-order pipeline
CMSC 611: Advanced Computer Architecture
A Dynamic Algorithm: Tomasulo’s
Out of Order Processors
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
CS 704 Advanced Computer Architecture
Adapted from the slides of Prof
Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)
How to improve (decrease) CPI
Static vs. dynamic scheduling
CSCE430/830 Computer Architecture
Static vs. dynamic scheduling
Tomasulo Organization
Reduction of Data Hazards Stalls with Dynamic Scheduling
Adapted from the slides of Prof
CS152 Computer Architecture and Engineering Lecture 16 Compiler Optimizations (Cont) Dynamic Scheduling with Scoreboards.
Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005
September 20, 2000 Prof. John Kubiatowicz
High-level view Out-of-order pipeline
Lecture 7 Dynamic Scheduling
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Instruction-Level Parallelism Dynamic Scheduling

Dynamic Scheduling Static Scheduling – Compiler rearranging instructions to reduce stalls Dynamic Scheduling – Hardware rearranges instruction stream to reduce stalls Possible to account for dependencies that are only known at run time - e. g. memory reference Simplifies the compiler Code built for one pipeline in mind could also run efficiently on a different pipeline minimizes the need for speculative slot filling - NP hard Once a common tactic with supercomputers and mainframes but was too expensive for single-chip Now fits on a desktop PC’s

Dynamic Scheduling Dependencies that are close together stall the entire pipeline DIVD F0, F2, F4 ADDD F10, F0, F8 SUBD F12, F8, F14 The ADD needs the DIV to finish, so there is a stall… which also stalls the SUBD Looong stall for DIV But the SUBD is independent so there is no reason why we shouldn’t execute it Or is there? Precise Interrupts - Ignore for now Compiler could rearrange instructions, but so could the hardware

Dynamic Scheduling It would be desirable to decode instructions into the pipe in order but then let them stall individually while waiting for operands before issue to execution units. Dynamic Scheduling – Out of Order Issue / Execution Scoreboarding

Scoreboarding Split ID stage into two: Issue (Decode, check for Structural hazards) Read Operands (Wait until no data hazards, read operands)

Scoreboard Overview Consider another example: Note: DIVD F0, F2, F4 ADDD F10, F0, F8 ; RAW Hazard SUBD F8, F8, F14 ; WAR Hazard Note: We may still have to stall in the Scoreboard SUBD wouldn’t have to wait if we had a different register other than F8 MIPS Pipeline changes again Ignore MEM and IF stages for now to focus on the scoreboard

Scoreboard and the CDC Introduced on CDC 6600 Results 4 Floating Point Units, 5 Memory, 7 Integer Units Load/Store architecture Results 70 to 150% improvement Good since this cost over $1 million Recall this is an old machine with no cache, pipelining, primitive compilers Fastest machine of the time Back today in IBM, Sun machines, similar concepts in Intel chips

MIPS Scoreboard Two MULT units Note: Need a lot more buses

New Pipeline Steps Ignoring IF, and MEM Issue Read Operands IF If functional unit free and no other active instruction has the same destination register, issue the instruction to the functional unit Stall for WAW hazards or structural hazards Nothing out of order yet Read Operands Monitor source operands, if available for an active instruction, then send to the functional unit to execute Instructions get out of order at this step, dynamic RAW resolution Execution Starts when operands ready, notifies scoreboard when done Write Result Stall for WAR hazards before writing result back to register file IF EX WB

Scoreboard Data Structure Instruction Status Indicates which of the four steps the instruction is in (Issue, Read Operands, Execute Complete, Write Result) The capacity of the instruction status is the window size Functional Unit Status Busy : Yes/No indicates if busy or not Op: Operation to perform, e.g. Add or Subtract Fi : Destination Register Fj, Fk : Source Registers Qj, Qk : Functional Units producing source register j, k Rj, Rk : Yes/No indicates if source j or k is ready or not, set to NO after operands are read Register Result Status For each register, indicates the Functional Unit that will write to it Blank means nothing will write to it Note overlap of this info with Q/F fields, but indexed by register, not instr

Example Scoreboard

Scoreboard Example Assume the following EX latencies FP Add = 2 clock cycles FP Mult = 10 clock cycles FP Div = 40 clock cycles Assume entire code fragment (basic block) fits in the window Also assume the code fragment is already loaded Issue Rules Issue maximum 1 instruction per cycle Clearing hazards will depend on the pipeline structure

MIPS Scoreboarding Rules Issue If FU not busy (no structural hazards) and no result pending for destination register (no WAW hazards) then Assign FU entry Set operand source in scoreboard, indicate if ready or not Read Operands If source1 and source2 are both “ready” (no RAW hazards) then Set ready=No for both, read registers, and start EX Execute Notify scoreboard when complete Writeback If none of the functional units needs to use our result register and has yet to read it (no WAR hazards) then Set ready=Yes for any functional unit sources waiting for this FU write the register file, clear the FU status, and clear our destination’s register result pending entry. Stages must stall for hazards

Scoreboard – Cycle 1 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F6 R2 Mult1 Mult2 Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Int RRS

Scoreboard – Cycle 2 IS FS Issue 2nd LD? RRS Name Busy Op Fi Fj Fk Qj Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F6 R2 Mult1 Mult2 Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Int Issue 2nd LD? RRS

Scoreboard – Cycle 3 IS FS Issue Mult? RRS Name Busy Op Fi Fj Fk Qj Qk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F6 R2 No Mult1 Mult2 Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Int Issue Mult? RRS

Scoreboard – Cycle 4 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F6 R2 No Mult1 Mult2 Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU RRS

Scoreboard – Cycle 5 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F2 R3 Mult1 Mult2 Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Int RRS

Scoreboard – Cycle 6 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F2 R3 Mult1 Mult F0 F4 Int No Mult2 Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Int RRS

Scoreboard – Cycle 7 IS FS Read Mult operands? RRS Name Busy Op Fi Fj Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F2 R3 No Mult1 Mult F0 F4 Int Mult2 Add Sub F8 F6 Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Int Add Read Mult operands? RRS

Scoreboard – Cycle 8a IS FS Issue first RRS Name Busy Op Fi Fj Fk Qj Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 8 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F2 R3 No Mult1 Mult F0 F4 Int Mult2 Add Sub F8 F6 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Int Add Div Issue first RRS

Scoreboard – Cycle 8b IS FS Write result RRS Name Busy Op Fi Fj Fk Qj Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Yes LD F2 R3 No Mult1 Mult F0 F4 Mult2 Add Sub F8 F6 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div Write result RRS

Scoreboard – Cycle 9 IS FS Issue ADD? RRS Name Busy Op Fi Fj Fk Qj Qk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add Sub F8 F6 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div Issue ADD? RRS

Scoreboard – Cycle 10 IS FS Exec MULT, ADD RRS Name Busy Op Fi Fj Fk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add Sub F8 F6 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div Exec MULT, ADD RRS

Scoreboard – Cycle 11 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 11 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add Sub F8 F6 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div RRS

Scoreboard – Cycle 12 IS FS Read opnds DIVD? RRS Name Busy Op Fi Fj Fk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add Sub F8 F6 Divide Div F10 Mul1 FS Read opnds DIVD? F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Div RRS

Scoreboard – Cycle 13 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 13 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add F6 F8 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div RRS

Scoreboard – Cycle 14 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 13 14 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add F6 F8 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div RRS

Scoreboard – Cycle 15 IS FS Start EX on ADD RRS Name Busy Op Fi Fj Fk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 13 14 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add F6 F8 Divide Div F10 Mul1 FS Start EX on ADD F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div RRS

Scoreboard – Cycle 16 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 13 14 16 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add F6 F8 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div RRS

Scoreboard – Cycle 17 IS FS Write Result Add? RRS Name Busy Op Fi Fj Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 13 14 16 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add F6 F8 Divide Div F10 Mul1 FS Write Result Add? F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div RRS

Scoreboard – Cycle 19 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 19 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 13 14 16 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add F6 F8 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Mul Add Div RRS

Scoreboard – Cycle 20 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 19 20 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 ADDD F6, F8, F2 13 14 16 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Yes Mult F0 F2 F4 Mult2 Add F6 F8 Divide Div F10 Mul1 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Add Div RRS

Scoreboard – Cycle 21 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 19 20 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 21 ADDD F6, F8, F2 13 14 16 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Mult2 Add Yes F6 F8 F2 Divide Div F10 F0 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Add Div RRS

Scoreboard – Cycle 22 IS FS Safe to write RRS Name Busy Op Fi Fj Fk Qj Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 19 20 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 21 ADDD F6, F8, F2 13 14 16 22 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Mult2 Add Yes F6 F8 F2 Divide Div F10 F0 FS Safe to write F0 F2 F4 F6 F8 F10 F12 … F30 FU Div RRS

Scoreboard – Cycle 61 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 19 20 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 21 61 ADDD F6, F8, F2 13 14 16 22 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Mult2 Add Divide Yes Div F10 F0 F6 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU Div RRS

Scoreboard – Cycle 62 IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Instr Issue Read Operands Exec Complete Write Result LD F6, 34(R2) 1 2 3 4 LD F2, 45(R3) 5 6 7 8 MULTD F0, F2, F4 9 19 20 SUBD F8, F6, F2 11 12 DIVD, F10, F0, F6 21 61 62 ADDD F6, F8, F2 13 14 16 22 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer No Mult1 Mult2 Add Divide Yes Div F10 F0 F6 FS F0 F2 F4 F6 F8 F10 F12 … F30 FU RRS

IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult Add Divide F0 Instr Issue Read Operands Exec Complete Write Result LD F2, 0(R1) ADDD F6, F2, F4 MULTD F8, F6, F0 LD F10, -8(R1) ADDD, F12, F10, F4 MULTD F14, F12,F0 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Mult Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU RRS

Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Y Load F2 R1 Yes F10 Mult F8 Instr Issue Read Operands Exec Complete Write Result LD F2, 0(R1) 1 2 3 4 ADDD F6, F2, F4 5 7 8 MULTD F8, F6, F0 9 19 20 LD F10, -8(R1) 6 ADDD, F12, F10, F4 10 11 MULTD F14, F12,F0 21 22 32 33 Name Busy Op Fi Fj Fk Qj Qk Rj Rk Integer Y Load F2 R1 Yes F10 Mult F8 F6 F0 Add1 Add F4 Int1 F12 Int2 No Divide F0 F2 F4 F6 F8 F10 F12 F14 F30 FU Mult1 Int2 Add2

IS FS RRS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Int1 Int2 Mult Add Divide Instr Issue Read Operands Exec Complete Write Result DIVD F10, F8, F2 ADDD F4, F10, F2 LD F2, 8(R0) ADDD F6, F0, F12 IS Name Busy Op Fi Fj Fk Qj Qk Rj Rk Int1 Int2 Mult Add Divide FS F0 F2 F4 F6 F8 F10 F12 … F30 FU RRS

Scoreboarding Comments Could improve a bit with forwarding Improve by one cycle for forwarded data Whats Missing Control Hazards, i.e. branches Dynamic Branch Prediction Speculative Execution Limitations Note we had to wait for a register to be read before writing Could be avoided with Register Renaming Amount of ILP limited by RAW hazards Some estimate the amount of parallelism typically ~2 Window Size Number and Type of Functional Units, Lots of Buses Big increase in cost and complexity!

Static Scheduling Revisited Scoreboarding stalled for WAR and WAW hazards Consider the fragment DIVD F1, F2, F3 ; slow SUBD F2, F3, F4 ; must wait for DIV to read, WAR SD 0(R1), F2 There is a WAR antidependence on F2 between the DIV and the SUB. By “renaming” F2 and subsequent uses of it we avoid the hazard DIVD F1, F2, F3 ; slow SUBD F5, F3, F4 ; Free to start anytime SD 0(R1), F5 Reduces stalls for some pipelines, but uses additional registers

Dynamic Scheduling Same idea as before, but in hardware, not software Every time you write to a register number, allocate a new physical register from a pool of virtual registers. Results in multiple copies of each register, one for each new value written to it. Implementation implications: Need more physical registers than register addresses Keep a pool of unassigned physical registers, and allocate one when a register is written All uses to the original register number must be remapped to the new physical register When a register is written, all previous physical registers allocated for it can be freed as soon as all previous instructions that use it have completed Can eliminate stalls for WAR and WAW hazards

Register Renaming Idea Initial mappings: F1  V1 F2  V2 F3  V3 F4  V4 DIVD F1, F2, F3 F1  V5 SUBD F2, F3, F4 F2  V6 ADDD F1, F3, F2 F1  V7 Becomes DIVD V5, V2, V3 SUBD V6, V3, V4 ADDD V7, V3, V6 Note V6 here not V2

Tomasulo’s Algorithm Uses register renaming Idea: When opcode, operands, and functional unit is available, execute the instruction, data-centric Replace the Functional Unit Status table with reservation stations for each functional unit The reservation station stores operand values as they become available, or the name of a reservation station that will provide the result. This means that the original register number is no longer needed, and is now “renamed” as a reservation station Each result is broadcast to any reservation station that needs it Instructions can now be issued even if there are structural hazards or WAR/WAW hazards.

MIPS FP via Tomasulo Tags Need more reservation stations than actual registers Tags

Tomasulo vs. Scoreboarding Tomasulo’s Algorithm Used on IBM 360 attempt to deal with long memory and FP latencies 360 permitted register to memory ALU instructions vs. CDC scoreboard hazard detection and interlocks are distributed Each functional unit has reservation stations to control execution Reservation stations act as interlocks until data available Results passed directly to functional units rather than through the registers (with renamed registers) multicast on CDB (common data bus) Dataflow centric no WAR or WAW hazards exist - rename instead Scoreboard was control centric

Tomasulo - Central Idea To speed up the 360 we can Add more functional units, but they wouldn’t be heavily used Replicating registers is cheaper and increases utilization Helped on 360 which had only 6 FP registers (similar to x86) Outcome each execution unit fed by multiple reservation stations Execution units used pipelining Instruction issue to reservation stations is in-order, but reservation stations would execute upon acquiring all source operands  out of order execution reservation stations contain virtual registers that remove WAW and WAR induced stalls

Dynamic Scheduling Both Scoreboarding and Tomasulo’s algorithm were early approaches, but the ideas are still used today Scoreboarding - CDC 6600 (1964) Tomasulo - IBM 360/ 91 (1967) Both methods are used today UltraSparc Pentium/Itanium More transistors dedicated to ILP in the original Pentium than in the entire 486