Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm

Slides:

Advertisements

Similar presentations

Spring 2003CSE P5481 Out-of-Order Execution Several implementations out-of-order completion CDC 6600 with scoreboarding IBM 360/91 with Tomasulos algorithm.

Advertisements

Hardware-Based Speculation. Exploiting More ILP Branch prediction reduces stalls but may not be sufficient to generate the desired amount of ILP One way.

Instruction-level Parallelism Compiler Perspectives on Code Movement dependencies are a property of code, whether or not it is a HW hazard depends on.

A scheme to overcome data hazards

Dynamic ILP: Scoreboard Professor Alvin R. Lebeck Computer Science 220 / ECE 252 Fall 2008.

Dyn. Sched. CSE 471 Autumn 0219 Tomasulo’s algorithm “Weaknesses” in scoreboard: –Centralized control –No forwarding (more RAW than needed) Tomasulo’s.

EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.

COMP25212 Advanced Pipelining Out of Order Processors.

CMSC 611: Advanced Computer Architecture Scoreboard Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Spring 2003CSE P5481 Reorder Buffer Implementation (Pentium Pro) Hardware data structures retirement register file (RRF) (~ IBM 360/91 physical registers)

CPE 731 Advanced Computer Architecture ILP: Part IV – Speculative Execution Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University.

Limits on ILP. Achieving Parallelism Techniques – Scoreboarding / Tomasulo’s Algorithm – Pipelining – Speculation – Branch Prediction But how much more.

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

Computer Architecture

EENG449b/Savvides Lec /22/05 March 22, 2005 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

EENG449b/Savvides Lec /20/04 February 12, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

1 IBM System 360. Common architecture for a set of machines. Robert Tomasulo worked on a high-end machine, the Model 91 (1967), on which they implemented.

EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.

CSC 4250 Computer Architectures October 13, 2006 Chapter 3.Instruction-Level Parallelism & Its Dynamic Exploitation.

Nov. 9, Lecture 6: Dynamic Scheduling with Scoreboarding and Tomasulo Algorithm (Section 2.4)

1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.

1 Lecture 6 Tomasulo Algorithm CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading:Textbook 2.4, 2.5.

Professor Nigel Topham Director, Institute for Computing Systems Architecture School of Informatics Edinburgh University Informatics 3 Computer Architecture.

04/03/2016 slide 1 Dynamic instruction scheduling Key idea: allow subsequent independent instructions to proceed DIVDF0,F2,F4; takes long time ADDDF10,F0,F8;

CIS 662 – Computer Architecture – Fall Class 11 – 10/12/04 1 Scoreboarding  The following four steps replace ID, EX and WB steps  ID: Issue –

COMP25212 Advanced Pipelining Out of Order Processors.

Ch2. Instruction-Level Parallelism & Its Exploitation 2. Dynamic Scheduling ECE562/468 Advanced Computer Architecture Prof. Honggang Wang ECE Department.

Sections 3.2 and 3.3 Dynamic Scheduling – Tomasulo’s Algorithm 吳俊興高雄大學資訊工程學系 October 2004 EEF011 Computer Architecture 計算機結構.

Instruction-Level Parallelism and Its Dynamic Exploitation

IBM System 360. Common architecture for a set of machines

Dynamic Scheduling Why go out of style?

/ Computer Architecture and Design

CPE 731 Advanced Computer Architecture ILP: Part V – Multiple Issue

Tomasulo’s Algorithm Born of necessity

Approaches to exploiting Instruction Level Parallelism (ILP)

Out of Order Processors

Dynamic Scheduling and Speculation

Step by step for Tomasulo Scheme

CS203 – Advanced Computer Architecture

Microprocessor Microarchitecture Dynamic Pipeline

Lecture 10 Tomasulo’s Algorithm

Lecture 12 Reorder Buffers

Chapter 3: ILP and Its Exploitation

Advantages of Dynamic Scheduling

High-level view Out-of-order pipeline

11/14/2018 CPE 631 Lecture 10: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenković, Electrical and Computer.

CMSC 611: Advanced Computer Architecture

A Dynamic Algorithm: Tomasulo’s

Out of Order Processors

Superscalar Processors & VLIW Processors

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

CS 704 Advanced Computer Architecture

CS 704 Advanced Computer Architecture

Adapted from the slides of Prof

Checking for issue/dispatch

Lecture 7: Dynamic Scheduling with Tomasulo Algorithm (Section 2.4)

Advanced Computer Architecture

Static vs. dynamic scheduling

CC423: Advanced Computer Architecture ILP: Part V – Multiple Issue

Tomasulo Organization

Reduction of Data Hazards Stalls with Dynamic Scheduling

CS5100 Advanced Computer Architecture Dynamic Scheduling

Adapted from the slides of Prof

Scoreboarding ENGS 116 Lecture 7 Vincent H. Berk October 5, 2005

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

September 20, 2000 Prof. John Kubiatowicz

High-level view Out-of-order pipeline

Lecture 7 Dynamic Scheduling

Conceptual execution on a processor which exploits ILP

Presentation transcript:

Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm Spring 2005 © Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed by, Georgia Tech)

Algorithms for Out-of-order Issue Scoreboarding Tomasulo’s Algorithm Others Spring 2005

In-Order Issue, Out-of-order Execution, Out-of-order Completion I-Fetch Execution Core Retire Spring 2005

Dynamic Scheduling Hardware will detect and preserve dependencies (within a limited window of the instruction stream) Hardware will check for resource availability Independent instructions will be issued to the correct functional units Spring 2005

Advantages Correctness of execution guaranteed by hardware Independent of compiler optimizations Backward compatibility Software scheduling: different machine configuration necessitate recompilation (or at least rescheduling) Spring 2005

IBM 360/91 Introduced in 1966 Introduced many important architectural innovations pipelining parallel functional units out of order execution imprecise interrupts load/store buffers Can execute programs 10 to 100 faster than its immediate predecessor (IBM 7090) According to Hennessy and Patterson: “Many of the ideas in the 360/91 faded from use for nearly 25 years before being broadly employed in the 1990s.” Spring 2005

IBM 360 Instruction Format Known as the RX format All instructions (except load and stores)are of the format where SOURCE may be a memory operand or a register while the SINK must be a register SOURCE op SINK  SINK Spring 2005

Tomasulo’s Algorithm Credited to R.M. Tomasulo who presented it in a paper Implemented for the floating point unit of the IBM 360/91 Spring 2005

IBM 360/91 FPU FP Registers FP Add (2 stage) FP Mul/Div (6 stage) From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers FP operations are sent by the instruction unit to the FPU into a “stack” (IBM terminology - actually a queue!) Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU Decides if it is an add or a multiply/divide FP From Memory From Instruction Unit 8 7 6 5 4 3 2 1 Decides if it is an add or a multiply/divide FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU FP Registers 4 floating point registers FP Add From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers 4 floating point registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Buffers for load. Each load request that goes out to memory gets a buffer allocated. Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU FP Registers The two floating point functional units. From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder The two floating point functional units. Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU FP Registers FP Add (2 stage) FP Mul/Div (6 stage) From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Supplies operands to reservation stations. Each operand has a tag. Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Each reservation station holds the two operands of a operation together with their tags as well as the busy bit (which indicates if the operand is available.) Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

Tags Each tag identify uniquely either one of the 5 reservation stations one of the 6 load buffers Indicates the “producer” of an operand that is not available from the registers A zero tag indicates that the operand is immediately available. Spring 2005

Reservation Stations Each reservation station contains the following fields: the operation to be performed (also known as a CTRL field in IBM terminology) the SOURCE the tag for the SOURCE, together with the busy bit the SINK the tag for the SINK, together with the busy bit Spring 2005

Data Structures LD/SD buffers act as reservations stations for memory units Instruction execution cannot start until all branches resolved Reservation stations Values Op Qj Qk Vj Vk A Busy Register value Qi Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” All operand transport occurs on the common data bus - only one operand may occupy the bus. Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

Tomasulo’s Algorithm Decode an operation at the head of the floating point operation stack Look for an empty reservation station in the functional unit corresponding to the operation. If none exist, instruction issue stalls until one does exit Read the source operands from the register file, bringing forward the tags Spring 2005

Tomasulo’s Algorithm - cont’d Mark the busy bit of the SINK in the register file. Also, the tag will be set to point to the selected reservation station When the functional unit completes its execution, it will write its result and the corresponding reservation station number back to the register file via the common data bus Spring 2005

Tomasulo’s Algorithm - cont’d All units will listen to the bus and if it is one of the operands it need, it will read it in clear the busy bit When a functional unit is free, it will examine its reservation stations. The one with both its operands’ busy bit clear will be selected for execution Spring 2005

Flow Dependency Flow dependency is obeyed The exclusivity and broadcast nature of the common data bus ensures that once an operand is produced, all operations requiring it will be notified Anti and output dependencies are handled by implicit register renaming Spring 2005

The Essence of Register Renaming DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T WAR WAW Renaming is performed by the hardware using additional storage  reservation stations Equivalently, renaming can be performed by the compiler Spring 2005

The Data Path and Functional Units Register renaming A form of “generalized” forwarding The reservation stations can be viewed as “renaming registers” that are physically distributed among the functional units Control logic for forwarding results is distributed among the function units Serialization of broadcast of results enables correct operation Spring 2005

Anti-dependence example Considering the following anti-dependence Suppose both S1 and S2 are issued and now reside in two distinct reservation stations, RS1 and RS2 say Two possibilities: either the operation producing R0 for S1 (let’s call this S0 and assume that it occupies RS0) has completed S0 has not completed execution S1: R0 + R1  R2 S2: R3 + R4  R0 Spring 2005

Anti-dependence example - cont’d If S0 has completed the value of R0 would be read during the issuing of S1 and would now reside in RS1 even if S2 completed and overwrites R0, there would be no effect If S0 has not completed when S0 completes, its result goes straight to RS1 R0 is not written by the value produced by S0 because R0 now points to RS2, not RS0 Spring 2005

Anti-dependence example - cont’d R0 is mapped to two physical registers (one register and one reservation station field) depending on the instruction Hardware implicit register renaming overcomes anti-dependence Spring 2005

Output-dependence example Considering the following output dependence When S1 is issued off the FP op queue, it is assigned a reservation station, RS1 say Suppose S2 is assigned RS2 when issued S1: R1 + R2  R0 S2: R3 * R4  R0 Spring 2005

Output-dependence example - cont’d If S1 completes before S2 is issued the tag of R0 would point to RS1 when S2 completes its execution, it will overwrite R0 If S2 is issued and assigned RS2 before S1 completes its execution, the tag of R0 will point to RS2. We have to consider three sub-cases: S1 completes before S2: since tag of R0 no longer points to RS1, the result of S1 will not overwrite R0 S2 completes before S1: S2’s result will be written into R0; when S1 completes, since the tag of R0 no longer points to RS1, the result of S1 will not be entered into R0 Spring 2005

Output-dependence example - cont’d S1 and S2 completes at the same time: exclusivity of bus ownership ensures that only one of the two above sub-cases will occur Renaming of R0 also prevents violation of output dependency Spring 2005

Memory Disambiguation Detection of RAW dependencies through memory SD F6, 44(R4) LD F8, 32(R8) Loads must be checked with preceding stores (RAW) Stores must be checked with preceding Loads and Stores (WAW and WAR) A simple scheme: all effective address calculations are performed in program order Buffers’ A field stores effective address Can use forwarding directly to/from load/store buffers RAW Dependency? Spring 2005

Disadvantages Relies on a global bus - lack scalability Modern CPUs have many more registers and buffers - tag comparison becomes expensive and this can impact the critical path of instruction processing Spring 2005

Example 1 Functional unit latencies are as follows FPADD = 3 cycles, FPMULT = 5 cycles, Integer/Branch = 1 cycle, LD/SD = 2 cycles One of each type functional unit each with a single reservation station Functional units are pipelined If an operand is written over the CDB on one cycle, dependent operations execute on the next cycle Spring 2005

Example 1 (cont.) Code Issue Execute Writeback L.D F2, 0(R1) 1-2 3 1-2 3 MUL.D F4, F2, F0 1 4-8 9 L.D F6, 0(R2) 4 5-6 7 ADD.D F6, F4, F6 5 10-12 13 S.D F6, 0(R2) 8 14-15 DADDUI R1, R1, #8 10 11 DADDIU R2, R2, #-8 12 14 BGT R1, #800 15 16 Only one reservation station Only one reservation station Spring 2005

Example 2 Now consider the status of the reservation stations, load/store buffers, and FP registers Show the status of the data structures for the following program when the first MUL.D has completed execution but not yet written the results Spring 2005

Example 2 Spring 2005

Example 2 Spring 2005

Note the potential conflict on the CDB in cycle 8 Example 2 Note the potential conflict on the CDB in cycle 8 Spring 2005

Overlapping Loop Iterations Register renaming Multiple iterations use different physical destinations for registers (dynamic loop unrolling). Reservation stations Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard. Alternative perspective: Tomasulo building data flow dependency graph on the fly. Spring 2005

Additional Reference Reference the Tomasulo Example in lecture notes for CS 252, Department of EECS, University of California, Berkeley, taught by Professor David Patterson and available at http://www.cs.berkeley.edu/~pattrsn/252S01/ Spring 2005

Conclusions Three key elements improve performance Dynamic scheduling Register renaming Memory disambiguation What limits instruction concurrency? Hardware resources Control flow Spring 2005