Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm

Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm
Spring 2005 © Krishna V. Palem, Weng Fai Wong, and Sudhakar Yalamanchili, Georgia Institute of Technology (slides contributed by Prof. Weng Fai Wong were prepared while visiting, and employed by, Georgia Tech)

Algorithms for Out-of-order Issue
Scoreboarding Tomasulo’s Algorithm Others Spring 2005

In-Order Issue, Out-of-order Execution, Out-of-order Completion
I-Fetch Execution Core Retire Spring 2005

Dynamic Scheduling Hardware will detect and preserve dependencies (within a limited window of the instruction stream) Hardware will check for resource availability Independent instructions will be issued to the correct functional units Spring 2005

Advantages Correctness of execution guaranteed by hardware
Independent of compiler optimizations Backward compatibility Software scheduling: different machine configuration necessitate recompilation (or at least rescheduling) Spring 2005

IBM 360/91 Introduced in 1966 Introduced many important architectural innovations pipelining parallel functional units out of order execution imprecise interrupts load/store buffers Can execute programs 10 to 100 faster than its immediate predecessor (IBM 7090) According to Hennessy and Patterson: “Many of the ideas in the 360/91 faded from use for nearly 25 years before being broadly employed in the 1990s.” Spring 2005

IBM 360 Instruction Format
Known as the RX format All instructions (except load and stores)are of the format where SOURCE may be a memory operand or a register while the SINK must be a register SOURCE op SINK  SINK Spring 2005

Tomasulo’s Algorithm Credited to R.M. Tomasulo who presented it in a paper Implemented for the floating point unit of the IBM 360/91 Spring 2005

IBM 360/91 FPU FP Registers FP Add (2 stage) FP Mul/Div (6 stage)
From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers FP operations are sent by the instruction unit to the FPU into a “stack” (IBM terminology - actually a queue!) Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU Decides if it is an add or a multiply/divide FP
From Memory From Instruction Unit 8 7 6 5 4 3 2 1 Decides if it is an add or a multiply/divide FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU FP Registers 4 floating point registers FP Add
From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers 4 floating point registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Buffers for load. Each load request that goes out to memory gets a buffer allocated. Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU FP Registers The two floating point functional units.
From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder The two floating point functional units. Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU FP Registers FP Add (2 stage) FP Mul/Div (6 stage)
From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Supplies operands to reservation stations. Each operand has a tag. Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” Load Buffer 6 5 4 3 2 1 FP Registers Each reservation station holds the two operands of a operation together with their tags as well as the busy bit (which indicates if the operand is available.) Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

Tags Each tag identify uniquely either
one of the 5 reservation stations one of the 6 load buffers Indicates the “producer” of an operand that is not available from the registers A zero tag indicates that the operand is immediately available. Spring 2005

Reservation Stations Each reservation station contains the following fields: the operation to be performed (also known as a CTRL field in IBM terminology) the SOURCE the tag for the SOURCE, together with the busy bit the SINK the tag for the SINK, together with the busy bit Spring 2005

Data Structures LD/SD buffers act as reservations stations for memory units Instruction execution cannot start until all branches resolved Reservation stations Values Op Qj Qk Vj Vk A Busy Register value Qi Spring 2005

IBM 360/91 FPU From Memory From Instruction Unit 8 7 6 5 4 3 2 1 FP Ops “Stack” All operand transport occurs on the common data bus - only one operand may occupy the bus. Load Buffer 6 5 4 3 2 1 FP Registers Operand Busses Decoder Operation Bus 3 2 1 To Memory 2 1 Reservation Stations FP Add (2 stage) FP Mul/Div (6 stage) Store Buffer 3 2 1 Common Data Bus Spring 2005

Tomasulo’s Algorithm Decode an operation at the head of the floating point operation stack Look for an empty reservation station in the functional unit corresponding to the operation. If none exist, instruction issue stalls until one does exit Read the source operands from the register file, bringing forward the tags Spring 2005

Tomasulo’s Algorithm - cont’d
Mark the busy bit of the SINK in the register file. Also, the tag will be set to point to the selected reservation station When the functional unit completes its execution, it will write its result and the corresponding reservation station number back to the register file via the common data bus Spring 2005

Tomasulo’s Algorithm - cont’d
All units will listen to the bus and if it is one of the operands it need, it will read it in clear the busy bit When a functional unit is free, it will examine its reservation stations. The one with both its operands’ busy bit clear will be selected for execution Spring 2005

Flow Dependency Flow dependency is obeyed
The exclusivity and broadcast nature of the common data bus ensures that once an operand is produced, all operations requiring it will be notified Anti and output dependencies are handled by implicit register renaming Spring 2005

The Essence of Register Renaming
DIV.D F0, F2, F4 ADD.D F6, F0, F8 S.D F6, 0(R1) SUB.D F8, F10, F14 MUL.D F6, F10, F8 DIV.D F0, F2, F4 ADD.D S, F0, F8 S.D S, 0(R1) SUB.D T, F10, F14 MUL.D F6, F10, T WAR WAW Renaming is performed by the hardware using additional storage  reservation stations Equivalently, renaming can be performed by the compiler Spring 2005

The Data Path and Functional Units
Register renaming A form of “generalized” forwarding The reservation stations can be viewed as “renaming registers” that are physically distributed among the functional units Control logic for forwarding results is distributed among the function units Serialization of broadcast of results enables correct operation Spring 2005

Anti-dependence example
Considering the following anti-dependence Suppose both S1 and S2 are issued and now reside in two distinct reservation stations, RS1 and RS2 say Two possibilities: either the operation producing R0 for S1 (let’s call this S0 and assume that it occupies RS0) has completed S0 has not completed execution S1: R0 + R1  R2 S2: R3 + R4  R0 Spring 2005

Anti-dependence example - cont’d
If S0 has completed the value of R0 would be read during the issuing of S1 and would now reside in RS1 even if S2 completed and overwrites R0, there would be no effect If S0 has not completed when S0 completes, its result goes straight to RS1 R0 is not written by the value produced by S0 because R0 now points to RS2, not RS0 Spring 2005

Anti-dependence example - cont’d
R0 is mapped to two physical registers (one register and one reservation station field) depending on the instruction Hardware implicit register renaming overcomes anti-dependence Spring 2005

Output-dependence example
Considering the following output dependence When S1 is issued off the FP op queue, it is assigned a reservation station, RS1 say Suppose S2 is assigned RS2 when issued S1: R1 + R2  R0 S2: R3 * R4  R0 Spring 2005

Output-dependence example - cont’d
If S1 completes before S2 is issued the tag of R0 would point to RS1 when S2 completes its execution, it will overwrite R0 If S2 is issued and assigned RS2 before S1 completes its execution, the tag of R0 will point to RS2. We have to consider three sub-cases: S1 completes before S2: since tag of R0 no longer points to RS1, the result of S1 will not overwrite R0 S2 completes before S1: S2’s result will be written into R0; when S1 completes, since the tag of R0 no longer points to RS1, the result of S1 will not be entered into R0 Spring 2005

Output-dependence example - cont’d
S1 and S2 completes at the same time: exclusivity of bus ownership ensures that only one of the two above sub-cases will occur Renaming of R0 also prevents violation of output dependency Spring 2005

Memory Disambiguation
Detection of RAW dependencies through memory SD F6, 44(R4) LD F8, 32(R8) Loads must be checked with preceding stores (RAW) Stores must be checked with preceding Loads and Stores (WAW and WAR) A simple scheme: all effective address calculations are performed in program order Buffers’ A field stores effective address Can use forwarding directly to/from load/store buffers RAW Dependency? Spring 2005

Disadvantages Relies on a global bus - lack scalability
Modern CPUs have many more registers and buffers - tag comparison becomes expensive and this can impact the critical path of instruction processing Spring 2005

Example 1 Functional unit latencies are as follows
FPADD = 3 cycles, FPMULT = 5 cycles, Integer/Branch = 1 cycle, LD/SD = 2 cycles One of each type functional unit each with a single reservation station Functional units are pipelined If an operand is written over the CDB on one cycle, dependent operations execute on the next cycle Spring 2005

Example 1 (cont.) Code Issue Execute Writeback L.D F2, 0(R1) 1-2 3
1-2 3 MUL.D F4, F2, F0 1 4-8 9 L.D F6, 0(R2) 4 5-6 7 ADD.D F6, F4, F6 5 10-12 13 S.D F6, 0(R2) 8 14-15 DADDUI R1, R1, #8 10 11 DADDIU R2, R2, #-8 12 14 BGT R1, #800 15 16 Only one reservation station Only one reservation station Spring 2005

Example 2 Now consider the status of the reservation stations, load/store buffers, and FP registers Show the status of the data structures for the following program when the first MUL.D has completed execution but not yet written the results Spring 2005

Example 2 Spring 2005

Note the potential conflict on the CDB in cycle 8
Example 2 Note the potential conflict on the CDB in cycle 8 Spring 2005

Overlapping Loop Iterations
Register renaming Multiple iterations use different physical destinations for registers (dynamic loop unrolling). Reservation stations Permit instruction issue to advance past integer control flow operations Also buffer old values of registers - totally avoiding the WAR stall that we saw in the scoreboard. Alternative perspective: Tomasulo building data flow dependency graph on the fly. Spring 2005

Additional Reference Reference the Tomasulo Example in lecture notes for CS 252, Department of EECS, University of California, Berkeley, taught by Professor David Patterson and available at Spring 2005

Conclusions Three key elements improve performance
Dynamic scheduling Register renaming Memory disambiguation What limits instruction concurrency? Hardware resources Control flow Spring 2005

Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm

Similar presentations

Presentation on theme: "Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm

Similar presentations

Presentation on theme: "Module: Part 2 Dynamic Scheduling in Hardware - Tomasulo’s Algorithm"— Presentation transcript:

Similar presentations

About project

Feedback