COMP25212 Advanced Pipelining Out of Order Processors.

COMP25212 Advanced Pipelining Out of Order Processors

2 Classic 5-stage pipeline Fetch Logic Decode Logic Exec Logic Mem Logic Write Logic Inst CacheData Cache A single execution flow

3 Modern Pipelines Fetch Decode Add1 Ld2 Write Back Inst Cache Write Back Ld1 Mul3 Write Back Mul1Mul2 Div 3 Write Back Div1Div 2 Many execution flows Functional Units (FU)

4 ARM Pipelines In-order processor Out of order processor

5 Out of Order Execution The original order in a program is not preserved Processors execute instructions as input data becomes available Pipeline stalls due to conflicted instructions are avoided by processing instructions which are able to run immediately Take advantage of ILP Instructions per cycle increases

6 Conflicted Instructions Cache misses: long wait before finishing execution Structural Hazard: the required resources are not available Data hazard: dependencies between instructions

7 Structural Hazards Functional Units are typically not pipelined This means only one instruction can use them at once If all suitable Functional Units for executing an instruction are busy, then the instruction can not be executed This is known as an Structural Hazard

8 Data dependencies True dependency r1 <- r2 op r3 r4 <- r1 op r5 Anti-dependency r1 <- r2 op r3 r2 <- r4 op r5 Output dependency r1 <- r2 op r3 … r1 <- r4 op r5 Read-after-write RAW Write-after-read WAR Write-after-write WAW

9 Key Idea: Allow instructions behind stall to proceed. => Instructions executing in parallel. There are multiple execution units, so use them DIVDF0, F2, F4 ADDDF10, F0, F8 SUBDF12, F8, F14 –Enables out-of-order execution => out-of-order completion Even though ADDD stalls, the SUBD has no dependencies and can run. Dynamic Scheduling Dynamic pipeline scheduling overcomes the limitations of in-order pipelined execution by allowing out-of-order instruction execution

Out of Order Execution with Scoreboard

11 Scoreboard The scoreboard is a centralized hardware mechanism –Instruction are executed as soon as their operands are available and there are no hazard conditions It dynamically constructs the dependency graph by hardware for a window of instructions as they are issued in program order The scoreboard is a “data structure” that provides the information necessary for all pieces of the processor to work together (In Appendix A.8) CDC6600 (1963)

12 Out-of-order execution divides ID stage: 1.Issue—decode instructions, check for structural hazards 2.Read operands—wait until no data hazards, then read operands Scoreboard allows instruction to execute whenever 1 & 2 hold, not waiting for prior instructions We will use In order issue, out of order execution, out of order commit ( also called completion) The Key idea of Scoreboards

13 Typical Scoreboard Structure

14 Stages of a Scoreboard Pipeline Issue Write Back Read Operands Execute Integer Write Back Execute FP Multiplication Write Back Execute FP Add Write Back Execute FP Division Write Back Execute FP Multiplication

15 1.Issue —decode instructions & check for structural & WAW hazards (ID) If a functional unit for the instruction is free (no structural hazards) and no other active instruction has the same destination register (no WAW), the scoreboard issues the instruction to the functional unit and updates its internal data structure. If a structural or WAW hazard exists, then the instruction issue stalls, and no further instructions will issue until these hazards are cleared. 2. Read operands —wait until no data hazards, then read operands (RO) A source operand is available if no earlier issued active instruction is going to write it, or if the register containing the operand is being written by a currently active functional unit (no RAW). When the source operands are available, the scoreboard tells the functional unit to proceed to read the operands from the registers and begin execution. The scoreboard resolves RAW hazards dynamically in this step, and instructions may be sent into execution out of order. Stages of a Scoreboard Pipeline Always done in program order Can be done out of program order

16 3.Execution —operate on operands (EX) The functional unit begins execution upon receiving operands. When the result is ready, it notifies the scoreboard that it has completed execution. 4.Write result —finish execution (WB) Once the functional unit has completed execution, the scoreboard checks for WAR hazards. If none, it writes results. If WAR, then it stalls the instruction. Example: DIVDF0,F2,F4 ADDDF10,F0,F8 SUBDF8,F8,F14 Scoreboard would stall SUBD completion until ADDD reads operands Stages of a Scoreboard Pipeline

17 1.Instruction status—which of 4 steps the instruction is in 2.Functional unit status—Indicates the state of the functional unit (FU). 9 fields for each functional unit Busy—Indicates whether the unit is being used or not Op—Operation to perform in the unit (e.g., + or –) Fi—Destination register Fj, Fk—Source-register numbers Qj, Qk—Functional units producing source registers Fj, Fk Rj, Rk—Flags indicating when Fj, Fk are ready. Set to No after operands are read. 3.Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions will write that register Information within the Scoreboard

18 A Scoreboard Example The following code is run on a scoreboard pipeline with: L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Functional Unit (FU) # of FUs EX cycles Integer Mem 1 1 Floating Point Multiply 2 10 Floating Point Add 1 2 Floating point Divide 1 40 Functional units are not pipelined

19 Dependency Graph For Example Code L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 123456123456 L.D F6, 34 (R2) 1 L.D F2, 45 (R3) 2 MUL.D F0, F2, F4 3 DIV.D F10, F0, F6 5 SUB.D F8, F6, F2 4 ADD.D F6, F8, F2 6 Date Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) Example Code Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW)

20 Scoreboard Example Instruction stream FU count down Clock cycle counter Functional Units: 1 Integer 2 Multiplication 1 Addition 1 Division

21 Scoreboard Example Cycle 1

28 Scoreboard Example Cycle 8a

29 Scoreboard Example Cycle 8b

39 cycles later…

45 Scoreboard Example Cycle 62 In-order issue, out-of-order execution and out-of- order completion.

46 Scoreboard techniques to deal with hazards: –Result forwarding to reduce or eliminate RAW hazards –Hazard detection hardware to stall the pipeline during hazards –Uses a hardware-based mechanism to rearrange instruction execution order to reduce stalls dynamically at runtime (dynamic scheduling) »Better dynamic exploitation of instruction-level parallelism (ILP) Summary

47 The amount of parallelism available among a block of instructions The number of score entries determines the window size (typically small ~5 instr.) The number and types of functional units (Structural hazards increase with out of order) The presence of antidependence and output dependences lead to WAR and WAW stalls. Limitations of Scoreboard

Out of Order Execution with Tomasulo

49 Tomasulo’s Algorithm Control logic for out-of-order execution is decentralized –Reservation Stations (RS) in the functional units keep instruction information –In addition RS seamlessly rename registers A Common Data Bus (CDB) broadcasts data and results to the different devices –A single instruction can finish each cycle Distributed control allows for a larger window of instructions – Dynamic scheduling

50 Tomasulo’s Algorithm Structural hazards stall the pipeline RS tracks when operands are available and buffers them as soon as they are –No need for a register bank (store values or sources) Impact of RAW dependencies are limited –Execute an instruction when its operands are available WAW and WAR dependencies are avoided –Register renaming

51 DIV.DF0, F2, F4 ADD.DF6, F0, F8 ST.DF6, 0(R1) SUB.DF8, F10, F14 MUL.DF6, F10, F8 DIV.DF0, F2, F4 ADD.DS, F0, F8 ST.DS, 0(R1) SUB.DT, F10, F14 MUL.DF6, F10, T Antidependence Output dependence Using temporary registers S, T Register Renaming (Example) Eliminates WAR and WAW hazards by renaming all destination registers. Can be done by compiler True dependences

52 Tomasulo Organization FP adders Add1 Add2 Add3 FP multipliers Mult1 Mult2 From Mem FP Registers Reservation Stations Common Data Bus (CDB) To Mem FP Op Queue Load Buffers Store Buffers Load1 Load2 Load3 Load4 Load5 Load6 Normal data bus: data + destination Common data bus: data + source

53 Stages of a Tomasulo Pipeline Issue Write Back Execute Integer Write Back Execute FP Multiplication Write Back Execute FP Add Write Back Execute FP Division Write Back Execute FP Multiplication

54 Three Stages of Tomasulo Algorithm 1.Issue—get instruction from FP Op Queue If reservation station free (no structural hazard), control issues instr & sends operands (renames registers). 2.Execute—operate on operands (EX) When both source operands are ready then execute; if not ready, watch Common Data Bus for result 3.Write result—finish execution (WB) Write on Common Data Bus to all awaiting units; mark reservation station available Normal data bus: data + destination (“go to” bus) Common data bus: data + source (“come from” bus) –64 bits of data + 4 bits of Functional Unit source address –Write if matches expected Functional Unit (produces result) –Does the broadcast

55 Reservation Station Components Op:Operation to perform in the unit (e.g., + or –) Vj, Vk: Value of Source operands –Store buffers has V field, result to be stored Qj, Qk: Reservation stations producing source registers (value to be written) –Note: Qj,Qk=0 => ready –Store buffers only have Qi for RS producing result Busy: Indicates reservation station or FU is busy Register result status—Indicates which functional unit will write each register, if one exists. Blank when no pending instructions that will write that register.

56 A Tomasulo Example The following code is run on a Tomasulo pipeline with: L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Functional Unit (FU) # of FUs EX cycles FP Multiply/Division 2 10/40 FP Addition/Substraction 3 2 Mem Load 3 Functional units not pipelined

57 Dependency Graph L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F6, F2 DIV.D F10, F0, F6 ADD.D F6, F8, F2 123456123456 L.D F6, 34 (R2) 1 L.D F2, 45 (R3) 2 MUL.D F0, F2, F4 3 DIV.D F10, F0, F6 5 SUB.D F8, F6, F2 4 ADD.D F6, F8, F2 6 Data Dependence: (1, 4) (1, 5) (2, 3) (2, 4) (2, 6) (3, 5) (4, 6) Output Dependence: (1, 6) Anti-dependence: (5, 6) Example Code Real Data Dependence (RAW) Anti-dependence (WAR) Output Dependence (WAW)

58 Tomasulo Example Clock cycle counter FU count down Instruction stream 3 Load/Buffers 3 FP Adder R.S. 2 FP Mult R.S.

59 Tomasulo Example Cycle 1

39 cycles later…

78 Tomasulo Example Cycle 57 In-order issue, out-of-order execution and out-of- order completion.

79 Tomasulo’s advantages (1)Distributed hazard detection logic –distributed reservation stations and the CDB –If multiple instructions waiting on a single result, & each instruction has other operand, then instructions can be dispatched simultaneously by broadcasting on CDB –If a centralized register file were used, the units would have to read their results from the registers when register buses are available. (2) Avoids stalling due to WAW or WAR hazards

80 Tomasulo Drawbacks Complexity of hardware Performance limited by Common Data Bus –Each CDB must go to all functional units  high capacitance, high wiring density –Number of functional units that can complete per cycle limited to one! »Multiple CDBs  more FU logic for parallel stores

81 Summary Reservations stations: implicit register renaming to larger set of registers + buffering source operands –Prevents registers from being bottleneck –Avoids the WAR and WAW hazards of Scoreboard Lasting Contributions –Dynamic scheduling –Register renaming –Load/store disambiguation

Summary of Out-of-Order Processors

83 Out of Order Processors BENEFITS: Accelerates the execution of programs More efficient design –Increases the utilisation of processor resources LIMITATIONS: More complex design Very expensive in terms of area and power Non-precise interrupts –Interrupting exactly after an instruction might not be possible

84 Scoreboard vs Tomasulo

COMP25212 Advanced Pipelining Out of Order Processors.

Similar presentations

Presentation on theme: "COMP25212 Advanced Pipelining Out of Order Processors."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

COMP25212 Advanced Pipelining Out of Order Processors.

Similar presentations

Presentation on theme: "COMP25212 Advanced Pipelining Out of Order Processors."— Presentation transcript:

Similar presentations

About project

Feedback