Superscalar Processors J. Nelson Amaral
Scalar to Superscalar Scalar Processor: one instruction pass through each pipeline stage in each cycle Superscalar Processor: multiple instructions at each pipeline stage in each cycle Wider pipeline Superpipelined Processor: Decompose stages into smaller stages → More Stages Deeper pipeline Baer p. 75
Superscalar Front end (IF and ID) Back end (EX, Mem and WB) Must fetch and decode multiple instructions per cycle m-way superscalar: brings (ideally) m instructions per cycle into the pipeline Back end (EX, Mem and WB) Must execute and write back several instructions per cycle Baer p. 75
Superscalar In-order (or static) Out-of-order (or dynamic) Instructions leave front-end in program order Out-of-order (or dynamic) instructions leave front-end, and execute, in a different order than the program order WB is called commit stage must ensure that the program semantics is followed more complex design Baer p. 76
Limits to Superscalar Performance Superscalars rely on exploiting Instruction-Level Parallelism (ILP) They remove WAR and WAW dependences But the amount of ILP is limited by RAW (true) dependences Data Dependence Graph: S0: R1 ← R2 + R3 S1: R4 ← R1 + R5 S2: R1 ← R6 + R7 S3: R4 ← R1 + R9 Example: S0 RAW WAW S1 WAR S2 WAW RAW S3 Baer p. 76
Limits to Superscalar Performance Superscalars rely on exploiting Instruction-Level Parallelism (ILP) They remove WAR and WAW dependences But the amount of ILP is limited by RAW (true) dependences Data Dependence Graph: S0: R1 ← R2 + R3 S1: R4 ← R1 + R5 S2: R1 ← R6 + R7 S3: R4 ← R1 + R9 Example: S0 WAW RAW S1 WAR RA WAW RB RA S2 S3 RAW Baer p. 76
Limits to Superscalar Performance Complexity of logic to remove dependencies Designers predicted 8-way and 16-way superscalars We have 6-way superscalars and m is not likely to grow Baer p. 76
Limits to Superscalar Performance Number of Forward Paths 1-way: Baer p. 76
Limits to Superscalar Performance Number of Forward Paths 2-way: m-way requires m2 paths paths may become too long for signal propagation within a single clock Baer p. 76
Limits to Clock Cycle Reduction Power dissipation increases with frequency Read and Writing to pipeline registers in every cycle. Time to access pipeline register imposes a bound on the duration of a pipeline stage Baer p. 76
Limits on Pipeline Length Speculative actions (pe. branch prediction) are resolved later in a longer pipeline Recovery from misspeculation is delayed Branch Misspred. Penalty: 10 cycles Branch Misspred. Penalty: 20 cycles 31-stage pipeline 14-stage pipeline Baer p. 76
Why the Multicore Revolution? Power Dissipation: Linear growth with clock frequency - Cannot make single cores faster Moore’s Law: Number of transistors in a chip continues the exponential growth - What to do with extra logic? Design Complexity: Extracting more performance from single core requires extreme design complexity. - What to do with extra logic? Baer p. 77
Speed Demons X Brainiacs Pentium III Out-of-Order Superscalar 1999 DEC Alpha In-Order Superscalar 1994 register renaming reorder buffer reservation stations Baer p. 77
Out-of-Order and Memory Hierarchy Question: Does out-of-order execution help hide memory latencies? Short answer: No. Latencies of 100 cycles or more are too long and fill up all internal queues and stall pipelines Latencies around 100 cycles are too short to justify context switching. Solution: hardware for several contexts to enable fast context switching → multithreading Baer p. 78
DEC Alpha 21164 4-way in-order RISC Instruction Buffer virtually indexed Instruction Buffer 32 32 64-bit Miss Address File: merge outstanding misses to the same L2 line. Baer p. 79
21164 Instruction Pipeline Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48-entry I-TLB 64-entry D-TLB Baer p. 79
Brings 4 instructions from I-Cache (accesses I-Cache and ITLB in parallel) Performs branch prediction, calculates branch target slotting stage: steers instructions to units; resolves static conflicts resolves dynamic conflicts; schedules forwardings and stallings Integer pipe 1: shifter and multiplier Integer pipe 2: branches 48-entry I-TLB 64-entry D-TLB Baer p. 80
Example i1: R1 ← R2 + R3 # Use integer pipeline 1 i3: R7 ← R8 – R9 # Requires an integer pipeline i4: F0 ← F2 + F4 # Floating point add i5: i6: i7: i8: i9: i10: i11: i12: Assume no structural or data hazard for these instructions. Baer p. 81
Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 1 Time: t0 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 1 Time: t0 i5 i6 i7 i8 i1 i2 i3 i4 Baer p. 82
Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 1 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 1 i9 i10 i11 i12 i1 i2 i3 i4 i5 i6 i7 i8 Baer p. 82
Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 3 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 2 Time: t0 + 3 i9 i5 i1 i2 i10 i6 i11 i7 i3 i12 i8 i4 i3 cannot move to S3 because of resource conflict (there are only two integer pipelines) i4 does not move to S3 to preserve program order (it is blocked by i3) Baer p. 82
Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 4 Time: t0 + 3 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 4 Time: t0 + 3 i9 i5 i1 i10 i6 i2 i11 i7 i3 i12 i8 i4 i2 cannot move to the backend because of of RAW dependency with i1. Baer p. 82
Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 5 Time: t0 + 4 i1: R1 ← R2 + R3 i2: R4 ← R1 – R5 i3: R7 ← R8 – R9 i4: F0 ← F2 + F4 Front-end Occupancy S0 S1 S2 S3 Backend Time: t0 + 5 Time: t0 + 4 i3 i4 i11 i12 i9 i10 i2 i5 i6 i7 i8 i15 i16 i13 i14 i1 Baer p. 82
Backend Begins L1 D-cache and D-TLB accesses Miss: Start access to L2 Data available if hit in L2 Begins L1 D-cache and D-TLB accesses Hit: Forward data (if needed); write to int. or FP register Decide hit/miss in L1 D-cache and D-TLB Baer p. 82
Scoreboard Speculation Example: a load L, and a dependent use U reach S3 at cycle t If the load hits L1-cache, then schedule L at t+1 and U at t+3. Know if it is a hit or miss here. Scoreboard assumes it is a hit. If it is a miss, abort any dependent instruction already issued. Baer p. 82
Can Compiler Help Performance? (Example) i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Assume that all instructions are in issuing slot (state S2) at time t.
Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t i9 i10 i11 i12 i5 i6 i7 i8 i1 i2 i3 i4 Instruction i3 cannot advance to S3 because of an structural hazard: The load in i1 uses an integer pipe to compute the address Baer p. 82
Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t + 2 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 1 Time: t + 2 Time: t + 3 i9 i10 i11 i12 i5 i6 i7 i8 i1 i2 i3 i4 i2 cannot advance because of the RAW dependency with i1 at t+3 the load continues execution in the back end (2-cycle latency) Baer p. 82
Compiler Effect S0 S1 S2 S3 Backend Time: t + 4 Time: t + 3 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 4 Time: t + 3 i13 i14 i15 i16 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i12 i1 Baer p. 82
Compiler Effect S0 S1 S2 S3 Backend Time: t + 5 Time: t + 4 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 5 Time: t + 4 i13 i14 i15 i16 i9 i10 i11 i12 i5 i6 i7 i8 i2 i3 i4 i4 cannot advance because of the RAW dependency with i3 Baer p. 82
Compiler Effect S0 S1 S2 S3 Backend Time: t + 6 Time: t + 5 i1: R1 ← Mem[R2] i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 Compiler Effect S0 S1 S2 S3 Backend Time: t + 6 Time: t + 5 i4 i5 i6 i7 i8 i9 i10 i11 i12 i13 i14 i15 i16 i17 i18 i19 i20 i3 i4 advances to execution at t+6 and it will be the only integer instruction executing at that cycle. Baer p. 82
After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 1 Time: t i8 i9 i10 i11 i4 i1 i1’ i5 i6 i2 i7 i3 Two integer Instructions advance to S3. Baer p. 82
After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 1 Time: t + 2 i13 i14 i15 i12 i1 i1’ i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 Baer p. 82
After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 2 Time: t + 4 Time: t + 3 i13 i14 i15 i12 i8 i9 i10 i11 i4 i1 i5 i1’ i6 i2 i7 i3 Load in i1 still needs two cycles to execute. Baer p. 82
After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 4 Time: t + 5 i2 i3 i4 i5 i6 i7 i8 i9 i10 i11 i13 i14 i15 i12 i17 i18 i19 i16 i1 i2 and i3 can advance to backend together. There is no depencency between them. Baer p. 82
After Compiler Optimization i1: R1 ← Mem[R2] i1’: integer nop i2: R4 ← R1 + R3 i3: R5 ← R1 + R6 i4: R7 ← R4 + R5 After Compiler Optimization S0 S1 S2 S3 Backend Time: t + 5 Time: t + 4 Time: t + 6 i17 i18 i19 i16 i12 i8 i9 i10 i11 i4 i13 i14 i15 i5 i6 i2 i3 i7 i4 still advances to backend at t+6! but now i5 could advance along with i4 * Textbook says that i4 would advance to backend at t+5. Baer p. 82
Scoreboarding “Scoreboarding allows instructions to execute out of order when there are sufficient resources and no data dependences.” John L. Hennessy and David A. Patterson Computer Architecture: A Quantitative Approach Third Edition, p. A-69.
Another scoreboarding
Scoreboarding Thornton Algorithm (Scoreboarding): CDC 6600 (1964): A single unit (the scoreboard) monitors the progress of the execution of instructions and the status of all registers. Tomasulo’s Algorithm: IBM 360/91 (1967) Reservation stations buffer operands and results. A Common Data Bus (CDB) distributes results directly to functional units Some of this material is from Prof. Vojin G. Oklobzija’s tutorial at ISSCC’97. Baer p. 81
CDC 6600 Group I Group II Group III Group IV Not shown: branch unit that modifies the PC Group II Group III Group IV Baer p. 86
CDC 6600 Scoreboard Operation Issue free functional unit? Stall no WAW hazard? yes Stall yes Issue no Baer p. 86
CDC 6600 Scoreboard Operation Dispatch Mark execution unit busy Operands ready? Stall no yes Read operands Baer p. 87
CDC 6600 Scoreboard Operation Execution Execution complete? Stall no yes Notify Scoreboard that it is ready to write result Baer p. 87
CDC 6600 Scoreboard Operation Write result WAR hazard? Stall yes no Write WAR Example: i0 DIV.D F0, F2, F4 i1 ADD.D F10, F0, F8 i2 SUB.D F8, F8, F14 Has to stall the write of i2 until i1 has read F8 Baer p. 87
Scoreboarding Example i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 1 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 issued R4 R0 R2 1 1 Unit Busy (U)? Mult1 Mult2 Adder Register Unit R4 NIL R6 R8 Mult1 Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 2 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 dispatched issued R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 Unit Busy (U)? Mult1 Mult2 Adder Register Unit R4 Mult1 R6 NIL R8 1 Mult2 Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 3 i2 cannot be dispatched because R4 is not available Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 dispatched execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 issued These values are wrong on Table 3.2 (p. 88) in the textbook R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 NIL Adder Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 4 i4 cannot issue: (i) Adder is busy; AND (ii) WAW dependency on i1 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 dispatched issued R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder 1 Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 5 (No change) Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 dispatched execute R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 6 i3 asks for permission to write. Permission is denied (WAR with i2). Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 execute R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 8 i1 asks for permission to write. Permission is granted. Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i1 execute write R4 R0 R2 1 1 i2 issued R6 R4 R8 Mult1 1 i3 execute R8 R2 R12 1 1 Unit Busy (U)? Mult1 1 Mult2 Adder Register Unit R4 Mult1 R6 Mult2 R8 Adder Baer p. 88
Instructions in Flight i1: R4 ← R0 * R2 # Uses multiplier 1 i2: R6 ← R4 * R8 # Uses multiplier 2 i3: R8 ← R2 + R12 # Uses Adder i4: R4 ← R14 + R16 # Uses Adder Cycle 9 Source Reg Units Reg Flags Instruction Status Fj Fk Qj Qk Rj Rk Instructions in Flight Fi Res. i2 dispatched issued R6 R4 R8 Mult1 1 i3 execute write R8 R2 R12 1 1 i4 issue R4 R14 R16 1 Unit Busy (U)? Mult1 Mult2 1 Adder Register Unit R4 R6 Mult2 R8 Adder Adder Baer p. 88
Register Renaming, Reorder Buffer, and Reservation Stations Difference between in-order X out-of-order execution: When instructions leave the front end? In-order: WAR and WAW prevent dispatch Out-of-order: register renaming avoids WAR and WAW How are instructions processed in the back-end? Instructions can wait in reservation stations because of RAW dependencies or structural hazards A reorder buffer imposes program order commitment Baer p. 89
Register Renaming (example) i1: R1 ← R2/R3 # Takes a long time i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 The registers that appear in the program are logical or architectural registers. In-order: Only i1 issues. Others are blocked by RAW dependency. At the last stage of the front end all registers are mapped to physical registers. Out-of-order: i3 and i4 can issue and finish execution while i1 executes Baer p. 89
Renaming Process Renaming Stage: Ri ←Rj op Rk Ra ← Rb op Rc Rb = Rename(Rj); Rc = Rename(Rk); Ra = freelist(first); Rename(Ri) = freelist(first); first ←next(first) Baer p. 90
Register Renaming (example) How about i3, can it write into R5 before i1 and i2 complete? If i1 generates an exception, what will be the value of R5 in the exception state? i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 R32 Ri Rename(Ri) R1 R2 R3 R4 R5 R6 R7 R8 R9 R32 R35 R33 R32 R34 R35 R33 R34 i4 will finish execution before i1. Can we allow it to write the result to R1 before i1? Freelist = {R32, R33, R34, R35, R36, …} Baer p. 90
Reorder Buffer Even though we allow out-of-order execution, we require in-order-completion. A reorder buffer (ROB) ensures that the results produced by instructions are committed to the logical register in order. Baer p. 91
Reorder Buffer (cont.) Each entry in the ROB has the following fields: flag: has the instruction completed? value: value computed by the instruction result register name: logical register instruction type: arithmetic/load/store/branch/… Each instruction that has its destination register renamed is entered in the ROB Baer p. 91
i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 R32 Instruction Flag Value Reg. Name Type Head i1 Not Ready None R1 Arit Ready Some Tail i2 Not Ready None R4 Arit i3 Not Ready None R5 Arit Ready Some i4 Not Ready None R1 Arit Ready Some Ri Rename(Ri) R1 R2 R3 R4 R5 R6 R7 R8 R9 R32 R35 i1: R1 ← R2/R3 i2: R4 ← R1 + R5 i3: R5 ← R6 + R7 i4: R1 ← R8 + R9 R32 R33 R32 R33 R34 R34 R35 Freelist = {R32, R33, R34, R35, R36, …} Baer p. 92
But…. Where do instructions wait before being executed? How an instruction knows that it is ready to be executed? Baer p. 93
Reservation Stations After register renaming, the front-end dispatches the instruction to a reservation station. Reservation stations can: be grouped into a centralized queue called an instruction window. be associated with functional units according to the opcode. Baer p. 93
Reservation Stations (cont.) Each entry in the Reservation Station must contain: Operation to be performed Source operands (either value or physical name of the register) – a flag indicates which one physical name of the result register ROB entry where the result will be stored. Baer p. 93
Scheduling Scheduling: Selection of which instruction should execute next in a given execution unit oldest instruction; critical instruction; Baer p. 93
Ready Bit A ready bit is associated with each physical register. When an instruction that uses a physical register Ri is dispatched: if Ri is ready, pass Ri value to the reservation station and set flag to true (ready) if Ri is not ready, pass the name of Ri to the reservation station and set flag to false (not ready) When both flags are true, the instruction is ready to be issued. Baer p. 93
Ready Bit (cont.) Upon completion, an instruction broadcasts the name and content of its result physical register to all reservation stations (RS). Each RS that needs it, will grab the content and update its flags. Baer p. 93