Lecture 2: Pipelining and Superscalar Review. Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth.

Slides:



Advertisements
Similar presentations
CSE 502: Computer Architecture
Advertisements

CSE 502: Computer Architecture
Tor Aamodt EECE 476: Computer Architecture Slide Set #6: Multicycle Operations.
Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Advanced Pipelining Optimally Scheduling Code Optimally Programming Code Scheduling for Superscalars (6.9) Exceptions (5.6, 6.8)
1 Lecture: Out-of-order Processors Topics: out-of-order implementations with issue queue, register renaming, and reorder buffer, timing, LSQ.
EECC551 - Shaaban #1 Fall 2003 lec# Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining increases performance by overlapping.
COMP25212 Advanced Pipelining Out of Order Processors.
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
CSE502: Computer Architecture Core Pipelining. CSE502: Computer Architecture Before there was pipelining… Single-cycle control: hardwired – Low CPI (1)
Instruction-Level Parallelism (ILP)
1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
1 Lecture 18: Core Design Today: basics of implementing a correct ooo core: register renaming, commit, LSQ, issue queue.
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
EECC551 - Shaaban #1 Spring 2006 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
“Iron Law” of Processor Performance
EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
1 Lecture 5: Pipeline Wrap-up, Static ILP Basics Topics: loop unrolling, VLIW (Sections 2.1 – 2.2) Assignment 1 due at the start of class on Thursday.
Review of CS 203A Laxmi Narayan Bhuyan Lecture2.
EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.
DLX Instruction Format
1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)
EECC551 - Shaaban #1 Winter 2011 lec# Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level.
EECC551 - Shaaban #1 Spring 2004 lec# Definition of basic instruction blocks Increasing Instruction-Level Parallelism & Size of Basic Blocks.
1 Lecture 4: Advanced Pipelines Control hazards, multi-cycle in-order pipelines, static ILP (Appendix A.4-A.10, Sections )
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
1 Sixth Lecture: Chapter 3: CISC Processors (Tomasulo Scheduling and IBM System 360/91) Please recall:  Multicycle instructions lead to the requirement.
1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.
1 Advanced Computer Architecture Dynamic Instruction Level Parallelism Lecture 2.
1 Lecture 5 Overview of Superscalar Techniques CprE 581 Computer Systems Architecture, Fall 2009 Zhao Zhang Reading: Textbook, Ch. 2.1 “Complexity-Effective.
ECE 252 / CPS 220 Pipelining Professor Alvin R. Lebeck Compsci 220 / ECE 252 Fall 2008.
ECE3056A Architecture, Concurrency, and Energy Lecture: Pipelined Microarchitectures Prof. Moinuddin Qureshi Slides adapted from: Prof. Mutlu (CMU)
1 Pipelining Part I CS What is Pipelining? Like an Automobile Assembly Line for Instructions –Each step does a little job of processing the instruction.
Lecture 8Fall 2006 Chapter 6: Superscalar Adapted from Mary Jane Irwin at Penn State University for Computer Organization and Design, Patterson & Hennessy,
1 Lecture 7: Speculative Execution and Recovery using Reorder Buffer Branch prediction and speculative execution, precise interrupt, reorder buffer.
© Wen-mei Hwu and S. J. Patel, 2005 ECE 412, University of Illinois Lecture Instruction Execution: Dynamic Scheduling.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
1 Lecture 5: Dependence Analysis and Superscalar Techniques Overview Instruction dependences, correctness, inst scheduling examples, renaming, speculation,
1 Lecture 7: Speculative Execution and Recovery Branch prediction and speculative execution, precise interrupt, reorder buffer.
Recap Multicycle Operations –MIPS Floating Point Putting It All Together: the MIPS R4000 Pipeline.
11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Dataflow Order Execution  Use data copying and/or hardware register renaming to eliminate WAR and WAW ­register name refers to a temporary value produced.
ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.
1 Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations.
Computer Architecture Lecture 7: Pipelining Prof. Onur Mutlu Carnegie Mellon University Spring 2014, 1/29/2014.
PipeliningPipelining Computer Architecture (Fall 2006)
1 Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software.
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Appendix C Pipeline implementation
Microprocessor Microarchitecture Dynamic Pipeline
Pipelining: Advanced ILP
Lecture 6: Advanced Pipelines
Out of Order Processors
Ka-Ming Keung Swamy D Ponpandi
How to improve (decrease) CPI
Lecture: Pipelining Extensions
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Instruction Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP)
Conceptual execution on a processor which exploits ILP
Presentation transcript:

Lecture 2: Pipelining and Superscalar Review

Motivation: Increase throughput with little increase in cost (hardware, power, complexity, etc.) Bandwidth or Throughput = Performance BW = num. tasks/unit time For a system that operates on one task at a time: BW = 1 / latency Pipelining can increase BW if many repetitions of same operation/task Latency per task remains same or increases Lecture 2: Pipelining and Superscalar Review 2

3 Combinatorial Logic N Gate Delays Combinatorial Logic N Gate Delays BW = ~(1/n) Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Combinatorial Logic N/2 Gate Delays Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates Comb. Logic N/3 Gates BW = ~(2/n) BW = ~(3/n)

T/k Starting from an unpipelined version with propagation delay T and BW=1/T Perf pipe = BW pipe = 1 / (T/k + S) where S = latch delay where k = num stages Lecture 2: Pipelining and Superscalar Review 4 T T S S S S k-stage pipelined unpipelined

G/k Starting from an unpipelined version with hardware cost G Cost pipe = G + kL where L = latch cost incl. control where k = num stages Lecture 2: Pipelining and Superscalar Review 5 G G L L L L k-stage pipelined unpipelined

Lecture 2: Pipelining and Superscalar Review 6 Cost/Performance: C/P = [Lk + G] / [1/(T/k + S)] = (Lk + G) (T/k + S) = LT + GS + LSk + GT/k Optimal Cost/Performance: find min. C/P w.r.t. choice of k k C/P      k opt GT LS =    Lk + G 1 T k + S d dk = LS - GT k2k2

Lecture 2: Pipelining and Superscalar Review 7 Pipeline Depth k x10 4 Cost/Performance Ratio (C/P) G=175, L=41, T=400, S=22 G=175, L=21, T=400, S=11

“Hardware Cost” –Transistor/Gate Count Should include additional logic to control the pipeline –Area (related to gate count) –Power! More gates  more switching More gates  more leakage Many metrics to optimize Very difficult to determine what really is “optimal” Lecture 2: Pipelining and Superscalar Review 8

Uniform Suboperations –The operation to be pipelined can be evenly partitioned into uniform-latency suboperations Repetition of Identical Operations –The same operations are to be performed repeatedly on a large number of different inputs Repetition of Independent Operations –All the repetitions of the same operation are mutually independent, i.e., no data dependences and no resource conflicts Lecture 2: Pipelining and Superscalar Review 9 Good Examples: Automobile assembly line Floating-Point multiplier Instruction pipeline (?) Good Examples: Automobile assembly line Floating-Point multiplier Instruction pipeline (?)

Uniform Suboperations … NOT! –Balance pipeline stages Stage quantization to yield balanced stages Minimize internal fragmentation (some waiting stages) Identical operations … NOT! –Unifying instruction types Coalescing instruction types into one “multi-function” pipe Minimize external fragmentation (some idling stages) Independent operations … NOT! –Resolve data and resource hazards Inter-instruction dependency detection and resolution Minimize performance loss Lecture 2: Pipelining and Superscalar Review 10

The “computation” to be pipelined: 1.Instruction Fetch (IF) 2.Instruction Decode (ID) 3.Operand(s) Fetch (OF) 4.Instruction Execution (EX) 5.Operand Store (OS) a.k.a. writeback (WB) 6.Update Program Counter (PC) Lecture 2: Pipelining and Superscalar Review 11

Lecture 2: Pipelining and Superscalar Review 12 Based on Obvious Subcomputations: Instruction Fetch Instruction Decode Operand Fetch Instruction Execute Operand Store IFIF IDID OF/RFOF/RF EXEX OS/WBOS/WB

Lecture 2: Pipelining and Superscalar Review 13 T IF = 6 units T ID = 2 units T ID = 9 units T EX = 5 units T OS = 9 units Without pipelining T cyc  T IF +T ID +T OF +T EX +T OS = 31 Pipelined T cyc  max{T IF, T ID, T OF, T EX, T OS } = 9 Speedup= 31 / 9 Can we do better in terms of either performance or efficiency? IFIF IDID OF/RFOF/RF EXEX OS/WBOS/WB

Two methods for stage quantization –Merging multiple subcomputations into one –Subdividing a subcomputation into multiple smaller ones Recent/Current trends –Deeper pipelines (more and more stages) To a certain point: then cost function takes over –Multiple different pipelines/subpipelines –Pipelining of memory accesses (tricky) Lecture 2: Pipelining and Superscalar Review 14

Lecture 2: Pipelining and Superscalar Review 15 Coarser-Grained Machine Cycle: 4 machine cyc / instruction T IF&ID = 8 units T OF = 9 units T EX = 5 units T OS = 9 units Finer-Grained Machine Cycle: 11 machine cyc /instruction T cyc = 3 units T IF,T ID,T OF,T EX,T OS = (6/2/9/5/9) IFIF IDID OFOF OSOS EXEX IFIF IFIF IDID OFOF OFOF OFOF EXEX EXEX OSOS OSOS OSOS

Logic needed for each pipeline stage Register file ports needed to support all (relevant) stages Memory accessing ports needed to support all (relevant) stages Lecture 2: Pipelining and Superscalar Review 16 IFIF IDID OFOF OSOS EXEX IFIF IFIF IDID OFOF OFOF OFOF EXEX EXEX OSOS OSOS OSOS

Lecture 2: Pipelining and Superscalar Review 17 IFIF RDRD ALUALU MEMMEM WBWB IF ID OF EX OS PC GEN Cache Read DecodeDecode Read REG Add GEN Cache Read EX 1 EX 2 Check Result Write Result OS EX OF ID IF MIPS R2000/R3000 AMDAHL 470V/7

Data Dependence –True Dependence (RAW) Instruction must wait for all required input operands –Anti-Dependence (WAR) Later write must not clobber a still-pending earlier read –Output Dependence (WAW) Earlier write must not clobber an already-finished later write Control Dependence (a.k.a. Procedural Dependence) –Conditional branches cause uncertainty to instruction sequencing –Instructions following a conditional branch depends on the execution of the branch instruction –Instructions following a computed branch depends on the execution of the branch instruction Lecture 2: Pipelining and Superscalar Review 18

Lecture 2: Pipelining and Superscalar Review19 bge$10, $9, $36 mul$15, $10, 4 addu$24, $6, $15 lw$25, 0($24) mul$13, $8, 4 addu$14, $6, $13 lw$15, 0($14) bge$25, $15, $36 $35: addu$10, $10, 1... $36: addu$11, $11, #for (;(j<high)&&(array[j]<array[low]);++j); #$10 = j; $9 = high; $6 = array; $8 = low

Processor must handle –Register Data Dependencies RAW, WAW, WAR –Memory Data Dependencies RAW, WAW, WAR –Control Dependencies Lecture 2: Pipelining and Superscalar Review 20

Pipeline Hazards: –Potential violations of program dependencies –Must ensure program dependencies are not violated Hazard Resolution: –Static method: performed at compile time in software –Dynamic method: performed at runtime using hardware Stall, Flush or Forward Pipeline Interlock: –Hardware mechanism for dynamic hazard resolution –Must detect and enforce dependencies at runtime Lecture 2: Pipelining and Superscalar Review 21

Lecture 2: Pipelining and Superscalar Review 22 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

Lecture 2: Pipelining and Superscalar Review 23 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4

Lecture 2: Pipelining and Superscalar Review 24 IFIDRDALUMEMWB IFIDRDALUMEMWB IFID Stalled in RD ALUMEMWB IF Stalled in ID RDALUMEMWB Stalled in IF IDRDALUMEM IFIDRDALU t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 RD ID IF IFIDRD IFID IF

Lecture 2: Pipelining and Superscalar Review 25

Lecture 2: Pipelining and Superscalar Review 26 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Many possible paths Inst j Inst j+1 Inst j+2 Inst j+3 Inst j+4 MEMMEMALUALU Requires stalling even with fwding paths

Lecture 2: Pipelining and Superscalar Review 27 Deeper pipeline may require additional forwarding paths Deeper pipeline may require additional forwarding paths IFIFIDID Register File src1 src2 = = = = ALUALU MEMMEM = = = = dest

Lecture 2: Pipelining and Superscalar Review 28 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF

Lecture 2: Pipelining and Superscalar Review 29 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUMEM IFIDRDALU IFIDRD IFID IF t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 Stalled in IF

Lecture 2: Pipelining and Superscalar Review 30 t0t0 t1t1 t2t2 t3t3 t4t4 t5t5 Inst i Inst i+1 Inst i+2 Inst i+3 Inst i+4 IFIDRDALUMEMWB IFIDRDALUMEMWB IFIDRDALUnopnop IFIDRDnopnop IFIDnopnop IFIDRD IFID IFnopnopnop ALUnop RDALU IDRD nopnop nop New Inst i+2 New Inst i+3 New Inst i+4 Speculative State Cleared Fetch Resteered

Simple pipeline limited to execution of CPI ≥ 1.0 “Superscalar” can achieve CPI ≤ 1.0 (i.e., IPC ≥ 1.0) –Superscalar means executing more than one scalar instruction in parallel (e.g., add + xor + mul) –Contrast to Vector which effectively executes multiple operations in parallel, but they all must be the same (e.g., four parallel additions) Lecture 2: Pipelining and Superscalar Review 31

Scalar pipeline (baseline) –Instruction/overlap parallelism = D –Operation Latency = 1 –Peak IPC = 1 Lecture 2: Pipelining and Superscalar Review 32 D Successive Instructions Time in cycles D different instructions overlapped

Superscalar (pipelined) Execution –Instruction parallelism = D x N –Operation Latency = 1 –Peak IPC = N per cycle Lecture 2: Pipelining and Superscalar Review 33 N Successive Instructions Time in cycles D x N different instructions overlapped

Lecture 2: Pipelining and Superscalar Review 34 PrefetchPrefetch Decode1Decode1 Decode2Decode2Decode2Decode2 ExecuteExecuteExecuteExecute WritebackWritebackWritebackWriteback 4× 32-byte buffers Decode up to 2 insts Read operands, Addr comp Asymmetric pipes u-pipev-pipe shift rotate some FP jmp, jcc, call, fxch Both mov, lea, simple ALU, push/pop test/cmp

“Pairing Rules” (when can/can’t two insts exec at the same time?) –read/flow dependence mov eax, 8 mov [ebp], eax –output dependence mov eax, 8 mov eax, [ebp] –partial register stalls mov al, 1 mov ah, 0 –function unit rules some instructions can never be paired: MUL, DIV, PUSHA, MOVS, some FP Lecture 2: Pipelining and Superscalar Review 35

CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point –i.e., when N approaches the average distance between dependent instructions –Forwarding is no longer effective  Must stall more often  Pipeline may never be full due to frequency of dependency stalls Lecture 2: Pipelining and Superscalar Review 36

Lecture 2: Pipelining and Superscalar Review 37 Ex. Superscalar degree N = 4 Any dependency between these instructions will cause a stall Dependent inst must be N = 4 instructions away On average, the parent- child separation is only about 5± instructions! (Franklin and Sohi ’92) Pentium: Superscalar degree N=2 is reasonable… going much further encounters rapidly diminishing returns Pentium: Superscalar degree N=2 is reasonable… going much further encounters rapidly diminishing returns Average of 5 means there are many cases when the separation is < 4… each of these limits parallelism

“Trivial” Parallelism is limited –What is trivial parallelism? In-order: sequential instructions do not have dependencies in all previous examples, all instructions executed either at the same time or after earlier instructions –previous slides show that superscalar execution quickly hits a ceiling So what is “non-trivial” parallelism? … Lecture 2: Pipelining and Superscalar Review 38

Work T 1 : time to complete a computation on a sequential system Critical Path T  : time to complete the same computation on an infinitely-parallel system Average Parallelism P avg = T 1 / T  For a p-wide system T p  max{T 1 /p, T  } P avg >> p  T p  T 1 /p Lecture 2: Pipelining and Superscalar Review 39 x = a + b; y = b * 2 z =(x-y) * (x+y)

ILP is a measure of the amount of inter-dependencies between instructions Average ILP = num instructions / longest path code 1 :ILP = 1 (must execute serially) T 1 = 3, T  = 3 code 2 :ILP = 3 (can execute at the same time) T 1 = 3, T  = 1 Lecture 2: Pipelining and Superscalar Review 40 code 1 : r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 code 2 :r1  r2 + 1 r3  r9 / 17 r4  r0 - r10

Instruction level parallelism usually assumes infinite resources, perfect fetch, and unit-latency for all instructions ILP is more a property of the program dataflow IPC is the “real” observed metric of exactly how many instructions are executed per machine cycle, which includes all of the limitations of a real machine The ILP of a program is an upper-bound on the attainable IPC Lecture 2: Pipelining and Superscalar Review 41

Lecture 2: Pipelining and Superscalar Review 42 r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r r13  r19 / 17 r14  r0 - r20 ILP=2 ILP=1ILP=3

A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] Lecture 2: Pipelining and Superscalar Review 43

Lecture 2: Pipelining and Superscalar Review 44 Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Issue = send an instruction to execution Issue = send an instruction to execution INTINTFadd1Fadd1 Fadd2Fadd2 Fmul1Fmul1 Fmul2Fmul2 Fmul3Fmul3 Ld/StLd/St In-orderInst.Stream Execution Begins In-order Out-of-order Completion

Lecture 2: Pipelining and Superscalar Review 45 A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R7 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R1] J: R1 = R1 – 1 K: R3  ST 0[R1] AB Cycle 1: C 2: D 3: 4: 5: EF 6: GHJ K 7: 8: IPC = 10/8 = 1.25 AB C D EF G H J K

Lecture 2: Pipelining and Superscalar Review 46 A: R1 = R2 + R3 B: R4 = R5 + R6 C: R1 = R1 * R4 D: R9 = LD 0[R1] E: BEQZ R7, +32 F: R4 = R7 - 3 G: R1 = R1 + 1 H: R4  ST 0[R9] J: R1 = R9 – 1 K: R3  ST 0[R1] AB Cycle 1: C 2: D 3: 4: 5: EFG IPC = 10/7 = 1.43 HJ 6: K 7: AB C D E FG HJ K

Scoreboard: a bit-array, 1-bit for each GPR –If the bit is not set: the register has valid data –If the bit is set: the register has stale data i.e., some outstanding instruction is going to change it Issue in Order: RD  Fn (RS, RT) –If SB[RS] or SB[RT] is set  RAW, stall –If SB[RD] is set  WAW, stall –Else, dispatch to FU (Fn) and set SB[RD] Complete out-of-order –Update GPR[RD], clear SB[RD] Lecture 2: Pipelining and Superscalar Review 47

Lecture 2: Pipelining and Superscalar Review 48 INTINTFadd1Fadd1 Fadd2Fadd2 Fmul1Fmul1 Fmul2Fmul2 Fmul3Fmul3 Ld/StLd/St In-order Inst. Stream DRDRDRDRDRDRDRDR Out-of-order Completion Out of Program Order Execution Need an extra Stage/buffers for Dependency Resolution

Similar to In-Order scoreboarding –Need new tables to track status of individual instructions and functional units –Still enforce dependencies Stall dispatch on WAW Stall issue on RAW Stall completion on WAR Limitations of Scoreboarding? Hints –No structural hazards –Can always write a RAW-free code sequence Add R1 = R0 + 1; Add R2 = R0 + 1; Add R3 = R0 + 1; … –Think about x86 ISA with only 8 registers Lecture 2: Pipelining and Superscalar Review 49 Finite number of registers in any ISA will force you to reuse register names at some point  WAR, WAW  stalls Finite number of registers in any ISA will force you to reuse register names at some point  WAR, WAW  stalls

More out-of-orderness  More ILP exposed  But more hazards Stalling is a generic technique to ensure sequencing RAW stall is a fundamental requirement (?) Compiler analysis and scheduling can help (not covered in this course) Lecture 2: Pipelining and Superscalar Review 50

Lecture 2: Pipelining and Superscalar Review 51

Tomasulo’s algorithm (1967) was not the first Also at IBM, Lynn Conway proposed multi-issue dynamic instruction scheduling (OOO) in Feb 1966 –Ideas got buried due to internal politics, changing project goals, etc. –But it’s still the first (as far as I know) Lecture 2: Pipelining and Superscalar Review 52

Lecture 2: Pipelining and Superscalar Review 53 Tomasulo Peak IPC = 1 2 FP FU’s Single CDB Operand copying RS Tag Tag-based forwarding Imprecise Modern Peak IPC = FU’s Many forwarding buses Renamed registers Tag-based forwarding Precise (requires ROB) Machine Width Structural Deps Anti-Deps Output-Deps True Deps Exceptions