IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle.

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle from I-cache 40% of instructions are load/store, require access to D-cache  Code Characteristics (dynamic) loads - 25% stores - 15% ALU/RR - 40% branches - 20% 1/3 unconditional (always taken); 1/3 conditional taken; 1/3 conditional not taken

More Statistics and Assumptions  Cache Performance hit ratio of 100% is assumed in the experiments cache latency: I-cache = i; D-cache = d; default: i=d=1 cycle  Load and Branch Scheduling loads: 25% cannot be scheduled 75% can be moved back 1 instruction branches: unconditional - 100% schedulable conditional - 50% schedulable

CPI Calculations I  No cache bypass of reg. file, no scheduling of loads or branches Load Penalty: 2 cycles (0.25*2=0.5) Branch Penalty: 2 cycles (0.2*0.66*2=0.27) Total CPI: 1 + 0.5 + 0.27 = 1.77 CPI  Bypass, no scheduling of loads or branches Load Penalty: 1 cycle (0.25*1=0.25) Total CPI: 1 + 0.25 + 0.27 = 1.52 CPI

CPI Calculations II  Bypass, scheduling of loads and branches Load Penalty: 75% can be moved back 1 => no penalty remaining 25% => 1 cycle penalty (0.25*0.25*1=0.063) Branch Penalty: 1/3 Uncond. 100% schedulable => 1 cycle (.2*.33=.066) 1/3 Cond. Not Taken, if biased for NT => no penalty 1/3 Cond. Taken 50% schedulable => 1 cycle (0.2*.33*.5=.033) 50% unschedulable => 2 cycles (.2*.33*.5*2=.066) Total CPI: 1 + 0.063 + 0.167 = 1.23 CPI

CPI Calculations III  Parallel target address generation 90% of branches can be coded as PC relative i.e. target address can be computed without register access A separate adder can compute (PC+offset) in the decode stage Branch Penalty: Conditional:Unconditional: Total CPI: 1 + 0.063 + 0.087 = 1.15 CPI = 0.87 IPC PC-relative addressing SchedulableBranch penalty YES (90%)YES (50%)0 cycle YES (90%)NO (50%)1 cycle NO (10%)YES (50%)1 cycle NO (10%)NO (50%)2 cycles

Pipelined Depth (T/ k +S ) T S S T/k k-stage pipelined unpipelined ?

Limitations of Scalar Pipelines  Upper Bound on Scalar Pipeline Throughtput Limited by IPC = 1  Inefficient Unification Into Single Pipeline Long latency for each instruction  Performance Lost Due to Rigid Pipeline Unnecessary stalls

Stalls in an Inorder Scalar Pipeline Instructions are in order with respect to any one stage i.e. no dynamic reordering

Architectures for Instruction-Level Parallelism Scalar Pipeline (baseline) Instruction Parallelism = D Operation Latency = 1 Peak IPC = 1 D

Superpipelined Machine Superpipelined Execution IP = DxM OL = M minor cycles Peak IPC = 1 per minor cycle(M per baseline cycle) 1 2 3 4 5 IFDE EX WB 6 1 2 3 4 5 6 major cycle = M minor cycle minor cycle

Superscalar Machines Superscalar (Pipelined) Execution IP = DxN OL = 1 baseline cycles Peak IPC = N per baseline cycle N

Superscalar and Superpipelined Superscalar and superpipelined machines of equal degree have roughly the same performance, i.e. if n = m then both have about the same IPC. Superscalar Parallelism Operation Latency: 1 Issuing Rate: N Superscalar Degree (SSD): N (Determined by Issue Rate) Superpipeline Parallelism Operation Latency: M Issuing Rate: 1 Superpipelined Degree (SPD): M (Determined by Operation Latency)

Limitations of Inorder Pipelines  CPI of inorder pipelines degrades very sharply if the machine parallelism is increased beyond a certain point, i.e. when NxM approaches average distance between dependent instructions  Forwarding is no longer effective  must stall more often Pipeline may never be full due to frequent dependency stalls!!

What is Parallelism?  Work T 1 - time to complete a computation on a sequential system  Critical Path T  - time to complete the same computation on an infinitely- parallel system  Average Parallelism P avg = T 1 / T   For a p wide system T p  max{ T 1 /p, T  } P avg >>p  T p  T 1 /p x = a + b; y = b * 2 z =(x-y) * (x+y)

ILP: Instruction-Level Parallelism  ILP is a measure of the amount of inter-dependencies between instructions  Average ILP =no. instruction / no. cyc required code1: ILP = 1 i.e. must execute serially code2: ILP = 3 i.e. can execute at the same time code1: r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 code2:r1  r2 + 1 r3  r9 / 17 r4  r0 - r10

Inter-instruction Dependences  Data dependence r 3  r 1 op r 2 Read-after-Write r 5  r 3 op r 4 (RAW)  Anti-dependence r 3  r 1 op r 2 Write-after-Read r 1  r 4 op r 5 (WAR)  Output dependence r 3  r 1 op r 2 Write-after-Write r 5  r 3 op r 4 (WAW) r 3  r 6 op r 7  Control dependence

Scope of ILP Analysis r1  r2 + 1 r3  r1 / 17 r4  r0 - r3 r11  r12 + 1 r13  r19 / 17 r14  r0 - r20 ILP=2 ILP=1 Out-of-order execution permits more ILP to be exploited

Purported Limits on ILP Weiss and Smith [1984]1.58 Sohi and Vajapeyam [1987]1.81 Tjaden and Flynn [1970]1.86 Tjaden and Flynn [1973]1.96 Uht [1986]2.00 Smith et al. [1989]2.00 Jouppi and Wall [1988]2.40 Johnson [1991]2.50 Acosta et al. [1986]2.79 Wedig [1982]3.00 Butler et al. [1991]5.8 Melvin and Patt [1991]6 Wall [1991]7 Kuck et al. [1972]8 Riseman and Foster [1972]51 Nicolau and Fisher [1984] 90

Flow Path Model of Superscalars I-cache FETCH DECODE COMMIT D-cache Branch Predictor Instruction Buffer Store Queue Reorder Buffer Integer Floating-point Media Memory Instruction Register Data Memory Data Flow EXECUTE (ROB) Flow

Superscalar Pipeline Design Retire Instruction Flow Data Flow

Inorder Pipelines Inorder pipeline, no WAW no WAR (almost always true)

Out-of-order Pipelining 101 IF ID RD WB INTFadd1Fmult1LD/ST Fadd2Fmult2 Fmult3 EX Program Order I a : F1  F2 x F3..... I b : F1  F4 + F5 What is the value of F1? WAW!!! Out-of-order WB I b : F1  “F4 + F5”... I a : F1  “F2 x F3”

Output Dependences (WAW) Superscalar Execution Check List INSTRUCTION PROCESSING CONSTRAINTS Resource ContentionCode Dependences Control Dependences Data Dependences True Dependences Anti-Dependences Storage Conflicts (Structural Dependences) (RAW) (WAR)

In-order Issue into Diversified Pipelines RD  Fn (RS, RT) Dest. Reg. Func Unit Source Registers Issue stage needs to check: 1. Structural Dependence 2. RAW Hazard 3. WAW Hazard 4. WAR Hazard Inorder Inst. Stream

Simple Scoreboarding  Scoreboard: a bit-array, 1-bit for each GPR if the bit is not set, the register has valid data if the bit is set, the register has stale data i.e. some outstanding inst is going to change it u Dispatch in Order: RD  Fn (RS, RT) u Complete out-of-order - update GPR[RD], clear SB[RD] - else dispatch to FU, set SB[RD] - if SB[RD] is set is set  WAW, stall - if SB[RS] or SB[RT] is set  RAW, stall

Scoreboarding Example FU StatusScoreboard IntFaddFMultFDiv WBR0R1R2R3R4 r5 t0i1x t1i2i1xx t2 t3 t4 t5 t6 t7 t8 t9 t10 t11 i1: FDIVR3, R3, R2 i2:LDR1, 0(R6) i3:FMULR0, R1, R2 i4:FDIVR4, R3, R1 i5FSUBR5, R0, R3 i6FMULR3, R3, R1 assume 1 issue/cyc

Scoreboarding Example FU StatusScoreboard IntFaddFMultFDiv WBR0R1R2R3R4 r5 t0i1 t1i2i1i2i1 t2i1i2 i1 t3i3i1i3i1 t4i3i1i3i1 t5i3i4i3i4 t6i4i3 i4 t7i5i4 i5 t8i6i4i5i6i4i5 t9i6i4,i6i4 t10i6 t11i6 Can WAW really happen here? Can we go to multiple issue? Can we go to out-of-order issue? i1: FDIVR3, R3, R2 i2:LDR1, 0(R6) i3:FMULR0, R1, R2 i4:FDIVR4, R3, R1 i5FSUBR5, R0, R3 i6FMULR3, R3, R1

Out-of-Order Issue INTLD/ST Inorder Inst. Stream Dispatch RD WB EX ID IF Fmult Fadd

Scoreboarding for Out-of-Order Issue  Scoreboard: one entry per GPR (what do we need to record?)  Dispatch in order: “RD  Fn (RS, RT)” - if FU is busy  structural hazard, stall - if SB[RD] is set is set  WAW, stall - if SB[RS] or SB[RT] is set is set  RAW (what to do??) u Issue out-of-order: (when?) u Complete out-of-order - update GPR[RD], clear SB[RD] (what about WAR?)

Scoreboard for Out-of-Order Issue [H&P pp242~251] Function Unit Status NameBusyOpFi i FjFkQjQkRjRk IntegerFnRDRSRTYesNo FAdd FMult LD/ST Register Results Status (a.k.a Scoreboard) R0R1R2R3R4R5R6.... FU RD  Fn (RS, RT)” Which FU is computing the new value if not ready? Which FU is going to update the register?

Scoreboard Management: “RD  Fn (RS, RT)” StatusWait untilBookkeeping Dispatchnot busy (FU) and not Result (‘RD’) Busy(FU)  yes; Op(FU)  Fn; Fi(FU)  ’RD’; Fj(FU)  ’RS’; Fk(FU)  ’RT’; Qj(FU)  Result(‘RS’); Qk(FU)  Result(‘RT’); Rj(FU)  not Qj(FU); Rk(FU)  not Qk(FU); Result(’RD’)  FU; Issue (Read operands) Rj(FU) and Rk(FU) Rj(FU)  no; Rk(FU)  no; Qj(FU)  0; Qk(FU)  0; Execution Complete Functional unit done Write Result  f (( Fj(f)  Fi(FU) or Rj(f)==No ) and ( Fk(f)  Fi(FU) or Rk(f)==No ))  f ( if Qj(f)==FU then Rj(f)  yes);  f ( if Qk(f)==FU then Rk(f)  yes); Result(Fi(FU))  0; Busy(FU)  no; Legends:FU -- the fxn unit used by the instruction; Fj( X ) -- content of entry Fj for fxn unit X; Result( X ) -- register result status entry for register X;

Scoreboarding Example 1/3 Instruction Status InstructionDispatchRead Operands Execution Complete Write Result LD F6, 43 (R2)XXXX LD F2, 45(R3)XXX MULTD F0, F2, F4X SUBD F8, F6, F2X DIVD F10, F0, F6X ADDD F6, F8, F2 Function Unit Status NameBusyOpFi i FjFkQjQkRjRk Integer (1)YesLDF2R3No Mult1(10)YesMULTDF0F2F4IntegerNoYes Mult2(10)No Add(2)YesSUBDF8F6F2IntegerYesNo Div(40)YesDIVDF10F0F6Mult1NoYes Register Results Status (a.k.a Scoreboard) F0F2F4F6F8F10F12.... FUMult1IntegerAddDivide

Scoreboarding Example 2/3 Instruction Status InstructionDispatchRead Operands Execution Complete Write Result LD F6, 43 (R2)XXXX LD F2, 45(R3)XXXX MULTD F0, F2, F4XXX SUBD F8, F6, F2XXXX DIVD F10, F0, F6X ADDD F6, F8, F2XXX Function Unit Status NameBusyOpFi i FjFkQjQkRjRk Integer (1)No Mult1(10)YesMULTDF0F2F4No Mult2(10)No Add(2)YesADDDF6F8F2No Div(40)YesDIVDF10F0F6Mult1NoYes Register Results Status (a.k.a Scoreboard) F0F2F4F6F8F10F12.... FUMult1AddDivide

Scoreboarding Example 3/3 Instruction Status InstructionDispatchRead Operands Execution Complete Write Result LD F6, 43 (R2)XXXX LD F2, 45(R3)XXXX MULTD F0, F2, F4XXXX SUBD F8, F6, F2XXXX DIVD F10, F0, F6XXX ADDD F6, F8, F2XXXX Function Unit Status NameBusyOpFi i FjFkQjQkRjRk Integer (1)No Mult1(10)No Mult2(10)No Add(2)No Div(40)YesDIVDF10F0F6No Register Results Status (a.k.a Scoreboard) F0F2F4F6F8F10F12.... FUDivide

Limitations of Scoreboarding  Consider a scoreboard processor with infinitely wide datapath In the best case, how many instructions can be simultaneously outstanding? u Hints no structural hazards can always write a RAW-free code sequence addi r1,r0,1; addi r2,r0,1; addi r3,r0,1; ……. think about x86 ISA with only 8 registers

Contribution to Register Recycling 9$34:mul$14$7,40 10addu$15,$4,$14 11mul$24,$9,4 12addu$25,$15,$24 13lw$11,0($25) 14mul$12,$9,40 15addu$13,$5,$12 16mul$14,$8,4 17addu$15,$13,$14 18lw$24,0($15) 19mul$25,$11,$24 20addu$10, $25 21addu$9, 1 22ble$9,10,$34 COMPILER REGISTER ALLOCATION INSTRUCTION LOOPS Single Assignment, Symbolic Reg. Map Symbolic Reg. to Physical Reg. Maximize Reuse of Reg. CODE GENERATION REG. ALLOCATION Reuse Same Set of Reg. in Each Iteration Overlapped Execution of Different Iterations For (k=1;k<= 10; k++) t += a [i] [k] * b [k] [j] ;

Resolving False Dependences (2) R3 R5 + 1 Must Prevent (2) from completing (1) R4 R3 + 1 before (1) is dispatched Stalling: delay Dispatching (or write back) of the 2nd instruction Copy Operands: Copy not-yet-used operand to prevent being overwritten (WAR) Register Renaming: use a different register (WAW & WAR) Must Prevent (2) from completing before (1) completes (1) R3 R3 op R5 R3 (2) R3 R5 + 1

Register Renaming  Anti and output dependencies are false dependencies  The dependence is on name/location rather than data  Given infinite number of registers, anti and output dependencies can always be eliminated r 3  r 1 op r 2 r 5  r 3 op r 4 r 3  r 6 op r 7 Renamed r1  r2 / r3 r4  r1 * r5 r8  r3 + r6 r9  r8 - r4 Original r1  r2 / r3 r4  r1 * r5 r1  r3 + r6 r3  r1 - r4

Rename Register File (t0... t63) Rename Table Hardware Register Renaming  maintain bindings from ISA reg. names to rename registers  When issuing an instruction that updates ‘RD’: allocate an unused rename register TX recording binding from ‘RD’ to TX  When to remove a binding? When to de-allocate a rename register? ISA name e.g. R12 rename T56 R1  R2 / R3 R4  R1 * R5 R1  R3 + R6 To be continued next lecture!!

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle.

Similar presentations

Presentation on theme: "IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth ­at least one word/cycle to fetch 1 instruction/cycle.

Similar presentations

Presentation on theme: "IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth ­at least one word/cycle to fetch 1 instruction/cycle."— Presentation transcript:

Similar presentations

About project

Feedback

IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle.

Presentation on theme: "IBM’s Experience on Pipelined Processors [Agerwala and Cocke 1987] Attributes and Assumptions:  Memory Bandwidth at least one word/cycle to fetch 1 instruction/cycle."— Presentation transcript: