Hakim Weatherspoon CS 3410 Computer Science Cornell University

Hakim Weatherspoon CS 3410 Computer Science Cornell University
Pipelining Hakim Weatherspoon CS 3410 Computer Science Cornell University The slides are the product of many rounds of teaching CS 3410 by Professors Weatherspoon, Bala, Bracy, McKee, and Sirer.

Review: Single Cycle Processor
memory register file inst alu +4 +4 addr =? PC din dout control cmp offset memory new pc target imm extend

Review: Single Cycle Processor
Advantages Single cycle per instruction make logic and clock simple Disadvantages Since instructions take different time to finish, memory and functional unit are not efficiently utilized Cycle time is the longest delay Load instruction Best possible CPI is 1 (actually < 1 w parallelism) However, lower MIPS and longer clock period (lower clock frequency); hence, lower performance

Review: Multi Cycle Processor
Advantages Better MIPS and smaller clock period (higher clock frequency) Hence, better performance than Single Cycle processor Disadvantages Higher CPI than single cycle processor Pipelining: Want better Performance want small CPI (close to 1) with high MIPS and short clock period (high clock frequency)

Improving Performance
Parallelism Pipelining Both!

The Kids Alice Bob They don’t always get along…

The Bicycle

The Materials Saw Drill Glue Paint

The Instructions N pieces, each built following same sequence: Saw
Drill Glue Paint

Design 1: Sequential Schedule
Alice owns the room Bob can enter when Alice is finished Repeat for remaining tasks No possibility for conflicts

Sequential Performance
time … Latency: 4 hours/task Throughput: 1 task/4 hrs Concurrency: 1 Latency: Throughput: Concurrency: Elapsed Time for Alice: 4 Elapsed Time for Bob: 4 Total elapsed time: 4*N Can we do better? Latency = 4 CPI = 4 (here the instruction is the construction of the bike) Throughput = 2 bikes in 8 hrs. So 1 task in 4 hrs. So ¼ throughput Concurrency: 1 (i.e. 1 task at a time) CPI = 4 CPI =

Design 2: Pipelined Design
Partition room into stages of a pipeline Dave Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep

Partition room into stages of a pipeline Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Dave Carol Bob Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Partition room into stages of a pipeline Alice Alice Alice Alice One person owns a stage at a time 4 stages 4 people working simultaneously Everyone moves right in lockstep It still takes all four stages for one job to complete

Pipelined Performance
time … Latency (Delay): 4 cycles / task Throughput: 1 task / cycle (huge improvement) after your initial startup of filling the pipeline. Parallelism: 4 concurrent tasks Latency: 4 hrs/task Throughput: 1 task/hr Concurrency: 4 Latency: Throughput: Concurrency: What if some tasks don’t need drill? CPI = 1

Time Now what if drilling is twice as long but the gluing and paint are ½ each. What if drilling takes twice as long, but gluing and paint take ½ as long? What if some tasks don’t need drill? Latency: Throughput: CPI =

Time Done: 4 cycles Done: 6 cycles Done: 8 cycles So total latency is still the same: 4 Throughput: 1 task / 2 cycle, First task out at cycle 4, second at 6, third at 8, fourth at 10. So ½. Not 1! CPI: 2 cycles / task What if drilling takes twice as long, but gluing and paint take ½ as long? What if some tasks don’t need drill? Latency: Throughput: Latency: 4 cycles/task Throughput: 1 task/2 cycles CPI = 2 CPI =

Lessons Principle: Throughput increased by parallel execution
Balanced pipeline very important Else slowest stage dominates performance Pipelining: Identify pipeline stages Isolate stages from each other Resolve pipeline hazards (next lecture)

Single Cycle vs Pipelined Processor

Single Cycle  Pipelining
insn0.fetch, dec, exec insn1.fetch, dec, exec Pipelined insn0.fetch insn0.dec insn0.exec insn1.fetch insn1.dec insn1.exec

Agenda 5-stage Pipeline Implementation Working Example Hazards
Structural Data Hazards Control Hazards

A Processor Review: Single cycle processor memory register file alu
inst alu +4 +4 addr =? PC din dout control cmp offset memory new pc target imm extend

compute jump/branch targets
A Processor Write- Back Instruction Fetch Instruction Decode Execute Memory memory register file inst alu +4 addr PC din dout control memory new pc What does that do to a clock cycle. It is the time for 1 stage. So 5 times faster in this case (ASSUMING all stages are approximately equal sized) Left to right flow except for the write-back phase and the branch targets that can change the PC. Otherwise left to right. compute jump/branch targets imm extend

Pipelined Processor memory register file alu +4 addr PC din dout control memory compute jump/branch targets new pc extend Fetch Decode Execute Memory WB

Pipelined Processor Write- Back Instruction Fetch Instruction Decode Execute Memory memory register file A alu D D B +4 addr PC din dout inst B M control memory compute jump/branch targets new pc extend imm ctrl ctrl ctrl IF/ID ID/EX EX/MEM MEM/WB

Time Graphs Cycle IF ID EX WB IF ID EX WB IF ID EX WB IF ID EX WB IF
1 2 3 4 5 6 7 8 9 Cycle add IF ID EX MEM WB nand IF ID EX MEM WB IF ID EX MEM WB lw IF ID EX MEM WB add sw IF ID EX MEM WB CPI = 1 (inverse of throughput) Latency: 5 cycles Throughput: 1 insn/cycle Concurrency: 5 Latency: Throughput: CPI = 1

Principles of Pipelined Implementation
Break datapath into multiple cycles (here 5) Parallel execution increases throughput Balanced pipeline very important Slowest stage determines clock rate Imbalance kills performance Add pipeline registers (flip-flops) for isolation Each stage begins by reading values from latch Each stage ends by writing values to latch Resolve hazards

Perform Functionality Latch values of interest
Pipeline Stages Stage Perform Functionality Latch values of interest Fetch Use PC to index Program Memory, increment PC Instruction bits (to be decoded) PC + 4 (to compute branch targets) Decode Decode instruction, generate control signals, read register file Control information, Rd index, immediates, offsets, register values (Ra, Rb), PC+4 (to compute branch targets) Execute Perform ALU operation Compute targets (PC+4+offset, etc.) in case this is a branch, decide if branch taken Control information, Rd index, etc. Result of ALU operation, value in case this is a store instruction Memory Perform load/store if needed, address is ALU result Result of load, pass result from execute Writeback Select value, write to register file Anything needed by later pipeline stages Next stage will read this pipeline register

Instruction Fetch (IF)
instruction memory addr mc +4 PC - PC+4 - pc-rel (PC-relative); e.g. BEQ, BNE - pc-abs (PC absolute); e.g. J and JAL . (PC+4) • target • 00 - pc-reg (PC registers); e.g. JR new pc

instruction memory addr mc inst +4 00 = read word Rest of pipeline PC+4 PC pc-reg new pc pc-rel pc-abs pc-sel IF/ID

instruction memory addr mc inst +4 00 = read word Rest of pipeline PC+4 PC pc-reg pc-rel pc-abs PC+4 pc-reg (PC registers: JR) pc-rel (PC-relative: BEQ, BNE) pc-abs (PC absolute: J and JAL) pc-sel IF/ID

Stage 1: Instruction Fetch
Decode register file WE A A Rd D B B Ra Rb inst Stage 1: Instruction Fetch Rest of pipeline imm PC+4 Early decode: decode all instr in ID, pass control signals to later stages Late decode: decode some instr in ID, pass instr so each stage computes its own control signals PC+4 ctrl IF/ID ID/EX

Stage 1: Instruction Fetch
Decode result dest register file WE A A Rd D B B Ra Rb decode inst Stage 1: Instruction Fetch Rest of pipeline imm extend PC+4 Early decode: decode all instr in ID, pass control signals to later stages Late decode: decode some instr in ID, pass instr so each stage computes its own control signals PC+4 ctrl IF/ID ID/EX

Stage 2: Instruction Decode
Execute (EX) A D alu B imm B Stage 2: Instruction Decode Rest of pipeline PC+4 target ctrl ctrl ID/EX EX/MEM

Stage 2: Instruction Decode
Execute (EX) pcsel branch? pcreg A D alu B imm B Stage 2: Instruction Decode Rest of pipeline pcrel + PC+4 target pcabs  ctrl ctrl ID/EX EX/MEM

MEM D D B M memory ctrl ctrl EX/MEM MEM/WB addr Stage 3: Execute
din dout Stage 3: Execute Rest of pipeline memory target mc ctrl ctrl EX/MEM MEM/WB

MEM branch? D D B M memory ctrl ctrl EX/MEM MEM/WB addr
pcsel branch? MEM pcreg D D addr B M din dout Stage 3: Execute Rest of pipeline memory pcrel target mc pcabs ctrl ctrl EX/MEM MEM/WB

WB D M Stage 4: Memory ctrl MEM/WB

WB result D M Stage 4: Memory dest ctrl MEM/WB

Putting it all together!
inst mem Rd Ra Rb D B A inst PC+4 B A Rt D M imm OP Rd addr din dout +4 mem PC Rd IF/ID ID/EX EX/MEM MEM/WB

iClicker Question Consider a non-pipelined processor with clock period C (e.g., 50 ns). If you divide the processor into N stages (e.g., 5) , your new clock period will be: C N less than C/N C/N greater than C/N

Takeaway Pipelining is a powerful technique to mask latencies and increase throughput Logically, instructions execute one at a time Physically, instructions execute in parallel Instruction level parallelism Abstraction promotes decoupling Interface (ISA) vs. implementation (Pipeline)

MIPS is designed for pipelining
Instructions same length 32 bits, easy to fetch and then decode 3 types of instruction formats Easy to route bits between stages Can read a register source before even knowing what the instruction is Memory access through lw and sw only Access memory after ALU

Example: : Sample Code (Simple)
add r3  r1, r2 nand r6  r4, r5 lw r4  20(r2) add r5  r2, r5 sw r7  12(r3) Assume 8-register machine

IF/ID ID/EX EX/MEM MEM/WB + 4 valA instruction valB valB Rd dest dest
target PC+4 PC+4 R0 regA R1 ALU result A L U Inst mem Register file regB R2 valA PC Data mem M U X instruction R3 ALU result mdata R4 R5 valB R6 M U X data R7 imm dest extend valB Bits 11-15 Rd dest dest Bits 16-20 Rt M U X Bits 26-31 op op op IF/ID ID/EX EX/MEM MEM/WB

Example: Start State @ Cycle 0
At time 1, Fetch add r3 r1 r2 Example: Start Cycle 0 M U X 4 + R0 R1 36 A L U add nand lw sw Register file R2 9 PC Data mem M U X nop R3 12 4 R4 18 R5 7 R6 41 M U X data R7 22 dest extend Initial State Bits 11-15 Bits 16-20 M U X Bits 26-31 nop nop nop IF/ID ID/EX EX/MEM MEM/WB Time: 0

Cycle 1: Fetch add IF/ID ID/EX EX/MEM MEM/WB add 3 1 2 + / 4 / 36 8 4
U X 4 + 4 / 4 R0 R1 36 A L U add nand lw sw Register file R2 9 PC M U X / 36 Data mem add R3 12 8 4 R4 18 R5 7 / 9 R6 41 M U X data R7 22 dest extend Fetch: add 3 1 2 Bits 11-15 / 3 Bits 16-20 / 2 M U X Bits 26-31 nop / add nop nop IF/ID ID/EX EX/MEM MEM/WB Time: 1 / 2

Cycle 2: Fetch nand, Decode add
nand add 3 1 2 M U X 4 + / 4 8 4 / 8 R0 R1 36 1 add nand lw sw 36 A L U Register file R2 9 2 36 PC / 18 Data mem M U X nand R3 12 / 45 12 8 R4 18 9 R5 7 9 / 7 R6 41 M U X data R7 22 3 dest extend / 9 Fetch: nand 6 4 5 Bits 11-15 / 6 3 3 / 3 Bits 16-20 / 5 2 M U X Bits 26-31 add / nand nop / add nop IF/ID ID/EX EX/MEM MEM/WB Time: 2 / 3

Cycle 3: Fetch lw, Decode nand, …
lw 4 20(2) nand add 3 1 2 M U X nand ( 18∙7 ) 18 = 7 = -3 = 4 + 4 / 8 12 8 R0 R1 36 4 / 45 36 / 18 A L U add nand lw sw Register file R2 9 5 18 PC Data mem M U X lw (2) R3 12 45 / -3 16 12 R4 18 / 7 9 R5 7 7 R6 41 M U X data R7 22 6 dest extend 9 / 7 Fetch: lw 4 20(2) 0x 0x And = 10 Nand = Bits 11-15 3 6 3 3 / 6 / 3 Bits 16-20 2 5 M U X Bits 26-31 nand add / nand nop / add IF/ID ID/EX EX/MEM MEM/WB Time: 3 / 4

Cycle 4: Fetch add, Decode lw, …
add lw 4 20(2) nand add 3 1 2 M U X 4 + 8 16 12 R0 R1 36 2 45 A L U add nand lw sw 18 Register file R2 9 4 9 PC Data mem M U X add R3 12 -3 20 16 R4 18 45 7 R5 7 18 R6 41 M U X data R7 22 20 dest extend 7 Fetch: add 5 2 5 Bits 11-15 6 6 3 6 3 Bits 16-20 5 4 M U X Bits 26-31 lw nand add IF/ID ID/EX EX/MEM MEM/WB Time: 4

Cycle 4: Fetch add, Decode lw, …
sw 7 12(3) add lw 4 20 (2) nand add M U X 4 + 12 20 16 R0 R1 36 45 2 -3 add nand lw sw 9 A L U Register file R2 9 5 9 PC Data mem M U X sw 7 12(3) R3 45 29 24 20 R4 18 -3 R5 7 7 R6 41 M U X data R7 22 20 5 dest extend 18 Fetch: sw 7 12(3) Bits 11-15 5 4 6 3 4 6 Bits 16-20 4 5 M U X Bits 26-31 add lw nand IF/ID ID/EX EX/MEM MEM/WB Time: 5

Cycle 6: Decode sw, … IF/ID ID/EX EX/MEM MEM/WB
sw 7 12(3) add lw 4 20(2) nand M U X 4 + 16 20 R0 R1 36 -3 3 29 add nand lw sw 9 9 A L U Register file R2 7 45 PC Data mem M U X R3 45 16 99 28 24 R4 18 29 7 R5 7 22 R6 -3 M U X data R7 22 12 dest extend 7 No more instructions Bits 11-15 5 5 4 6 5 4 Bits 16-20 5 7 M U X Bits 26-31 sw add lw IF/ID ID/EX EX/MEM MEM/WB Time: 6

Cycle 7: Execute sw, ... IF/ID ID/EX EX/MEM MEM/WB
nop nop sw 7 12(3) add lw 4 20(2) M U X 4 + 20 R0 R1 36 16 add nand lw sw 9 45 A L U Register file R2 PC Data mem M U X R3 45 99 57 32 28 R4 99 16 R5 7 R6 -3 M U X data R7 22 12 dest extend 22 No more instructions Bits 11-15 7 5 4 7 5 Bits 16-20 7 M U X Bits 26-31 sw add IF/ID ID/EX EX/MEM MEM/WB Time: 7

Cycle 8: Memory sw, ... IF/ID ID/EX EX/MEM MEM/WB
nop nop nop sw 7 12(3) add M U X 4 + R0 R1 36 16 57 A L U add nand ;w sw Register file R2 9 PC Data mem M U X R3 45 36 32 R4 99 57 22 R5 16 R6 -3 M U X data R7 22 22 dest extend No more instructions Bits 11-15 5 7 Bits 16-20 M U X Bits 26-31 sw IF/ID ID/EX EX/MEM MEM/WB Time: 8 Slides thanks to Sally McKee

Cycle 9: Writeback sw, ... IF/ID ID/EX EX/MEM MEM/WB
nop nop nop nop sw 7 12(3) M U X 4 + R0 R1 36 A L U add nand ;w sw Register file R2 9 PC Data mem M U X R3 45 40 36 R4 99 R5 16 R6 -3 M U X data R7 22 dest extend No more instructions Bits 11-15 Bits 16-20 M U X Bits 21-23 IF/ID ID/EX EX/MEM MEM/WB Time: 9

iClicker Question Pipelining is great because:
You can fetch and decode the same instruction at the same time. You can fetch two instructions at the same time. You can fetch one instruction while decoding another. Instructions only need to visit the pipeline stages that they require. C and D

Hazards Structural hazards
Correctness problems associated w/processor design Structural hazards Same resource needed for different purposes at the same time (Possible: ALU, Register File, Memory) Data hazards Instruction output needed before it’s available Control hazards Next instruction PC unknown at time of Fetch Structural: say only 1 memory for instruction read and the MEM stage. That is a structural hazard. We have designed the MIPS ISA and our implementation of separate memory for instruction and data to avoid this problem. MIPS is designed very carefully to not have any structural hazards. Data hazards arise when data needed by one instruction has not yet been computed (because it is further down the pipeline). We need a solution for this. Control hazards arise when we don’t know what the PC will be. This makes it not possible to push the next instruction down the pipeline. Till we know what the issue is.

Dependences and Hazards
Dependence: relationship between two insns Data: two insns use same storage location Control: 1 insn affects whether another executes at all Not a bad thing, programs would be boring otherwise Enforced by making older insn go before younger one Happens naturally in single-/multi-cycle designs But not in a pipeline Hazard: dependence & possibility of wrong insn order Effects of wrong insn order cannot be externally visible Hazards are a bad thing: most solutions either complicate the hardware or reduce performance

“earlier” = started earlier
iClicker Question Data Hazards register file (RF) reads occur in stage 2 (ID) RF writes occur in stage 5 (WB) RF written in ½ half, read in second ½ half of cycle x10: add r3  r1, r2 x14: sub r5  r3, r4 1. Is there a dependence? 2. Is there a hazard? destination reg of earlier instruction == source reg of current stage left “earlier” = started earlier = stage right Yes No Cannot tell with the information given. assumes (WE=0 implies rD=0) everywhere, similar for rA needed, rB needed (can ensure this in ID stage)

iClicker Question Data Hazards register file (RF) reads occur in stage 2 (ID) RF writes occur in stage 5 (WB) RF written in ½ half, read in second ½ half of cycle x10: add r3  r1, r2 x14: sub r5  r3, r4 1. Is there a dependence? 2. Is there a hazard? destination reg of earlier instruction == source reg of current stage left “earlier” = started earlier = stage right Yes No Cannot tell with the information given. for both assumes (WE=0 implies rD=0) everywhere, similar for rA needed, rB needed (can ensure this in ID stage)

iClicker Follow-up Which of the following statements is true? A. Whether there is a data dependence between two instructions depends on the machine the program is running on. B. Whether there is a data hazard between two instructions depends on the machine the program is running on. C. Both A & B D. Neither A nor B destination reg of earlier instruction == source reg of current stage left “earlier” = started earlier = stage right assumes (WE=0 implies rD=0) everywhere, similar for rA needed, rB needed (can ensure this in ID stage)

Where are the Data Hazards?
Clock cycle time IF ID MEM WB add r3, r1, r2 sub r5, r3, r4 IF ID MEM WB lw r6, 4(r3) IF ID MEM WB or r5, r3, r5 IF ID MEM WB sw r6, 12(r3) IF ID MEM WB

iClicker How many data hazards due to r3 only 1 2 3 4 5 add r3, r1, r2
sub r5, r3, r4 lw r6, 4(r3) C or r5, r3, r5 sw r6, 12(r3)

Visualizing Data Hazards (1)
Clock cycle time backwards arrows require time travel IF ID X MEM WB add r3, r1, r2 sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB #1 and #2 #1 and #3 #1 and #4 (r3) Can’t go backward in time. #2 and #4 #3 and #5 or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB

Clock cycle time IF ID MEM WB add r3, r1, r2 X sub r5, r3, r4 IF ID X MEM WB lw r6, 4(r3) IF ID X MEM WB #1 and #2 #1 and #3 #1 and #4 (r3) Can’t go backward in time. #2 and #4 #3 and #5 or r5, r3, r5 IF ID X MEM WB sw r6, 12(r3) IF ID X MEM WB

Data Hazards Data Hazards register file reads occur in stage 2 (ID) register file writes occur in stage 5 (WB) next instructions may read values about to be written i.e. add r3, r1, r2 sub r5, r3, r4 How to detect? destination reg of earlier instruction == source reg of current stage left “earlier” = started earlier = stage right assumes (WE=0 implies rD=0) everywhere, similar for rA needed, rB needed (can ensure this in ID stage)

Detecting Data Hazards
inst mem Rd Ra Rb D B A inst PC+4 OP B A Rt D M imm Rd addr din dout +4 mem PC Rd IF/ID.Ra ≠ 0 && (IF/ID.Ra==ID/Ex.Rd IF/ID.Ra==Ex/M.Rd IF/ID.Ra==M/W.Rd) sub r5,r3,r4 add r3, r1, r2 IF/ID ID/EX EX/MEM MEM/WB repeat for Rb

inst mem Rd Ra Rb D B A inst PC+4 OP B A Rt D M imm Rd addr din dout +4 mem detect hazard PC Rd IF/ID ID/EX EX/MEM MEM/WB

Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards.

Next Goal What to do if data hazard detected?
Nothing: Modify ISA to match implementation Stall: Pause current and all subsequent instruction Forward/Bypass: Steal correct value from elsewhere in pipeline

iClicker What to do if data hazard detected? Wait/Stall
Reorder in Software (SW) Forward/Bypass All the above None. We will use some other method Nothing: Modify ISA to match implementation Stall: Pause current and all subsequent instruction Forward/Bypass: Steal correct value from elsewhere in pipeline

Possible Responses to Data Hazards
Do Nothing Change the ISA to match implementation “Hey compiler: don’t create code w/data hazards!” (We can do better than this) Stall Pause current and subsequent instructions till safe Forward/bypass Forward data value to where it is needed (Only works if value actually exists already) Nothing: Modify ISA to match implementation Stall: Pause current and all subsequent instruction Forward/Bypass: Steal correct value from elsewhere in pipeline

Stalling How to stall an instruction in ID stage
prevent IF/ID pipeline register update stalls the ID stage instruction convert ID stage instr into nop for later stages innocuous “bubble” passes through pipeline prevent PC update stalls the next (IF stage) instruction prev slide: Q: what happens if branch in EX? A: bad stuff: we miss the branch So: lets try not to stall

add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 inst mem Rd Ra Rb D B A inst PC+4 OP B A Rt D M imm Rd addr din dout +4 mem detect hazard PC Rd If detect hazard MemWr=0 RegWr=0 WE=0 IF/ID ID/EX EX/MEM MEM/WB

Stalling time add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4
Clock cycle time 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 IF ID 11=r1 22=r2 EX D=33 MEM D=33 WB r3=33 ID ?=r3 33=r3 EX MEM M find hazards first, then draw

Stalling IF ID Ex M W IF ID ID ID ID Ex M W IF IF IF IF ID Ex M IF ID
Clock cycle time 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 IF ID 11=r1 22=r2 EX D=33 MEM D=33 WB r3=33 ID ?=r3 33=r3 EX MEM M r3 = 10 IF ID Ex M W r3 = 20 3 Stalls Stall IF ID ID ID ID Ex M W IF IF IF IF ID Ex M find hazards first, then draw IF ID Ex

Stalling nop /stall A A D D inst mem D rD B B data mem rA rB B M
+4 Rd Rd Rd (MemWr=0 RegWr=0) WE PC WE WE nop Op Op Op sub r5,r3,r5 add r3,r1,r2 draw control signals for /stall Q: what happens if branch in EX? A: bad stuff: we miss the branch So: lets try not to stall (WE=0) or r6,r3,r4 /stall NOP = If(IF/ID.rA ≠ 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) STALL CONDITION MET

Stalling nop nop /stall A A D D inst mem D rD B B data mem rA rB B M
+4 Rd Rd Rd (MemWr=0 RegWr=0) WE PC WE WE nop (MemWr=0 RegWr=0) Op Op Op nop sub r5,r3,r5 add r3,r1,r2 draw control signals for /stall Q: what happens if branch in EX? A: bad stuff: we miss the branch So: lets try not to stall (WE=0) or r6,r3,r4 /stall NOP = If(IF/ID.rA ≠ 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) STALL CONDITION MET

Stalling nop nop nop /stall A A D D inst mem D rD B B data mem rA rB B
+4 Rd Rd Rd (MemWr=0 RegWr=0) WE PC WE WE nop (MemWr=0 RegWr=0) (MemWr=0 RegWr=0) Op Op Op nop nop sub r5,r3,r5 add r3,r1,r2 draw control signals for /stall Q: what happens if branch in EX? A: bad stuff: we miss the branch So: lets try not to stall (WE=0) or r6,r3,r4 /stall NOP = If(IF/ID.rA ≠ 0 && (IF/ID.rA==ID/Ex.Rd IF/ID.rA==Ex/M.Rd IF/ID.rA==M/W.Rd)) STALL CONDITION MET

Stalling IF ID Ex M W IF ID ID ID ID Ex M W IF IF IF IF ID Ex M IF ID
Clock cycle time 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r5 or r6, r3, r4 add r6, r3, r8 IF ID 11=r1 22=r2 EX D=33 MEM D=33 WB r3=33 ID ?=r3 33=r3 EX MEM M r3 = 10 IF ID Ex M W r3 = 20 3 Stalls Stall IF ID ID ID ID Ex M W IF IF IF IF ID Ex M find hazards first, then draw IF ID Ex

Stalling How to stall an instruction in ID stage
prevent IF/ID pipeline register update stalls the ID stage instruction convert ID stage instr into nop for later stages innocuous “bubble” passes through pipeline prevent PC update stalls the next (IF stage) instruction prev slide: Q: what happens if branch in EX? A: bad stuff: we miss the branch So: lets try not to stall

Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. *Bubbles in pipeline significantly decrease performance.

Possible Responses to Data Hazards
Do Nothing Change the ISA to match implementation “Compiler: don’t create code with data hazards!” (Nice try, we can do better than this) Stall Pause current and subsequent instructions till safe Forward/bypass Forward data value to where it is needed (Only works if value actually exists already) Nothing: Modify ISA to match implementation Stall: Pause current and all subsequent instruction Forward/Bypass: Steal correct value from elsewhere in pipeline

Forwarding Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Three types of forwarding/bypass Forwarding from Ex/Mem registers to Ex stage (MEx) Forwarding from Mem/WB register to Ex stage (WEx) RegisterFile Bypass

Add the Forwarding Datapath
B A A D D inst mem B data mem B M imm Rd Rd detect hazard Rb forward unit WE WE Ra MC MC draw control signals IF/ID ID/Ex Ex/Mem Mem/WB

Forwarding Datapath D B A A D D inst mem B data mem B M
imm Rd Rd detect hazard Rb forward unit WE WE Ra MC MC draw control signals IF/ID ID/Ex Ex/Mem Mem/WB Three types of forwarding/bypass Forwarding from Ex/Mem registers to Ex stage (MEx) Forwarding from Mem/WB register to Ex stage (W  Ex) RegisterFile Bypass

Forwarding Datapath 1: Ex/MEM  EX
B A inst mem data mem sub r5, r3, r1 add r3, r1, r2 add r3, r1, r2 sub r5, r3, r1 IF ID Ex M W IF ID Ex M W Problem: EX needs ALU result that is in MEM stage Solution: add a bypass from EX/MEM.D to start of EX

Forwarding Datapath 1: Ex/MEM  EX
B A inst mem data mem sub r5, r3, r1 add r3, r1, r2 Detection Logic in Ex Stage: forward = (Ex/M.WE && EX/M.Rd != 0 && ID/Ex.Ra == Ex/M.Rd) || (same for Rb)

Forwarding Datapath 2: Mem/WB  EX
inst mem data mem or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2 add r3, r1, r2 sub r5, r3, r1 or r6, r3, r4 IF ID Ex M W IF ID Ex M W IF ID Ex M W Problem: EX needs value being written by WB Solution: Add bypass from WB final value to start of EX

Forwarding Datapath 2: Mem/WB  EX
inst mem data mem or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2 Detection Logic: forward = (M/WB.WE && M/WB.Rd != 0 && ID/Ex.Ra == M/WB.Rd && not (ID/Ex.WE && Ex/M.Rd != 0 && ID/Ex.Ra == Ex/M.Rd) || (same for Rb)

Register File Bypass D B A inst mem data mem add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2 Problem: Reading a value that is currently being written Solution: just negate register file clock writes happen at end of first half of each clock cycle reads happen during second half of each clock cycle

Register File Bypass IF ID Ex M W IF ID Ex M W IF ID Ex M W IF ID Ex M
inst mem data mem add r6, r3, r8 or r6, r3, r4 sub r5, r3, r1 add r3, r1,r2 add r3, r1, r2 sub r5, r3, r1 or r6, r3, r4 add r6, r3, r8 IF ID Ex M W IF ID Ex M W IF ID Ex M W IF ID Ex M W

Forwarding Example 2 time add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3)
Clock cycle 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r5 sw r6, 12(r3) IF ID 11=r1 22=r2 EX D=33 MEM D=33 WB r3=33 ID 33=r3 EX D=55 MEM r5=55 ID 33=r3 D=? MEM D=66 r6=66 ID 33=r3 55=r5 66=r6 identify hazards then draw; if “or” had used r6, would have needed to stall

Forwarding Example 2 IF ID Ex M W IF ID Ex M W IF ID Ex M W IF ID Ex M
time Clock cycle 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r6 sw r6, 12(r3) IF ID 11=r1 22=r2 EX D=33 MEM D=33 WB r3=33 ID 33=r3 EX D=55 MEM r5=55 ID 33=r3 D=? MEM D=66 r6=66 ID 33=r3 55=r5 66=r6 IF ID Ex M W IF ID Ex M W IF ID Ex M W IF ID Ex M W identify hazards then draw; if “or” had used r6, would have needed to stall IF ID Ex M W

Forwarding Example 2 IF ID Ex M W IF ID Ex M W IF ID Ex M W IF ID Ex M
time Clock cycle backwards arrows require time travel 1 2 3 4 5 6 7 8 add r3, r1, r2 sub r5, r3, r4 lw r6, 4(r3) or r5, r3, r5 sw r6, 12(r3) IF ID 11=r1 22=r2 EX D=33 MEM D=33 WB r3=33 ID 33=r3 EX D=55 MEM r5=55 ID 33=r3 D=? MEM D=66 r6=66 ID 33=r3 55=r5 66=r6 IF ID Ex M W IF ID Ex M W IF ID Ex M W IF ID Ex M W identify hazards then draw; if “or” had used r6, would have needed to stall IF ID Ex M W

Load-Use Hazard Explained
B A inst mem data mem or r6, r3, r4 lw r4, 20(r8) Data dependency after a load instruction: Value not available until after the M stage Next instruction cannot proceed if dependent THE KILLER HAZARD often see lw …; nop; … we will stall for our project 2

Load-Use Stall A D lw r4, 20(r8) inst mem B data mem or r6,r4,r1
Need to go back in time

Load-Use Stall (1) IF ID Ex IF ID A D lw r4, 20(r8) inst mem B
data mem or r6,r4,r1 lw r4, 20(r8) lw r4, 20(r8) or r6, r3, r4 IF ID Ex Need to go back in time IF ID

Load-Use Stall (2) IF ID Ex M W IF ID* ID Ex M W Stall D B A inst mem
data mem or r6,r4,r1 NOP lw r4, 20(r8) lw r4, 20(r8) or r6, r3, r4 IF ID Ex M W requires one bubble Stall IF ID* ID Ex M W

Load-Use Stall (3) IF ID Ex M W IF ID* ID Ex Ex M W Stall D B A
inst mem data mem or r6,r4,r1 NOP lw r4, 20(r8) lw r4, 20(r8) or r6, r3, r4 IF ID Ex M W requires one bubble Stall IF ID* ID Ex Ex M W

Load-Use Detection D B A A D D inst mem B data mem B M
imm Rd Rd Rd detect hazard Rb forward unit WE WE It happens in an earlier stage. You stall since you don’t want to go ahead from the ID stage unless you are ready. Walk through what happens with a Ra MC MC MC IF/ID ID/Ex Ex/Mem Mem/WB Stall = If(ID/Ex.MemRead && IF/ID.Ra == ID/Ex.Rd

Incorrectly Resolving Load-Use Hazards
B A A D D inst mem B data mem B M imm Rd Rd Rd detect hazard Rb forward unit WE WE Ra often see lw …; nop; … we will stall for our project 2 MC MC MC IF/ID ID/Ex Ex/Mem Mem/WB Most frequent 3410 non-solution to load-use hazards Why is this “solution” so so so so so so awful?

iClicker Question Forwarding values directly from Memory to the Execute stage without storing them in a register first: Does not remove the need to stall. Adds one too many possible inputs to the ALU. Will cause the pipeline register to have the wrong value. Halves the frequency of the processor. Both A & D

Resolving Load-Use Hazards
Two MIPS Solutions: MIPS 2000/3000: delay slot ISA says results of loads are not available until one cycle later Assembler inserts nop, or reorders to fill delay slot MIPS 4000 onwards: stall But really, programmer/compiler reorders to avoid stalling in the load delay slot often see lw …; nop; … we will stall for our project 2

Takeaway Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. A pipelined processor needs to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Bubbles (nops) in pipeline significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.

Quiz Find all hazards, and say how they are resolved: add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) Find all hazards, and say how they are resolved

Quiz Find all hazards, and say how they are resolved: add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) Find all hazards, and say how they are resolved 5 Hazards

Quiz Find all hazards, and say how they are resolved: add r3, r1, r2 nand r5, r3, r4 add r2, r6, r3 lw r6, 24(r3) sw r6, 12(r2) Forwarding from Ex/MID/Ex (MEx) Forwarding from M/WID/Ex (WEx) RegisterFile (RF) Bypass Forwarding from M/WID/Ex (WEx) Find all hazards, and say how they are resolved Stall + Forwarding from M/WID/Ex (WEx) 5 Hazards

Quiz Find all hazards, and say how they are resolved: add r3, r1, r2 sub r3, r2, r1 nand r4, r3, r1 or r0, r3, r4 xor r1, r4, r3 sb r4, 1(r0) 6 hazards be careful of store (no rD) be careful or r0 (never a hazard)

Quiz 2 Find all hazards, and say how they are resolved: add r3, r1, r2 sub r3, r2, r1 nand r4, r3, r1 or r0, r3, r4 xor r1, r4, r3 sb r4, 1(r0) 6 hazards be careful of store (no rD) be careful or r0 (never a hazard) Hours and hours of debugging!

Data Hazard Recap Delay Slot(s) Stall Forward/Bypass Tradeoffs?
Modify ISA to match implementation Stall Pause current and all subsequent instructions Forward/Bypass Try to steal correct value from elsewhere in pipeline Otherwise, fall back to stalling or require a delay slot Tradeoffs? 1: ISA tied to impl; messy asm; impl easy & cheap; perf depends on asm. 2: ISA correctness not tied to impl; clean+slow asm, or messy+fast asm; impl easy and cheap; same perf as 1. 3: ISA perf not tied to impl; clean+fast asm; impl is tricky; best perf

A bit of Context i = 0; do { n += 2; i++; } while(i < max) i  r1
x10 addiu r1, r0, 0 # i=0 x14 Loop: addiu r2, r2, 2 # n += 2 x addiu r1, r1, 1 # i++ x1C blt r1, r3, Loop # i<max? x addiu r1, r0, 7 # i = 7 x subi r2, r2, 1 # n-- A bit of Context i  r1 Assume: n  r2 max  r3

Control Hazards Control Hazards Branch not taken? No Problem!
instructions are fetched in stage 1 (IF) branch and jump decisions occur in stage 3 (EX)  next PC not known until 2 cycles after branch/jump x1C blt r1, r3, Loop x20 addiu r1, r0, 7 x24 subi r2, r2, 1 Branch not taken? No Problem! Branch taken? Just fetched 2 addi’s  Zap & Flush Q: what happens with branch/jump in branch delay slot? A: bad stuff, so forbidden Q: why one, not two? A: can move branch calc from EX to ID; will require new bypasses into ID stage; or can just zap the second instruction Q: performance? A: stall is bad

Zap & Flush IF ID Ex M W IF ID IF IF ID Ex M W If branch Taken Zap
prevent PC update clear IF/ID latch branch continues Zap & Flush inst mem D B A +4 data mem PC branch calc decide branch New PC = 14 If branch Taken Zap IF ID Ex M W branch decision in EX need 2 stalls, and maybe kill 1C blt r1,r3,L 20 addiu r1,r0,7 24 subi r2,r2,1 IF ID NOP NOP NOP IF NOP NOP NOP NOP IF ID Ex M W 14 L:addi r2,r2,2

Zap & Flush IF ID Ex M W IF ID IF IF ID Ex M W If branch Taken Zap
prevent PC update clear IF/ID latch branch continues Zap & Flush inst mem D B A +4 data mem PC branch calc decide branch New PC = 1C If branch Taken Zap IF ID Ex M W branch decision in EX need 2 stalls, and maybe kill 1C blt r1,r3,L 20 addiu r1,r0,7 24 subi r2,r2,1 IF ID NOP NOP NOP IF NOP NOP NOP NOP IF ID Ex M W 14 L:addi r2,r2,2 For every taken branch? OUCH!!!

Reducing the cost of control hazard
Delay Slot You MUST do this MIPS ISA: 1 insn after ctrl insn always executed Whether branch taken or not Resolve Branch at Decode Some groups do this for Project 3, your choice Move branch calc from EX to ID Alternative: just zap 2nd instruction when branch taken Branch Prediction Not in 3410, but every processor worth anything does this (no offense!) Q: what happens with branch/jump in branch delay slot? A: bad stuff, so forbidden Q: why one, not two? A: can move branch calc from EX to ID; will require new bypasses into ID stage; or can just zap the second instruction Q: performance? A: stall is bad

Problem: Zapping 2 insns/branch
inst mem D B A +4 data mem PC branch calc decide branch New PC = 1C If branch Taken Zap 1C blt r1, r3, Loop F D X 20 addiu r1, r0, 7 24 subi r2, r2, 1 Pipeline is humming Zap!

Solution #1: Delay Slot i = 0; do { n += 2; i++; } while(i < max)
x10 addiu r1, r0, 0 # i=0 x14 Loop: addiu r2, r2, 2 # n += 2 x addiu r1, r1, 1 # i++ x1C blt r1, r3, Loop # i<max? x nop x addiu r1, r0, 7 # i = 7 x subi r2, r2, 1 # n++ Solution #1: Delay Slot i  r1 Assume: n  r2 max  r3

Delay Slot in Action Zap! If branch Taken Zap F D X inst mem D B A
+4 data mem PC branch calc decide branch New PC = 1C If branch Taken Zap 1C blt r1, r3, Loop F D X 20 nop 24 addiu r1, r0, 7 Pipeline is humming Zap!

Soln #2: Resolve Branches @ Decode
inst mem D B A +4 data mem PC branch calc decide branch New PC = 1C If branch Taken No Zapping 1C blt r1, r3, Loop F D X 20 nop 14 Loop:addiu r2,r2,2 branch decision in EX need 2 stalls, and maybe kill No Zapping!

Optimization: Fill the Delay Slot
x10 addiu r1, r0, 0 # i=0 x14 Loop: addiu r2, r2, 2 # n += 2 x addiu r1, r1, 1 # i++ x1C blt r1, r3, Loop # i<max? x nop Compiler transforms code x10 addiu r1, r0, 0 # i=0 x14 Loop: addiu r1, r1, 1 # i++ x blt r1, r3, Loop # i<max? x1C addiu r2, r2, 2 # n += 2

Optimization In Action!
inst mem D B A +4 data mem PC branch calc decide branch New PC = 1C 1C blt r1, r3, Loop F D X 20 addi r2,r2,2 14 Loop:addi r1,r1,1 branch decision in EX need 2 stalls, and maybe kill No Nop or Zapping! Note: Insn in delay slot will always be executed whether branch take or not

Branch Prediction Most processor support Speculative Execution
Guess direction of the branch Allow instructions to move through pipeline Zap them later if guess turns out to be wrong A must for long pipelines Q: How to guess? A: constant; hint; branch prediction; random; …

Speculative Execution: Loops
Pipeline so far “Guess” (predict) that the branch will not be taken We can do better! Make prediction based on last branch Predict “take branch” if last branch “taken” Or Predict “do not take branch” if last branch “not taken” Need one bit to keep track of last branch

Speculative Execution: Loops
While (r3 ≠ 0) {…. r3--;} Top: BEQZ r3, End J Top End: Top2: BEQZ r3, End2 End2: What is accuracy of branch predictor? Wrong twice per loop! Once on loop enter and exit We can do better with 2 bits

Speculative Execution: Branch Execution
Branch Not Taken (NT) Predict Taken 2 (PT2) Predict Taken 1 (PT1) Branch Taken (T) Branch Taken (T) Branch Not Taken (NT) Branch Taken (T) Sates is a 2 bit prediction scheme Predict Not Taken 2 (PT2) Predict Not Taken 1 (PT1) Branch Not Taken (NT)

Summary Branch prediction Control hazards Is branch taken or not?
Performance penalty: stall and flush Reduce cost of control hazards Move branch decision from Ex to ID 2 nops to 1 nop Delay slot Compiler puts useful work in delay slot. ISA level. Branch prediction Correct. Great! Wrong. Flush pipeline. Performance penalty

Hazards Summary Data hazards Control hazards Structural hazards
resource contention so far: impossible because of ISA and pipeline design Q: what if we had a “copy 0(r3), 4(r3)” instruction, as the x86 does, or “add r4, r4, 0(r3)”? A: structural hazard

Hazards Summary Data hazards Control hazards Structural hazards
register file reads occur in stage 2 (IF) register file writes occur in stage 5 (WB) next instructions may read values soon to be written Control hazards branch instruction may change the PC in stage 3 (EX) next instructions have already started executing Structural hazards resource contention so far: impossible because of ISA and pipeline design Q: what if we had a “copy 0(r3), 4(r3)” instruction, as the x86 does, or “add r4, r4, 0(r3)”? A: structural hazard

Data Hazard Takeaways Data hazards occur when a operand (register) depends on the result of a previous instruction that may not be computed yet. Pipelined processors need to detect data hazards. Stalling, preventing a dependent instruction from advancing, is one way to resolve data hazards. Stalling introduces NOPs (“bubbles”) into a pipeline. Introduce NOPs by (1) preventing the PC from updating, (2) preventing writes to IF/ID registers from changing, and (3) preventing writes to memory and register file. Nops significantly decrease performance. Forwarding bypasses some pipelined stages forwarding a result to a dependent instruction operand (register). Better performance than stalling.

Control Hazard Takeaways
Control hazards occur because the PC following a control instruction is not known until control instruction is executed. If branch is taken  need to zap instructions. 1 cycle performance penalty. Delay Slots can potentially increase performance due to control hazards. The instruction in the delay slot will always be executed. Requires software (compiler) to make use of delay slot. Put nop in delay slot if not able to put useful instruction in delay slot. We can reduce cost of a control hazard by moving branch decision and calculation from Ex stage to ID stage. With a delay slot, this removes the need to flush instructions on taken branches.

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Similar presentations

Presentation on theme: "Hakim Weatherspoon CS 3410 Computer Science Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Hakim Weatherspoon CS 3410 Computer Science Cornell University

Similar presentations

Presentation on theme: "Hakim Weatherspoon CS 3410 Computer Science Cornell University"— Presentation transcript:

Similar presentations

About project

Feedback