Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Pipelining (I)

Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30 minutes  Dryer takes 30 minutes  Folding clothes takes 30 minutes  Roommate takes 30 minutes to put clothes away 2

Sequential Laundry  Sequential laundry takes 8 hours for 4 loads  If they learned pipelining, how long would laundry take? Task Order 3

Pipelined Laundry: Start Work ASAP  Pipelined laundry takes 3.5 hours for 4 loads! Task Order 4

5 Real-World Pipelines: Car Washes  Idea  Divide process into independent stages  Move objects through stages in sequence  At any given times, multiple objects being processed SequentialParallel Pipelined

6 Computational Example  System  Computation requires total of 300 picoseconds  Additional 20 picoseconds to save result in register  Must have clock cycle of at least 320 ps Combinational logic RegReg 300 ps20 ps Clock Delay = 320 ps Throughput = 3.12 GOPS

7 3-Way Pipelined Version  System  Divide combinational logic into 3 blocks of 100 ps each  Can begin new operation as soon as previous one passes through stage A.  Begin new operation every 120 ps  Overall latency increases  360 ps from start to finish RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps Delay = 360 ps Throughput = 8.33 GOPS

8 Pipeline Diagrams  Unpipelined  Cannot start new operation until previous one completes  3-Way Pipelined  Up to 3 operations in process simultaneously Time OP1 OP2 OP3 Time ABC ABC ABC OP1 OP2 OP3

9 Operating a Pipeline Time OP1 OP2 OP3 ABC ABC ABC 0120240360480640 Clock RegReg Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 239 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 241 RegReg RegReg RegReg 100 ps20 ps100 ps20 ps100 ps20 ps Comb. logic A Comb. logic B Comb. logic C Clock 300 RegReg Clock Comb. logic A RegReg Comb. logic B RegReg Comb. logic C 100 ps20 ps100 ps20 ps100 ps20 ps 359

10 Limitations: Nonuniform Delays  Throughput limited by slowest stage  Other stages sit idle for much of the time  Challenging to partition system into balanced stages RegReg Clock RegReg Comb. logic B RegReg Comb. logic C 50 ps20 ps150 ps20 ps100 ps20 ps Delay = 510 ps Throughput = 5.88 GOPS Comb. logic A Time OP1 OP2 OP3 ABC ABC ABC

11 Limitations: Register Overhead  As try to deepen pipeline, overhead of loading registers becomes more significant  Percentage of clock cycle spent loading register:  1-stage pipeline: 6.25% (= 20/320)  3-stage pipeline: 16.67% (= 60/360)  6-stage pipeline: 28.57% (= 120/420)  High speeds of modern processor designs obtained through very deep pipelining Delay = 420 ps, Throughput = 14.29 GOPSClock RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps RegReg Comb. logic 50 ps20 ps

Pipelining Lessons  Pipelining doesn’t help latency of single task, it helps throughput of entire workload  Multiple tasks operating simultaneously using different resources  Potential speedup = Number pipeline stages  Pipeline rate limited by slowest pipeline stage  Unbalanced lengths of pipe stages reduces speedup  Time to fill pipeline and time to drain it reduces speedup  Stall for Dependences 12

The Five Stages of Load  I-fetch: Instruction Fetch  Fetch the instruction from the Instruction Memory  Dec/Reg: Instruction Decode and Registers Fetch  Exec: Calculate the memory address  Mem: Read the data from the Data Memory  WB: Write the data back to the register file I-fetch Dec/Reg Exec Mem WB 1 cycle 2 cycle 3 cycle 4 cycle 5 cycle 13

Pipelining  Improve performance by increasing instruction throughput 14

Ideal Speedup  Q: Ideal speedup is number of stages in the pipeline.  Do we achieve this?  Imperfect balance & Pipeline overhead  For 1,000,003 instructions  Non-pipelined : 800,002,400 ps  Pipelined : 200,001,400 ps  Pipelining improves performance  By increasing instruction throughput  Not by decreasing the execution time of an individual instructions Time between instructions no-pipelined Number of pipe stages = 800002400 200001400 4  Time between instructions pipelined 15

Instruction Set for Pipelining  MIPS is made for pipelining!  All instructions are the same length:  Helps I-fetch & Decode State  A few instruction formats & fixed source operands fields  Can read operands & decode opcode at the same time  Memory operands only appear in loads/stores  Calculate the memory address in Execute stage  Operands must be aligned in memory  One data memory access for a single data transfer instruction 16

Pipeline Hazards  When the next instruction cannot execute in the following clock cycle  Three Types of Hazards  Structural hazards  HW cannot support this combination of instructions due to lack of HW capacity (or resource conflict)  Data hazards  Instruction depends on result of prior instruction still in the pipeline  Control hazards  Pipelining of branches and other instructions that change the PC 17

18 Structural Hazard  Resource shortage  Multiple instructions try to use the same resource on the same cycle  Delay younger instruction  Detect hazard situation dynamically  Stall younger instruction to allow older instruction to use the resource  Eliminate conflicts  Reorganize pipeline to avoid accessing resource in 2 stages (If possible, may require ISA change)  Provide separate copies or ports to resource per stage (Maybe too expensive)

 Resource conflict 19 IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Instruction Order Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Structural Hazard Example Access memory from two instructions at the same cycle

 Insert three bubbles 20 IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Instruction Order Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Pipeline Stall Due to Structural Hazard bubble

 Duplicate resource 21 IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Instruction Order Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Structural Hazard Solution Use dual port memory to support two simultaneous accesses

22 Data Hazard  Data dependence  An instruction depends on the result of previous one  Data hazard occurs due to data dependences  Read after Write (RAW) : flow dependence True dependence  Write after Read (WAR) : anti-dependence  Write after Write (WAW): output dependence  Need to maintain illusion of in-order, sequential execution  An instruction is fully executed before any following instruction begins add $t1, $t2, $t3 sub $t4, $t1, $t5 and $t5, $t6, $t7 xor $t5, $t6, $t7 RAW (Read after Write) WAR (Write after Read) WAW (Write after Write) False dependence

Data Hazard Example 23 IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 add $t1,$t2,$t3 sub $t4,$t1,$t3 and $t5,$t1,$t3 or $t6,$t1,$t3 xor $t7,$t2,$t1 Clcok Reg Store into Reg Read from Reg

Pipeline Stall Due to Data Hazard 24 IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 bubble add $t1,$t2,$t3 sub $t4,$t1,$t3 and $t5,$t1,$t3 or $t6,$t1,$t3 xor $t7,$t2,$t1

Forwarding (1)  Forwarding = Bypassing = Short-Circuiting  Observation:  Don’t need to wait for the instruction to complete.  As soon as the ALU computes the sum for the add, we can supply it as an input for the subtract No stalling after forwarding IF IDMEMWB EX IF IDMEMWB EX Time 200 400 600 800 1000 add $s0,$t0,$t1 add $t2,$s0,$t3 Program execution order (in instructions) 25

Forwarding (2)  Load-Use Data Hazard Stall even with forwarding IF IDMEMWB EX IF IDMEMWB EX Time 200 400 600 800 1000 1200 lw $s0,20($t1) sub $t2,$s0,$t3 Program execution order (in instructions) bubble 26

Another Forwarding Example 27 IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 add $t1,$t2,$t3 sub $t4,$t1,$t3 lw $t6,20($t1) or $t7,$t6,$t3 xor $t5,$t2,$t6 bubble

Control Hazard  Need to change execution flow (change PC)  Branch, jump, call, return, etc.  Also called branch hazard  After instruction fetch at cycle 1, still need to figure out  Current instruction is branch  Condition check (for conditional branches)  Target address calculation  Even if we do all these within one cycle with enough extra HW, one cycle stall is still unavoidable 28 IFIDEXEMEMWB IFIDEXEMEMWB IFIDEXEMEMWB bubble add $4, $5, $6 beq $1, $2, 40 or $7, $8, $9 200 ps 400 ps

Branch Prediction for Control Hazard  Simple branch prediction  Always predict branches will be untaken  Pipeline stall only when prediction is incorrect IFIDEXEMEMWB IFIDEXEMEMWB IFIDEXEMEMWB bubble add $4, $5, $6 beq $1, $2, 40 or $7, $8, $9 200 ps 400 ps IFIDEXEMEMWB IFIDEXEMEMWB IFIDEXEMEMWB add $4, $5, $6 beq $1, $2, 40 lw $3, 300($0) 200 ps 29

 2 cycles wasted for taken branches (in our multicycle datapath) Control Hazard Example (original datapath) IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 IM bne $t5,0,0x5F add $t4,$t1,$t2 lw $t6,20($t4) or $t7,$t2,$t3 Now we know instructions to be executed is branch Now target address is available Branch condition is available bubble 30

Delayed Branch for Control Hazard  Delayed branch  Solution used in MIPS  If you run SPIM in a bare mode, you must pay attention to delayed branches  Compiler fills delayed branch slots with instructions from multiple places IFIDEXEMEMWB IFIDEXEMEMWB IFIDEXEMEMWB beq $ 1, $2, L 1 add $4, $5, $6 or $7, $8, $9 200 ps delayed branch slot add $4,$5,$6 beq $1,$2,L1 L1:or $7,$8,$9 delay slot beq $1,$2,L1 L1:or $7,$8,$9 add $4,$5,$6 L1: sub $7,$8,$9 add $4,$5,$6 beq $1,$2,L1 delay slot sub $7,$8,$9 L1: add $4,$5,$6 beq $1,$2,L1 sub $7,$8,$9 beq $1,$2,L1 sub $7,$8,$9 L1: delay slot beq $1,$2,L1 L1: sub $7,$8,$9 same basic blocktarget basic blockfall-thru basic block 31

Instruction Scheduling  Original schedule requires one cycle stall  After rescheduling (by HW or compiler), a stall is removed 32 lw $t1, 0($t0) lw $t2, 4($t0) sw $t2, 0($t0) sw $t1, 4($t0) IF IDMEMWB EX IF IDMEMWB EX lw $t1, 0($t0) lw $t2, 4($t0) sw $t1, 4($t0) sw $t2, 0($t0) IF IDMEMWB EX IF IDMEMWB EX IF IDMEMWB EX

Summary  Pipelining  Executing multiple instructions at different steps  Improve the performance of instruction execution throughput  By overlapping different phases of multiple instructions  Structural / data / control hazard exists  Stalls are needed to make it work correctly  Several techniques to reduce stalls  Capacity enhancement  Two-phase register accesses  Forwarding  Prediction  Delayed branch slot  Instruction scheduling 33

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Similar presentations

Presentation on theme: "Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Similar presentations

Presentation on theme: "Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30."— Presentation transcript:

Similar presentations

About project

Feedback