Presentation is loading. Please wait.

Presentation is loading. Please wait.

CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

Similar presentations


Presentation on theme: "CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252."— Presentation transcript:

1 CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252 lecture slides at Berkeley)

2 CMPUT 429 - Computer Systems and Architecture2 What is Pipelining? Pipelining is a key implementation technique used to build fast processors. It allows the execution of multiple instructions to overlap in time. A pipeline within a processor is similar to a car assembly line. Each assembly station is called a pipe stage or a pipe segment. The throughput of an instruction pipeline is the measure of how often an instruction exits the pipeline.

3 CMPUT 429 - Computer Systems and Architecture3 Pipelining: Its Natural! zLaundry Example zAnn, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold xWasher takes 30 minutes xDryer takes 40 minutes x“Folder” takes 20 minutes ABCD

4 CMPUT 429 - Computer Systems and Architecture4 Sequential Laundry xSequential laundry takes 6 hours for 4 loads xIf they learned pipelining, how long would laundry take? ABCD 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time

5 CMPUT 429 - Computer Systems and Architecture5 Pipelined Laundry Start work ASAP yPipelined laundry takes 3.5 hours for 4 loads ABCD 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20 What is preventing them from doing it faster? How could we eliminate this limiting factor?

6 CMPUT 429 - Computer Systems and Architecture6 Pipelined Laundry Start work ASAP ABCD 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20 Pipeline throughput: How many loads are completed per unit of time? Pipeline Latency: Once a load starts, how long does it takes to complete it?

7 MuxMux 0 1 MuxMux 0 1 MuxMux 01230123 012012 MuxMux PC Sign ext. Shift left 2 Conc/ Shift left 2 Read address Write address Write data MemData Instruction [31-26] Instruction [25-0] Instruction register Memory Read register 1 Read register 2 Write register Write data Read data 1 Read data 2 Registers 4 32 MuxMux 0 1 0 1 MuxMux ALU result Zero ALU Target 16 4 26 32 I[25-21] I[20-16] I[15-0] [15-11]

8 CMPUT 429 - Computer Systems and Architecture8 5 Steps of MIPS Datapath Figure 3.1, Page 130, CA:AQA 2e Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc LMDLMD ALU MUX Memory Reg File MUX Data Memory MUX Sign Extend 4 Adder Zero? Next SEQ PC Address Next PC WB Data Inst RD RS1 RS2 Imm

9 CMPUT 429 - Computer Systems and Architecture9 5 Steps of MIPS Datapath Figure 3.4, Page 134, CA:AQA 2e Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX

10 CMPUT 429 - Computer Systems and Architecture10 Steps to Execute Each Instruction Type

11 CMPUT 429 - Computer Systems and Architecture11 Pipeline Stages We can divide the execution of an instruction into the following stages: IF: Instruction Fetch ID: Instruction Decode EX: Execution MEM: Memory Access WB: Write Back

12 CMPUT 429 - Computer Systems and Architecture12 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Consider the pipeline above with the indicated delays. We want to know what is the pipeline throughput and the pipeline latency. Pipeline throughput: instructions completed per second. Pipeline latency: how long does it take to execute a single instruction in the pipeline.

13 CMPUT 429 - Computer Systems and Architecture13 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Pipeline throughput: how often an instruction is completed. Pipeline latency: how long does it take to execute an instruction in the pipeline. Is this right?

14 CMPUT 429 - Computer Systems and Architecture14 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Simply adding the latencies to compute the pipeline latency, only would work for an isolated instruction IFMEMID I1 L(I1) = 28ns EXWB MEMID IF I2 L(I2) = 33ns EXWB MEMID IF I3 L(I3) = 38ns EXWB MEMID IF I4 L(I5) = 43ns EXWB We are in trouble! The latency is not constant. This happens because this is an unbalanced pipeline. The solution is to make every state the same length as the longest one.

15 CMPUT 429 - Computer Systems and Architecture15 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns The slowest pipeline state also limits the latency!! IFMEMID I1 L(I1) = L(I2) = L(I3) = L(I4) = 50ns EXWB IFMEMID I2 L(I2) = 50ns EXWB IFMEMIDEXWB IFMEMIDEX 0102030405060 I3 I4

16 CMPUT 429 - Computer Systems and Architecture16 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns How long does it take to execute 20000 instructions in this pipeline? (disregard bubbles caused by branches, cache misses, and hazards) How long would it take using the same modules without pipelining?

17 CMPUT 429 - Computer Systems and Architecture17 Pipeline Throughput and Latency IFIDEXMEMWB 5 ns4 ns5 ns10 ns4 ns Thus the speedup that we got from the pipeline is: How can we improve this pipeline design? We need to reduce the unbalance to increase the clock speed.

18 CMPUT 429 - Computer Systems and Architecture18 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns Now we have one more pipeline stage, but the maximum latency of a single stage is reduced in half. MEM2 5 ns The new latency for a single instruction is:

19 CMPUT 429 - Computer Systems and Architecture19 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns MEM2 5 ns IF MEM1 ID I1 EXWB MEM1 IF MEM1 ID I2 EXWB MEM1 IF MEM1 ID I3 EXWB MEM1 IF MEM1 ID I4 EXWB MEM1 IF MEM1 ID I5 EXWB MEM1 IF MEM1 ID I6 EXWB MEM1 IF MEM1 ID I7 EXWB MEM1

20 CMPUT 429 - Computer Systems and Architecture20 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns MEM2 5 ns How long does it take to execute 20000 instructions in this pipeline? (disregard bubles caused by branches, cache misses, etc, for now) Thus the speedup that we get from the pipeline is:

21 CMPUT 429 - Computer Systems and Architecture21 Pipeline Throughput and Latency IFIDEX MEM1 WB 5 ns4 ns5 ns 4 ns MEM2 5 ns What have we learned from this example? 1. It is important to balance the delays in the stages of the pipeline 2. The throughput of a pipeline is 1/max(delay). 3. The latency is N  max(delay), where N is the number of stages in the pipeline.

22 CMPUT 429 - Computer Systems and Architecture22 Pipelining Lessons ABCD 6 PM 789 TaskOrderTaskOrder Time 3040 20 Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” and “drain” pipeline reduces speedup

23 CMPUT 429 - Computer Systems and Architecture23 Computer Pipelines zExecute billions of instructions, so throughput is what matters zDesirable features: yall instructions same length, yregisters located in same place in instruction format, ymemory operands only in loads or stores

24 Visualizing Pipelining Figure 3.3, Page 133, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5

25 CMPUT 429 - Computer Systems and Architecture25 Its Not That Easy for Computers yLimits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle xStructural hazards: Hardware cannot support this combination of instructions (single person to fold and put clothes away) xData hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) xControl hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps).

26 One Memory Port/Structural Hazards Figure 3.6, Page 142, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMem Ifetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg

27 One Memory Port/Structural Hazards Figure 3.7, Page 143, CA:AQA 2e I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Instr 3 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg Bubble

28 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Data Hazard on R1 Figure 3.9, page 147, CA:AQA 2e Time (clock cycles) IFID/RF EX MEM WB

29 CMPUT 429 - Computer Systems and Architecture29 yRead After Write (RAW) Instr J tries to read operand before Instr I writes it yCaused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. Three Generic Data Hazards I: add r1,r2,r3 J: sub r4,r1,r3

30 CMPUT 429 - Computer Systems and Architecture30 yWrite After Read (WAR) Instr J writes operand before Instr I reads it yCalled an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards

31 CMPUT 429 - Computer Systems and Architecture31 Three Generic Data Hazards yWrite After Write (WAW) Instr J writes operand before Instr I writes it. yCalled an “output dependence” by compiler writers This also results from the reuse of name “r1”. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

32 CMPUT 429 - Computer Systems and Architecture32 Three Generic Data Hazards zWAW Can’t happen in MIPS 5 stage pipeline because: x All instructions take 5 stages, and x Writes are always in stage 5

33 CMPUT 429 - Computer Systems and Architecture33 Forwarding to Avoid Data Hazard I n s t r. O r d e r Time (clock cycles) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

34 CMPUT 429 - Computer Systems and Architecture34 HW Change for Forwarding Figure 3.20, Page 161

35 CMPUT 429 - Computer Systems and Architecture35 I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with Forwarding Figure 3.12, Page 153

36 CMPUT 429 - Computer Systems and Architecture36 Data Hazard Even with Forwarding Figure 3.13, Page 154 I n s t r. O r d e r Time (clock cycles) lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9

37 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

38 CMPUT 429 - Computer Systems and Architecture38 Why the fast code is faster?

39 CMPUT 429 - Computer Systems and Architecture39 Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

40 CMPUT 429 - Computer Systems and Architecture40 Example: Branch Stall Impact zIf 30% branch, Stall of 3 cycles is significant zTwo part solution: yDetermine if branch is taken or not sooner, AND yCompute taken branch address earlier zMIPS branch tests if register = 0 or  0 zMIPS Solution: yMove Zero test to ID/RF stage yAdder to calculate new PC in ID/RF stage y1 clock cycle penalty for branch versus 3

41 Adder IF/ID Pipelined MIPS Datapath Figure 3.22, page 163, CA:AQA 2/e Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX ID/EX

42 CMPUT 429 - Computer Systems and Architecture42 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken yExecute successor instructions in sequence y“Squash” instructions in pipeline if branch actually taken yAdvantage of late pipeline state update y47% MIPS branches not taken on average yPC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken y53% MIPS branches taken on average yBut haven’t calculated branch target address in MIPS xMIPS still incurs 1 cycle branch penalty xOther machines: branch target known before outcome

43 CMPUT 429 - Computer Systems and Architecture43 Four Branch Hazard Alternatives #4: Delayed Branch yDefine branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor 2........ sequential successor n branch target if taken y1 slot delay allows proper decision and branch target address in 5 stage pipeline yMIPS uses this Branch delay of length n

44 CMPUT 429 - Computer Systems and Architecture44 Delayed Branch zWhere to get instructions to fill branch delay slot? yBefore branch instruction yFrom the target address: only valuable when branch taken yFrom fall through: only valuable when branch not taken yCanceling branches allow more slots to be filled xA canceling branch only executes instructions in the delay slot if the direction predicted is correct.

45 CMPUT 429 - Computer Systems and Architecture45 Fig. 3.28

46 CMPUT 429 - Computer Systems and Architecture46 Delayed Branch zCompiler effectiveness for single branch delay slot: yFills about 60% of branch delay slots yAbout 80% of instructions executed in branch delay slots useful in computation yAbout 50% (60% x 80%) of slots usefully filled zDelayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)


Download ppt "CMPUT 429 - Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252."

Similar presentations


Ads by Google