Presentation is loading. Please wait.

Presentation is loading. Please wait.

CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

Similar presentations


Presentation on theme: "CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,"— Presentation transcript:

1

2 CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic, milenka@ece.uah.edumilenka@ece.uah.edu http://www.ece.uah.edu/~milenka

3  AM LaCASALaCASA 2 Outline Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards Structural Data Control

4  AM LaCASALaCASA 3 Laundry Example (by David Patterson) Four loads of clothes: A, B, C, D Task: each one to wash, dry, and fold Resources Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes ABCD

5  AM LaCASALaCASA 4 Sequential Laundry Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? A B C D 304020304020304020304020 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time

6  AM LaCASALaCASA 5 Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads A B C D 6 PM 789 10 11 Midnight TaskOrderTaskOrder Time 3040 20

7  AM LaCASALaCASA 6 Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate is limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” reduce speedup ABCD 6 PM 789 TaskOrderTaskOrder Time 3040 20

8  AM LaCASALaCASA 7 Computer Pipelines Execute billions of instructions, so throughput is what matters What is desirable in instruction sets for pipelining? Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place?

9  AM LaCASALaCASA 8 A "Typical" RISC Registers 32 64-bit general-purpose (integer) registers (R0-R31) 32 64-bit floating-point registers (F0-F31) Data types 8-bit bytes, 16-bit half-words, 32-bit words, 64-bit double words for integer data 32-bit single- or 64-bit double-precision numbers Addressing Modes for MIPS Data Transfers Load-store architecture: Immediate, Displacement Memory is byte addressable with a 64-bit address Mode bit to select Big Endian or Little Endian

10  AM LaCASALaCASA 9 MIPS64 Instruction Formats Op 312601516202125 RsRt immediate Op 3126025 Op 312601516202125 RsRt address Rdfunct Register-Register 56 1011 Register-Immediate Jump / Call shamt Op 312601516202125 FmtFtFsfunct 56 1011 Fd Floating-point (FR) Op 312601516202125 FmtFt immediate Floating-point (FI)

11  AM LaCASALaCASA 10 MIPS64 Instructions MIPS Operations (See Appendix B, Figure B.26) Data Transfers (LB, LBU, SB, LH, LHU, SH, LW, LWU, SW, LD, SD, L.S, L.D, S.S, S.D, MFCO, MTCO, MOV.S, MOV.D, MFC1, MTC1) Arithmetic/Logical (DADD, DADDI, DADDU, DADDIU, DSUB, DSUBU, DMUL, DMULU, DDIV, DDIVU, MADD, AND, ANDI, OR, ORI, XOR, XORI, LUI, DSLL, DSRL, DSRA, DSLLV, DSRLV, DSRAV, SLT, SLTI, SLTU, SLTIU) Control (BEQZ, BNEZ, BEQ, BNE, BC1T, BC1F, MOVN, MOVZ, J, JR, JAL, JALR, TRAP, ERET) Floating Point (ADD.D, ADD.S, ADD.PS, SUB.D, SUB.S, SUB.PS, MUL.D, MUL.S, MUL.PS, MADD.D, MADD.S, MADD.PS, DIV.D, DIV.S, DIV.PS, CVT._._, C._.D, C._.S

12  AM LaCASALaCASA 11 5 Steps of Simple RISC Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc LMDLMD ALU MUX Memory Reg File MUX Data Memory MUX Sign Extend 4 Adder Zero? Next SEQ PC Address Next PC WB Data Inst RD RS1 RS2 Imm

13  AM LaCASALaCASA 12 5 Steps of Simple RISC Datapath (cont’d) Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? IF/ID ID/EX MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX

14  AM LaCASALaCASA 13 Visualizing Pipeline Reg ALU DM IM Reg ALU DM IM Reg ALU DM IM Reg I n s t r. O r d e r Time (clock cycles) Reg ALU DM IM Reg CC 2CC 3CC 4CC 6CC 7CC 5 CC 1

15  AM LaCASALaCASA 14 Instruction Flow through Pipeline Reg ALU DM IM Reg CC 2 CC 3 CC 1 Reg ALU DM IM Reg ALU DM IM Reg Add R1,R2,R3 Nop Add R1,R2,R3 Nop Lw R4,0(R2) Add R1,R2,R3 Nop Lw R4,0(R2) Sub R6,R5,R7 Reg ALU DM IM Reg Add R1,R2,R3 Lw R4,0(R2) Sub R6,R5,R7 CC 4 Xor R9,R8,R1 Time (clock cycles)

16  AM LaCASALaCASA 15 Simple RISC Pipeline Definition: IF, ID Stage IF IF/ID.IR  Mem[PC]; if EX/MEM.cond {IF/ID.NPC, PC  EX/MEM.ALUOUT} else {IF/ID.NPC, PC  PC + 4}; Stage ID ID/EX.A  Regs[IF/ID.IR 6…10 ]; ID/EX.B  Regs[IF/ID.IR 11…15 ]; ID/EX.Imm  (IF/ID.IR 16 ) 16 ## IF/ID.IR 16…31 ; ID/EX.NPC  IF/ID.NPC; ID/EX.IR  IF/ID.IR;

17  AM LaCASALaCASA 16 Simple RISC Pipeline Definition: IE ALU EX/MEM.IR  ID/EX.IR; EX/MEM.ALUOUT  ID/EX.A func ID/EX.B; or EX/MEM.ALUOUT  ID/EX.A func ID/EX.Imm; EX/MEM.cond  0; load/store EX/MEM.IR  ID/EX.IR; EX/MEM.B  ID/EX.B; EX/MEM.ALUOUT  ID/EX.A  ID/EX.Imm; EX/MEM.cond  0; branch EX/MEM.Aluout  ID/EX.NPC  (ID/EX.Imm<< 2); EX/MEM.cond  (ID/EX.A func 0);

18  AM LaCASALaCASA 17 Simple RISC Pipeline Def.: MEM, WB Stage MEM ALU MEM/WB.IR  EX/MEM.IR; MEM/WB.ALUOUT  EX/MEM.ALUOUT; load/store MEM/WB.IR  EX/MEM.IR; MEM/WB.LMD  Mem[EX/MEM.ALUOUT] or Mem[EX/MEM.ALUOUT]  EX/MEM.B; Stage WB ALU Regs[MEM/WB.IR 16…20 ]  MEM/WB.ALUOUT; or Regs[MEM/WB.IR 11…15 ]  MEM/WB.ALUOUT; load Regs[MEM/WB.IR 11…15 ]  MEM/WB.LMD;

19  AM LaCASALaCASA 18 Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

20  AM LaCASALaCASA 19 One Memory Port/Structural Hazards I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Instr 3 Instr 4 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMem Ifetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg

21  AM LaCASALaCASA 20 One Memory Port/Structural Hazards (cont’d) I n s t r. O r d e r Time (clock cycles) Load Instr 1 Instr 2 Stall Instr 3 Reg ALU DMem Ifetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Cycle 1Cycle 2Cycle 3Cycle 4Cycle 6Cycle 7Cycle 5 Reg ALU DMemIfetch Reg Bubble

22  AM LaCASALaCASA 21 I n s t r. O r d e r dadd r1,r2,r3 dsub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg Data Hazard on R1 Time (clock cycles) IFID/RF EX MEM WB

23  AM LaCASALaCASA 22 Three Generic Data Hazards Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3

24  AM LaCASALaCASA 23 Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

25  AM LaCASALaCASA 24 Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

26  AM LaCASALaCASA 25 Time (clock cycles) Forwarding to Avoid Data Hazard I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

27  AM LaCASALaCASA 26 HW Change for Forwarding MEM/WR ID/EX EX/MEM Data Memory ALU mux Registers NextPC Immediate mux

28  AM LaCASALaCASA 27 Forwarding to DM input lw R4,0(R1) sw 12(R1),R4 add R1,R2,R3 I n s t. O r d e r Time (clock cycles) - Forward R1 from EX/MEM.ALUOUT to ALU input (lw) - Forward R1 from MEM/WB.ALUOUT to ALU input (sw) - Forward R4 from MEM/WB.LMD to memory input (memory output to memory input) Reg ALU DM IM Reg ALU DM IM Reg ALU DM IM Reg CC 2CC 3CC 4CC 6CC 7CC 5 CC 1

29  AM LaCASALaCASA 28 Forwarding to DM input (cont’d) sw 0(R4),R1 add R1,R2,R3 I n s t. O r d e r Time (clock cycles) Forward R1 from MEM/WB.ALUOUT to DM input Reg ALU DM IM Reg ALU DM IM Reg CC 2CC 3CC 4CC 6CC 5 CC 1

30  AM LaCASALaCASA 29 Forwarding to Zero beqz R1,50 add R1,R2,R3 InstructionOrderInstructionOrder Time (clock cycles) Forward R1 from EX/MEM.ALUOUT to Zero sub R4,R5,R6 bneq R1,50 add R1,R2,R3 Forward R1 from MEM/WB.ALUOUT to Zero Reg ALU DM IM Reg ALU DM IM Reg CC 2 CC 3 CC 4 CC 6 CC 5 CC 1 Reg ALU DM IM Reg ALU DM IM Reg ALU DM IM Reg Z Z

31  AM LaCASALaCASA 30 Time (clock cycles) I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with Forwarding Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

32  AM LaCASALaCASA 31 Data Hazard Even with Forwarding Time (clock cycles) or r8,r1,r9 I n s t r. O r d e r lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Reg ALU DMemIfetch Reg Ifetch ALU DMem Reg Bubble Ifetch ALU DMem Reg Bubble Reg Ifetch ALU DMem Bubble Reg

33  AM LaCASALaCASA 32 Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd

34  AM LaCASALaCASA 33 Control Hazard on Branches Three Stage Stall 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg ALU DMemIfetch Reg

35  AM LaCASALaCASA 34 Example: Branch Stall Impact If 30% branch, Stall 3 cycles significant Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or  0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

36  AM LaCASALaCASA 35 Adder IF/ID Pipelined Simple RISC Datapath Memory Access Write Back Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc ALU Memory Reg File MUX Data Memory MUX Sign Extend Zero? MEM/WB EX/MEM 4 Adder Next SEQ PC RD WB Data Data stationary control – local decode for each instruction phase / pipeline stage Next PC Address RS1 RS2 Imm MUX ID/EX

37  AM LaCASALaCASA 36 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction

38  AM LaCASALaCASA 37 Branch not Taken Time [clocks] MemIFIDExWB IFID branch (not taken) 5 IFIDExMemWB ExMemWB I i+1 I i+2 MemIFIDExWB Instructions branch (taken) 5 IFidle IFID ExMemWB I i+1 branch target IFID ExMemWB branch target+1 Branch is untaken (determined during ID), we have fetched the fall- through and just continue  no wasted cycles Branch is taken (determined during ID), restart the fetch from at the branch target  one cycle wasted

39  AM LaCASALaCASA 38 Four Branch Hazard Alternatives #3: Predict Branch Taken Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Make sense only when branch target is known before branch outcome

40  AM LaCASALaCASA 39 Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2........ sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this Branch delay of length n

41  AM LaCASALaCASA 40 Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken

42  AM LaCASALaCASA 41 Scheduling the branch delay slot: From Before Delay slot is scheduled with an independent instruction from before the branch Best choice, always improves performance ADDR1,R2,R3 if(R2=0) then Becomes if(R2=0) then

43  AM LaCASALaCASA 42 Scheduling the branch delay slot: From Target Delay slot is scheduled from the target of the branch Must be OK to execute that instruction if branch is not taken Usually the target instruction will need to be copied because it can be reached by another path  programs are enlarged Preferred when the branch is taken with high probability SUB R4,R5,R6... ADD R1,R2,R3 if(R1=0) then Becomes... ADD R1,R2,R3 if(R2=0) then

44  AM LaCASALaCASA 43 Scheduling the branch delay slot: From Fall Through Delay slot is scheduled from the taken fall through Must be OK to execute that instruction if branch is taken Improves performance when branch is not taken ADDR1,R2,R3 if(R2=0) then SUB R4,R5,R6 Becomes ADDR1,R2,R3 if(R2=0) then

45  AM LaCASALaCASA 44 Delayed Branch Effectiveness Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

46  AM LaCASALaCASA 45 Example: Branch Stall Impact Assume CPI = 1.0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles OpFreqCyclesCPI(i)(% Time) Other 70%1.7(37%) Branch30%4 1.2(63%) => new CPI = 1.9, or almost 2 times slower

47  AM LaCASALaCASA 46 Example 2: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1:

48  AM LaCASALaCASA 47 Example 3: Evaluating Branch Alternatives (for 1 program) SchedulingBranchCPIspeedup v. scheme penaltystall Stall pipeline31.421.0 Predict taken11.141.26 Predict not taken11.091.29 Delayed branch0.51.071.31 Conditional & Unconditional = 14%, 65% change PC

49  AM LaCASALaCASA 48 Example 4: Dual-port vs. Single-port Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed

50  AM LaCASALaCASA 49 Extended Simple RISC Pipeline IF ID EX Int EX FP/I Mult EX FP Add EX FP/I Div Mem WB DLX pipe with three unpipelined, FP functional units In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1

51  AM LaCASALaCASA 50 Extended Simple RISC Pipeline (cont’d) Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result Functional unitLatencyInitiation interval Integer ALU01 Data Memory11 FP Add31 FP/Integer Multiply61 FP/Integer Divide2425

52  AM LaCASALaCASA 51 Extended Simple RISC Pipeline (cont’d) IFID Ex M1M2M3M4M5M6M7 A1A2A3.. M WB A4

53  AM LaCASALaCASA 52 Extended Simple RISC Pipeline (cont’d) Multiple outstanding FP operations FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined Pipeline timing for independent operations MUL.DIFIDM1M2M3M4M5M6M7MemWB ADD.DIFIDA1A2A3A4MemWB L.DIFIDExMemWB S.DIFIDExMemWB

54  AM LaCASALaCASA 53 Hazards and Forwarding in Longer Pipes Structural hazard: divide unit is not fully pipelined detect it and stall the instruction Structural hazard: number of register writes can be larger than one due to varying running times WAW hazards are possible Exceptions! instructions can complete in different order than they were issued RAW hazards will be more frequent

55  AM LaCASALaCASA 54 Examples Stalls arising from RAW hazards Three instructions that want to perform a write back to the FP register file simultaneously L.D F4, 0(R2)IFIDEXMemWB MUL.D F0, F4, F6 IFIDstallM1M2M3M4M5M6M7MemWB ADD.D F2, F0, F8 IFstallIDstall A1A2A3A4MemWB S.D 0(R2), F2IFstall IDEXstall Mem MUL.D F0, F4, F6 IFIDM1M2M3M4M5M6M7MemWB...IFIDEXMemWB...IFIDEXMemWB ADD.D F2, F4, F6 IFIDA1A2A3A4MemWB...IFIDEXMemWB...IFIDEXMemWB L.D F2, 0(R2)IFIDEXMemWB

56  AM LaCASALaCASA 55 Solving Register Write Conflicts First approach: track the use of the write port in the ID stage and stall an instruction before it issues use a shift register that indicates when already issued instructions will use the register file if there is a conflict with an already issued instruction, stall the instruction for one clock cycle on each clock cycle the reservation register is shifted one bit Alternative approach: stall a conflicting instruction when it tries to enter MEM or WB stage we can stall either instruction e.g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two different places

57  AM LaCASALaCASA 56 WAW Hazards Result of ADD.D is overwritten without any instruction ever using it WAWs occur when useless instruction is executed still, we must detect them and provide correct execution Why? IFIDEXMemWB ADD.D F2, F4, F6IFIDA1A2A3A4MemWB IFIDEXMemWB L.D F2, 0(R2)IFIDEXMemWB BNEZR1, foo DIV.DF0, F2, F4 ; delay slot from fall-through... foo:L.D F0, qrs

58  AM LaCASALaCASA 57 Solving WAW Hazards First approach: delay the issue of load instruction until ADD.D enters MEM Second approach: stamp out the result of the ADD.D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away Detect hazard in ID when LD is issuing stall LD, or make ADDD no-op Luckily this hazard is rare

59  AM LaCASALaCASA 58 Hazard Detection in ID Stage Possible hazards hazards among FP instructions hazards between an FP instruction and an integer instr. FP and integer registers are distinct, except for FP load-stores, and FP-integer moves Assume that pipeline does all hazard detection in ID stage

60  AM LaCASALaCASA 59 Hazard Detection in ID Stage (cont’d) Check for structural hazards wait until the required functional unit is not busy and make sure that the register write port is available Check for RAW data hazards wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result Check for WAW data hazards determine if any instruction in A1,.. A4, M1,.. M7, D has the same register destination as this instruction; if so, stall the issue of the instruction in ID

61  AM LaCASALaCASA 60 Forwarding Logic Check if the destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data


Download ppt "CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,"

Similar presentations


Ads by Google