CPE 631 Review: Pipelining

CPE 631 Review: Pipelining
Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,

Outline Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards
Structural Data Control Here is the Outline of today’s lecture. First, we shall give a short overview of major trends and breakthroughs in Computer Technology for last 50 years. Then, we will give an answer to the question “What is Computer Architecture”. After these two introductory topics we will consider Measuring and Reporting Performance and major Quantitative Principles of Computer Design.

Laundry Example (by David Patterson)
Four loads of clothes: A, B, C, D Task: each one to wash, dry, and fold Resources Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Sequential Laundry A B C D 30 40 20 6 PM 7 8 9 10 11 Midnight
k O r d e Time Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads A B C
6 PM 7 8 9 10 11 Midnight T a s k O r d e Time 30 40 20

Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” reduce speedup 6 PM 7 8 9 Time T a s k O r d e 30 40 20 A B C D

Computer Pipelines Execute billions of instructions, so throughput is what matters What is desirable in instruction sets for pipelining? Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place?

A "Typical" RISC 32-bit fixed format instruction (3 formats)
Memory access only via load/store instructions 32 32-bit GPR (R0 contains zero) 3-address, reg-reg arithmetic instruction; registers in same place Single address mode for load/store: base + displacement no indirection Simple branch conditions Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

Example: MIPS Register-Register 31 26 25 21 20 16 15 11 10 6 5 Op Rs1
Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 target Op

5 Steps of MIPS Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch
Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data

5 Steps of MIPS Datapath (cont’d)
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage

Visualizing Pipeline Time (clock cycles) I n s t r. O r d e CC 1 CC 2
Reg ALU DM IM ALU Reg IM Reg DM Reg ALU DM IM The pipeline can be thought as a series of datapaths shifted in time. This slide shows the overlap among the parts of the datapath, with clock cycle CC5, showing the steady state situation (pipelined is filled). To check whether this pipeline is correct, we have to determine what happens on every clock cycle of the machine and make sure we do not try to perform two different operations with the same datapath resource on the same clock cycle. Let’s consider this detailed. First, we use separate instruction (IM) and data (DM) memories, which would typically implemented with separate data and instruction caches (we will discuss this later during the course). In this way, we eliminate possible conflict for a single memory that would arise between instruction fetch and data memory access. Second, the register file is used in two stages: for reading in ID and for writing in WB. These uses are distinct, so we show register file in two places; it means that we need to perform two reads and one write every clock cycle, and we will see it is easy to implement (even if you should write/read the same register). To support this we perform register write in the first half of the clock cycle and read in the second half. Third, this figure does not deal with the PC. To start new instruction every clock, we must increment and store the PC every clock, and this must be done in IF stage. However, the branch instructions make this difficult, and this problem we will discuss later in the part devoted to Control hazards. To ensure that different instructions in different stages of the pipeline do not interfere with one another we introduce the PIPELINE Registers between successive stages of the pipeline, so that at the end of a clock cycle all the results from a given stage are stored into a register that is used as the input to the next stage on the next clock cycle. Reg ALU DM IM

Instruction Flow through Pipeline
Time (clock cycles) CC 1 CC 2 CC 3 CC 4 Add R1,R2,R3 Lw R4,0(R2) Sub R6,R5,R7 Xor R9,R8,R1 Nop Add R1,R2,R3 Lw R4,0(R2) Sub R6,R5,R7 Reg ALU DM IM Reg ALU DM IM Reg ALU DM IM Reg ALU DM IM Nop Add R1,R2,R3 Here we can see an illustration of the instruction flow through the pipeline registers. We will consider instruction flow for a code sequence which includes four consequent, independent instructions Add, Lw, Sub, and Xor. In clock cycle 1 (CC1) we fetch the first instruction, and at the end of this clock cycle we will have this instruction in the IF/ID pipeline register. We assume that instructions before this were nops (no operations). In clock cycle 2 (CC2), Add instruction is in ID pipe stage (instruction decoding and register read), and we fetch new instruction from the IM (Lw). At the end of CC2 we will have instruction Lw in IF/ID pipeline register, and Add in ID/EX stage. .... Nop Lw R4,0(R2) Nop Nop Nop Add R1,R2,R3

DLX Pipeline Definition: IF, ID
Stage IF IF/ID.IR  Mem[PC]; if EX/MEM.cond {IF/ID.NPC, PC  EX/MEM.ALUOUT} else {IF/ID.NPC, PC  PC + 4}; Stage ID ID/EX.A  Regs[IF/ID.IR6…10]; ID/EX.B  Regs[IF/ID.IR11…15]; ID/EX.Imm  (IF/ID.IR16)16 ## IF/ID.IR16…31; ID/EX.NPC  IF/ID.NPC; ID/EX.IR  IF/ID.IR; Let’s consider events on every pipe stage of the DLX pipeline. In IF, instruction is fetched and stored in IF/ID.IR, and the new PC is computed; If we have not branch, the PC is incremented and stored for later use in computing the branch target address. In ID, we fetch the registers, extend the sign of the lower 16 bits of the IR, and pass them along with IR and NPC.

DLX Pipeline Definition: IE
ALU EX/MEM.IR  ID/EX.IR; EX/MEM.ALUOUT  ID/EX.A func ID/EX.B; or EX/MEM.ALUOUT  ID/EX.A func ID/EX.Imm; EX/MEM.cond  0; load/store EX/MEM.IR  ID/EX.IR; EX/MEM.B  ID/EX.B; EX/MEM.ALUOUT  ID/EX.A  ID/EX.Imm; branch EX/MEM.Aluout  ID/EX.NPC  (ID/EX.Imm<< 2); EX/MEM.cond  (ID/EX.A func 0); During EX, we perform an ALU operation or an address calculation. We pass along the IR and B register (if the instruction is a store). Also we set the value of cond to 1 if the instruction is a taken branch.

DLX Pipeline Definition: MEM, WB
Stage MEM ALU MEM/WB.IR  EX/MEM.IR; MEM/WB.ALUOUT  EX/MEM.ALUOUT; load/store MEM/WB.LMD  Mem[EX/MEM.ALUOUT] or Mem[EX/MEM.ALUOUT]  EX/MEM.B; Stage WB Regs[MEM/WB.IR16…20]  MEM/WB.ALUOUT; or Regs[MEM/WB.IR11…15]  MEM/WB.ALUOUT; load Regs[MEM/WB.IR11…15]  MEM/WB.LMD; During the MEM phase, we cycle the memory, make a branch decision and update PC if needed, and pass along the values needed in the final pipe stage. Finally, during the WB stage, we update the register field from either ALUoutput or loaded value. For simplicity we always pass the entire IR from one stage to the next, though as an instruction proceeds down the pipeline, less and less of the IR is needed.

Its Not That Easy for Computers
Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

One Memory Port/Structural Hazards
Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4

One Memory Port/Structural Hazards (cont’d)
Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 Bubble Stall Reg ALU DMem Ifetch Instr 3

Data Hazard on R1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch

Three Generic Data Hazards
Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3

Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Forwarding to Avoid Data Hazard
Time (clock cycles) I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch

HW Change for Forwarding
ID/EX EX/MEM MEM/WR NextPC mux Registers ALU Data Memory mux mux Immediate

Forwarding to DM input - Forward R1 from EX/MEM.ALUOUT to ALU input (lw) - Forward R1 from MEM/WB.ALUOUT to ALU input (sw) - Forward R4 from MEM/WB.LMD to memory input (memory output to memory input) I n s t. O r d e Time (clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 Reg ALU DM IM add R1,R2,R3 Forwarding can be generalized to include passing a result directly to the functional unit that requires it. Consider this sequence of instructions. To prevent a stall in this sequence we would need to -Forward R1 from EX/MEM.ALUOUT to ALU input (lw) - Forward R1 from MEM/WB.ALUOUT to ALU input (sw) - Forward R4 from MEM/WB.LMD to memory input (memory output to memory input). Because the ALU and DM both accept operands, forwarding paths are needed to their inputs from both EX/MEM and MEM/WB pipeline registers. Reg ALU DM IM lw R4,0(R1) sw 12(R1),R4 Reg ALU DM IM

Forwarding to DM input (cont’d)
Forward R1 from MEM/WB.ALUOUT to DM input I n s t. O r d e Time (clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Reg ALU DM IM add R1,R2,R3 sw 0(R4),R1 Reg ALU DM IM Consider this sequence of instructions. Here we have forwarding from MEM/WB.ALUout to DM input. Notice, that source is not the same as in the previous example (MEM/WB.LMD).

Forwarding to Zero I Forward R1 from EX/MEM.ALUOUT to Zero n s t r u c
Time (clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Reg ALU DM IM add R1,R2,R3 Z beqz R1,50 Reg ALU DM IM Forward R1 from MEM/WB.ALUOUT to Zero Reg ALU DM IM add R1,R2,R3 In addition to forwarding to ALU input and DM input, in DLX forwarding to Zero unit which tests conditions will be needed as well. Let’s consider the following two examples. ... Reg ALU DM IM sub R4,R5,R6 Z bneq R1,50 Reg ALU DM IM

Data Hazard Even with Forwarding
Time (clock cycles) Reg ALU DMem Ifetch I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 MIPS actutally didn’t interlecok: MPU without Interlocked Pipelined Stages

Data Hazard Even with Forwarding
Time (clock cycles) I n s t r. O r d e Reg ALU DMem Ifetch lw r1, 0(r2) Reg Ifetch ALU DMem Bubble sub r4,r1,r6 Ifetch ALU DMem Reg Bubble and r6,r1,r7 Bubble Ifetch Reg ALU DMem or r8,r1,r9

Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Control Hazard on Branches Three Stage Stall
Reg ALU DMem Ifetch 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11

Example: Branch Stall Impact
If 30% branch, Stall 3 cycles significant Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or  0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Pipelined MIPS Datapath
Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC ID/EX EX/MEM MEM/WB MUX 4 Adder IF/ID Adder Zero? RS1 Address Reg File Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage

Four Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction

Branch not Taken Time [clocks] Mem IF ID Ex WB Branch is untaken
5 Time [clocks] branch (not taken) Mem IF ID Ex WB Branch is untaken (determined during ID), we have fetched the fall-through and just continue  no wasted cycles IF ID Ex Mem WB Ii+1 IF ID Ex Mem WB Ii+2 5 branch (taken) Mem IF ID Ex WB Branch is taken (determined during ID), restart the fetch from at the branch target  one cycle wasted Ii+1 IF idle branch target IF ID Ex Mem WB branch target+1 IF ID Ex Mem WB Instructions

#3: Predict Branch Taken Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Make sense only when branch target is known before branch outcome

#4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this Branch delay of length n

Delayed Branch Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken

Scheduling the branch delay slot: From Before
ADD R1,R2,R3 if(R2=0) then <Delay Slot> Delay slot is scheduled with an independent instruction from before the branch Best choice, always improves performance Becomes if(R2=0) then <ADD R1,R2,R3>

Scheduling the branch delay slot: From Target
Delay slot is scheduled from the target of the branch Must be OK to execute that instruction if branch is not taken Usually the target instruction will need to be copied because it can be reached by another path  programs are enlarged Preferred when the branch is taken with high probability SUB R4,R5,R6 ... ADD R1,R2,R3 if(R1=0) then <Delay Slot> Becomes ... ADD R1,R2,R3 if(R2=0) then <SUB R4,R5,R6>

Scheduling the branch delay slot: From Fall Through
ADD R1,R2,R3 if(R2=0) then <Delay Slot> SUB R4,R5,R6 Delay slot is scheduled from the taken fall through Must be OK to execute that instruction if branch is taken Improves performance when branch is not taken Becomes ADD R1,R2,R3 if(R2=0) then <SUB R4,R5,R6>

Delayed Branch Effectiveness
Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Example: Branch Stall Impact
Assume CPI = 1.0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles Op Freq Cycles CPI(i) (% Time) Other 70% (37%) Branch 30% (63%) => new CPI = 1.9, or almost 2 times slower

Example 2: Speed Up Equation for Pipelining
For simple RISC pipeline, CPI = 1:

Example 3: Evaluating Branch Alternatives (for 1 program)
Scheduling Branch CPI speedup v. scheme penalty stall Stall pipeline Predict taken Predict not taken Delayed branch Conditional & Unconditional = 14%, 65% change PC

Example 4: Dual-port vs. Single-port
Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed

Extended MIPS Pipeline
DLX pipe with three unpipelined, FP functional units EX Int EX FP/I Mult IF ID Mem WB EX FP Add In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1 EX FP/I Div

Extended MIPS Pipeline (cont’d)
Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result Functional unit Latency Initiation interval Integer ALU 1 Data Memory FP Add 3 FP/Integer Multiply 6 FP/Integer Divide 24 25

IF ID A1 A2 A3 A4 M WB ..

Multiple outstanding FP operations FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined Pipeline timing for independent operations MUL.D IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB ADD.D A1 A2 A3 A4 L.D Ex S.D

Hazards and Forwarding in Longer Pipes
Structural hazard: divide unit is not fully pipelined detect it and stall the instruction Structural hazard: number of register writes can be larger than one due to varying running times WAW hazards are possible Exceptions! instructions can complete in different order than they were issued RAW hazards will be more frequent

Examples Stalls arising from RAW hazards
Three instructions that want to perform a write back to the FP register file simultaneously L.D F4, 0(R2) IF ID EX Mem WB MUL.D F0, F4, F6 stall M1 M2 M3 M4 M5 M6 M7 ADD.D F2, F0, F8 A1 A2 A3 A4 S.D 0(R2), F2 MUL.D F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB ... EX ADD.D F2, F4, F6 A1 A2 A3 A4 L.D F2, 0(R2)

Solving Register Write Conflicts
First approach: track the use of the write port in the ID stage and stall an instruction before it issues use a shift register that indicates when already issued instructions will use the register file if there is a conflict with an already issued instruction, stall the instruction for one clock cycle on each clock cycle the reservation register is shifted one bit Alternative approach: stall a conflicting instruction when it tries to enter MEM or WB stage we can stall either instruction e.g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two different places

WAW Hazards IF ID EX Mem WB ADD.D F2, F4, F6 A1 A2 A3 A4 L.D F2, 0(R2) Result of ADD.D is overwritten without any instruction ever using it WAWs occur when useless instruction is executed still, we must detect them and provide correct execution Why? BNEZ R1, foo DIV.D F0, F2, F4 ; delay slot from fall-through ... foo: L.D F0, qrs

Solving WAW Hazards First approach: delay the issue of load instruction until ADD.D enters MEM Second approach: stamp out the result of the ADD.D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away Detect hazard in ID when LD is issuing stall LD, or make ADDD no-op Luckily this hazard is rare

Hazard Detection in ID Stage
Possible hazards hazards among FP instructions hazards between an FP instruction and an integer instr. FP and integer registers are distinct, except for FP load-stores, and FP-integer moves Assume that pipeline does all hazard detection in ID stage

Hazard Detection in ID Stage (cont’d)
Check for structural hazards wait until the required functional unit is not busy and make sure that the register write port is available Check for RAW data hazards wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result Check for WAW data hazards determine if any instruction in A1, .. A4, M1, .. M7, D has the same register destination as this instruction; if so, stall the issue of the instruction in ID

Forwarding Logic Check if the destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data

CPE 631 Review: Pipelining

Similar presentations

Presentation on theme: "CPE 631 Review: Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CPE 631 Review: Pipelining

Similar presentations

Presentation on theme: "CPE 631 Review: Pipelining"— Presentation transcript:

Similar presentations

About project

Feedback