CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining.

CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining

Recap: Summary of Pipelining Basics
5 stages: Fetch: Fetch instruction from memory Decode: get register values and decode control information Execute: Execute arithmetic operations/calculate addresses Memory: Do memory ops (load or store) Writeback: Write results back to registers (I.e. COMMIT) Pipelines pass control information down the pipe just as data moves down pipe Forwarding/Stalls handled by local control Balancing length of instructions makes pipelining much smoother Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency

Recap: Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

Pipelining the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54)

The Four Stages of R-type
Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: ALU operates on the two register operands Update PC Wr: Write the ALU output back to the register file Well, so far so good. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage. +1 = 15 min. (55)

Pipelining the R-type and Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) R-type Ifetch Reg/Dec Exec Wr We have a structural hazard: Two instructions try to write to the register file at the same time! Only one write port

Important Observation
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. +1 = 17 min. (X:57) Ifetch Reg/Dec Exec Wr R-type 1 2 3 4 2 ways to solve this pipeline hazard.

Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59) Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6!

Solution 2: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) Clock R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr

Modified Control & Datapath
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– M; R[rt] <– M; R[rd] <– M; Equal Reg. File Reg File A M Exec S PC IR Next PC Inst. Mem B Mem Access D Data Mem

The Four Stages of Store
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Let’s continue our lecture by looking at the store instruction. Once again, the Ifetch and Reg/Decode stages are the same as all other instructions. The Exec stage of the store instruction calculates the memory address. Once the address is calculated, the store instruction will write the data it read from the register file back at the Reg/Decode stage into the data memory during the Mem stage. Notice that unlike the load instruction which takes five cycles to accomplish its task, the Store instruction only takes four cycles or four pipe stages. In order to keep our pipeline diagram looks more uniform, however, we will keep the Wr stage for the store instruction in the pipeline diagram. But keep in mind that as far as the pipelined control and pipelined datapath are concerned, the store instruction requires NOTHING to be done once it finishes its Mem stage. +2 = 27 min. (Y:07)

Ifetch: Instruction Fetch Reg/Dec: Exec:
The Three Stages of Beq Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, select correct branch target address latch into PC Well similar to the store instruction, the branch instruction only consists of four pipe stages. Ifetch and Reg/decode are the same as all other instructions because we do not know what instruction we have at this point. We have not finish decoding the instruction yet. During the Execute stage of the pipeline, the BEQ instruction will use the ALU to compare the two register operands it fetched during the Reg/Dec stage. At the same time, a separate adder is used to calculate the branch target address. If the registers we compared during the Execute stage (point to the last bullet) have the same value, the branch is taken. That is, the branch target address we calculated earlier (last bullet) will be written into the Program Counter. Once again, similar to the Store instruction, the BEQ instruction will require NEITHER the pipelined control nor the pipelined datapath to do ANY thing once it finishes its Mem stage. With all these talk about pipelined datapath and pipelined control, let’s take a look at how the pipelined datapath looks like. +2 = 29 min. (Y:09)

Control Diagram Equal Reg. File Reg File A M S Exec PC IR Next PC
IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A M S Exec PC IR Next PC Inst. Mem B Mem Access D Data Mem

The Big Picture: Where are We Now?
The Five Classic Components of a Computer Control Datapath Memory Processor Input Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47)

Recall: Single cycle control
Ideal Instruction Memory Control Signals Instruction Conditions Rd Rs Rt 5 5 5 Instruction Address A Data Address Data Out One thing you may noticed from our last slide is that almost all instructions, except Jump, require reading some registers, do some computation, and then do something else. Therefore our datapath will look something like this. For example, if we have an add instruction (points to the output of Instruction Memory), we will read the registers from the register file (Ra, Rb and then busA and busB). Add the two numbers together (ALU) and then write the result back to the register file. On the other hand, if we have a load instruction, we will first use the ALU to calculate the memory address. Once the address is ready, we will use it to access the Data Memory. And once the data is available on Data Memory’s output bus, we will write the data to the register file. Well, this is simple enough. But if it is this simple, you probably won’t need to take this class. So in today’s lecture, I will show you how to turn this abstract datapath into a real datapath by making it slightly (JUST slightly) more complicated so it can do real work for you. But before we do that, let’s do a quick review of the clocking methodology +3 = 16 (X:56) Clk PC Rw Ra Rb ALU 32 32 32 Ideal Data Memory 32 32-bit Registers Next Address Data In B Clk Clk 32 Datapath

Data Stationary Control
The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst Ex/Mem Register IF/ID Register ID/Ex Register Mem/Wr Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr

Datapath + Data Stationary Control
IR v v v fun rw rw rw wb wb wb Inst. Mem Decode rt me me WB Ctrl rs Mem Ctrl ex op im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem PC Next PC

Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5
24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal

Start: Fetch 10 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC
= IR im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem IF 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

Fetch 14, Decode 10 n n n Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC
lw r1, r2(35) Decode WB Ctrl Mem Ctrl PC Next PC 14 = IR im 2 rt Reg. File Reg File A M S Exec B Mem Access D Data Mem ID 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 IF

Fetch 20, Decode 14, Exec 10 n n Inst. Mem Decode WB Ctrl Mem Ctrl PC
addI r2, r2, 3 Decode WB Ctrl lw r1 Mem Ctrl PC Next PC 20 = IR 2 rt 35 Reg. File Reg File M r2 S Exec B Mem Access D Data Mem EX 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 ID IF

Fetch 24, Decode 20, Exec 14, Mem 10 n Inst. Mem Decode WB Ctrl Mem
sub r3, r4, r5 Decode addI r2, r2, 3 WB Ctrl lw r1 Mem Ctrl PC Next PC 24 = IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access D Data Mem M 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 EX ID IF

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode WB Mem Ctrl
addI r2 WB Ctrl lw r1 Mem Ctrl PC Next PC = IR Reg. File Reg File M[r2+35] r2+3 Exec Mem Access D Data Mem ID IF EX M WB 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode WB Ctrl Mem
beq r6, r7 100 Inst. Mem Decode addI r2 WB Ctrl sub r3 lw r1 Mem Ctrl PC Next PC 30 = sub IR 6 7 Reg. File Reg File r4 M[r2+35] r2+3 Exec r5 Mem Access D Data Mem ID IF EX M WB 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Note Delayed Branch: always execute ori after beq

Fetch 34, Dcd 30, Ex 24, Mem 20, WB 14 Inst. Mem Decode WB Mem Ctrl x
PC Next PC 34 = x x x x IR x x x Reg. File r1=M[r2+35] Reg File x Exec x x x Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB M EX ID Take the branch – r6-r7 = 0 IF

ori r8, r9 17 Decode addI r2 WB Ctrl sub r3 beq Mem Ctrl PC Next PC 34 =0 IR 9 xx 100 Reg. File r1=M[r2+35] Reg File r6 r2+3 r4-r5 Exec r7 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB M EX ID Take the branch – r6-r7 = 0 IF

ori r8 sub r3 WB Ctrl add r10, r11, r12 beq Mem Ctrl or 11 12 17 Reg. File IR r1=M[r2+35] Reg File r9 r4-r5 xxx Exec r2 = r2+3 x Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 100 PC Do we have a problem here?

ori r8 sub r3 WB Ctrl add r10, r11, r12 beq Mem Ctrl or 11 12 17 Reg. File IR r1=M[r2+35] Reg File r9 r4-r5 xxx Exec r2 = r2+3 x Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 100 PC ooops, we should have only one delayed instruction

Fetch 104, Dcd 100, Ex 34, Mem 30, WB 24 Squash the extra instruction
Inst. Mem Decode add r10 ori r8 beq WB Ctrl and r13, r14, r15 Mem Ctrl add 14 15 xx Reg. File IR r1=M[r2+35] Reg File r11 r9 | 17 xxx Exec r2 = r2+3 r3 = r4-r5 r12 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 104 PC Squash the extra instruction

Fetch 108, Dcd 104, Ex 100, Mem 34, WB 30 n Inst. Mem Decode WB Mem
add r10 ori r8 and r13 WB Ctrl Mem Ctrl xx Reg. File IR r1=M[r2+35] Reg File r14 r9 | 17 r11+r12 Exec r2 = r2+3 r3 = r4-r5 r15 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 108 PC

Fetch 112, Dcd 108, Ex 104, Mem 100, WB 34 n NO WB NO Ovflow and r13 Inst. Mem Decode add r10 WB Ctrl Mem Ctrl Reg. File IR r1=M[r2+35] Reg File r11+r12 r14 & R15 Exec r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Next PC 114 PC Squash the extra instruction in the branch shadow!

Separate control at each stage
Pipelined Processor Bubbles Exec Reg. File Mem Access Data A B S M Reg Equal PC Next PC IR Inst. Mem Valid IRex Dcd Ctrl IRmem Ex Ctrl IRwb Mem Ctrl WB Ctrl D Stalls Separate control at each stage Stalls propagate backwards to freeze previous stages Bubbles in pipeline introduced by placing “Noops” into local stage, stall previous stages.

Pipeline Hazards Again
I-Fetch DCD MemOpFetch OpFetch Exec Store IFetch DCD ° ° ° Structural Hazard I-Fet ch DCD OpFetch Jump Control Hazard IFetch DCD ° ° ° IF DCD EX Mem WB RAW (read after write) Data Hazard IF DCD EX Mem WB WAW Data Hazard (write after write) IF DCD EX Mem WB IF DCD OF Ex Mem IF DCD OF Ex RS WAR Data Hazard (write after read)

Detect and resolve remaining ones
Recap: Data Hazards Avoid some “by design” eliminate WAR by always fetching operands early (DCD) in pipe eliminate WAW by doing all WBs in order (last stage, static) Detect and resolve remaining ones stall or forward (if possible) IF DCD EX Mem WB IF DCD OF Ex Mem RAW Data Hazard WAW Data Hazard IF DCD OF Ex RS WAR Data Hazard IF DCD EX Mem WB

A RAW hazard exists on register if Rregs( i ) Wregs( j )
Hazard Detection Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. A RAW hazard exists on register if Rregs( i ) Wregs( j ) Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. When instruction issues, reserve its result register as a write reservation. When on operation completes, remove its write reservation. A WAW hazard exists on register if Wregs( i ) Wregs( j ) A WAR hazard exists on register if Wregs( i ) Rregs( j ) Window on execution: Only pending instructions can cause exceptions Inst J Inst I New Inst Instruction Movement:

Record of Pending Writes In Pipeline Registers
IAU npc Current operand registers Pending writes hazard <= ((rs == rwex) & regWex) OR ((rs == rwmem) & regWme) OR ((rs == rwwb) & regWwb) OR ((rt == rwex) & regWex) OR ((rt == rwmem) & regWme) OR ((rt == rwwb) & regWwb) I mem Regs op rw rs rt PC im n op rw B A alu n op rw S D mem m n op rw Regs

Resolve RAW by “forwarding” (or bypassing)
IAU Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe Increase muxes to add paths from pipeline registers Data Forwarding = Data Bypassing npc I mem Regs op rw rs rt PC Forward mux im v op rw B A alu v op rw S D mem m v op rw Regs

What about memory operations?
If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations! What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1  “Delayed Loads” Can recognize this in decode stage and introduce bubble while stalling fetch stage Tricky situation: R1 <- Mem[ R2 + I ] Mem[R3+34] <- R1 Handle with bypass in memory stage! op Rd Ra Rb op Rd Ra Rb A B Rd D R Mem Rd T to reg file

Compiler Avoiding Load Stalls:

What about Interrupts, Traps, Faults?
External Interrupts: Allow pipeline to drain, Fill with NOPs Load PC with interrupt address Faults (within instruction, restartable) Force trap instruction into IF disable writes till trap hits WB must save multiple PCs or PC + state Recall: Precise Exceptions  State of the machine is preserved as if program executed up to the offending instruction All previous instructions completed Offending instruction and all following instructions act as if they have not even started Same system code will work on different implementations

Exception/Interrupts: Implementation questions
5 instructions, executing in 5 different pipeline stages! Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error How do we stop the pipeline? How do we restart it? Do we interrupt immediately or wait? How do we sort all of this out to maintain preciseness?

Exception Handling IAU npc I mem detect bad instruction address Regs
lw $2,20($5) PC Excp detect bad instruction im n op rw B A Excp detect overflow alu S Excp detect bad data address D mem m Excp Allow exception to take effect Regs

Another look at the exception problem
Time Data TLB IFetch Dcd Exec Mem WB Bad Inst Inst TLB fault Program Flow Overflow Use pipeline to sort this out! Pass exception status along with instruction. Keep track of PCs for every instruction in pipeline. Don’t act on exception until it reaches WB stage Handle interrupts through “faulting no-op” in IF stage When instruction reaches end of MEM stage: Save PC  EPC, Interrupt vector addr  PC Turn all instructions in earlier stages into no-ops!

Resolution: Freeze above & Bubble Below
IAU npc I mem freeze Regs op rw rs rt PC bubble im n op rw B A alu n op rw S Flush accomplished by setting “invalid” bit in pipeline D mem m n op rw Regs

FYI: MIPS R3000 clocking discipline
phi1 phi2 2-phase non-overlapping clocks Pipeline stage is two (level sensitive) latches phi1 phi2 phi1 Edge-triggered

MIPS R3000 Instruction Pipeline
Decode Reg. Read Inst Fetch ALU / E.A Memory Write Reg TLB I-Cache RF Operation WB E.A TLB D-Cache TLB I-cache RF ALUALU D-Cache WB Resource Usage Write in phase 1, read in phase 2 => eliminates bypass from WB

Recall: Data Hazard on r1
Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 With MIPS R3000 pipeline, no need to forward from WB stage

MIPS R3000 Multicycle Operations
B op Rd Ra Rb mul Rd Ra Rb Rd to reg file R T Use control word of local stage to step through multicycle operation Stall all stages above multicycle operation in the pipeline Drain (bubble) stages below it Alternatively, launch multiply/divide to autonomous unit, only stall pipe if attempt to get result before ready - This means stall mflo/mfhi in decode stage if multiply/divide still executing Ex: Multiply, Divide, Cache Miss

Is CPI = 1 for our pipeline?
Remember that CPI is an “Average # cycles/inst CPI here is 1, since the average throughput is 1 instruction every cycle. What if there are stalls or multi-cycle execution? Usually CPI > 1. How close can we get to 1?? IFetch Dcd Exec Mem WB

Multicycle? Could treat as: CPIstall=(CYCLES-CPIbase)  freqinst
Recall: Compute CPI? Start with Base CPI Add stalls Suppose: CPIbase=1 Freqbranch=20%, freqload=30% Suppose branches always cause 1 cycle stall Loads cause a 100 cycle stall 1% of time Then: CPI = 1 + (10.20)+(100  0.300.01)=1.5 Multicycle? Could treat as: CPIstall=(CYCLES-CPIbase)  freqinst

Case Study: MIPS R4000 (200 MHz)
8 Stage Pipeline: IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. IS–second half of access to instruction cache. RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. DF–data fetch, first half of access to data cache. DS–second half of access to data cache. TC–tag check, determine whether the data cache access hit. WB–write back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why? Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don’t want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase

Case Study: MIPS R4000 TWO Cycle Load Latency IF IS IF RF IS IF EX RF
DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF THREE Cycle Branch Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken

FP Adder, FP Multiplier, FP Divider
MIPS R4000 Floating Point FP Adder, FP Multiplier, FP Divider Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers

MIPS FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 …
Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R Square root U E (A+R)108 … A R Negate U S Absolute value U S FP compare U A R Stages: M First stage of multiplier N Second stage of multiplier R Rounding stage S Operand shift stage U Unpack FP numbers A Mantissa ADD stage D Divide pipeline stage E Exception test stage

R4000 Performance Not ideal CPI of 1:
Load stalls (1 or 2 clock cycles) Branch stalls (2 cycles + unfilled slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism) Integer programs, major delay is branch stalls FP its structural stalls

Hazards limit performance
Summary Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Control: early evaluation & PC, delayed branch, prediction Data hazards must be handled carefully: RAW data hazards handled by forwarding WAW and WAR hazards don’t exist in 5-stage pipeline MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) Exceptions in 5-stage pipeline recorded when they occur, but acted on only at WB (end of MEM) stage Must flush all previous instructions More performance from deeper pipelines, parallelism

CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining.

Similar presentations

Presentation on theme: "CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining.

Similar presentations

Presentation on theme: "CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining."— Presentation transcript:

Similar presentations

About project

Feedback