CS152 – Computer Architecture and Engineering Lecture 11 –

Dave Patterson (www.cs.berkeley.edu/~patterson)
CS152 – Computer Architecture and Engineering Lecture 11 – Pipeline Control Dave Patterson ( www-inst.eecs.berkeley.edu/~cs152/

Review: Pipelining Reduce CPI by overlapping many instructions
Average throughput of approximately 1 CPI with fast clock Utilize capabilities of the Datapath start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

Recap: Ideal Pipelining
Assume instructions are completely independent! IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB Maximum Speedup  Number of stages Speedup Time for unpipelined operation Time for longest stage Example: 40ns data path, 5 stages, Longest stage is 10 ns, Speedup 4

FYI: MIPS R3000 clocking discipline
phi1 phi2 2-phase non-overlapping clocks Pipeline stage is two (level sensitive) latches phi1 phi2 phi1 Edge-triggered

MIPS R3000 Instruction Pipeline
Decode Reg. Read Inst Fetch ALU / E.A Memory Write Reg TLB I-Cache RF Operation WB E.A TLB D-Cache TLB I-cache RF ALUALU D-Cache WB Resource Usage Write in phase 1, read in phase 2 => eliminates bypass from WB

Recall: Data Hazard on r1
Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 With MIPS R3000 pipeline, no need to forward from WB stage

Clarification about clock edges in lab4!
Exec Reg. File Mem Access Data A B S M Reg Equal PC Next PC IR Inst. Mem Valid IRex Dcd Ctrl IRmem Ex Ctrl IRwb Mem Ctrl WB Ctrl D Since Register have edge-triggered write: Must have everything set up at end of memory stage This means that “M” register here is not necessary! Also, Memories will be synchronous Need to setup addresses and values in advance

MIPS R3000 Multicycle Operations
B op Rd Ra Rb mul Rd Ra Rb Rd to reg file R T Use control word of local stage to step through multicycle operation Stall all stages above multicycle operation in the pipeline Drain (bubble) stages below it Alternatively, launch multiply/divide to autonomous unit, only stall pipe if attempt to get result before ready - This means stall mflo/mfhi in decode stage if multiply/divide still executing - Extra credit in Lab 5 does this Ex: Multiply, Divide, Cache Miss

Recall: Single cycle control!
Ideal Instruction Memory Control Signals Instruction Conditions Rd Rs Rt 5 5 5 Instruction Address A Data Address Data Out Clk PC Rw Ra Rb ALU 32 32 32 Ideal Data Memory 32 32-bit Registers Next Address Data In B One thing you may noticed from our last slide is that almost all instructions, except Jump, require reading some registers, do some computation, and then do something else. Therefore our datapath will look something like this. For example, if we have an add instruction (points to the output of Instruction Memory), we will read the registers from the register file (Ra, Rb and then busA and busB). Add the two numbers together (ALU) and then write the result back to the register file. On the other hand, if we have a load instruction, we will first use the ALU to calculate the memory address. Once the address is ready, we will use it to access the Data Memory. And once the data is available on Data Memory’s output bus, we will write the data to the register file. Well, this is simple enough. But if it is this simple, you probably won’t need to take this class. So in today’s lecture, I will show you how to turn this abstract datapath into a real datapath by making it slightly (JUST slightly) more complicated so it can do real work for you. But before we do that, let’s do a quick review of the clocking methodology +3 = 16 (X:56) Clk Clk 32 Datapath

Data Stationary Control
The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst IF/ID Register The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) Ex/Mem Register ID/Ex Register Mem/Wr Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr

Datapath + Data Stationary Control
IR v v v fun rw rw rw wb wb wb Inst. Mem Decode rt me me WB Ctrl rs Mem Ctrl ex op im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem PC Next PC

Lab 4 Project document Thursday 9 PM paper or email
Administrivia Lab 4 Project document Thursday 9 PM paper or Reading Chapter 6, sections 6.1 to 6.5 Midterm Wed Oct 8 5:30 - 8:30 in 1 LeConte Midterm review Sunday Oct 4, 5 PM in 306 Soda Bring 1 page, handwritten notes, both sides Meet at LaVal’s Northside afterwards for Pizza Office hours Mon 4 – 5:30 Jack, Tue 3:30-5 Kurt, Wed 3 – 4:30 John, Thu 3:30-5 Ben Dave’s office hours Tue 3:30 – 5

Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5
24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal

Start: Fetch 10 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC
= IR im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem IF 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15

Fetch 14, Decode 10 n n n Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC
lw r1, r2(35) Decode WB Ctrl Mem Ctrl PC Next PC 14 = IR im 2 rt Reg. File Reg File A M S Exec B Mem Access D Data Mem ID 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 IF

Fetch 20, Decode 14, Exec 10 n n Inst. Mem Decode WB Ctrl Mem Ctrl PC
addI r2, r2, 3 Decode WB Ctrl lw r1 Mem Ctrl PC Next PC 20 = IR 2 rt 35 Reg. File Reg File M r2 S Exec B Mem Access D Data Mem EX 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 ID IF

Fetch 24, Decode 20, Exec 14, Mem 10 n Inst. Mem Decode WB Ctrl Mem
sub r3, r4, r5 Decode addI r2, r2, 3 WB Ctrl lw r1 Mem Ctrl PC Next PC 24 = IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access D Data Mem M 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 EX ID IF

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode WB Ctrl Mem
beq r6, r7 100 Inst. Mem Decode addI r2 WB Ctrl sub r3 lw r1 Mem Ctrl PC Next PC 30 = IR 6 7 Reg. File Reg File r4 M[r2+35] r2+3 Exec r5 Mem Access D Data Mem ID IF EX M WB 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Note Delayed Branch: always execute ori after beq

Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14 Inst. Mem Decode WB Ctrl Mem
ori r8, r9 17 Decode addI r2 WB Ctrl sub r3 beq Mem Ctrl PC Next PC 100 = IR 9 xx 100 Reg. File r1=M[r2+35] Reg File r6 r2+3 r4-r5 Exec r7 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB M EX ID IF

Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20 Inst. Mem Decode ? WB Mem
Ctrl Mem Ctrl PC Next PC ___ = IR Reg. File Reg File Exec Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 EX M WB Fill it in yourself! ID

Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24 ? ? Inst. Mem Decode WB Ctrl
PC Next PC ___ = IR ? Reg. File Reg File ? Exec ? Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB M Fill it in yourself! EX

Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30 ? ? ? Inst. Mem Decode WB
Ctrl Mem Ctrl PC Next PC ___ = IR ? Reg. File Reg File ? ? Exec Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB Fill it in yourself! M

Pipelined Processor Bubbles Stalls Separate control at each stage
Exec Reg. File Mem Access Data A B S M Reg Equal PC Next PC IR Inst. Mem Valid IRex Dcd Ctrl IRmem Ex Ctrl IRwb Mem Ctrl WB Ctrl D Stalls Separate control at each stage Stalls propagate backwards to freeze previous stages Bubbles in pipeline introduced by placing “Noops” into local stage, stall previous stages.

Recap: Data Hazards Avoid some “by design”
eliminate WAR by always fetching operands early (DCD) in pipe eleminate WAW by doing all WBs in order (last stage, static) Detect and resolve remaining ones stall or forward (if possible) IF DCD EX Mem WB IF DCD OF Ex Mem RAW Data Hazard WAW Data Hazard IF DCD OF Ex RS WAR Data Hazard IF DCD EX Mem WB

Hazard Detection Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. A RAW hazard exists on register if Rregs( i ) Wregs( j ) Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. When instruction issues, reserve its result register. When on operation completes, remove its write reservation. A WAW hazard exists on register if Wregs( i ) Wregs( j ) A WAR hazard exists on register if Wregs( i ) Rregs( j ) Window on execution: Only pending instructions can cause exceptions Inst J Inst I New Inst Instruction Movement:

Record of Pending Writes In Pipeline Registers
IAU npc Current operand registers Pending writes hazard <= ((rs == rwex) & regWex) OR ((rs == rwmem) & regWme) OR ((rs == rwwb) & regWwb) OR ((rt == rwex) & regWex) OR ((rt == rwmem) & regWme) OR ((rt == rwwb) & regWwb) I mem Regs op rw rs rt PC im n op rw B A alu n op rw S D mem m n op rw Regs

Resolve RAW by “forwarding” (or bypassing)
IAU Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe Increase muxes to add paths from pipeline registers Data Forwarding = Data Bypassing npc I mem Regs op rw rs rt PC Forward mux im n op rw B A alu n op rw S D mem m n op rw Regs

What about memory operations?
If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations! What does delaying WB on arithmetic operations cost? – cycles ? – hardware ? What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1  “Delayed Loads” Can recognize this in decode stage and introduce bubble while stalling fetch stage (hint for lab 5!) Tricky situation: R1 <- Mem[ R2 + I ] Mem[R3+34] <- R1 Handle with bypass in memory stage! op Rd Ra Rb op Rd Ra Rb A B Rd D R Mem Rd T to reg file

Compiler Avoiding Load Stalls:

Question: Critical Path???
Bypass path is invariably trouble Options? Make logic really fast Move forwarding after muxes Problem: screws up branches that require forwarding! Use same tricks as “carry-skip” adder to fix this? This option may just push delay around….! Insert an extra cycle for branches that need forwarding? Or: hit common case of forwarding from EX stage and stall for forward from memory? Regs PC Sel Forward mux Equal im B A alu S D mem m Regs

Is CPI = 1 for our pipeline?
Remember that CPI is an “Average # cycles/inst CPI here is 1, since the average throughput is 1 instruction every cycle. What if there are stalls or multi-cycle execution? Usually CPI > 1. How close can we get to 1?? IFetch Dcd Exec Mem WB

Recall: Compute CPI? Start with Base CPI Add stalls Suppose:
CPIbase=1 Freqbranch=20%, freqload=30% Suppose branches always cause 1 cycle stall Loads cause a 100 cycle stall 1% of time Then: CPI = 1 + (10.20)+(100  0.300.01)=1.5 Multicycle? Could treat as: CPIstall=(CYCLES-CPIbase)  freqinst

Case Study: MIPS R4000 (200 MHz)
8 Stage Pipeline: IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. IS–second half of access to instruction cache. RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. DF–data fetch, first half of access to data cache. DS–second half of access to data cache. TC–tag check, determine whether the data cache access hit. WB–write back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why? Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don’t want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase

Case Study: MIPS R4000 WB TC DS DF EX RF IS IF TWO Cycle Load Latency
THREE Cycle Branch Latency (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken

MIPS R4000 Floating Point FP Adder, FP Multiplier, FP Divider
Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers

MIPS FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 …
Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R Square root U E (A+R)108 … A R Negate U S Absolute value U S FP compare U A R Stages: M First stage of multiplier N Second stage of multiplier R Rounding stage S Operand shift stage U Unpack FP numbers A Mantissa ADD stage D Divide pipeline stage E Exception test stage

R4000 Performance Not ideal CPI of 1: FP structural stalls: Not enough FP hardware (parallelism) FP result stalls: RAW data hazard (latency) Branch stalls (2 cycles + unfilled slots) Load stalls (1 or 2 clock cycles) Integer programs, major delay is branch stalls FP its structural stalls

Summary What makes it easy Hazards limit performance
all instructions are the same length just a few instruction formats memory operands appear only in loads and stores Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Control: early evaluation & PC, delayed branch, prediction Data hazards must be handled carefully: RAW data hazards handled by forwarding WAW and WAR hazards don’t exist in 5-stage pipeline MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) More performance from deeper pipelines, parallelism

CS152 – Computer Architecture and Engineering Lecture 11 –

Similar presentations

Presentation on theme: "CS152 – Computer Architecture and Engineering Lecture 11 –"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS152 – Computer Architecture and Engineering Lecture 11 –

Similar presentations

Presentation on theme: "CS152 – Computer Architecture and Engineering Lecture 11 –"— Presentation transcript:

Similar presentations

About project

Feedback