CMCS 411-101 Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, 2001 www.csee.umbc.edu/~younis/CMSC411/ CMSC411.htm Mohamed.

CMCS Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, CMSC411.htm Mohamed Younis CMCS 411, Computer Architecture

Lecture’s Overview Previous Lecture: An overview of pipelining
Pipelining concept is natural Start handling of next instruction while current one is in progress Pipeline performance Performance improvement by increasing instruction throughput Ideal and upper bound for speedup is number of stages in pipeline Pipelined hazards Structural, data and control hazards Hazard resolution techniques This Lecture: Designing a pipelined datapath Controlling pipeline operations Mohamed Younis CMCS 411, Computer Architecture

Multi-stage Instruction Execution
Mohamed Younis CMCS 411, Computer Architecture

Stages of Instruction Execution
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg/Dec Exec Mem Wr Load The load instruction is the longest All instructions follows at most the following five steps: Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48) * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

Instruction Pipelining
Start handling of next instruction while the current instruction is in progress Pipelining is feasible when different devices are used at different stages of instruction execution IFetch Dec Exec Mem WB Program Flow Time Pipelining improves performance by increasing instruction throughput Mohamed Younis CMCS 411, Computer Architecture

Pipelined Datapath Data Stationary Mohamed Younis
CMCS 411, Computer Architecture

Datapath for Pipelined Processor
Valid IRex IR IRwb Inst. Mem IRmem WB Ctrl Dcd Ctrl Ex Ctrl Mem Ctrl Equal A Reg. File S Reg File Exec PC Next PC B Mem Access M Data Mem Easy to read * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt]
Control and Datapath IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S  A + B; R[rd]  S; S  A + SX; M  Mem[S] R[rd]  M; S  A or ZX; R[rt]  S; Mem[S]  B if Cond PC  PC+SX; Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

Pipelining the same Instruction
Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw 3rd lw Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Wr R-type For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54) Mohamed Younis CMCS 411, Computer Architecture

Pipelining R-type & Load Instructions
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) R-type Ifetch Reg/Dec Exec Wr We have pipeline conflict or structural hazard: Two instructions try to write to register file at the same time! Only one write port * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

Solution 1: Insert “Bubble” into Pipeline
Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Exec Wr R-type Mem Load Pipeline Bubble The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59) Insert a “bubble” into pipeline to prevent 2 writes at same cycle The control logic can be complex. Lost instruction fetch and speedup opportunity No instruction is started in Cycle 6! * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

Solution 2: Delay Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Ifetch Reg/Dec Mem Wr R-type Exec Load Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

Modified Control & Datapath
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S  A + B; S  A or ZX; S  A + SX; S  A + SX; if Cond PC  PC+SX; M  S M  S M  Mem[S] Mem[S]  B R[rd]  M; R[rt]  M; R[rd]  M; Reg. File Reg File A M Exec S PC IR Next PC Inst. Mem B Mem Access D Data Mem * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

Standardized Five Stages Instructions
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Store Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Beq Ifetch Reg/Dec Exec Mem Wr Ori Ifetch Reg/Dec Exec Mem Wr Mohamed Younis CMCS 411, Computer Architecture

Data Stationary Control
The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) Ex/Mem Register IF/ID Register ID/Ex Register Mem/Wr Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

Datapath + Data Stationary Control
IR v v v fun rw rw rw wb wb wb Inst. Mem Decode rt me me WB Ctrl rs Mem Ctrl ex op Ex Ctrl im rs rt Reg. File A M Reg File S Exec B Mem Access D Data Mem PC Next PC * Slide is courtesy of Dave Patterson Mohamed Younis CMCS 411, Computer Architecture

An example Start: Fetch 12 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl
PC Next PC 12 = Ex Ctrl IR im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem IF 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Start: Fetch 12 Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 16, Decode 12 n n n Inst. Mem Decode WB Ctrl Mem Ctrl
lw r1, r2(35) Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC 16 = Ex Ctrl IR im 2 rt Reg. File Reg File A M S Exec B Mem Access D Data Mem ID 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 IF Fetch 16, Decode 12 Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 20, Decode 16, Exec 12 n n Inst. Mem Decode WB Ctrl
addI r2, r2, 3 Decode WB Ctrl lw r1 Mem Ctrl PC Next PC 20 = Ex Ctrl IR 2 rt 35 Reg. File Reg File M r2 S Exec B Mem Access D Data Mem EX 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 ID IF Fetch 20, Decode 16, Exec 12 Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 24, Decode 20, Exec 16, Mem 12 n Inst. Mem Decode WB
addI r2, r2, 3 Inst. Mem sub r3, r4, r5 Decode WB Ctrl lw r1 Mem Ctrl PC Next PC 24 = Ex Ctrl IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access D Data Mem M 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 EX ID IF Fetch 24, Decode 20, Exec 16, Mem 12 Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 28, Dcd 24, Ex 20, Mem 16, WB 12
Note Delayed Branch always executes ori after beq beq r6, r7 100 Inst. Mem Decode addI r2 WB Ctrl sub r3 lw r1 Mem Ctrl PC Next PC 28 = Ex Ctrl IR 6 7 Reg. File Reg File r4 M[r2+35] r2+3 Exec r5 Mem Access D Data Mem ID IF EX M WB 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 Fetch 28, Dcd 24, Ex 20, Mem 16, WB 12 Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 32, Dcd 28, Ex 24, Mem 20, WB 16 Exec Reg. File Mem
Access Data r6 r7 r2+3 Reg PC Next PC IR Inst. Mem D Decode Ctrl WB r1=M[r2+35] 9 xx 32 beq addI r2 sub r3 r4-r5 100 ori r8, r9 17 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 ID IF EX M WB Fetch 32, Dcd 28, Ex 24, Mem 20, WB 16 Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 100, Dcd 32, Ex 28, Mem 24, WB 20 Exec Reg. File Mem
Access Data r9 x Reg PC Next PC IR Inst. Mem D Decode Ctrl WB r1=M[r2+35] r2 = r2+3 11 12 100 beq sub r3 r4-r5 17 ori r8 xxx add r10, r11, r12 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 WB M EX ID Fetch 100, Dcd 32, Ex 28, Mem 24, WB 20 IF Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 104, Dcd 100, Ex 32, Mem 28, WB 24 n Inst. Mem Decode
add r10 ori r8 beq WB Ctrl and r13, r14, r15 Mem Ctrl r1=M[r2+35] r2 = r2+3 r3 = r4-r5 14 15 xx Reg. File IR Reg File r11 r9 | 17 xxx Exec r12 Mem Access D Data Mem 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 Next PC 104 EX M WB PC Fetch 104, Dcd 100, Ex 32, Mem 28, WB 24 ID Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 108, Dcd 104, Ex 100, Mem 32, WB 28 n Inst. Mem
Decode add r10 ori r8 and r13 WB Ctrl Mem Ctrl r1=M[r2+35] r2 = r2+3 r3 = r4-r5 xx Reg. File IR Reg File r14 r9 | 17 r11+r12 Exec r15 Mem Access D Data Mem 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 Next PC 108 WB M PC Fetch 108, Dcd 104, Ex 100, Mem 32, WB 28 EX Mohamed Younis CMCS 411, Computer Architecture

An example Fetch 112, Dcd 108, Ex 104, Mem 100, WB 32 n Inst. Mem
Exec Reg. File Mem Access Data Reg PC Next PC IR Inst. Mem D Decode Ctrl WB 112 add r10 and r13 n r11+r12 r14 & r15 r1=M[r2+35] r2 = r2+3 r3 = r4-r5 r8 = r9 | 17 12 lw r1, r2(35) 16 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 28 ori r8, r9, 17 32 add r10, r11, r12 100 and r13, r14, 15 WB Fetch 112, Dcd 108, Ex 104, Mem 100, WB 32 M Mohamed Younis CMCS 411, Computer Architecture

Reading assignment includes sections 6.2 & 6.3 in the text book
Conclusion Summary Designing a pipelined datapath Standardized multi-stage instruction execution Unique resources per stage Pipeline control Simplified stage-based control Detailed working example Next Lecture Pipeline hazard detection Handling hazard in the pipeline design We will take a break and talk about class philosophy. Reading assignment includes sections 6.2 & 6.3 in the text book Mohamed Younis CMCS 411, Computer Architecture

CMCS 411-101 Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, 2001 www.csee.umbc.edu/~younis/CMSC411/ CMSC411.htm Mohamed.

Similar presentations

Presentation on theme: "CMCS 411-101 Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, 2001 www.csee.umbc.edu/~younis/CMSC411/ CMSC411.htm Mohamed."— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CMCS 411-101 Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, 2001 www.csee.umbc.edu/~younis/CMSC411/ CMSC411.htm Mohamed.

Similar presentations

Presentation on theme: "CMCS 411-101 Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, 2001 www.csee.umbc.edu/~younis/CMSC411/ CMSC411.htm Mohamed."— Presentation transcript:

Similar presentations

About project

Feedback