John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

Slides:



Advertisements
Similar presentations
1 IKI20210 Pengantar Organisasi Komputer Kuliah no. 25: Pipeline 10 Januari 2003 Bobby Nazief Johny Moningka
Advertisements

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
ECE 232 L22.Pipeline3.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 22 Pipelining,
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
ECE 232 L19.Pipeline2.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 19 Pipelining,
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)
1 COMP 206: Computer Architecture and Implementation Montek Singh Mon., Sep 9, 2002 Topic: Pipelining Basics.
CS152 / Kubiatowicz Lec13.1 3/17/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 13 Introduction to Pipelining: Datapath and Control.
Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Ceg3420 L1 4.1 DAP Fa97,  U.CB CEG3420 Computer Design Introduction to Pipelining.
CS152 / Kubiatowicz Lec /17/01©UCB Fall 2001 CS152 Computer Architecture and Engineering Lecture 13 Introduction to Pipelining: Datapath and Control.
Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.
Lecture 12: Pipeline Datapath Design Professor Mike Schulte Computer Architecture ECE 201.
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.
Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)
Pipelining CS365 Lecture 9. D. Barbara Pipeline CS465 2 Outline  Today’s topic  Pipelining is an implementation technique in which multiple instructions.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Advanced Computer Architecture CS 704 Advanced Computer Architecture Lecture 10 Computer Hardware Design (Pipeline Datapath and Control Design) Prof. Dr.
Lecture 18: Pipelining I.
Computer Organization
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
Pipeline Implementation (4.6)
ECE232: Hardware Organization and Design
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
CpE 442 Designing a Pipeline Processor (lect. II)
Dave Patterson (http.cs.berkeley.edu/~patterson)
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
Chapter 4 The Processor Part 3
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 2
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
John Kubiatowicz (http.cs.berkeley.edu/~kubitron)
CS 704 Advanced Computer Architecture
CS152 – Computer Architecture and Engineering Lecture 11 –
CS152 – Computer Architecture and Engineering Lecture 10 –
Pipelining in more detail
CS-447– Computer Architecture Lecture 14 Pipelining (2)
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Pipeline control unit (highly abstracted)
An Introduction to pipelining
Instruction Execution Cycle
Pipeline control unit (highly abstracted)
CS152 Computer Architecture and Engineering Lecture 14 Pipelining Control Continued Introduction to Advanced Pipelining.
Pipeline Control unit (highly abstracted)
Pipelining Appendix A and Chapter 3.
Give qualifications of instructors: DAP
Introduction to Computer Organization and Architecture
CMCS Computer Architecture Lecture 20 Pipelined Datapath and Control April 11, CMSC411.htm Mohamed.
Guest Lecturer: Justin Hsia
A relevant question Assuming you’ve got: One washer (takes 30 minutes)
Recall: Performance Evaluation
Presentation transcript:

John Kubiatowicz (http.cs.berkeley.edu/~kubitron) CS152 Computer Architecture and Engineering Lecture 13 Introduction to Pipelining II: Datapath and Control October 13, 1999 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/ 10/13/99 ©UCB Fall 1999

Recap: Sequential Laundry 6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take? 10/13/99 ©UCB Fall 1999

Recap: Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D 10/13/99 ©UCB Fall 1999

Recap: Ideal Pipelining Assume instructions are completely independent! IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB IF DCD EX MEM WB Maximum Speedup  Number of stages Speedup Time for unpipelined operation Time for longest stage Example: 40ns data path, 5 stages, Longest stage is 10 ns, Speedup 4 10/13/99 ©UCB Fall 1999

The Big Picture: Where are We Now? The Five Classic Components of a Computer Today’s Topics: Recap last lecture/finish datapath Pipelined Control/ Do it yourself Pipelined Control Administrivia Hazards/Forwarding Exceptions Review MIPS R3000 pipeline Control Datapath Memory Processor Input Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) 10/13/99 ©UCB Fall 1999

Can pipelining get us into trouble? Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline control hazards: attempt to make a decision before condition is evaulated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards 10/13/99 ©UCB Fall 1999

Single Memory is a Structural Hazard Time (clock cycles) ALU I n s t r. O r d e Load Mem Reg Mem Reg ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write) 10/13/99 ©UCB Fall 1999

Structural Hazards limit performance Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI  1.3 otherwise resource is more than 100% utilized 10/13/99 ©UCB Fall 1999

Control Hazard Solution #1: Stall e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Move decision to end of decode save 1 cycle per branch 10/13/99 ©UCB Fall 1999

Control Hazard Solution #2: Predict Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right ­ 50% of time) Need to “Squash” and restart following instruction if wrong Produce CPI on branch of (1 *.5 + 2 * .5) = 1.5 Total CPI might then be: 1.5 * .2 + 1 * .8 = 1.1 (20% branch) More dynamic scheme: history of 1 branch (­ 90%) 10/13/99 ©UCB Fall 1999

Control Hazard Solution #3: Delayed Branch Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (­ 50% of time) As launch more instruction per clock cycle, less useful 10/13/99 ©UCB Fall 1999

Data Hazard on r1: Read after write hazard (RAW) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 10/13/99 ©UCB Fall 1999

Data Hazard on r1: Read after write hazard (RAW) Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 10/13/99 ©UCB Fall 1999

Data Hazard Solution: Forwarding “Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 10/13/99 ©UCB Fall 1999

Forwarding (or Bypassing): What about Loads? Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg 10/13/99 ©UCB Fall 1999

Forwarding (or Bypassing): What about Loads Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall 10/13/99 ©UCB Fall 1999

Designing a Pipelined Processor Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve conflicts assert control in appropriate stage 10/13/99 ©UCB Fall 1999

Control and Datapath: Split state diag into 5 pieces IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem 10/13/99 ©UCB Fall 1999

Pipelined Processor (almost) for slides Exec Reg. File Mem Access Data A B S M Reg Equal PC Next PC IR Inst. Mem Valid IRex Dcd Ctrl IRmem Ex Ctrl IRwb Mem Ctrl WB Ctrl D What happens if we start a new instruction every cycle? 10/13/99 ©UCB Fall 1999

Pipelined Datapath (as in book); hard to read 10/13/99 ©UCB Fall 1999

Administrivia Get started on LAB 5! Problem 0 due tonight at 12 Midnight via email: evaluate your teammates. Organization on Lab due by Monday at 5pm Next week: Sections in Cory lab. Run “mystery program” on Lab 4 to test your design. Since we are going to be testing things that you may not have tried, think again carefully about testing your design! Perhaps read the paper by Doug Clark on handouts page 10/13/99 ©UCB Fall 1999

The Big Picture: Where are We Now? The Five Classic Components of a Computer Today’s Topics: Recap last lecture Pipelined Control/ Do it yourself Pipelined Control Administrivia Hazards/Forwarding Exceptions Review MIPS R3000 pipeline Control Datapath Memory Processor Input Output So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) 10/13/99 ©UCB Fall 1999

Pipelining the Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54) 10/13/99 ©UCB Fall 1999

The Four Stages of R-type Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: ALU operates on the two register operands Update PC Wr: Write the ALU output back to the register file Well, so far so good. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage. +1 = 15 min. (55) 10/13/99 ©UCB Fall 1999

Pipelining the R-type and Load Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56) We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port 10/13/99 ©UCB Fall 1999

Important Observation Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 Ifetch Reg/Dec Exec Wr R-type 1 2 3 4 I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. +1 = 17 min. (X:57) 2 ways to solve this pipeline hazard. 10/13/99 ©UCB Fall 1999

Solution 1: Insert “Bubble” into the Pipeline Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59) 10/13/99 ©UCB Fall 1999

Solution 2: Delay R-type’s Write by One Cycle Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) R-type Ifetch Reg/Dec Exec Mem Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr 10/13/99 ©UCB Fall 1999

Modified Control & Datapath IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; if Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– M; R[rt] <– M; R[rd] <– M; Equal Reg. File Reg File A M Exec S PC IR Next PC Inst. Mem B Mem Access D Data Mem 10/13/99 ©UCB Fall 1999

The Four Stages of Store Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Let’s continue our lecture by looking at the store instruction. Once again, the Ifetch and Reg/Decode stages are the same as all other instructions. The Exec stage of the store instruction calculates the memory address. Once the address is calculated, the store instruction will write the data it read from the register file back at the Reg/Decode stage into the data memory during the Mem stage. Notice that unlike the load instruction which takes five cycles to accomplish its task, the Store instruction only takes four cycles or four pipe stages. In order to keep our pipeline diagram looks more uniform, however, we will keep the Wr stage for the store instruction in the pipeline diagram. But keep in mind that as far as the pipelined control and pipelined datapath are concerned, the store instruction requires NOTHING to be done once it finishes its Mem stage. +2 = 27 min. (Y:07) 10/13/99 ©UCB Fall 1999

Ifetch: Instruction Fetch Reg/Dec: Exec: The Three Stages of Beq Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, select correct branch target address latch into PC Well similar to the store instruction, the branch instruction only consists of four pipe stages. Ifetch and Reg/decode are the same as all other instructions because we do not know what instruction we have at this point. We have not finish decoding the instruction yet. During the Execute stage of the pipeline, the BEQ instruction will use the ALU to compare the two register operands it fetched during the Reg/Dec stage. At the same time, a separate adder is used to calculate the branch target address. If the registers we compared during the Execute stage (point to the last bullet) have the same value, the branch is taken. That is, the branch target address we calculated earlier (last bullet) will be written into the Program Counter. Once again, similar to the Store instruction, the BEQ instruction will require NEITHER the pipelined control nor the pipelined datapath to do ANY thing once it finishes its Mem stage. With all these talk about pipelined datapath and pipelined control, let’s take a look at how the pipelined datapath looks like. +2 = 29 min. (Y:09) 10/13/99 ©UCB Fall 1999

Control Diagram Equal Reg. File Reg File A M S Exec PC IR Next PC IR <- Mem[PC]; PC < PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– S M <– Mem[S] Mem[S] <- B M <– S R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A M S Exec PC IR Next PC Inst. Mem B Mem Access D Data Mem 10/13/99 ©UCB Fall 1999

Data Stationary Control The Main Control generates the control signals during Reg/Dec Control signals for Exec (ExtOp, ALUSrc, ...) are used 1 cycle later Control signals for Mem (MemWr Branch) are used 2 cycles later Control signals for Wr (MemtoReg MemWr) are used 3 cycles later Reg/Dec Exec Mem Wr ExtOp ExtOp ALUSrc ALUSrc ALUOp ALUOp Main Control RegDst RegDst The main control here is identical to the one in the single cycle processor. It generate all the control signals necessary for a given instruction during that instruction’s Reg/Decode stage. All these control signals will be saved in the ID/Exec pipeline register at the end of the Reg/Decode cycle. The control signals for the Exec stage (ALUSrc, ... etc.) come from the output of the ID/Exec register. That is they are delayed ONE cycle from the cycle they are generated. The rest of the control signals that are not used during the Exec stage is passed down the pipeline and saved in the Exec/Mem register. The control signals for the Mem stage (MemWr, Branch) come from the output of the Exec/Mem register. That is they are delayed two cycles from the cycle they are generated. Finally, the control signals for the Wr stage (MemtoReg & RegWr) come from the output of the Exec/Wr register: they are delayed three cycles from the cycle they are generated. +2 = 45 min. (Y:45) Ex/Mem Register IF/ID Register ID/Ex Register Mem/Wr Register MemWr MemWr MemWr Branch Branch Branch MemtoReg MemtoReg MemtoReg MemtoReg RegWr RegWr RegWr RegWr 10/13/99 ©UCB Fall 1999

Datapath + Data Stationary Control IR v v v fun rw rw rw wb wb wb Inst. Mem Decode rt me me WB Ctrl rs Mem Ctrl ex op im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem PC Next PC 10/13/99 ©UCB Fall 1999

Let’s Try it Out 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 these addresses are octal 10/13/99 ©UCB Fall 1999

Start: Fetch 10 n n n n Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC = IR im rs rt Reg. File Reg File A M S Exec B Mem Access D Data Mem IF 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 10/13/99 ©UCB Fall 1999

Fetch 14, Decode 10 n n n Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC lw r1, r2(35) Decode WB Ctrl Mem Ctrl PC Next PC 14 = IR im 2 rt Reg. File Reg File A M S Exec B Mem Access D Data Mem ID 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 IF 10/13/99 ©UCB Fall 1999

Fetch 20, Decode 14, Exec 10 n n Inst. Mem Decode WB Ctrl Mem Ctrl PC addI r2, r2, 3 Decode WB Ctrl lw r1 Mem Ctrl PC Next PC 20 = IR 2 rt 35 Reg. File Reg File M r2 S Exec B Mem Access D Data Mem EX 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 ID IF 10/13/99 ©UCB Fall 1999

Fetch 24, Decode 20, Exec 14, Mem 10 n Inst. Mem Decode WB Ctrl Mem sub r3, r4, r5 Decode addI r2, r2, 3 WB Ctrl lw r1 Mem Ctrl PC Next PC 24 = IR 4 5 3 Reg. File Reg File M r2 r2+35 Exec B Mem Access D Data Mem M 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 EX ID IF 10/13/99 ©UCB Fall 1999

Fetch 30, Dcd 24, Ex 20, Mem 14, WB 10 Inst. Mem Decode WB Ctrl Mem beq r6, r7 100 Inst. Mem Decode addI r2 WB Ctrl sub r3 lw r1 Mem Ctrl PC Next PC 30 = IR 6 7 Reg. File Reg File r4 M[r2+35] r2+3 Exec r5 Mem Access D Data Mem ID IF EX M WB 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 Note Delayed Branch: always execute ori after beq 10/13/99 ©UCB Fall 1999

Fetch 100, Dcd 30, Ex 24, Mem 20, WB 14 Inst. Mem Decode WB Ctrl Mem ori r8, r9 17 Decode addI r2 WB Ctrl sub r3 beq Mem Ctrl PC Next PC 100 = IR 9 xx 100 Reg. File r1=M[r2+35] Reg File r6 r2+3 r4-r5 Exec r7 Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB M EX ID 10/13/99 ©UCB Fall 1999 IF

Fetch 104, Dcd 100, Ex 30, Mem 24, WB 20 Inst. Mem Decode ? WB Mem Ctrl Mem Ctrl PC Next PC ___ = IR Reg. File Reg File Exec Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 EX M WB Fill it in yourself! 10/13/99 ©UCB Fall 1999 ID

Fetch 110, Dcd 104, Ex 100, Mem 30, WB 24 ? ? Inst. Mem Decode WB Ctrl PC Next PC ___ = IR ? Reg. File Reg File ? Exec ? Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB M Fill it in yourself! 10/13/99 ©UCB Fall 1999 EX

Fetch 114, Dcd 110, Ex 104, Mem 100, WB 30 ? ? ? Inst. Mem Decode WB Ctrl Mem Ctrl PC Next PC ___ = IR ? Reg. File Reg File ? ? Exec Mem Access D Data Mem 10 lw r1, r2(35) 14 addI r2, r2, 3 20 sub r3, r4, r5 24 beq r6, r7, 100 30 ori r8, r9, 17 34 add r10, r11, r12 100 and r13, r14, 15 WB Fill it in yourself! 10/13/99 ©UCB Fall 1999 M

Pipeline Hazards Again I-Fet ch DCD MemOpFetch OpFetch Exec Store IFetch DCD ° ° ° Structural Hazard I-Fet ch DCD OpFetch Jump Control Hazard IFetch DCD ° ° ° IF DCD EX Mem WB RAW (read after write) Data Hazard IF DCD EX Mem WB WAW Data Hazard (write after write) IF DCD EX Mem WB IF DCD OF Ex Mem IF DCD OF Ex RS WAR Data Hazard (write after read) 10/13/99 ©UCB Fall 1999

Detect and resolve remaining ones Data Hazards Avoid some “by design” eliminate WAR by always fetching operands early (DCD) in pipe eleminate WAW by doing all WBs in order (last stage, static) Detect and resolve remaining ones stall or forward (if possible) IF DCD EX Mem WB IF DCD OF Ex Mem RAW Data Hazard WAW Data Hazard IF DCD OF Ex RS WAR Data Hazard IF DCD EX Mem WB 10/13/99 ©UCB Fall 1999

Hazard Detection Suppose instruction i is about to be issued and a predecessor instruction j is in the instruction pipeline. A RAW hazard exists on register if Rregs( i ) Wregs( j ) Keep a record of pending writes (for inst's in the pipe) and compare with operand regs of current instruction. When instruction issues, reserve its result register. When on operation completes, remove its write reservation. A WAW hazard exists on register if Wregs( i ) Wregs( j ) A WAR hazard exists on register if Wregs( i ) Rregs( j ) 10/13/99 ©UCB Fall 1999

Record of Pending Writes IAU npc Current operand registers Pending writes hazard <= ((rs == rwex) & regWex) OR ((rs == rwmem) & regWme) OR ((rs == rwwb) & regWwb) OR ((rt == rwex) & regWex) OR ((rt == rwmem) & regWme) OR ((rt == rwwb) & regWwb) I mem Regs op rw rs rt PC im n op rw B A alu n op rw S D mem m n op rw Regs 10/13/99 ©UCB Fall 1999

Resolve RAW by forwarding IAU Detect nearest valid write op operand register and forward into op latches, bypassing remainder of the pipe Increase muxes to add paths from pipeline registers Data Forwarding = Data Bypassing npc I mem Regs op rw rs rt PC Forward mux im n op rw B A alu n op rw S D mem m n op rw 10/13/99 Regs ©UCB Fall 1999

What about memory operations? If instructions are initiated in order and operations always occur in the same stage, there can be no hazards between memory operations! What does delaying WB on arithmetic operations cost? – cycles ? – hardware ? What about data dependence on loads? R1 <- R4 + R5 R2 <- Mem[ R2 + I ] R3 <- R2 + R1  “Delayed Loads” Can recognize this in decode stage and introduce bubble while stalling fetch stage (hint for lab 5!) Tricky situation: R1 <- Mem[ R2 + I ] Mem[R3+34] <- R1 Handle with bypass in memory stage! op Rd Ra Rb op Rd Ra Rb A B Rd D R Mem Rd T to reg file 10/13/99 ©UCB Fall 1999

Compiler Avoiding Load Stalls: 10/13/99 ©UCB Fall 1999

What about Interrupts, Traps, Faults? External Interrupts: Allow pipeline to drain, Load PC with interrupt address Faults (within instruction, restartable) Force trap instruction into IF disable writes till trap hits WB must save multiple PCs or PC + state Recall: Precise Exceptions  State of the machine is preserved as if program executed up to the offending instruction All previous instructions completed Offending instruction and all following instructions act as if they have not even started Same system code will work on different implementations 10/13/99 ©UCB Fall 1999

Exception Problem Exceptions/Interrupts: 5 instructions executing in 5 stage pipeline How to stop the pipeline? Restart? Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic exception MEM Page fault on data fetch; misaligned memory access; memory-protection violation; memory error Load with data page fault, Add with instruction page fault? Solution 1: interrupt vector/instruction 2: interrupt ASAP, restart everything incomplete 10/13/99 ©UCB Fall 1999

Another look at the exception problem Time Data TLB IFetch Dcd Exec Mem WB Bad Inst Inst TLB fault Program Flow Overflow Use pipeline to sort this out! Pass exception status along with instruction. Keep track of PCs for every instruction in pipeline. Don’t act on exception until it reache WB stage Handle interrupts through “faulting noop” in IF stage When instruction reaches WB stage: Save PC  EPC, Interrupt vector addr  PC Turn all instructions in earlier stages into noops! 10/13/99 ©UCB Fall 1999

Exception Handling IAU npc I mem detect bad instruction address Regs lw $2,20($5) PC Excp detect bad instruction im n op rw B A Excp detect overflow alu S Excp detect bad data address D mem m Excp Allow exception to take effect Regs 10/13/99 ©UCB Fall 1999

Resolution: Freeze above & Bubble Below IAU npc I mem freeze Regs op rw rs rt PC bubble im n op rw B A alu n op rw S Flush accomplished by setting “invalid” bit in pipeline D mem m n op rw Regs 10/13/99 ©UCB Fall 1999

FYI: MIPS R3000 clocking discipline phi1 phi2 2-phase non-overlapping clocks Pipeline stage is two (level sensitive) latches phi1 phi2 phi1 Edge-triggered 10/13/99 ©UCB Fall 1999

MIPS R3000 Instruction Pipeline Decode Reg. Read Inst Fetch ALU / E.A Memory Write Reg TLB I-Cache RF Operation WB E.A. TLB D-Cache TLB I-cache RF ALUALU D-Cache WB Resource Usage Write in phase 1, read in phase 2 => eliminates bypass from WB 10/13/99 ©UCB Fall 1999

Recall: Data Hazard on r1 Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11 With MIPS R3000 pipeline, no need to forward from WB stage 10/13/99 ©UCB Fall 1999

MIPS R3000 Multicycle Operations op Rd Ra Rb Ex: Multiply, Divide, Cache Miss Stall all stages above multicycle operation in the pipeline Drain (bubble) stages below it Use control word of local stage state to step through multicycle operation mul Rd Ra Rb A B Rd R Rd T to reg file 10/13/99 ©UCB Fall 1999

Issues in Pipelined design Limitation ° Pipelining IF D Ex M W IF D Ex M W IF D Ex M W Issue rate, FU stalls, FU depth ° Super-pipeline IF D Ex M W - Issue one instruction per (fast) cycle - ALU takes multiple cycles IF D Ex M W IF D Ex M W Clock skew, FU stalls, FU depth IF D Ex M W IF D Ex M W ° Super-scalar IF D Ex M W Hazard resolution - Issue multiple scalar IF D Ex M W IF D Ex M W instructions per cycle IF D Ex M W ° VLIW (“EPIC”) - Each instruction specifies Packing IF D Ex M W multiple scalar operations - Compiler determines parallelism Ex M W Ex M W Ex M W ° Vector operations IF D Ex M W Applicability - Each instruction specifies Ex M W Ex M W series of identical operations Ex M W 10/13/99 ©UCB Fall 1999

Summary #1/2: Pipelining What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? HAZARDS! structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction Pipelines pass control information down the pipe just as data moves down pipe Forwarding/Stalls handled by local control Exceptions stop the pipeline 10/13/99 ©UCB Fall 1999

Forwarding/Stalls handled by local control Summary #2/2 Pipelines pass control information down the pipe just as data moves down pipe Forwarding/Stalls handled by local control Exceptions stop the pipeline MIPS I instruction set architecture made pipeline visible (delayed branch, delayed load) More performance from deeper pipelines, parallelism 10/13/99 ©UCB Fall 1999