CS152 – Computer Architecture and Engineering Lecture 10 –

Dave Patterson (www.cs.berkeley.edu/~patterson)
CS152 – Computer Architecture and Engineering Lecture 10 – Introduction to Pipelining Dave Patterson ( www-inst.eecs.berkeley.edu/~cs152/

Review (1 of 4) Disadvantages of the Single Cycle Processor
Long cycle time Cycle time is too long for all instructions except the Load No reuse of hardware Multiple Cycle Processor: Divide the instructions into smaller steps Execute each step (instead of the entire instruction) in one cycle Partition datapath into equal size chunks to minimize cycle time ~10 levels of logic between latches Follow same 5-step method for designing “real” processor

Review (2 of 4) Control is specified by finite state diagram
Specialized state-diagrams easily captured by microsequencer simple increment & “branch” fields datapath control fields Control is more complicated with: complex instruction sets restricted datapaths (see the book) Control design can become Microprogramming

Summary (3 of 4) Microprogramming is a fundamental concept
implement an instruction set by building a very simple processor and interpreting the instructions essential for very complex instructions and when few register transfers are possible Control design reduces to Microprogramming Design of a Microprogramming language Start with list of control signals Group signals together that make sense (vs. random): called “fields” Place fields in some logical order (e.g., ALU operation & ALU operands first and microinstruction sequencing last) To minimize the width, encode operations that will never be used at the same time Create a symbolic legend for the microinstruction format, showing name of field values and how they set the control signals

Review: Overview of Control
Control may be designed using one of several initial representations. The choice of sequence control, and how logic is represented, can then be determined independently; the control can then be implemented with one of several methods using a structured logic technique. Initial Representation Finite State Diagram Microprogram Sequencing Control Explicit Next State Microprogram counter Function + Dispatch ROMs Logic Representation Logic Equations Truth Tables Implementation PLA ROM Technique “hardwired control” “microprogrammed control”

Can we get CPI < 4? Seems to be lots of “idle” hardware
Why not overlap instructions? Pipeline Ideal Memory WrAdr Din RAdr 32 Dout MemWr ALU ALUOp Control IRWr Instruction Reg Reg File Ra Rw busW Rb 5 busA busB RegWr Rs Rt Mux 1 Rd PCWr ALUSelA RegDst PC MemtoReg Extend ExtOp 2 3 4 16 Imm << 2 ALUSelB Zero PCWrCond PCSrc IorD Mem Data Reg ALU Out B A Putting it all together, here it is: the multiple cycle datapath we set out to built. +1 = 47 min. (Y:47)

Pipelining is Natural! Laundry Example
Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Sequential Laundry Sequential laundry takes 6 hours for 4 loads
6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads

Pipelining Lessons Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D

The Five Stages of Load Ifetch: Instruction Fetch
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)

Note: These 5 stages were there all along!
IR <= MEM[PC] PC <= PC + 4 R-type ALUout <= A fun B R[rd] <= ALUout ALUout <= A op ZX R[rt] <= ALUout ORi ALUout <= A + SX R[rt] <= M M <= MEM[ALUout] LW MEM[ALUout] <= B SW 0000 0001 0100 0101 0110 0111 1000 1001 1010 1011 1100 BEQ 0010 If A = B then PC <= ALUout ALUout <= PC +SX Fetch Decode Execute Memory Write-back

Pipelining Improve performance by increasing throughput
Ideal speedup is number of stages in the pipeline. Do we achieve this?

Basic Idea What do we need to add to split the datapath into stages?

Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

Conventional Pipelined Execution Representation
Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB

Dave’s office hours Tue 3:30 – 5 Lab 3 demo Friday, due Monday
Administrivia Office hours in Lab Mon 4 – 5:30 Jack, Tue 3:30-5 Kurt, Wed 3 – 4:30 John, Thu 3:30-5 Ben Dave’s office hours Tue 3:30 – 5 Lab 3 demo Friday, due Monday Reading Chapter 6, sections 6.1 to 6.4 Midterm Wed Oct 8 5:30 - 8:30 in 1 LeConte Midterm review Sunday Oct 4, 5 PM in 306 Soda Bring 1 page, handwritten notes, both sides Meet at LaVal’s Northside afterwards for Pizza

Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr

Suppose we execute 100 instructions Single Cycle Machine
Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine 4.5 ns/cycle x 1 CPI x 100 inst = 450 ns Multicycle Machine 1.0 ns/cycle x 4.1 CPI (due to inst mix) x 100 inst = 410 ns Ideal pipelined machine 1.0 ns/cycle x (1 CPI x 100 inst + 4 cycle fill) = 104 ns

Why Pipeline? Because we can!
Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Inst 0 ALU Im Reg Dm Inst 1 ALU Im Reg Dm Inst 2 Inst 3 ALU Im Reg Dm Inst 4 ALU Im Reg Dm

Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy watching TV control hazards: attempt to make a decision before condition is evaluated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

Single Memory is a Structural Hazard
Time (clock cycles) ALU I n s t r. O r d e Load Mem Reg Mem Reg ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write)

Structural Hazards limit performance
Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI  1.3 otherwise resource is more than 100% utilized One Structural Hazard solution: more resources Instruction cache and Data cache

Control Hazard Solution #1: Stall
e Time (clock cycles) Load Beq Add ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles for Beq instruction above) => slow Move decision to end of decode save 1 cycle per branch, may stretch clock cycle

Control Hazard Solution #2: Predict
Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if guess right, 1 if wrong (right ~ 50% of time) Need to “Squash” and restart following instruction if wrong Produce CPI on branch of (1 * * .5) = 1.5 Total CPI might then be: 1.5 * * .8 = 1.1 (20% branch) More dynamic schemes: history of branch behavior (~90-99%)

Control Hazard Solution #3: Delayed Branch
Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (~50% of time) As launch more instruction per clock cycle, less useful

Data Hazard on r1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
xor r10,r1,r11

Data Hazard on r1: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9
Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

Data Hazard Solution: add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7
“Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

Forwarding (or Bypassing): What about Loads?
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg

Forwarding (or Bypassing): What about Loads
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall

Control and Datapath: Split state diag into 5 pieces
IR <- Mem[PC]; PC <– PC+4; A <- R[rs]; B<– R[rt] S <– A + B; S <– A or ZX; S <– A + SX; S <– A + SX; If Cond PC < PC+SX; M <– Mem[S] Mem[S] <- B R[rd] <– S; R[rt] <– S; R[rd] <– M; Equal Reg. File Reg File A S Exec PC IR Next PC Inst. Mem B M Mem Access D Data Mem

Pipelined Processor (almost) for slides
Exec Reg. File Mem Access Data A B S M Reg Equal PC Next PC IR Inst. Mem Valid IRex Dcd Ctrl IRmem Ex Ctrl IRwb Mem Ctrl WB Ctrl D What happens if we start a new instruction every cycle?

Pipelined Datapath (as in book); hard to read

Pipelining the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock Ifetch Reg/Dec Exec Mem Wr 1st lw 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd lw Ifetch Reg/Dec Exec Mem Wr The five independent functional units in the pipeline datapath are: Instruction Memory for the Ifetch stage Register File’s Read ports (bus A and busB) for the Reg/Dec stage ALU for the Exec stage Data Memory for the Mem stage Register File’s Write port (bus W) for the Wr stage For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54)

The Four Stages of R-type
Cycle 1 Cycle 2 Cycle 3 Cycle 4 R-type Ifetch Reg/Dec Exec Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: ALU operates on the two register operands Update PC Wr: Write the ALU output back to the register file Well, so far so good. Let’s take a look at the R-type instructions. The R-type instruction does NOT access data memory so it only takes four clock cycles, or in our new pipeline terminology, four stages to complete. The Ifetch and Reg/Dec stages are identical to the Load instructions. Well they have to be because at this point, we do not know we have a R-type instruction yet. Instead of calculating the effective address during the Exec stage, the R-type instruction will use the ALU to operate on the register operands. The result of this ALU operation is written back to the register file during the Wr back stage. +1 = 15 min. (55)

Pipelining the R-type and Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Oops! We have a problem! R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Exec Wr We have pipeline conflict or structural hazard: Two instructions try to write to the register file at the same time! Only one write port What happened if we try to pipeline the R-type instructions with the Load instructions? Well, we have a problem here!!! We end up having two instructions trying to write to the register file at the same time! Why do we have this problem (the write “bubble”)? Well, the reason for this problem is that there is something I have not yet told you. +1 = 16 min. (X:56)

Important Observation
Each functional unit can only be used once per instruction Each functional unit must be used at the same stage for all instructions: Load uses Register File’s Write Port during its 5th stage R-type uses Register File’s Write Port during its 4th stage 2 ways to solve this pipeline hazard. Ifetch Reg/Dec Exec Mem Wr Load 1 2 3 4 5 I already told you that in order for pipeline to work perfectly, each functional unit can ONLY be used once per instruction. What I have not told you is that this (1st bullet) is a necessary but NOT sufficient condition for pipeline to work. The other condition to prevent pipeline hiccup is that each functional unit must be used at the same stage for all instructions. For example here, the load instruction uses the Register File’s Wr port during its 5th stage but the R-type instruction right now will use the Register File’s port during its 4th stage. This (5 versus 4) is what caused our problem. How do we solve it? We have 2 solutions. +1 = 17 min. (X:57) Ifetch Reg/Dec Exec Wr R-type 1 2 3 4

Solution 1: Insert “Bubble” into the Pipeline
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock Ifetch Reg/Dec Exec Wr Load Ifetch Reg/Dec Exec Mem Wr Ifetch Reg/Dec Exec Wr R-type Ifetch Reg/Dec Pipeline Exec Wr R-type R-type Ifetch Bubble Reg/Dec Exec Wr Ifetch Reg/Dec Exec Insert a “bubble” into the pipeline to prevent 2 writes at the same cycle The control logic can be complex. Lose instruction fetch and issue opportunity. No instruction is started in Cycle 6! The first solution is to insert a “bubble” into the pipeline AFTER the load instruction to push back every instruction after the load that are already in the pipeline by one cycle. At the same time, the bubble will delay the Instruction Fetch of the instruction that is about to enter the pipeline by one cycle. Needless to say, the control logic to accomplish this can be complex. Furthermore, this solution also has a negative impact on performance. Notice that due to the “extra” stage (Mem) Load instruction has, we will not have one instruction finishes every cycle (points to Cycle 5). Consequently, a mix of load and R-type instruction will NOT have an average CPI of 1 because in effect, the Load instruction has an effective CPI of 2. So this is not that hot an idea Let’s try something else. +2 = 19 min. (X:59)

Solution 2: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NOP stage: nothing is being done. 1 2 3 4 5 R-type Ifetch Reg/Dec Exec Mem Wr Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) Load Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr R-type Ifetch Reg/Dec Exec Mem Wr

The Four Stages of Store => 5 stages
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Store Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Write the data into the Data Memory Let’s continue our lecture by looking at the store instruction. Once again, the Ifetch and Reg/Decode stages are the same as all other instructions. The Exec stage of the store instruction calculates the memory address. Once the address is calculated, the store instruction will write the data it read from the register file back at the Reg/Decode stage into the data memory during the Mem stage. Notice that unlike the load instruction which takes five cycles to accomplish its task, the Store instruction only takes four cycles or four pipe stages. In order to keep our pipeline diagram looks more uniform, however, we will keep the Wr stage for the store instruction in the pipeline diagram. But keep in mind that as far as the pipelined control and pipelined datapath are concerned, the store instruction requires NOTHING to be done once it finishes its Mem stage. +2 = 27 min. (Y:07)

The Three Stages of Beq => 5 stages
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Beq Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: compares the two register operand, select correct branch target address latch into PC Well similar to the store instruction, the branch instruction only consists of four pipe stages. Ifetch and Reg/decode are the same as all other instructions because we do not know what instruction we have at this point. We have not finish decoding the instruction yet. During the Execute stage of the pipeline, the BEQ instruction will use the ALU to compare the two register operands it fetched during the Reg/Dec stage. At the same time, a separate adder is used to calculate the branch target address. If the registers we compared during the Execute stage (point to the last bullet) have the same value, the branch is taken. That is, the branch target address we calculated earlier (last bullet) will be written into the Program Counter. Once again, similar to the Store instruction, the BEQ instruction will require NEITHER the pipelined control nor the pipelined datapath to do ANY thing once it finishes its Mem stage. With all these talk about pipelined datapath and pipelined control, let’s take a look at how the pipelined datapath looks like. +2 = 29 min. (Y:09)

Peer Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st lw Ifetch Reg/Dec Exec Mem1 Wr Mem2 2nd lw Ifetch Reg/Dec Exec Mem1 Wr Mem2 Ifetch Reg/Dec Exec Mem1 Wr Mem2 3rd lw Suppose a big (overlapping) data cache results in a data cache latency of 2 clock cycles and a 6-stage pipeline. What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above

Peer Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st lw Ifetch1 Ifetch2 Reg/Dec Exec Mem Wr Suppose a big (overlapping) I cache results in a I cache latency of 2 clock cycles and a 6-stage pipeline. What is the impact? 1. Instruction bandwidth is now 5/6-ths of the 5-stage pipeline 2. Instruction bandwidth is now 1/2 of the 5-stage pipeline 3. The branch delay slot is now 2 instructions 4. The load-use hazard can be with 2 instructions following load 5. Both 3 and 4: branch delay and load-use now 2 instructions 6. None of the above

Peer Instruction Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Clock 1st add Ifetch Reg/Dec Exec Mem/Wr 2nd lw Ifetch Reg/Dec Exec Mem Wr 3rd add Ifetch Reg/Dec Exec Mem/Wr Suppose we use with a 4 stage pipeline that combines memory access and write back stages for all instructions but load, stalling when there are structural hazards. Impact? 1. The branch delay slot is now 0 instructions 2. Every load stalls since it has a structural hazard 3. Every store stalls since it has a structural hazard 4. Both 2 & 3: loads & stores stall due to structural hazards 5. Every load stalls, but there is no load-use hazard anymore 6. Both 2 & 3, but there is no load-use hazard anymore 7. None of the above

Designing a Pipelined Processor
Go back and examine your datapath and control diagram Associate resources with states Ensure that backwards flows do not conflict, or figure out how to resolve Assert control in appropriate stage

Summary: Pipelining Reduce CPI by overlapping many instructions
Average throughput of approximately 1 CPI with fast clock Utilize capabilities of the Datapath start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction

CS152 – Computer Architecture and Engineering Lecture 10 –

Similar presentations

Presentation on theme: "CS152 – Computer Architecture and Engineering Lecture 10 –"— Presentation transcript:

Similar presentations

About project

Feedback

Log in

Auth with social network:

CS152 – Computer Architecture and Engineering Lecture 10 –

Similar presentations

Presentation on theme: "CS152 – Computer Architecture and Engineering Lecture 10 –"— Presentation transcript:

Similar presentations

About project

Feedback