Presentation is loading. Please wait.

Presentation is loading. Please wait.

Recall: Performance Evaluation

Similar presentations


Presentation on theme: "Recall: Performance Evaluation"— Presentation transcript:

1 CS152 Computer Architecture and Engineering Lecture 13 Introduction to Pipelining

2 Recall: Performance Evaluation
What is the average CPI? state diagram gives CPI for each instruction type workload gives frequency of each type Type CPIi for type Frequency CPIi x freqIi Arith/Logic 4 40% 1.6 Load 5 30% 1.5 Store 4 10% 0.4 branch 3 20% 0.6 Average CPI: 4.1

3 Seems to be lots of “idle” hardware
Can we get CPI < 4.1? Seems to be lots of “idle” hardware Why not overlap instructions??? Ideal Memory WrAdr Din RAdr 32 Dout MemWr ALU ALUOp Control IRWr Instruction Reg Reg File Ra Rw busW Rb 5 busA busB RegWr Rs Rt Mux 1 Rd PCWr ALUSelA RegDst PC MemtoReg Extend ExtOp 2 3 4 16 Imm << 2 ALUSelB Zero PCWrCond PCSrc IorD Mem Data Reg ALU Out B A Putting it all together, here it is: the multiple cycle datapath we set out to built. +1 = 47 min. (Y:47)

4 The Big Picture: Where are We Now?
The Five Classic Components of a Computer Next Topics: Pipelining by Analogy Pipeline hazards Processor Input Control Memory So where are in in the overall scheme of things. Well, we just finished designing the processor’s datapath. Now I am going to show you how to design the control for the datapath. +1 = 7 min. (X:47) Datapath Output

5 Pipelining is Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

6 Sequential laundry takes 6 hours for 4 loads
6 PM 7 8 9 10 11 Midnight Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e A B C D Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

7 Pipelined Laundry: Start work ASAP
6 PM 7 8 9 10 11 Midnight Time 30 40 20 T a s k O r d e A B C D Pipelined laundry takes 3.5 hours for 4 loads

8 Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup Stall for Dependences 6 PM 7 8 9 Time 30 40 20 T a s k O r d e A B C D

9 Ifetch: Instruction Fetch
The Five Stages of Load Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Load Ifetch Reg/Dec Exec Mem Wr Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Register Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory Wr: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)

10 Note: These 5 stages were there all along
IR <= MEM[PC] PC <= PC + 4 R-type ALUout <= A fun B R[rd] <= ALUout ALUout <= A op ZX R[rt] <= ALUout ORi ALUout <= A + SX R[rt] <= M M <= MEM[ALUout] LW MEM[ALUout] <= B SW 0000 0001 0100 0101 0110 0111 1000 1001 1010 1011 1100 BEQ 0010 If A = B then PC <= ALUout ALUout <= PC +SX Fetch Decode Execute Memory Write-back

11 Pipelining Improve performance by increasing throughput Ideal speedup is number of stages in the pipeline. Do we achieve this?

12 Basic Idea What do we need to add to split the datapath into stages?

13 Graphically Representing Pipelines
Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths

14 Conventional Pipelined Execution Representation
Time IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB IFetch Dcd Exec Mem WB Program Flow IFetch Dcd Exec Mem WB

15 Single Cycle, Multiple Cycle, vs. Pipeline
Clk Single Cycle Implementation: Load Store Waste Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 Clk Here are the timing diagrams showing the differences between the single cycle, multiple cycle, and pipeline implementations. For example, in the pipeline implementation, we can finish executing the Load, Store, and R-type instruction sequence in seven cycles. In the multiple clock cycle implementation, however, we cannot start executing the store until Cycle 6 because we must wait for the load instruction to complete. Similarly, we cannot start the execution of the R-type instruction until the store instruction has completed its execution in Cycle 9. In the Single Cycle implementation, the cycle time is set to accommodate the longest instruction, the Load instruction. Consequently, the cycle time for the Single Cycle implementation can be five times longer than the multiple cycle implementation. But may be more importantly, since the cycle time has to be long enough for the load instruction, it is too long for the store instruction so the last part of the cycle here is wasted. +2 = 77 min. (X:57) Multiple Cycle Implementation: Load Store R-type Ifetch Reg Exec Mem Wr Ifetch Reg Exec Mem Ifetch Pipeline Implementation: Load Ifetch Reg Exec Mem Wr Store Ifetch Reg Exec Mem Wr R-type Ifetch Reg Exec Mem Wr

16 Suppose we execute 100 instructions Single Cycle Machine
Why Pipeline? Suppose we execute 100 instructions Single Cycle Machine 45 ns/cycle x 1 CPI x 100 inst = 4500 ns Multicycle Machine 10 ns/cycle x 4.6 CPI (due to inst mix) x 100 inst = 4600 ns Ideal pipelined machine 10 ns/cycle x (1 CPI x 100 inst + 4 cycle drain) = 1040 ns

17 Inst 0 Inst 1 Inst 2 Inst 3 Inst 4 Why pipeline (cont.)?
Time (clock cycles) ALU Im Reg Dm I n s t r. O r d e Inst 0 ALU Im Reg Dm Inst 1 ALU Im Reg Dm Inst 2 Inst 3 ALU Im Reg Dm Inst 4 ALU Im Reg Dm

18 Can pipelining get us into trouble?
Yes: Pipeline Hazards structural hazards: attempt to use the same resource two different ways at the same time E.g., combined washer/dryer would be a structural hazard or folder busy doing something else (watching TV) control hazards: attempt to make a decision before condition is evaluated E.g., washing football uniforms and need to get proper detergent level; need to see after dryer before next load in branch instructions data hazards: attempt to use item before it is ready E.g., one sock of pair in dryer and one in washer; can’t fold until get sock from washer through dryer instruction depends on result of prior instruction still in the pipeline Can always resolve hazards by waiting pipeline control must detect the hazard take action (or delay action) to resolve hazards

19 Single Memory is a Structural Hazard
Time (clock cycles) ALU I n s t r. O r d e Load Mem Reg Mem Reg ALU Mem Reg Instr 1 ALU Mem Reg Instr 2 ALU Instr 3 Mem Reg Mem Reg ALU Mem Reg Instr 4 Detection is easy in this case! (right half highlight means read, left half write)

20 Structural Hazards limit performance
Example: if 1.3 memory accesses per instruction and only one memory access per cycle then average CPI  1.3 otherwise resource is more than 100% utilized

21 Control Hazard Solution #1: Stall
e Time (clock cycles) Add Beq Load ALU Mem Reg Lost potential Stall: wait until decision is clear Impact: 2 lost cycles (i.e. 3 clock cycles per branch instruction) => slow Move decision to end of decode save 1 cycle per branch

22 Control Hazard Solution #2: Predict
Time (clock cycles) Add Beq Load ALU Mem Reg Predict: guess one direction then back up if wrong Impact: 0 lost cycles per branch instruction if right, 1 if wrong (right ­ 50% of time) Need to “Squash” and restart following instruction if wrong Produce CPI on branch of (1 * * .5) = 1.5 Total CPI might then be: 1.5 * * .8 = 1.1 (20% branch) More dynamic scheme: history of 1 branch (­ 90%)

23 Control Hazard Solution #3: Delayed Branch
Time (clock cycles) Add Beq Misc ALU Mem Reg Load Delayed Branch: Redefine branch behavior (takes place after next instruction) Impact: 0 clock cycles per branch instruction if can find instruction to put in “slot” (­ 50% of time) As launch more instruction per clock cycle, less useful

24 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Data Hazard on r1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11

25 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Data Hazard on r1: Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

26 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11
Data Hazard Solution: “Forward” result from one stage to another “or” OK if define read/write properly Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Im Reg ALU Reg Dm I n s t r. O r d e ALU sub r4,r1,r3 Im Reg Dm Reg ALU and r6,r1,r7 Im Reg Dm Reg or r8,r1,r9 Im Reg ALU Dm Reg ALU Im Reg Dm xor r10,r1,r11

27 Forwarding (or Bypassing): What about Loads?
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU sub r4,r1,r3 Im Reg Dm Reg

28 Forwarding (or Bypassing): What about Loads
Dependencies backwards in time are hazards Can’t solve with forwarding: Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Im Reg ALU Reg Dm ALU Im Reg Dm sub r4,r1,r3 Stall

29 Designing a Pipelined Processor
Go back and examine your datapath and control diagram associated resources with states ensure that flows do not conflict, or figure out how to resolve assert control in appropriate stage

30 Reduce CPI by overlapping many instructions
Summary: Pipelining Reduce CPI by overlapping many instructions Average throughput of approximately 1 CPI with fast clock Utilize capabilities of the Datapath start next instruction while working on the current one limited by length of longest stage (plus fill/flush) detect and resolve hazards What makes it easy all instructions are the same length just a few instruction formats memory operands appear only in loads and stores What makes it hard? structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction


Download ppt "Recall: Performance Evaluation"

Similar presentations


Ads by Google