Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010.

Similar presentations


Presentation on theme: "1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010."— Presentation transcript:

1 1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010

2 2Topics  Pipelining Can think of as A way to parallelize, or A way to parallelize, or A way to make better utilization of the hardware. Goal: use all hardware every cycle A way to make better utilization of the hardware. Goal: use all hardware every cycle  Section 7.5 of text

3 Parallelism  Two types of parallelism: Spatial parallelism Spatial parallelism  duplicate hardware performs multiple tasks at once Temporal parallelism Temporal parallelism  task is broken into multiple stages  also called pipelining  for example, an assembly line

4 Parallelism Definitions  Some definitions: Token: A group of inputs processed to produce a group of outputs Token: A group of inputs processed to produce a group of outputs Latency: Time for one token to pass from start to end Latency: Time for one token to pass from start to end Throughput: The number of tokens that can be produced per unit time Throughput: The number of tokens that can be produced per unit time  Parallelism increases throughput Often sacrificing latency Often sacrificing latency

5 Parallelism Example  Ben is baking cookies It takes 5 minutes to roll the cookies and 15 minutes to bake them. It takes 5 minutes to roll the cookies and 15 minutes to bake them. After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism? After finishing one batch he immediately starts the next batch. What is the latency and throughput if Ben doesn’t use parallelism? Latency = 5 + 15 = 20 minutes = 1/3 hour Throughput = 1 tray/ 1/3 hour = 3 trays/hour

6 Parallelism Example  What is the latency and throughput if Ben uses parallelism? Spatial parallelism: Ben asks Allysa to help, using her own oven Spatial parallelism: Ben asks Allysa to help, using her own oven Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on. Temporal parallelism: Ben breaks the task into two stages: roll and baking. He uses two trays. While the first batch is baking he rolls the second batch, and so on.

7 Spatial Parallelism Latency = ? Throughput = ?

8 Spatial Parallelism Latency = 5 + 15 = 20 minutes = 1/3 hour (same) Throughput = 2 trays/ 1/3 hour = 6 trays/hour (doubled)

9 Temporal Parallelism Latency = ? Throughput = ?

10 Temporal Parallelism Latency = 5 + 15 = 20 minutes = 1/3 hour Throughput = 1 trays/ 1/4 hour = 4 trays/hour Using both techniques, the throughput would be 8 trays/hour

11 Pipelined MIPS  Temporal parallelism  Divide single-cycle processor into 5 stages: Fetch Fetch Decode Decode Execute Execute Memory Memory Writeback Writeback  Add pipeline registers between stages

12 Single-Cycle vs. Pipelined Performance

13 Pipelining Abstraction

14 Single-Cycle and Pipelined Datapath

15 Multi-Cycle and Pipelined Datapath

16 Corrected Pipelined Datapath WriteReg must arrive at the same time as Result

17 Pipelined Control Same control unit as single-cycle processor Control delayed to proper pipeline stage

18 Pipeline Hazard  Occurs when an instruction depends on results from previous instruction that hasn’t completed.  Types of hazards: Data hazard: register value not written back to register file yet Data hazard: register value not written back to register file yet Control hazard: next instruction not decided yet (caused by branches) Control hazard: next instruction not decided yet (caused by branches)

19 Data Hazard

20 Handling Data Hazards  Static Insert nop s in code at compile time Insert nop s in code at compile time Rearrange code at compile time Rearrange code at compile time  Dynamic Forward data at run time Forward data at run time Stall the processor at run time Stall the processor at run time

21 Compile-Time Hazard Elimination  Insert enough nops for result to be ready  Or move independent useful instructions forward

22 Data Forwarding  Also known as bypassing

23 Data Forwarding

24  Forward to Execute stage from either: Memory stage or Memory stage or Writeback stage Writeback stage  Forwarding logic for ForwardAE: if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01 else ForwardAE = 00  Forwarding logic for ForwardBE same, but replace rsE with rtE

25 Data Forwarding if ((rsE != 0) AND (rsE == WriteRegM) AND RegWriteM) then ForwardAE = 10 else if ((rsE != 0) AND (rsE == WriteRegW) AND RegWriteW) then ForwardAE = 01 else ForwardAE = 00 25

26 Forwarding can fail… lw has a 2-cycle latency!

27 Stalling

28 Stalling Hardware

29  Stalling logic: lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall

30 Stalling Control lwstall = ((rsD == rtE) OR (rtD == rtE)) AND MemtoRegE StallF = StallD = FlushE = lwstall

31 Control Hazards  beq : branch is not determined until the fourth stage of the pipeline branch is not determined until the fourth stage of the pipeline Instructions after the branch are fetched before branch occurs Instructions after the branch are fetched before branch occurs These instructions must be flushed if the branch happens These instructions must be flushed if the branch happens

32 Effect & Solutions  Could stall when branch decoded Expensive: 3 cycles lost per branch! Expensive: 3 cycles lost per branch!  Could predict and flush if wrong  Branch misprediction penalty Instructions flushed when branch is taken Instructions flushed when branch is taken May be reduced by determining branch earlier May be reduced by determining branch earlier 32

33 Control Hazards: Flushing

34 Control Hazards: Original Pipeline (for comparison)

35 Control Hazards: Early Branch Resolution Introduced another data hazard in Decode stage (fix a few slides away)

36 Control Hazards with Early Branch Resolution Penalty now only one lost cycle

37 Aside: Delayed Branch  MIPS always executes instruction following a branch So branch delayed So branch delayed  This allows us to avoid killing inst. Compilers move instruction that has no conflict w/ branch into delay slot Compilers move instruction that has no conflict w/ branch into delay slot 37

38 Example  This sequence add $4 $5 $6 beq $1 $2 40  reordered to this beq $1 $2 40 add $4 $5 $6 38

39 Handling the New Hazards

40 Control Forwarding and Stalling Hardware  Forwarding logic: ForwardAD = (rsD !=0) AND (rsD == WriteRegM) AND RegWriteM ForwardBD = (rtD !=0) AND (rtD == WriteRegM) AND RegWriteM  Stalling logic: branchstall = BranchD AND RegWriteE AND (WriteRegE == rsD OR WriteRegE == rtD) (WriteRegE == rsD OR WriteRegE == rtD) OR OR BranchD AND MemtoRegM AND BranchD AND MemtoRegM AND (WriteRegM == rsD OR WriteRegM == rtD) (WriteRegM == rsD OR WriteRegM == rtD) StallF = StallD = FlushE = lwstall OR branchstall

41 Branch Prediction  Especially important if branch penalty > 1 cycle  Guess whether branch will be taken Backward branches are usually taken (loops) Backward branches are usually taken (loops) Perhaps consider history of whether branch was previously taken to improve the guess Perhaps consider history of whether branch was previously taken to improve the guess  Good prediction reduces the fraction of branches requiring a flush

42 Pipelined Performance Example  Ideally CPI = 1 But less due to: stalls (caused by loads and branches) But less due to: stalls (caused by loads and branches)  SPECINT2000 benchmark: 25% loads 25% loads 10% stores 10% stores 11% branches 11% branches 2% jumps 2% jumps 52% R-type 52% R-type  Suppose: 40% of loads used by next instruction 40% of loads used by next instruction 25% of branches mispredicted 25% of branches mispredicted All jumps flush next instruction All jumps flush next instruction  What is the average CPI?

43 Pipelined Performance Example  SPECINT2000 benchmark: 25% loads 25% loads 10% stores 10% stores 11% branches 11% branches 2% jumps 2% jumps 52% R-type 52% R-type  Suppose: 40% of loads used by next instruction 40% of loads used by next instruction 25% of branches mispredicted 25% of branches mispredicted All jumps flush next instruction All jumps flush next instruction  What is the average CPI? Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus, Load/Branch CPI = 1 when no stalling, 2 when stalling. Thus,  CPIlw = 1(0.6) + 2(0.4) = 1.4  CPIbeq = 1(0.75) + 2(0.25) = 1.25  Average CPI = (0.25)(1.4) + (0.1)(1) + (0.11)(1.25) + (0.02)(2) + (0.52)(1) = 1.15

44 Pipelined Performance  Pipelined processor critical path: T c = max { t pcq + t mem + t setup t pcq + t mem + t setup 2(t RFread + t mux + t eq + t AND + t mux + t setup ) 2(t RFread + t mux + t eq + t AND + t mux + t setup ) t pcq + t mux + t mux + t ALU + t setup t pcq + t mux + t mux + t ALU + t setup t pcq + t memwrite + t setup t pcq + t memwrite + t setup 2(t pcq + t mux + t RFwrite ) } 2(t pcq + t mux + t RFwrite ) }

45 Pipelined Performance Example T c = 2(t RFread + t mux + t eq + t AND + t mux + t setup ) = 2[150 + 25 + 40 + 15 + 25 + 20] ps = 550 ps

46 Pipelined Performance Example  For a program with 100 billion instructions executing on a pipelined MIPS processor, CPI = 1.15 CPI = 1.15 T c = 550 ps T c = 550 ps Execution Time = (# instructions) × CPI × Tc = (100 × 109)(1.15)(550 × 10-12) = (100 × 109)(1.15)(550 × 10-12) = 63 seconds = 63 seconds

47 Summary  Pipelining attempts to use hdw more efficiently  Throughput increases at cost of latency  Hazards ensue  Modern processors pipelined

48 Next Time  I/O Joysticks Joysticks Keyboard (and mouse?) Keyboard (and mouse?) 48


Download ppt "1 COMP541 Pipelined MIPS Montek Singh Mar 30, 2010."

Similar presentations


Ads by Google