Presentation is loading. Please wait.

Presentation is loading. Please wait.

Lecture 4: Pipeline Complications: Data and Control Hazards

Similar presentations


Presentation on theme: "Lecture 4: Pipeline Complications: Data and Control Hazards"— Presentation transcript:

1 Lecture 4: Pipeline Complications: Data and Control Hazards
Professor Alvin R. Lebeck Computer Science 220 Fall 2001

2 Administrative Homework #1 Due Tuesday, September 11
Start Reading Chapter 4 Projects CPS 220

3 Review: A Single Cycle Processor
32 ALUctr Clk busW RegWr busA busB 5 Rw Ra Rb 32 32-bit Registers Rs Rt Rd RegDst Extender Mux 16 imm16 ALUSrc ExtOp MemtoReg Data In WrEn Adr Data Memory MemWr ALU Instruction Fetch Unit Zero Instruction<31:0> Jump Branch 1 <21:25> <16:20> <11:15> <0:15> Imm16 Main Control op 6 func 3 ALUop : Instr<5:0> Instr<31:26> Instr<15:0> OK, now that we have the Main Control implemented, we have everything we needed for the single cycle processor and here it is. The Instruction Fetch Unit gives us the instruction. The OP field is fed to the Main Control for decode and the Func field is fed to the ALU Control for local decoding. The Rt, Rs, Rd, and Imm16 fields of the instruction are fed to the data path. Bsed on the OP field of the instruction, the Main Control of will set the control signals RegDst, ALUSrc, .... etc properly as I showed you earlier using separate slides. Furthermore, the ALUctr is use the ALUop from the Main conrol and the func field of the instruction to generate the ALUctr signals to ask the ALU to do the right thing: Add, Subtract, Or, and so on. This processor will execute each of the MIPS instruction in the subset in one cycle. There is, however, couple subtle differences between this single-cycle processor and a real MIPS processor in terms of instruction execution. +2 = 72 min (Y:52) CPS 220 20

4 Review: Pipelining Lessons
Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup A B C D 6 PM 7 8 9 T a s k O r d e Time 30 40 20 CPS 220

5 Review: The Five Stages of a Load
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Ifetch Reg/Dec Exec Mem WrB Load Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory Reg/Dec: Registers Fetch and Instruction Decode Exec: Calculate the memory address Mem: Read the data from the Data Memory WrB: Write the data back to the register file As shown here, each of these five steps will take one clock cycle to complete. And in pipeline terminology, each step is referred to as one stage of the pipeline. +1 = 8 min. (X:48)

6 Review: Pipelining the Load Instruction
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Ifetch Reg/Dec Exec Mem WrB 1st lw 2nd lw 3rd lw Clock The five independent pipeline stages are: Read Next Instruction: The Ifetch stage. Decode Instruction and fetch register values: The Reg/Dec stage Execute the operation: The Exec stage. Access Data-Memory: The Mem stage. Write Data to Destination Register: The WrB stage One instruction enters the pipeline every cycle One instruction comes out of the pipeline (completed) every cycle The “Effective” Cycles per Instruction (CPI) is 1; ~1/5 cycle time For the load instructions, the five independent functional units in the pipeline datapath are: (a) Instruction Memory for the Ifetch stage. (b) Register File’s Read ports for the Reg/Decode stage. (c) ALU for the Exec stage. (d) Data memory for the Mem stage. (e) And finally Register File’s write port for the Write Back stage. Notice that I have treat Register File’s read and write ports as separate functional units because the register file we have allows us to read and write at the same time. Notice that as soon as the 1st load finishes its Ifetch stage, it no longer needs the Instruction Memory. Consequently, the 2nd load can start using the Instruction Memory (2nd Ifetch). Furthermore, since each functional unit is only used ONCE per instruction, we will not have any conflict down the pipeline (Exec-Ifet, Mem-Exec, Wr-Mem) either. I will show you the interaction between instructions in the pipelined datapath later. But for now, I want to point out the performance advantages of pipelining. If these 3 load instructions are to be executed by the multiple cycle processor, it will take 15 cycles. But with pipelining, it only takes 7 cycles. This (7 cycles), however, is not the best way to look at the performance advantages of pipelining. A better way to look at this is that we have one instruction enters the pipeline every cycle so we will have one instruction coming out of the pipeline (Wr stages) every cycle. Consequently, the “effective” (or average) number of cycles per instruction is now ONE even though it takes a total of 5 cycles to complete each instruction. +3 = 14 min. (X:54)

7 Review: Delay R-type’s Write by One Cycle
Delay R-type’s register write by one cycle: Now R-type instructions also use Reg File’s write port at Stage 5 Mem stage is a NO-OP stage: nothing is being done. Effective CPI? Ifetch Reg/Dec Exec Wr R-type Mem 1 2 3 4 5 Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Clock R-type Ifetch Reg/Dec Exec Mem WrB Well one thing we can do is to add a “Nop” stage to the R-type instruction pipeline to delay its register file write by one cycle. Now the R-type instruction ALSO uses the register file’s witer port at its 5th stage so we eliminate the write conflict with the load instruction. This is a much simpler solution as far as the control logic is concerned. As far as performance is concerned, we also gets back to having one instruction completes per cycle. This is kind of like promoting socialism: by making each individual R-type instruction takes 5 cycles instead of 4 cycles to finish, our overall performance is actually better off. The reason for this higher performance is that we end up having a more efficient pipeline. +1 = 20 min. (Y:00) R-type Ifetch Reg/Dec Exec Mem WrB Load Ifetch Reg/Dec Exec Mem WrB R-type Ifetch Reg/Dec Exec Mem WrB R-type Ifetch Reg/Dec Exec Mem WrB

8 Review: A Pipelined Datapath
IF/ID Register ID/Ex Register Ex/Mem Register Mem/Wr Register PC Data Mem WA Di RA Do IF_Unit A I RFile Ra Rb Rw MemWr RegWr ExtOp Exec Unit busA busB Imm16 ALUOp ALUSrc Mux 1 MemtoReg RegDst Rt Rd PC+4 Rs Zero Branch Clk Ifetch Reg/Dec WrB EX The pipelined datapath consists of combination logic blocks separated by pipeline registers. If you get rid of all these registers (not the PC), this pipelined datapath is reduced to the single-cycle datapath. This should give you extra incentive to do a good job on your single cycle processor design homework because you can build your pipeline design based on your single cycle design. Anyway, the registers mark the beginning and the end of a pipe stage. In the multiple clock cycle lecture, I recommended that the best way to think about a logic clock cycle is that it begins slightly after the clock tick and ends right at the next clock tick. For example here, the Reg/Decode stage begins slightly after this clock tick when the output of the IF/ID register has stabilized to its new value AND ends RIGHT at the next clock tick when the output of the register file is clocked into the ID/Exec register. At the end of the Reg/Decode stage, the register output that just clocked into the ID/Exec register has NOT yet propagate to the register output yet. It takes a Clk-to-Q delay. When the new value we just clocked in (points to the clock tick) has propagate to the register output, then we have reach the beginning of the Exec stage. Notice that the Wr stage of the pipeline starts here (last cycle) but there is no corresponding datapath underneath it because the Wr stage of the pipeline is handled by the same part of the pipeline that handles the Register Read stage. This part of the datapath (Reg File) is the only part that is used by more than one stage of the pipeline. This is OK because the register file has independent Read and Write ports. More specifically, the Reg/Decode stage of the pipeline uses the register file’s read port while the Write Back stage of the pipeline uses the register file’s write port. +3 = 32 min. (Y:12)

9 Its Not That Easy for Computers
What could go wrong? Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions CPS 220

10 Speed Up Equation for Pipelining
Speedup from pipelining = Ave Instr Time unpipelined Ave Instr Time pipelined = CPIunpipelined x Clock Cycleunpipelined CPIpipelined x Clock Cyclepipelined = CPIunpipelined Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined Ideal CPI = CPIunpipelined/Pipeline depth Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined CPIpipelined Clock Cyclepipelined x x CPS 220

11 Speed Up Equation for Pipelining
CPIpipelined = Ideal CPI + Pipeline stall clock cycles per instr Speedup = Ideal CPI x Pipeline depth Clock Cycleunpipelined Ideal CPI + Pipeline stall CPI Clock Cyclepipelined Speedup = Pipeline depth Clock Cycleunpipelined 1 + Pipeline stall CPI Clock Cyclepipelined x x Here, we assume the Ideal CPI is 1 for most pipelined machines. We also assume that pipelining doesn’t change the clock cycle time. With both of these assumptions, we get the same equation as the text. CPS 220

12 Example: Dual-port vs. Single-port
Machine A: Dual ported memory Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/( x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster Machine A has no delays, perfect speedup Machine B has (pipeline depth)/(1 + pipeline stalls) delays but clock is 5% faster than Machine A CPS 220

13 Three Generic Data Hazards
InstrI followed by InstrJ Read After Write (RAW) InstrJ tries to read operand before InstrI writes it CPS 220

14 Three Generic Data Hazards
InstrI followed by InstrJ Write After Read (WAR) InstrJ tries to write operand before InstrI reads it Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, Reads are always in stage 2, and Writes are always in stage 5 CPS 220

15 Three Generic Data Hazards
InstrI followed by InstrJ Write After Write (WAW) InstrJ tries to write operand before InstrI writes it Leaves wrong result ( InstrI not InstrJ) Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes CPS 220

16 Data Hazards We must deal with instruction dependencies. Example:
sub $2, $1, $3 and $12, $2, $5 # $12 depends on the result in $2 or $13, $6, $2 # but $2 is updated 3 clock add $14, $2, $2 # cycles later. sw $15, 100($2) # We have a problem!! Data Hazard Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Clock 0: sub Ifetch Reg/Dec Exec Mem WrB 4: and Ifetch Reg/Dec Exec Mem WrB 8: or Ifetch Reg/Dec Exec Mem WrB 12: add Ifetch Reg/Dec Exec Mem WrB 16: sw Ifetch Reg/Dec Exec Mem WrB

17 RAW Data Hazard Solution: Register Forwarding
ALU

18 RAW Data Hazard for Load
Clock Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Ifetch Reg/Dec Exec Mem Wr I0: Load Plus 1 Plus 2 Plus 3 Plus 4 Load is fetched during Cycle 1: The data is NOT written into the Reg File until the end of Cycle 5 We cannot read this value from the Reg File until Cycle 6 3-instruction delay before the load takes effect This is a Data Hazard: Register forwarding reduces the load delay to ONE instruction It is not possible to entirely eliminate the load Data Hazard! Similarly, although the load instruction is fetched during cycle 1, the data is not written into the register file until the end of Cycle 5. Consequently, the earliest time we can read this value from the register file is in Cycle 6. In other words, there is a 3-instruction delay between the load instruction and the instruction that can use the result of the load. This is referred to as Data Hazard in the text book. We will show in the next lecture that by clever design techniques, we can reduce this delay to ONE instruction. That is if the load instruction is issued in Cycle 1, the instruction comes right next to it (Plus 1) cannot use the result of this load but the next-next instruction (Plus 2) can. +2 = 73 min. (Y:53)

19 Load Data Forwarding This slide shows the datapath for register forwarding of load data. Note that the instruction in the mem stage cannot use the data.

20 Dealing with the Load Data Hazard
There are two ways to deal with the load data hazard: Insert a NOOP bubble into the data path. Use Delayed load semantic (see a next slide) IF/ID Register ID/Ex Register Ex/Mem Register Mem/Wr Register PC Data Mem WA Di RA Do IF_Unit A I RFile Ra Rb Rw MemWr RegWr ExtOp Exec Unit busA busB Imm16 ALUOp ALUSrc Mux 1 MemtoReg RegDst Rt Rd PC+4 Rs Zero Branch EX Insert NOOP Here Stall0 Stall1 Stall2 How?

21 Delayed Load Load instructions are defined such that immediate successor instruction will not read result of load. BAD ld r1, 8(r2) sub r3, r1, r3 add r2, r2, 4 OK

22 Software Scheduling to Avoid Load Hazards
Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd CPS 220

23 Compiler Avoiding Load Stalls
CPS 220

24 Review: Data Hazards RAW WAR WAW Data Forwarding (Register Bypassing)
only one that can occur in DLX pipeline WAR WAW Data Forwarding (Register Bypassing) send data from one stage to another bypassing the register file Still have load use delay CPS 220

25 Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up ~ Pipeline Depth; if ideal CPI is 1, then: Hazards limit performance on computers: Structural: need more HW resources Data: need forwarding, compiler scheduling Control: discuss today Branches and Other Difficulties What makes branches difficult? Pipeline Depth Clock Cycle Unpipelined Speedup = X 1 + Pipeline stall CPI Clock Cycle Pipelined CPS 220

26 Control Hazard on Branches: Three Stage Stall
time cc1 cc2 cc3 cc4 cc5 cc6 cc7 cc8 cc9 I D beq r1, foo I D add r3, r4, r6 I D and r3, r2, r4 sub r2, r3, r5 I D I D add r3, r2, r5 CPS 220

27 Control Hazard Although Beq is fetched during Cycle 4:
Ifetch Reg/Dec Exec Mem Wr 16: R-type 24: R-type 20: R-type Clk 1000: Target of Br 12: Beq (target is 1000) Although Beq is fetched during Cycle 4: Target address is NOT written into the PC until the end of Cycle 7 Branch’s target is NOT fetched until Cycle 8 3-instruction delay before the branch take effect This is called a Control Hazard: Notice that although the branch instruction is fetched during Cycle 4, its target address is NOT written into the Program Counter until the end of clock Cycle 7. Consequently, the branch target instruction is not fetched until clock Cycle 8. In other words, there is a 3-instruction delay between the branch instruction is issued and the branch effect is felt in the program. This is referred to as Branch Hazard in the text book. And as we will show in the next lecture, by clever design techniques, we can reduce the delay to ONE instruction. That is if the branch instruction in Location 12 is issued in Cycle 4, we will only execute one more sequential instruction (Location 16) before the branch target is executed. +2 = 71 min. (Y:51)

28 Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! How can you reduce this delay? Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier DLX branch tests if register = 0 or != 0 DLX Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3 CPS 220

29 Branch Delays Example: sub $10, $4, $8 beq $10, $3, go add $12, $2, $5
IF/ID ID/EX Example: sub $10, $4, $8 beq $10, $3, go add $12, $2, $5 . . . go: lw $4, 16($12)

30 Branch Hazard Can we eliminate the effect of this one cycle branch delay?

31 Four Branch Hazard Alternatives
#1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 53% DLX branches taken on average But haven’t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome CPS 220

32 Four Branch Hazard Alternatives
#4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline DLX uses this Branch delay of length n CPS 220

33 Delayed Branch Where to get instructions to fill branch delay slot?
Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken Cancelling branches allows more slots to be filled Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled CPS 220

34 Evaluating Branch Alternatives
Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Stall pipeline Predict taken Predict not taken Delayed branch Branches = 14% of insts, 65% of them change PC CPS 220

35 Compiler “Static” Prediction of Taken/Untaken Branches
Improves strategy for placing instructions in delay slot Two strategies Backward branch predict taken, forward branch not taken Profile-based prediction: record branch behavior, predict branch based on prior run Taken backwards Not Taken Forwards Always taken CPS 220

36 Evaluating Static Branch Prediction
Misprediction ignores frequency of branch “Instructions between mispredicted branches” is a better metric CPS 220

37 Pipelining Complications
Interrupts (Exceptions) 5 instructions executing in 5 stage pipeline How to stop the pipeline? How to restart the pipeline? Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation CPS 220

38 Pipelining Complications
Simultaneous exceptions in > 1 pipeline stage Load with data page fault in MEM stage Add with instruction page fault in IF stage Solution #1 Interrupt status vector per instruction Defer check til last stage, kill state update if exception Solution #2 Interrupt ASAP Restart everything that is incomplete Exception in branch delay slot, SW needs two PCs Another advantage for state update late in pipeline! CPS 220

39 Next Time Next time Todo Read Chapter 3 and 4 Homework #1 due
More pipeline complications Longer pipelines (R4000) => Better branch prediction, more instruction parallelism? Todo Read Chapter 3 and 4 Homework #1 due Project selection by September 30 CPS 220

40 Pipeline Complications
Complex Addressing Modes and Instructions Address modes: Autoincrement causes register change during instruction execution Interrupts? Need to restore register state Adds WAR and WAW hazards since writes no longer last stage Memory-Memory Move Instructions Must be able to handle multiple page faults Long-lived instructions: partial state save on interrupt Condition Codes CPS 220

41 Pipeline Complications: Floating Point

42 Pipelining Complications
Floating Point: long execution time Also, may pipeline FP execution unit so they can initiate new instructions without waiting full latency FP Instruction Latency Initiation Rate (MIPS R4000) Add, Subtract 4 3 Multiply 8 4 Divide (interrupts, Square root WAW, WAR) Negate 2 1 Absolute value 2 1 FP compare 3 2 Cycles before issue instr of same type Cycles before use result CPS 220

43 Summary of Pipelining Basics
Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Compilers reduce cost of data and control hazards Load delay slots Branch delay slots Branch prediction Interrupts, Instruction Set, FP makes pipelining harder Handling context switches. CPS 220

44 Case Study: MIPS R4000 (100 MHz to 200 MHz)
8 Stage Pipeline: IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. IS–second half of access to instruction cache. RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. DF–data fetch, first half of access to data cache. DS–second half of access to data cache. TC–tag check, determine whether the data cache access hit. WB–write back for loads and register-register operations. 8 Stages: What is impact on Load delay? Branch delay? Why? Answer is 3 stages between branch and new instruction fetch and 2 stages between load and use (even though if looked at red insertions that it would be 3 for load and 2 for branch) Reasons: 1) Load: TC just does tag check, data available after DS; thus supply the data & forward it, restarting the pipeline on a data cache miss 2) EX phase does the address calculation even though just added one phase; presumed reason is that since want fast clockc cycle don’t want to sitck RF phase with reading regisers AND testing for zero, so just moved it back on phase CPS 220

45 Case Study: MIPS R4000 TWO Cycle Load Latency IF IS IF RF IS IF EX RF
DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF THREE Cycle Branch Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken CPS 220

46 MIPS R4000 Floating Point FP Adder, FP Multiplier, FP Divider
Last step of FP Multiplier/Divider uses FP Adder HW 8 kinds of stages in FP units: Stage Functional unit Description A FP adder Mantissa ADD stage D FP divider Divide pipeline stage E FP multiplier Exception test stage M FP multiplier First stage of multiplier N FP multiplier Second stage of multiplier R FP adder Rounding stage S FP adder Operand shift stage U Unpack FP numbers CPS 220

47 MIPS FP Pipe Stages FP Instr 1 2 3 4 5 6 7 8 …
Add, Subtract U S+A A+R R+S Multiply U E+M M M M N N+A R Divide U A R D28 … D+A D+R, D+R, D+A, D+R, A, R Square root U E (A+R)108 … A R Negate U S Absolute value U S FP compare U A R Stages: M First stage of multiplier N Second stage of multiplier R Rounding stage S Operand shift stage U Unpack FP numbers A Mantissa ADD stage D Divide pipeline stage E Exception test stage CPS 220

48 R4000 Performance Not ideal CPI of 1:
Load stalls (1 or 2 clock cycles) Branch stalls (2 cycles + unfilled slots) FP result stalls: RAW data hazard (latency) FP structural stalls: Not enough FP hardware (parallelism) CPS 220

49 Next Time Homework #1 is Due Instruction Level Parallelism (ILP)
Read Chapter 4 CPS 220


Download ppt "Lecture 4: Pipeline Complications: Data and Control Hazards"

Similar presentations


Ads by Google