Presentation is loading. Please wait.

Presentation is loading. Please wait.

1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia.

Similar presentations


Presentation on theme: "1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia."— Presentation transcript:

1 1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia

2 2 Flip Flops פליפ-לופ DQ A flip-flop “samples” right before the edge, and then “holds” value. Sampling circuit Holds value Which circuit contributes t_setup delay? Which circuit contributes t_clk-to-Q delay?

3 3 DQ clk-to-Q ? CLK == 0 Sense D, but Q outputs old value. CLK 0->1 Capture D, pass value to Q CLK setup ?hold ? clk-to-Q setup hold

4 4 Performance Equation Seconds Program Instructions Program = Seconds Cycle Instruction Cycles Goal is to optimize execution time, not individual equation terms. The CPI of the program. Reflects the program’s instruction mix. Machines are optimized with respect to program workloads. Clock period. Optimize jointly with machine CPI.

5 5 השעון Hertz=1/sec מחשב פנטיום במהירות של פירושו שהוא מבצע 8^10 *2 מחזורי שעון בשניה. כל מחזור שעון לוקח 200MHZ 5*10^-9=5nanosecond כמה לוקחת פקודה בימינו?

6 6 datapath מבנה ה - PC instruction memory +4 rt rs rd registers ALU Data memory imm 1. Instruction Fetch 2. Decode/ Register Read 3. Execute4. Memory 5. Write Back

7 7 Gotta Do Laundry °Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, fold, and put away ABCD °Dryer takes 30 minutes °“Folder” takes 30 minutes °“Stasher” takes 30 minutes to put clothes into drawers °Washer takes 30 minutes

8 8 Sequential Laundry Sequential laundry takes 8 hours for 4 loads TaskOrderTaskOrder B C D A 30 Time 30 6 PM 7 8 9 10 11 12 1 2 AM

9 9 Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads! TaskOrderTaskOrder B C D A 12 2 AM 6 PM 7 8 9 10 11 1 Time 30

10 10 General Definitions Latency: time to completely execute a certain task –for example, time to read a sector from disk is disk access time or disk latency Throughput: amount of work that can be done over a period of time

11 11 Pipelining Lessons (1/2) Pipelining doesn’t help latency of single task, it helps throughput of entire workload Multiple tasks operating simultaneously using different resources Potential speedup = Number pipe stages Time to “ fill ” pipeline and time to “ drain ” it reduces speedup: 2.3X v. 4X in this example 6 PM 789 Time B C D A 30 TaskOrderTaskOrder

12 12 Pipelining Lessons (2/2) Suppose new Washer takes 20 minutes, new Stasher takes 20 minutes. How much faster is pipeline? Pipeline rate limited by slowest pipeline stage Unbalanced lengths of pipe stages also reduces speedup 6 PM 789 Time B C D A 30 TaskOrderTaskOrder

13 13 Inspiration: Automobile assembly line Assembly line moves on a steady clock. Each station does the same task on each car. Car body shell Car chassis Merge station Bolting station The clock

14 14 Inspiration: Automobile assembly line Simpler station tasks → more cars per hour. Simple tasks take less time, clock is faster.

15 15 Inspiration: Automobile assembly line Line speed limited by slowest task. Most efficient if all tasks take same time to do

16 16 Inspiration: Automobile assembly line Simpler tasks, complex car → long line! These lines go 24 x 7, and rarely shut down. Why?

17 17 Lessons from car assembly lines Faster line movement yields more cars per hour off the line. Faster line movement requires more stages, each doing simpler tasks. To maximize efficiency, all stages should take same amount of time (if not, workers in fast stages are idle) “Filling”, “flushing”, and “stalling” assembly line are all bad news.

18 18 Steps in Executing MIPS 1) IFetch: Fetch Instruction, Increment PC 2) Decode Instruction, Read Registers 3) Execute: Mem-ref:Calculate Address Arith-log: Perform Operation 4) Memory: Load:Read Data from Memory Store:Write Data to Memory 5) Write Back: Write Data to Register

19 19 datapath מבנה ה - PC instruction memory +4 rt rs rd registers ALU Data memory imm 1. Instruction Fetch 2. Decode/ Register Read 3. Execute4. Memory 5. Write Back

20 20 Pipelined Execution Representation Every instruction must take same number of steps, also called pipeline “stages”, so some will go idle sometimes IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB IFtchDcdExecMemWB Time

21 21 Key Analogy: The instruction is the car IR Instruction Fetch IR Pipeline Stage #1Stage #2 Controls hardware in stage 2 Stage #3 Controls hardware in stage 3 Stage #4 Controls hardware in stage 4 Stage #5 Controls hardware in stage 5 “Data-stationary control”

22 22 Representation #1: Timeline IR IF (Fetch)ID (Decode)EX (ALU) IR MEM WB ADD R4,R3,R2 OR R7,R6,R5 SUB R1,R9,R8 XOR R3,R2,R1 AND R6,R5,R4 I1: I2: I3: I4: I5: Sample Program IFID IF EX ID IF MEM EX ID IF WB MEM EX IF ID WB MEM ID EX IF WB EX MEM ID MEM WB EX Pipeline is “full” Good for visualizing pipeline fills. I1: I2: I3: I4: I5: t1t2t3t4t5t6t7t8 Time: Inst I6:

23 23 Pipeline is “full” Good for visualizing pipeline stalls. Representation #2: Resource Usage IR ADD R4,R3,R2 OR R7,R6,R5 SUB R1,R9,R8 XOR R3,R2,R1 AND R6,R5,R4 I1: I2: I3: I4: I5: Sample Program I1I2 I1 I3 I2 I1 I4 I3 I2 I1 I5 I4 I3 I1 I2 IF: ID: EX: MEM: WB: t1t2t3t4t5t6t7t8 Time: Stage IF (Fetch)ID (Decode)EX (ALU)MEM WB I5 I4 I2 I3 I6 I5 I3 I4 I6 I7 I4 I5 I6 I7 I8

24 24 Review: Datapath for MIPS Stage 1 Stage 2Stage 3Stage 4Stage 5 Use datapath figure to represent pipeline IFtchDcdExecMemWB ALU I$ Reg D$Reg PC instruction memory +4 rt rs rd registers ALU Data memory imm 1. Instruction Fetch 2. Decode/ Register Read 3. Execute4. Memory 5. Write Back

25 25 Graphical Pipeline Representation I n s t r. O r d e r Load Add Store Sub Or I$ Time (clock cycles) I$ ALU Reg I$ D$ ALU Reg D$ Reg I$ D$ Reg ALU Reg D$ Reg D$ ALU (In Reg, right half highlight read, left half write) Reg I$

26 26 Example Suppose 2 ns for memory access, 2 ns for ALU operation, and 1 ns for register file read or write Nonpipelined Execution: –lw : IF + Read Reg + ALU + Memory + Write Reg = 2 + 1 + 2 + 2 + 1 = 8 ns –add: IF + Read Reg + ALU + Write Reg = 2 + 1 + 2 + 1 = 6 ns Pipelined Execution: –Max(IF,Read Reg,ALU,Memory,Write Reg) = 2 ns

27 27 PIPELINE הרעיון מאחורי Never waste time !!!

28 28 חלוקה לשלבים

29 29 הוספת הרגיסטרים

30 30 Instruction memory Address 4 32 0 Add Add result Shift left 2 I n s t r u c t i o n IF/IDEX/MEMMEM/WB M u x 0 1 Add PC 0 Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data 1 ALU result M u x ALU Zero ID/EX Instruction fetch lw A IF/ID

31 31 ID/EX

32 32 EX/MEM

33 33 Instruction memory Address 4 32 0 Add Add result Shift left 2 I n s t r u c t i o n IF/IDEX/MEM M u x 0 1 Ad d PC 0 Write data M u x 1 Registers Read data 1 Read data 2 Read register 1 Read register 2 16 Sign extend Write register Write data Read data Data memory 1 ALU result M u x ALU Zero ID/EXMEM/WB Memory lw Address MEM/WB

34 34

35 35 A correction !!! תיקון Keep the right Rd all the way!

36 36 So here is the updated CPU;

37 37 תצוגה גרפית

38 38 Control

39 39 קווי הבקרה

40 40 Datapath with Control

41 41 דוגמא A demonstration of a sequence of instructions: Lw $10,20($1) Sub $11,$2,$3 And $12,$4,$5 Or $13,$6,$7 Add $14,$8,$9

42 42

43 43 ID: and $12, $4, $5

44 44

45 45

46 46 <4> 000 00 0000 000 00 00 0 0 00 0 0 0 1 0 M u x 0 1 Add PC 0 Write data M u x 1 Control Registers Read data 1 Read data 2 Read register 1 Read register 2 Instruction [20–16] Instruction [15–0] Sign extend Instruction [15–11] MemRead M e m W r i t e M u x M u x ALU Read data WB 14 14 Write register Write data Data memory Address Clock 9

47 47 Pipeline Hazard: Matching socks in later load A depends on D; stall since folder tied up TaskOrderTaskOrder B C D A E F bubble 12 2 AM 6 PM 7 8 9 10 11 1 Time 30

48 48 Problems for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) –Control hazards: Pipelining of branches & other instructions stall the pipeline until the hazard “bubbles” in the pipeline –Data hazards: Instruction depends on result of prior instruction still in the pipeline

49 49 An example for data hazards: sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2)

50 50 An example for data hazards: sub$2, $1, $3 and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) An example for data hazards: Register $2 is updated only at the WB phase, i.e., the 5th clock cycle (actually at the end of the 5th clock cycle). However, we try to use it at the 3rd clock cycle when we read $2 at the decode phase of the and instruction

51 51 Graphic representation of data hazards:

52 52 sub$2, $1, $3 nop and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) Solving data hazards by adding nops

53 53 Solving data hazards by adding nops Program execution order (in instructions) and $12, $2, $ or $13, $6, $2 add $14, $2, $ sw $15, 100($2 5 2 ) g IMReg IM Reg DMReg IMDMReg IMDMRe Reg Reg Reg DM IMRegDM R e g IMRegDM Reg IMRegDM R e g IMRegDM R e g sub $2, $1, $3 nop

54 54 The internal structure of the Register File 32 Read data 2 write data Read data 1 5 5 5 Rd reg 2 (= Rt) Rd reg 1 (= Rs) RegWrite Wr reg (= Rd) 32 E קוראים משתי היציאות בוזמנית ערכים של שני רגיסטרים שונים כותבים לאחד הרגיסטרים האחרים (בעליית השעון הבאה)

55 55 We could earn 1 ck cycle if GPR is “transparent” Program execution order (in instructions) and $12, $2, $ or $13, $6, $2 add $14, $2, $ sw $15, 100($2 5 2 ) g IMRegDM R e g IMRegDM Reg IMRegDM R e g sub $2, $1, $3 nop We could earn 1 ck cycle if GPR is “transparent”, i.e, we could see the write data to the GPR at the GPR outputs (if the write address equals the read address), i.e., during Ck #5.

56 56 The internal structure of the modified Register File. We ‘bypass” the input data (the write data) to the read data1 output whenever Rs=Rd/Rt (i.e., whenever read reg1=write reg but not zero). We “bypass” the input data (the write data) to the read data2 output whenever Rt=Rd/Rt (i.e., whenever read reg2=write reg, but not zero). 32 Read data 2 write data Read data 1 5 5 5 Rd reg 2 (= Rt) Rd reg 1 (= Rs) RegWrite Wr reg (= Rd) 32 E קוראים משתי היציאות בוזמנית ערכים של שני רגיסטרים שונים כותבים לאחד הרגיסטרים האחרים (בעליית השעון הבאה) 0 32 0 write data 32 write data Wr reg 5 5

57 57 sub$2, $1, $3 nop and $12, $2, $5 or$13, $6, $2 add$14, $2, $2 sw$15, 100($2) After doing that change we only need 2 nops After the change the WB of an early instruction can happen at the same time with the read reg (decode) phase of a newer instruction (3 with two other instructions in between). In case we have a data hazard, we need to add only two nop instructions. Unfortunately, this happens too often. We need a better solution!

58 58 Graphic representation of data hazards:

59 59

60 60 Forwarding – גניבת הערכים

61 61

62 62 Forwarding (done at the execute phase) If ID/EX.Rs=EX/MEM.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction instead of the output of the GPR. If ID/EX.Rs=MEM/WB.Rd, i.e., the Rd of the previous instruction equals the Rs of the current instruction (which is in the “decode” phase), then we use the “ALUout” of the previous instruction instead of the output of the GPR. [ similarly, compare also ID/EX.Rt to MEM/WB.Rd ] Similarly, compare also ID/EX.Rt to EX/MEM.Rd and to MEM/WB.Rd

63 63 Data hazard from previous instruction: ALU Src A: If (ID/EX.Rs = = EX/MEM.Rd) use the “ ALUOut ” instead of Rs I.e., if Rs of the current executing instruction = = Rd of the previous instruction The actual equations are: if ((EX/MEM.RegWrite = = ‘ 1 ” )&& (EX/MEM.Rd <> 0)&& (ID/EX.Rs = = EX/MEM.Rd)) => ForwardA= “ 1,0 ” ALU Src B: If (ID/EX.Rt = = EX/MEM.Rd) use the “ ALUOut ” instead of Rt I.e., if Rt of the current executing instruction = = Rd of the previous instruction The actual equations are: if ((EX/MEM.RegWrite = = ‘ 1 ” )&& (EX/MEM.Rd <> 0)&& (ID/EX.Rt = = EX/MEM.Rd)) => ForwardB= “ 1,0 ”

64 64 Data hazard from 2 instructions back: ALU Src A: If (ID/EX.Rs = = MEM/WB.Rd) use the GPR “ write data ” instead of Rs I.e., if Rs of the current executing instruction = = Rd of 2 instructions ago The actual equations are: if ((MEM/WB.RegWrite = = ‘ 1 ” )&& (MEM/WB.Rd <> 0)&& (ID/EX.Rs = = MEM/WB.Rd)) => ForwardA= “ 0,1 ” ALU Src B: If (ID/EX.Rt = = MEM/WB.Rd) use the GPR “ write data ” instead of Rt I.e., if Rt of the current executing instruction = = Rd of 2 instructions ago The actual equations are: if ((MEM/WB.RegWrite = = ‘ 1 ” )&& (MEM/WB.Rd <> 0)&& (ID/EX.Rt = = MEM/WB.Rd)) => ForwardB= “ 0,1 ” Double hazard: If there is a hazard from previous inst and the instruction before that?We should chhose the data from the previous instruction, it is up to date ( “ newer ” )!

65 65 An example for forwarding דוגמא Sub $2, $1, $3 And $4, $2, $5 needs forwarding from the previous instruction Or$4, $4, $2 needs forwarding from two instructions back Add $9, $4, $2 needs forwarding from 3 instructions back (thru the “ transparent ” GPR) Here we discuss the $2 register only (The first two cases are handled in the execute phase, the last one, in the decode phase).

66 66 An example for forwarding דוגמא Sub $2, $1, $3 And $4, $2, $5 Or$4, $4, $2 needs forwarding from the previous instruction Add $9, $4, $2 needs forwarding from the previous instruction Here we discuss the $4 register and there are two case (the 2nd one in purple)

67 67 Since Rs=2 and Rd of previous inst. was 2, we use ALUout instead of Rs Sub $2, $1, $3 And $4, $2, $5 Or$4, $4, $2 Add $9, $4, $2

68 68 In blue we see forwarding from two instructions back (Mem->Exec.), in red, from previous instruction (WB->Exec.), in purple, from 3 instructions back (WB- >Decode).

69 69 In blue we see forwarding from two instructions back (Mem->Exec.), in red, from previous instruction (WB->Exec.), in purple, from 3 instructions back (WB->Decode).

70 70 The solution does not work for lw - לא תמיד הפתרון עובד (in lw we do not have the data in the pipe!, it comes from the data memory!) If the previous instruction was lw to a register and we try to use the register in the current instruction, we have a problem, since we cannot go back in time! One solution is to avoid such cases by adding a nop (by the Assembler) whenever Rt of the lw is equal to Rs or Rt of the following instruction.

71 71 Another h/w solution is to add Bubbles, i.e., add nop by hardware “nop” We need to hold IF/ID for one ck cycle and insert a “ nop: into ID/EX. This is equal to adding a nop instruction by the Assembler.

72 72 Hazard detection unit We need to hold the IF/ID and PC for one ck cycle and insert a “ nop: into ID/EX. This is equal to adding a nop instruction by the Assembler. If (ID/EX.MemRd)&& ( (ID/EX.Rt= =IF/ID.Rs) || (ID/EX.Rt= =IF/ID.Rt) ) we must “ stall ” the pipeline! This means that prev. inst was lw and it was to the current Rs or Rt. (of course if one of them is not used, don ’ t stall) Holding means ” freeze ” the IF/ID and the PC for 1 clock cycle Hold the IF/ID by not giving a IF/IDWrire signal and do not increment the PC (which already points at the nex instruction) by not giving the PCWrite signal. Inserting a nop is by clearing all control signals. Rt from prev. inst. Rs, Rt of current inst. identifies lw

73 73 An example for lw hazard detection דוגמא lw $2, 20($1) And $4, $2, $5 Or$4, $4, $2 Add $9, $4, $2

74 74

75 75 The lw instruction is in the WB phase. $2 is “ being written ”. We can use $2 in the Execute phase of the and instruction, with the help of forwarding.

76 76

77 77 Just to remind us how branch is handled we show again the Datapath with Control

78 78 Branch Hazards Here we calc.Rs-Rt Here we decide to branch (switching the address to the PC and issuing PCWrite Cond) These 3 instructions should be “killed” before they do harm, I.e., change any register. In cc5 we already use the new PC calculated by the branch. (PC=72)

79 79 Control Hazard: Branching (1/7) Where do we do the compare for the branch? I$ beq Instr 1 Instr 2 Instr 3 Instr 4 ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg ALU I$ Reg D$Reg I n s t r. O r d e r Time (clock cycles)

80 80 Control Hazard: Branching (2/7) We put branch decision-making hardware in ALU stage –therefore two more instructions after the branch will always be fetched, whether or not the branch is taken Desired functionality of a branch –if we do not take the branch, don’t waste any time and continue executing normally –if we take the branch, don’t execute any instructions after the branch, just go to the desired label

81 81 Control Hazard: Branching (3/7) Initial Solution: Stall until decision is made –insert “no-op” instructions: those that accomplish nothing, just take time –Drawback: branches take 3 clock cycles each (assuming comparator is put in ALU stage)

82 82 Control Hazard: Branching (4/7) Optimization #1: –move asynchronous comparator up to Stage 2 –as soon as instruction is decoded (Opcode identifies is as a branch), immediately make a decision and set the value of the PC (if necessary) –Benefit: since branch is complete in Stage 2, only one unnecessary instruction is fetched, so only one no-op is needed –Side Note: This means that branches are idle in Stages 3, 4 and 5.

83 83

84 84 Insert a single no-op (bubble) Control Hazard: Branching (5/7) add beq lw ALU I$ Reg D$Reg ALU I$ Reg D$Reg ALU Reg D$Reg I$ I n s t r. O r d e r Time (clock cycles) bub ble Impact: 2 clock cycles per branch instruction  slow

85 85 Control Hazard: Branching (6/7) Optimization #2: Redefine branches – Old definition: if we take the branch, none of the instructions after the branch get executed by accident – New definition: whether or not we take the branch, the single instruction immediately following the branch gets executed (called the branch-delay slot)

86 86 Control Hazard: Branching (7/7) Notes on Branch-Delay Slot –Worst-Case Scenario: can always put a no- op in the branch-delay slot –Better Case: can find an instruction preceding the branch which can be placed in the branch-delay slot without affecting flow of the program re-ordering instructions is a common method of speeding up programs compiler must be very smart in order to find instructions to do this usually can find such an instruction at least 50% of the time Jumps also have a delay slot…

87 87 Example: Nondelayed vs. Delayed Branch add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit or $8, $9,$10 xor $10, $1,$11 Nondelayed Branch add $1,$2,$3 sub $4, $5,$6 beq $1, $4, Exit or $8, $9,$10 xor $10, $1,$11 Delayed Branch Exit:

88 88 Question (1/2) Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 103 loops, so pipeline full) Loop:lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addiu $s1, $s1, -4 bne $s1, $zero, Loop nop How many pipeline stages (clock cycles) per loop iteration to execute this code? 1 2 3 4 5 6 7 8 9 10

89 89 Answer (1/2) Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards. 10 3 iterations, so pipeline full. Loop:lw$t0, 0($s1) addu$t0, $t0, $s2 sw$t0, 0($s1) addiu$s1, $s1, -4 bne$s1, $zero, Loop nop How many pipeline stages (clock cycles) per loop iteration to execute this code? 1. 2. (data hazard so stall) 3. 4. 5. 6. (delayed branch so exec. nop) 7. 1 2 3 4 5 6 7 8 9 10

90 90 Question (2/2) Assume 1 instr/clock, delayed branch, 5 stage pipeline, forwarding, interlock on unresolved load hazards (after 10 3 loops, so pipeline full). Rewrite this code to reduce pipeline stages (clock cycles) per loop to as few as possible. Loop:lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addiu $s1, $s1, -4 bne $s1, $zero, Loop nop How many pipeline stages (clock cycles) per loop iteration to execute this code? 1 2 3 4 5 6 7 8 9 10

91 91 A (2/2) How long to execute? How many pipeline stages (clock cycles) per loop iteration to execute your revised code? (assume pipeline is full) Rewrite this code to reduce clock cycles per loop to as few as possible: Loop:lw$t0, 0($s1) addiu$s1, $s1, -4 addu$t0, $t0, $s2 bne$s1, $zero, Loop sw$t0, +4($s1) (no hazard since extra cycle) 1. 3. 4. 5. 2. (modified sw to put past addiu) 1 2 3 4 5 6 7 8 9 10

92 92 Peer Instruction A.Thanks to pipelining, I have reduced the time it took me to wash my shirt. B.Longer pipelines are always a win (since less work per stage & a faster clock). C.We can rely on compilers to help us avoid data hazards by reordering instrs. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT

93 93 Peer Instruction Answer A.Thanks to pipelining, I have reduced the time it took me to wash my shirt. B.Longer pipelines are always a win (since less work per stage & a faster clock). C.We can rely on compilers to help us avoid data hazards by reordering instrs. ABC 1: FFF 2: FFT 3: FTF 4: FTT 5: TFF 6: TFT 7: TTF 8: TTT F A L S E A. Throughput better, not execution time B. “…longer pipelines do usually mean faster clock, but branches cause problems! C. “they happen too often & delay too long.” Forwarding! (e.g, Mem  ALU) F A L S E

94 94 The situation was better if we some how “moved” the branch address calculation one ck earlier. This is easy to do since sign extension and shift are only wires. We just need to move the branch address ALU 1 register to the left. Rverything happens 1 ck earlier and so we’ll have to “kill” only two instructions. Next, we’ll add a fast comparator which will compare Rs and Rt at the same ck cycle of the “decode” phase. (Instead of using the ALU to calc. Rs-Rt, we’ll built a simple and fast xor circuit). This means extra h/w but now we earned one more ck cycle. So, we have to kill only a single instruction. Killing an instruction also called “flushing” the pipeline, is easily done by clreaing the IF/ID register of the instruction following the branch (if the branch is successful)

95 95 Flushing

96 96 An example for flushing דוגמא sub$10, $4, $8 beq $1, $3, 7 and$12, $2, $5 lw $4, 50($7)

97 97 sub$10, $4, $8 beq $1, $3, 7 and$12, $2, $5 lw $4, 50($7)

98 98 Summary of hazards Data hazards: * Forward from previous instruction * Forward from two instructions ago * (Forward thru “ transparent ” GPR = from 3 instructions ago) * If we cannot forward, (after lw) we stall the pipe by inserting a nop and freezing IF/ID and PC for 1 ck cycle Control hazards: * If branch is successful we flush the instruction following the branch (which is at the IF/ID register. We just clear the register) Notes: In the real MIPS CPU, no flush was employed. This give the compiler the opportunity to put useful instructions following the branch. This explains why the simulator always performs the instruction following the branch.this is called a delayed branch. Also, in the real MIPS CPU no lw stall was used. Again this give some freedom to the compiler to choose whether to put a nop following lw or some useful instruction. This is called a delayed load.

99 99 Other slides

100 100 Handling exceptions


Download ppt "1 Pipeline Datapath With some slides from: John Lazzaro and Dan Garcia."

Similar presentations


Ads by Google