Presentation is loading. Please wait.

Presentation is loading. Please wait.

ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Similar presentations


Presentation on theme: "ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher."— Presentation transcript:

1 ELEC / Computer Architecture and Design Spring Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher Professor Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849 Spr 2015, Mar ELEC / Lecture 7

2 Pipelined Datapath (without Jump)
IF/ID ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 ALU opcode Shift left 2 26-31 zero 21-25 Instr mem PC ALU Reg. File Data mem. 16-20 1 mux 0 0 mux 1 Sign ext. 16-20 for I-type lw 11-15 for R-type 1 mux 0 0-15 Spr 2015, Mar ELEC / Lecture 7

3 Mem. and Reg. File Need Controls
IF/ID ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 ALU opcode Shift left 2 RegWrite 26-31 MemWrite MemRead zero 21-25 Instr mem PC ALU 16-20 Reg. File Data mem. 1 mux 0 0 mux 1 Sign ext. 16-20 for I-type lw 11-15 for R-type 1 mux 0 0-15 Spr 2015, Mar ELEC / Lecture 7

4 Multiplexers Need Controls
IF/ID ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 ALU Shift left 2 opcode RegWrite Branch PCSrc 26-31 MemWrite MemRead MemtoReg ALUSrc zero 21-25 Instr mem PC ALU 16-20 Reg. File Data mem. 1 mux 0 0 mux 1 Sign ext. 16-20 for I-type lw 11-15 for R-type 1 mux 0 RegDst 0-15 Spr 2015, Mar ELEC / Lecture 7

5 ALU Needs a Control ALU ALU 1 mux 0 Add opcode Instr mem PC Reg. File
IF/ID ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 ALU Shift left 2 opcode RegWrite Branch PCSrc 26-31 MemWrite MemRead MemtoReg ALUSrc zero 21-25 Instr mem PC ALU 16-20 Reg. File Data mem. 1 mux 0 0 mux 1 ALU cont. Sign ext. 0-5 ALUOp 16-20 for I-type lw 11-15 for R-type 1 mux 0 RegDst 0-15 Spr 2015, Mar ELEC / Lecture 7

6 Compare with Single-Cycle Control
Control signals are the same as those needed for a single-cycle datapath. Control signals are generated using the Opcode in the ID (instruction decode) cycle and then distributed to other cycles. Let us reexamine the implementation of the single-cycle control (slides of Lecture 5). Spr 2015, Mar ELEC / Lecture 7

7 Hardwired CU: Single-Cycle
Implemented by combinational logic. Control logic Datapath 6 funct. code Control signals To ALU 6 opcode 3 ALU control ALUOp 2 Spr 2015, Mar ELEC / Lecture 7

8 ALU ALU Data mem. Single-cycle datapath 0 mux 1 Add 1 mux 0 opcode
Jump 0-25 Shift left 2 0 mux 1 4 Add 1 mux 0 ALU Branch opcode MemtoReg CONTROL 26-31 RegWrite ALUSrc 21-25 zero MemWrite MemRead ALU Instr. mem. PC Reg. File Data mem. 16-20 1 mux 0 0 mux 1 1 mux 0 Single-cycle datapath 11-15 RegDst ALUOp ALU Cont. Sign ext. Shift left 2 0-15 0-5 Spr 2015, Mar ELEC / Lecture 7

9 Single-Cycle Control Logic
Inputs Outputs Instr. type Opcode Instruction bits R 1 lw sw X beq J MemtoReg RegWrite Branch ALUOp1 Jump RegDst ALUSrc MemRead MemWrite ALUOp0 Spr 2015, Mar ELEC / Lecture 7

10 Single-Cycle Control Circuit
Op5 Op4 Op3 Op2 Op1 Op0 lw R sw beq J RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOp0 Jump Spr 2015, Mar ELEC / Lecture 7

11 Funct. Code from IR (bits 0-5)
ALU Control Logic Inputs Outputs to ALU Instr. type From CU Funct. Code from IR (bits 0-5) 3-bit code Opera-tion ALUOp1 ALUOp0 F5 F4 F3 F2 F1 F0 lw, sw X 010 Add B 1 110 Subtract R 000 AND 001 OR 111 slt Spr 2015, Mar ELEC / Lecture 7

12 ALU Control Operation select from control From Control Circuit 3 zero
ALUOp ALUOp0 3 zero ALU result F3 F2 F1 F0 overflow Operation select ALU function 000 AND 001 OR 010 Add 110 Subtract 111 Set on less than ALU control Spr 2015, Mar ELEC / Lecture 7

13 Returning to Pipelined Control
Opcode input to control is supplied by the pipeline register IF/ID in the ID (instruction decode) cycle. Nine control signals are generated in the ID cycle, but none is used. They are saved in the pipeline register ID/EX. ALUSrc, RegDst and ALUOp (2 bits) are used in the EX (execute) cycle. Remaining 5 control signals are saved in the pipeline register EX/MEM. Branch, MemWrite and MemRead are used in the MEM (memory access) cycle. Remaining 2 control signals are saved in the pipeline register MEM/WB. MemtoReg and RegWrite are used in the WB (write back) cycle. Pipelined control is shown without Jump. Spr 2015, Mar ELEC / Lecture 7

14 Placing Control in Pipelined Datapath
IF/ID IF/ID ID/EX ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 ALU opcode Shift left 2 CONTROL 26-31 Branch PCSrc RegWrite MemWrite MemRead MemtoReg ALUSrc zero Instr mem 21-25 PC ALU 16-20 Reg. File Data mem. 1 mux 0 0 mux 1 ALU cont. Sign ext. ALUOp 0-5 16-20 for I-type lw 11-15 for R-type 1 mux 0 RegDst 0-15 Spr 2015, Mar ELEC / Lecture 7

15 Highlighted Pipelined Control
IF/ID ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 ALU opcode Shift left 2 CONTROL 26-31 Branch PCSrc RegWrite MemWrite MemRead MemtoReg ALUSrc zero Instr mem 21-25 PC ALU 16-20 Reg. File Data mem 1 mux 0 0 mux 1 ALU cont. Sign ext. ALUOp 0-5 16-20 for I-type lw 11-15 for R-type 1 mux 0 RegDst 0-15 Spr 2015, Mar ELEC / Lecture 7

16 Single-Cycle Performance
Assume 200 ps for memory access 100 ps for ALU operation 50 ps for register file read or write Cycle time set according to longest instruction: lw ≡ IF + ID/RegRead + ALU + MEM + RegWrite = = 600 ps Av. instruction execution time = clock cycle time Spr 2015, Mar ELEC / Lecture 7

17 Multicycle Performance
Consider SPECINT2000* instruction mix: 25% lw 5 cycles 10% sw 4 cycles 11% branch 3 cycles 2% jump 3 cycles 52% ALU instr. 4 cycles Av. CPI = 0.25× × × × ×4 = 4.12 Clock cycle time determined from longest operation (memory access) = 200 ps Av. instruction execution time = 4.12×200 = 824 ps *Set of benchmark programs used for performance evaluation, to be discussed in a later lecture. Spr 2015, Mar ELEC / Lecture 7

18 Pipeline Performance Neglect initial latency (reasonable for long programs). One instruction completed every clock cycle unless delayed by hazard. Average CPI: lw 2 cycles in 50% cases due to hazard 1.5 cycles sw cycle ALU cycle branch 2 cycles in 25% cases due to hazard cycles jump cycles For SPECINT2000 Av. CPI = 0.25× × × × ×1 = 1.17 Clock cycle time (longest operation: memory access) = 200 ps Av. instruction execution time = 1.17×200 = 234 ps Spr 2015, Mar ELEC / Lecture 7

19 Comparing Alternatives
Type of datapath and control Clock cycle time Average CPI Av. instruction execution time Single-cycle 600 ps 1.00 Multicycle 200 ps 4.12 824 ps Pipelined 1.17 234 ps Spr 2015, Mar ELEC / Lecture 7

20 Other Controls for Pipeline
Forwarding Stall Branch hazard and branch prediction Instruction flush Exceptions Spr 2015, Mar ELEC / Lecture 7

21 Forwarding Consider a data hazard:
sub $2, $1, $3 # computes result in CC3, writes in $2 in CC5 and $12, $2, $5 # reads $2 in CC3, adds in CC4 CC1 CC2 CC CC CC CC CC CC8 MEM:DM CC3: ALU saves new data in EX/MEM, to be written to $2 in CC5 IF: IM ID: REG. FILE READ EX: ALU WB: REG. WRITE sub $2, $1, $3 IF/ID ID/EX EX/MEM MEM/WB MEM:DM IF: IM ID: REG. FILE READ EX: ALU WB: REG. WRITE and $12, $2, $5 IF/ID ID/EX EX/MEM MEM/WB CC3: and reads $2 to ID/EX, but the correct data is in EX/MEM CC4: forwarding allows execution of “and” with correct data Spr 2015, Mar ELEC / Lecture 7

22 Understanding Forwarding
Let’s ask following questions: Q: Why is there a hazard? A: Source register for the present instruction is the same as the destination register of the previous instruction. Q: When is the source register data needed? A: In the execute cycle (CC4). Q: Is source register data available in CC4? A: Yes – use forwarding. No – use stall. Q: Where is the required data in CC4? A: In the pipeline register EX/MEM as ALU output. Spr 2015, Mar ELEC / Lecture 7

23 Forwarding Hardware A forwarding unit is added to execute (ALU) cycle hardware. Functions of forwarding unit: Hazard detection Forward correct data to ALU Inputs to forwarding unit: Source registers of the instruction in EX Destination registers of instructions in DM and WB Outputs of forwarding unit: multiplexer controls to route correct data to the ALU. Spr 2015, Mar ELEC / Lecture 7

24 Recall Register Definitions
R-type instruction (add, sub, and, or, ) opcode Rs Rt Rd shamt funct I-type instruction (beq, lw, sw, addi, ) opcode Rs Rt constant_or_address J-type instruction (j, jal, jr) opcode a___d___d___r___e___s___s where Rs is the first source register Rt is the second source register Rd is the destination register Spr 2015, Mar ELEC / Lecture 7

25 Forwarding Implemented
ID/EX EX/MEM MEM/WB IF/ID Branch addr. PC+4 ALU opcode Shift left 2 26-31 Addr mem 21-25 zero MUX 16-20 Reg. File ALU Data mem. 1 mux 0 0 mux 1 MUX Sign ext. 16-20 11-15 1 mux 0 Rd 21-25 Rs Forwarding unit 16-20 Rt Rd 0-15 Spr 2015, Mar ELEC / Lecture 7

26 Stall Delay next instruction by sending nop through pipeline.
Necessary when hazard not resolved by forwarding. CC1 CC2 CC CC CC CC6 DM CC4: new data in MEM/WB, to be written to $2 IM ID, REG. FILE READ ALU REG. FILE WRITE lw $2, 20($1) IF/ID ID/EX EX/MEM MEM/WB DM IM ID, REG. FILE READ ALU REG. FILE WRITE and $4, $2, $5 IF/ID ID/EX EX/MEM MEM/WB CC4: execution of and is impossible; correct data unavailable until end of CC4 Spr 2015, Mar ELEC / Lecture 7

27 Detecting Hazard Requiring Stall
Consider instruction in IF/ID being decoded: If Previous instruction (lw) activated MemRead, and Instruction being decoded has a source register (Rs or Rt) same as the destination register (Rt for lw) of the previous instruction Then, stall the pipeline: Force all control outputs to 0 Prevent PC from changing Prevent IF/ID from changing Spr 2015, Mar ELEC / Lecture 7

28 Stall Implementation opcode Reg. File ALU 0 mux 1 1 mux 0 Rt Hazard
detection unit MemRead PCWrite IF/IDWrite Rs ID/EX EX/MEM MEM/WB opcode IF/ID Control 26-31 MUX Shift left 2 21-25 zero MUX PC Addr mem 16-20 Reg. File 1 mux 0 ALU Data mem. 0 mux 1 MUX Sign ext. 16-20 11-15 1 mux 0 Rd 21-25 Rs Forwarding unit 16-20 Rt Rd 0-15 Spr 2015, Mar ELEC / Lecture 7

29 Stall Execution with stall and forwarding: CC1 CC2 CC3 CC4 CC5 CC6 CC7
IF: IM ID: REG. FILE READ EX: ALU MEM:DM WB: REG. WRITE IF/ID ID/EX MEM/WB EX/MEM CC4: new data in MEM/WB, to be written to $2 lw $2, 20($1) ID: REG. FILE READ MEM:DM IF: IM EX: ALU WB: REG. WRITE bubble (nop) and $4, $2, $5 IF/ID ID/EX EX/MEM MEM/WB ID: REG. FILE READ MEM:DM State of IF/ID is frozen in CC3 EX: ALU WB: REG. WRITE IF/ID ID/EX EX/MEM MEM/WB next is fetched twice since PC was frozen MEM:DM IF: IM IF: IM ID: REG. FILE READ EX: ALU WB: REG. WRITE next IF/ID IF/ID ID/EX EX/MEM MEM/WB Spr 2015, Mar ELEC / Lecture 7

30 Branch Hazard Consider heuristic – branch not taken.
Continue fetching instructions in sequence following the branch instructions. If branch is taken (indicated by zero output of ALU): Control generates branch signal in ID cycle. branch activates PCSource signal in the MEM cycle to load PC with new branch address. Three instructions in the pipeline must be flushed if branch is taken – can this penalty be reduced? Spr 2015, Mar ELEC / Lecture 7

31 Branch Hazard ALU ALU 1 mux 0 Add opcode Instr mem PC Reg. File
IF/ID ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 ALU opcode Shift left 2 CONTROL 26-31 Branch PCSrc RegWrite MemWrite MemRead MemtoReg beq ALUSrc zero Instr mem 21-25 PC ALU 16-20 Reg. File Data mem. 1 mux 0 0 mux 1 ALU cont. Sign ext. 0-5 ALUOp 16-20 for I-type lw 11-15 for R-type 1 mux 0 RegDst 0-15 Spr 2015, Mar ELEC / Lecture 7

32 Branch Not Taken Branch on condition to Z A B C D Z
cycle b cycle b+1 cycle b+2 cycle b+3 cycle b+4 Branch fetched Branch decoded Branch decision PC keeps D (br. not taken) A fetched A decoded A executed A continues B fetched B decoded B executed C fetched C decoded D fetched Spr 2015, Mar ELEC / Lecture 7

33 Branch Taken Three-cycle penalty Three instructions are
Branch on condition to Z A B C D Z cycle b cycle b+1 cycle b+2 cycle b+3 cycle b+4 Branch fetched Branch decoded Branch decision PC gets Z (br. taken) A fetched A decoded A executed Nop B fetched B decoded Nop C fetched Nop Z fetched Three-cycle penalty Three instructions are flushed if branch is taken Spr 2015, Mar ELEC / Lecture 7

34 Branch Penalty Reduction
IF/ID ID/EX EX/MEM MEM/WB 4 Add 1 mux 0 Add opcode Shift left 2 CONTROL 26-31 Branch PCSrc RegWrite MemWrite MemRead MemtoReg beq ALUSrc zero Instr mem 21-25 PC ALU 16-20 Reg. File Data mem. 1 mux 0 0 mux 1 ALU cont. Sign ext. 0-5 ALUOp 16-20 for I-type lw 11-15 for R-type 1 mux 0 RegDst 0-15 Spr 2015, Mar ELEC / Lecture 7

35 Branch Taken One-cycle penalty One instructions is
Branch to Z A B C D Z cycle b cycle b+1 cycle b+2 cycle b+3 cycle b+4 Branch fetched Branch decision PC gets Z A fetched A flushed Nop Nop Z fetched Z decoded Z executed One-cycle penalty One instructions is flushed if branch is taken Spr 2015, Mar ELEC / Lecture 7

36 Pipeline Flush If branch is taken (as indicated by zero), then control does the following: Change all control signals to 0, similar to the case of stall for data hazard, i.e., insert bubble in the pipeline. Generate a signal IF.Flush that changes the instruction in the pipeline register IF/ID to 0 (nop). Penalty of branch hazard is reduced by Adding branch detection and address generation hardware in the decode cycle – one bubble needed – a next address generation logic in the decode stage writes PC+4, branch address, or jump address into PC. Using branch prediction. Unrolling loops. Spr 2015, Mar ELEC / Lecture 7

37 Branch Prediction Useful for program loops.
A one-bit prediction scheme: a one-bit buffer carries a “history bit” that tells what happened on the last branch instruction History bit = 1, branch was taken History bit = 0, branch was not taken Not taken Predict branch taken 1 Predict branch not taken taken Not taken taken Spr 2015, Mar ELEC / Lecture 7

38 Branch Prediction = Address of Target History
recent branch addresses bit(s) instructions Low-order bits used as index PC+4 Next PC 1 Prediction Logic = PC Spr 2015, Mar ELEC / Lecture 7

39 Branch Prediction for a Loop
Execution of Instruction d a b c d e I = 0 Execu-tion seq. Old hist. bit Next instr. New hist. bit Prediction Pred. I Act. 1 e b Bad 2 Good 3 4 5 6 7 8 9 10 I = I + 1 X = X + R(I) I – 10 = 0? N Y Store X in memory h.bit = 0 branch not taken, h.bit = 1 branch taken. Spr 2015, Mar ELEC / Lecture 7

40 Prediction Accuracy One-bit predictor:
2 errors out of 10 predictions Prediction accuracy = 80% To improve prediction accuracy, use two-bit predictor: A prediction must be wrong twice before it is changed Spr 2015, Mar ELEC / Lecture 7

41 Two-Bit Prediction Buffer
Implemented as a two-bit counter. Can improve correct prediction statistics. Not taken Predict branch taken 11 Predict branch taken 10 taken taken Not taken taken Not taken Predict branch not taken 00 Predict branch not taken 01 Not taken taken Spr 2015, Mar ELEC / Lecture 7

42 Branch Prediction for a Loop
Execution of Instruction 4 1 2 3 4 5 I = 0 Execu-tion seq. Old Pred.Buf Next instr. New pred.Buf Prediction Pred. I Act. 1 10 2 11 Good 3 4 5 6 7 8 9 Bad I = I + 1 X = X + R(I) I – 10 = 0? N Y Store X in memory Spr 2015, Mar ELEC / Lecture 7

43 Exceptions A typical exception occurs when ALU produces an overflow signal. Control asserts following actions on exception: Change the PC address to hex. This is the location of the exception routine. This is done by adding an additional input to the PC input multiplexer. Overflow is detected in the EX cycle. Similar to data hazard and pipeline flush, Set IF/ID to 0 (nop). Generate ID.Flush and EX.Flush signals to set all control signals to 0 in ID/EX and EX/MEM registers. This also prevents the ALU result (presumed contaminated) from being written in the WB cycle. Spr 2015, Mar ELEC / Lecture 7


Download ppt "ELEC 5200-001/6200-001 Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher."

Similar presentations


Ads by Google