Presentation is loading. Please wait.

Presentation is loading. Please wait.

CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to.

Similar presentations


Presentation on theme: "CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to."— Presentation transcript:

1 CDA 5155 Computer Architecture Week 1.5

2 Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to flow easily. (low resistance to current flow) Lattice of atoms with free electrons Insulator: a material that is a poor conductor of electrical current (High resistance to current flow) Lattice of atoms with strongly held electrons Semi-conductor: a material that can act like a conductor or an insulator depending on conditions. (variable resistance to current flow)

3 Making a semiconductor using silicon e e e e e e e e e e e e e e e e e e e e What is a pure silicon lattice? A. Conductor B. Insulator C. Semi conductor

4 N-type Doping We can increase the conductivity by adding atoms of phosphorus or arsenic to the silicon lattice. They have more electrons (1 more) which is free to wander… This is called n-type doping since we add some free (negatively charged) electrons

5 Making a semiconductor using silicon e e e e e e e e e e e e e e e e P e e e e e This electron is easily moved from here What is a n-doped silicon lattice? A. Conductor B. Insulator C. Semi-conductor

6 P-type Doping Interestingly, we can also improve the conductivity by adding atoms of gallium or boron to the silicon lattice. They have fewer electrons (1 fewer) which creates a hole. Holes also conduct current by stealing electrons from their neighbor (thus moving the hole). This is called p-type doping since we have fewer (negatively charged) electrons in the bond holding the atoms together.

7 Making a semiconductor using silicon e e e e e e e e e e e e e e e e Ga e e e ? This atom will accept an electron even though it is one too many since it fills the eighth electron position in this shell. Again this lets current flow since the electron must come from somewhere to fill this position.

8 Using doped silicon to make a junction diode A junction diode allows current to flow in one direction and blocks it in the other. Electrons like to move to Vcc GNDVcc Electrons move from GND to fill holes.

9 Using doped silicon to make a junction diode A junction diode allows current to flow in one direction and blocks it in the other. Current flows eeeeee Vcc GND eeeeeeee

10 Making a transistor Our first level of abstraction is the transistor. (basically 2 diodes sitting back-to-back) P-type Gate

11 Making a transistor Transistors are electronic switches connecting the source to the drain if the gate is “on”. http://www.intel.com/education/transworks/INDEX.HTM Vcc

12 12/96 Review of basic pipelining 5 stage “RISC” load-store architecture –About as simple as things get Instruction fetch: get instruction from memory/cache Instruction decode: translate opcode into control signals and read regs Execute: perform ALU operation Memory: Access memory if load/store Writeback/retire: update register file

13 13/96 Pipelined implementation Break the execution of the instruction into cycles (5 in this case). Design a separate datapath stage for the execution performed during each cycle. Build pipeline registers to communicate between the stages.

14 Stage 1: Fetch Design a datapath that can fetch an instruction from memory every cycle. Use PC to index memory to read instruction Increment the PC (assume no branches for now) Write everything needed to complete execution to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered

15 Instruction bits IF / ID Pipeline register PC Instruction memory en 1 + MUXMUX Rest of pipelined datapath PC + 1

16 Stage 2: Decode Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits). Decode is easy, just pass on the opcode and let later stages figure out their own control signals for the instruction. Write everything needed to complete execution to the pipeline register (ID/EX) Pass on the offset field and both destination register specifiers (or simply pass on the whole instruction!). Including PC+1 even though decode didn’t use it.

17 Destreg Data ID / EX Pipeline register Contents Of regA Contents Of regB Register File regA regB en Rest of pipelined datapath Instruction bits IF / ID Pipeline register PC + 1 Instruction bits Stage 1: Fetch datapath

18 Stage 3: Execute Design a datapath that performs the proper ALU operation for the instruction specified and the values present in the ID/EX pipeline register. The inputs are the contents of regA and either the contents of regB or the offset field on the instruction. Also, calculate PC+1+offset in case this is a branch. Write everything needed to complete execution to the pipeline register (EX/Mem) ALU result, contents of regB and PC+1+offset Instruction bits for opcode and destReg specifiers Result from comparison of regA and regB contents

19 ID / EX Pipeline register Contents Of regA Contents Of regB Rest of pipelined datapath Alu Result EX/Mem Pipeline register PC + 1 Instruction bits Stage 2: Decode datapath Instruction bits PC+1 +offset + contents of regB ALUALU MUXMUX

20 Stage 4: Memory Operation Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register. ALU result contains address for ld and st instructions. Opcode bits control memory R/W and enable signals. Write everything needed to complete execution to the pipeline register (Mem/WB) ALU result and MemData Instruction bits for opcode and destReg specifiers

21 Alu Result Mem/WB Pipeline register Rest of pipelined datapath Alu Result EX/Mem Pipeline register Stage 3: Execute datapath Instruction bits PC+1 +offset contents of regB This goes back to the MUX before the PC in stage 1. Memory Read Data Data Memory en R/W Instruction bits MUX control for PC input

22 Stage 5: Write back Design a datapath that completes the execution of this instruction, writing to the register file if required. Write MemData to destReg for ld instruction Write ALU result to destReg for add or nand instructions. Opcode bits also control register write enable signal.

23 Alu Result Mem/WB Pipeline register Stage 4: Memory datapath Instruction bits Memory Read Data MUXMUX This goes back to data input of register file This goes back to the destination register specifier MUXMUX bits 0-2 bits 16-18 register write enable

24 PC Inst mem Register file MUXMUX Sign extend ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 0-2 16-18

25 Sample Test Question (Easy) Which item does not need to be included in the Mem/WB pipeline register for the LC3101 pipelined implementation discussed in class? A.ALU result B.Memory read data C.PC+1+offset D.Destination register specifier E.Instruction opcode C. PC+1+offset

26 Sample Test Question (Hard?) What items need to be added to one of the pipeline registers (discussed in class) to support the ? A.IF/ID: PC B.ID/EX: PC+offset C.EX/Mem: Contents of regA D.EX/Mem: ALU2 result E.Mem/WB: Contents of regA

27 Things to think about… 1.How would you modify the pipeline datapath if you wanted to double the clock frequency? 2.Would it actually double? 3.How do you determine the frequency?

28 Sample Code (Simple) Run the following code on pipelined LC3101: add1 2 3 ; reg 3 = reg 1 + reg 2 nand 4 5 6 ; reg 6 = reg 4 & reg 5 lw2 4 20 ; reg 4 = Mem[reg2+20] add2 5 5 ; reg 5 = reg 2 + reg 5 sw 3 7 10 ; Mem[reg3+10] =reg 7

29 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB Bits 22-24 data dest

30 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 noop 0 0 0 0 00 0 0 0 0 0 0 0 0 9 12 18 7 36 41 0 22 R2 R3 R4 R5 R1 R6 R0 R7 Bits 22-24 data dest Initial State Time: 0

31 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 noop 0 0 0 0 01 0 0 0 0 0 0 0 0 add 1 2 3 9 12 18 7 36 41 0 22 R2 R3 R4 R5 R1 R6 R0 R7 Bits 22-24 data dest Fetch: add 1 2 3 Time: 1

32 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 add 3 3 9 36 12 0 0 noop 0 0 0 0 0 0 nand 4 5 6 9 12 18 7 36 41 0 22 R2 R3 R4 R5 R1 R6 R0 R7 1 2 Bits 22-24 data dest Fetch: nand 4 5 6 nand 4 5 6 add 1 2 3 Time: 2

33 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 nand 6 6 7 18 23 4 45 add 3 9 noop 0 0 0 0 lw 2 4 20 9 12 18 7 36 41 0 22 R2 R3 R4 R5 R1 R6 R0 R7 4 5 Bits 22-24 data dest Fetch: lw 2 4 20 lw 2 4 20 nand 4 5 6 add 1 2 3 Time: 3 36 9 1 3 3

34 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 lw 4 20 18 9 34 8 -3 nand 6 7 add 3 45 0 0 add 2 5 8 9 12 18 7 36 41 0 22 R2 R3 R4 R5 R1 R6 R0 R7 2 4 Bits 22-24 data dest Fetch: add 2 5 5 add 2 5 5 lw 2 4 20 nand 4 5 6 add 1 2 3 Time: 4 18 7 2 6 6 45 3

35 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 add 5 5 7 9 45 23 29 lw 4 18 nand 6 -3 0 0 sw 3 7 10 9 45 18 7 36 41 0 22 R2 R3 R4 R5 R1 R6 R0 R7 2 5 Bits 22-24 data dest Fetch: sw 3 7 10 sw 3 7 10 add 2 5 5 lw 2 4 20 nand 4 5 6 add Time: 5 9 20 3 4 -3 6 45 3

36 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 sw 7 10 22 45 5 9 16 add 5 7 lw 4 29 99 0 9 45 18 7 36 -3 0 22 R2 R3 R4 R5 R1 R6 R0 R7 3 7 Bits 22-24 data dest No more instructions sw 3 7 10 add 2 5 5 lw 2 4 20 nand Time: 6 9 7 4 5 5 29 4 -3 6

37 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 15 55 sw 7 22 add 5 16 0 0 9 45 99 7 36 -3 0 22 R2 R3 R4 R5 R1 R6 R0 R7 Bits 22-24 data dest No more instructions sw 3 7 10 add 2 5 5 lw Time: 7 45 5 10 7 16 5 99 4

38 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 sw 7 55 0 9 45 99 16 36 -3 0 22 R2 R3 R4 R5 R1 R6 R0 R7 Bits 22-24 data dest No more instructions sw 3 7 10 add Time: 8 22 55 22 16 5

39 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 9 45 99 16 36 -3 0 22 R2 R3 R4 R5 R1 R6 R0 R7 Bits 22-24 data dest No more instructions sw Time: 9

40 Time graphs Time: 1 2 3 4 5 6 7 8 9 add nand lw add sw fetch decode execute memory writeback

41 What can go wrong? Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written. Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that? Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?

42 Data Hazards Data hazards What are they? How do you detect them? How do you deal with them?

43 Pipeline function for ADD Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate sum Memory: Pass results to next stage Writeback: write sum into register file

44 Data Hazards add1 2 3 nand 3 4 5 time fetch decode execute memory writeback add nand If not careful, nand will read the wrong value of R3

45 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits 16-18 op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB Bits 22-24 data dest

46 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data dest

47 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op offset valB valA PC+1 target ALU result op valB op ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB fwd data

48 Three approaches to handling data hazards Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible)

49 Handling data hazards I: Avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. Put noops between any dependent instructions. add1 2 3 noop nand3 4 5 write R3 in cycle 5 read R3 in cycle 5

50 Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower –CPI is 1, but some instructions are noops

51 Handling data hazards II: Detect and stall until ready Detect: Compare regA with previous DestRegs 3 bit operand fields Compare regB with previous DestRegs 3 bit operand fields Stall: Keep current instructions in fetch and decode Pass a noop to execute

52 Hazard detection PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add 3 7 14 PC+1 target ALU result op valB op ALU result mdata eq? nand 3 4 5 7 10 14 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 First half of cycle 3

53 REG file IF/ ID ID/ EX 3 compare Hazard detected regA regB compare 3

54 3 Hazard detected regA regB compare 0 1 1 0 0 0 1

55 Handling data hazards II: Detect and stall until ready Detect: –Compare regA with previous DestReg 3 bit operand fields –Compare regB with previous DestReg 3 bit operand fields Stall: Keep current instructions in fetch and decode Pass a noop to execute

56 Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add 7 14 12 target ALU result valB ALU result mdata eq? nand 3 4 5 7 10 11 14 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 en First half of cycle 3

57 Handling data hazards II: Detect and stall until ready Detect: –Compare regA with previous DestReg 3 bit operand fields –Compare regB with previous DestReg 3 bit operand fields Stall: –Keep current instructions in fetch and decode Pass a noop to execute

58 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 21 add ALU result mdata nand 3 4 5 7 10 11 14 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 End of cycle 3

59 Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 2 21 add ALU result mdata nand 3 4 5 7 10 11 14 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 en First half of cycle 4 noop

60 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 add 21 nand 3 4 5 7 10 11 14 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 End of cycle 4

61 No Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 add 21 nand 3 4 5 7 10 11 14 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 First half of cycle 5

62 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand 11 21 23 noop add 3 7 7 7 21 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data End of cycle 5

63 No more stalling add1 2 3 nand 3 4 5 time fetch decode execute memory writeback fetch decode decode decode execute add nand Assume Register File gives the right value of R3 when read/written during same cycle. hazard

64 Problems with detect and stall CPI increases every time a hazard is detected! Is that necessary? Not always! Re-route the result of the add to the nand nand no longer needs to read R3 from reg file It can get the data later (when it is ready) This lets us complete the decode this cycle –But we need more control to remember that the data that we aren’t getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle.

65 Handling data hazards III: Detect and forward Detect: same as detect and stall Except that all 4 hazards are treated differently i.e., you can’t logical-OR the 4 hazard signals Forward: New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data Beware: Stalling may still be required even in the presence of forwarding

66 Sample Code Which hazards do you see? add 1 2 3 nand 3 4 5 add 4 3 7 add 6 3 7 lw 3 6 10 sw 6 2 12

67 Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add 7 14 12 nand 3 4 5 7 10 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 fwd 3 First half of cycle 3

68 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand 11 10 23 21 add add 4 3 7 7 10 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data H1 3 End of cycle 3

69 New Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand 11 10 23 21 add add 6 3 7 7 10 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data 3 MUXMUX H1 3 First half of cycle 4 21 11

70 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add 10 1 34 -2 nand add 21 lw 3 6 10 7 10 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 753 data MUXMUX H2H1 End of cycle 4

71 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add 10 1 34 -2 nand add 21 lw 3 6 10 7 10 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 753 data MUXMUX H2H1 First half of cycle 5 3 No Hazard 21 1

72 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX lw 10 21 4 5 22 add nand -2 sw 6 2 12 7 21 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 75 data MUXMUX H2H1 6 End of cycle 5

73 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX lw 10 21 4 5 22 add nand -2 sw 6 2 12 7 21 11 77 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 675 data MUXMUX H2H1 First half of cycle 6 Hazard 6 en L

74 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 5 31 lw add 22 sw 6 2 12 7 21 11 -2 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 67 data MUXMUX H2 End of cycle 6 noop

75 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 5 31 lw add 22 sw 6 2 12 7 21 11 -2 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 67 data MUXMUX H2 First half of cycle 7 Hazard 6

76 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX sw 12 7 1 5 noop lw 99 7 21 11 -2 14 1 0 22 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 6 data MUXMUX H3 End of cycle 7

77 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX sw 12 7 1 5 noop lw 99 7 21 11 -2 14 1 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB 6 data MUXMUX H3 First half of cycle 8 99 12

78 PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 12 7 1 5 111 sw noop 7 21 11 -2 14 99 0 8 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data MUXMUX H3 End of cycle 8

79 Control hazards How can the pipeline handle branch and jump instructions?

80 Pipeline function for BEQ Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal Writeback: Nothing left to do

81 Control Hazards beq1 1 10 sub 3 4 5 time fetch decode execute memory writeback fetch decode execute beq sub

82 Approaches to handling control hazards Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn’t have been executed

83 Handling control hazards I: Avoid all hazards Don’t have branch instructions! Maybe a little impractical Delay taking branch: dbeq r1 r2 offset Instructions at PC+1, PC+2, etc will execute before deciding whether to fetch from PC+1+offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

84 Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower –CPI equals 1, but some instructions are noops

85 Handling control hazards II: Detect and stall Detection: Must wait until decode Compare opcode to beq or jalr Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!)

86 PC Inst mem REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control noop MUXMUX

87 Control Hazards beq1 1 10 sub 3 4 5 time fetch decode execute memory writeback fetch fetch fetch beq sub fetch or fetch Target:

88 Problems with detect and stall CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken Let’s assume that it is NOT taken… –In this case, we can ignore the beq (treat it like a noop) –Keep fetching PC + 1 What if we are wrong? –OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don’t perform writeback)

89 Handling data hazards III: Speculate and squash Speculate: assume not equal Keep fetching from PC+1 until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: Decode, Execute and Memory Send target address to PC

90 PC REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control equal MUXMUX beq sub add nand add subbeq Inst mem noop

91 Problems with fetching PC+1 CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch – much less whether it is taken???

92 PC Inst mem REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control beq bpc MUXMUX target eq?

93 Branch prediction Predict not taken: ~50% accurate Predict backward taken:~65% accurate Predict same as last time:~80% accurate Pentium:~85% accurate Pentium Pro:~92% accurate Best paper designs:~97% accurate

94 94/96 Handling control hazards II: Detect and stall Detection: –Must wait until decode –Compare opcode to beq or jalr –Alternately, this is just another control signal Stall: –Keep current instructions in fetch –Pass noop to decode stage (not execute!)

95 95/96 PC Inst mem REG file ALUALU MUXMUX 1 Data memory ++ IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control noop MUXMUX MUXMUX MUXMUX

96 96/96 Role of the Compiler The primary user of the instruction set –Exceptions: getting less common Some device drivers; specialized library routines Some small embedded systems (synthesized arch) Compilers must: – generate a correct translation into machine code Compilers should: –fast compile time; generate fast code While we are at it: –generate reasonable code size; good debug support

97 97/96 Structure of Compilers Front-end: translate high level semantics to some generic intermediate form –Intermediate form does not have any resource constraints, but uses simple instructions. Back-end: translates intermediate form into assembly/machine code for target architecture –Resource allocation; code optimization under resource constraints Architects mostly concerned with optimization

98 98/96 Typical optimizations: CSE Common sub-expression elimination c = array1[d+e] / array2[d+e]; c = array1[i] / arrray2[i]; Purpose: –reduce instructions / faster code Architectural issues: –more register pressure

99 99/96 Typical optimization: LICM Loop invariant code motion for (i=0; i<100; i++) { t = 5; array1[i] = t; } Purpose: –remove statements or expressions from loops that need only be executed once (idempotent) Architectural issues: –more register pressure

100 100/96 Other transformations Procedure inlining: better inst schedule –greater code size, more register pressure Loop unrolling: better loop schedule –greater code size, more register pressure Software pipelining: better loop schedule –greater code size; more register pressure In general – “global”optimization: faster code –greater code size; more register pressure

101 101/96 Compiled code characteristics Optimized code has different characteristics than unoptimized code. –Fewer memory references, but it is generally the “easy ones” that are eliminated Example: Better register allocation retains active data in register file – these would be cache hits in unoptimized code. –Removing redundant memory and ALU operations leaves a higher ratio of branches in the code Branch prediction becomes more important Many optimizations provide better instruction scheduling at the cost of an increase in hardware resource pressure

102 102/96 What do compiler writers want in an instruction set architecture? More resources: better optimization tradeoffs Regularity: same behaviour in all contexts –no special cases (flags set differently for immediates) Orthogonality: –data type independent of addressing mode –addressing mode independent of operation performed Primitives, not solutions: –keep instructions simple –it is easier to compose than to fit. (ex. MMX operations)

103 103/96 What do architects want in an instruction set architecture? Simple instruction decode: –tends to increase orthogonality Small structures: –more resource constraints Small data bus fanout: – tends to reduce orthogonality; regularity Small instructions: –Make things implicit –non-regular; non-orthogonal; non-primative

104 104/96 To make faster processors Make the compiler team unhappy –More aggressive optimization over the entire program –More resource constraints; caches; HW schedulers –Higher expectations: increase IPC Make hardware design team unhappy –Tighter design constraints (clock) –Execute optimized code with more complex execution characteristics –Make all stages bottlenecks (Amdahl’s law)


Download ppt "CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to."

Similar presentations


Ads by Google