CDA 5155 Computer Architecture Week 1.5. Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to.

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

Instruction-Level Parallelism (ILP)

1 RISC Pipeline Han Wang CS3410, Spring 2010 Computer Science Cornell University See: P&H Chapter 4.6.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipeline Hazards See: P&H Chapter 4.7.

Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University RISC Pipeline See: P&H Chapter 4.6.

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

EECS 470 Pipeline Control Hazards Lecture 5 Coverage: Chapter 3 & Appendix A.

EECS 470 Pipeline Hazards Lecture 4 Coverage: Appendix A.

Computer Architecture Lecture 3 Coverage: Appendix A

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Chapter Six Enhancing Performance with Pipelining

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

CIS629 Fall 2002 Pipelining 2- 1 Control Hazards Created by branch statements BEQZLOC ADDR1,R2,R3. LOCSUBR1,R2,R3 PC needs to be computed but it happens.

Lec 8: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.

Appendix A Pipelining: Basic and Intermediate Concepts

1  1998 Morgan Kaufmann Publishers Chapter Six Enhancing Performance with Pipelining.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

CSE378 Pipelining1 Pipelining Basic concept of assembly line –Split a job A into n sequential subjobs (A 1,A 2,…,A n ) with each A i taking approximately.

EECS 470 Further review: Pipeline Hazards and More Lecture 2 – Winter 2014 Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Pipelined Datapath and Control

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Sample Code (Simple) Run the following code on a pipelined datapath: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg.

Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1  1998 Morgan Kaufmann Publishers Chapter Six. 2  1998 Morgan Kaufmann Publishers Pipelining Improve perfomance by increasing instruction throughput.

CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CDA 5155 Week 3 Branch Prediction Superscalar Execution.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

Pipeline Timing Issues

Computer Organization

Morgan Kaufmann Publishers

Single Clock Datapath With Control

Chapter 4 The Processor Part 3

Morgan Kaufmann Publishers The Processor

Pipelining review.

Single-cycle datapath, slightly rearranged

Current Design.

Pipelining in more detail

\course\cpeg323-05F\Topic6b-323

Pipelining Basic concept of assembly line

Pipeline control unit (highly abstracted)

The Processor Lecture 3.6: Control Hazards

Control unit extension for data hazards

The Processor Lecture 3.5: Data Hazards

Instruction Execution Cycle

Pipeline control unit (highly abstracted)

Pipeline Control unit (highly abstracted)

Pipelining Basic concept of assembly line

Control unit extension for data hazards

Pipelining Basic concept of assembly line

Introduction to Computer Organization and Architecture

Control unit extension for data hazards

MIPS Pipelined Datapath

Presentation transcript:

CDA 5155 Computer Architecture Week 1.5

Start with the materials: Conductors and Insulators Conductor: a material that permits electrical current to flow easily. (low resistance to current flow) Lattice of atoms with free electrons Insulator: a material that is a poor conductor of electrical current (High resistance to current flow) Lattice of atoms with strongly held electrons Semi-conductor: a material that can act like a conductor or an insulator depending on conditions. (variable resistance to current flow)

Making a semiconductor using silicon e e e e e e e e e e e e e e e e e e e e What is a pure silicon lattice? A. Conductor B. Insulator C. Semi conductor

N-type Doping We can increase the conductivity by adding atoms of phosphorus or arsenic to the silicon lattice. They have more electrons (1 more) which is free to wander… This is called n-type doping since we add some free (negatively charged) electrons

Making a semiconductor using silicon e e e e e e e e e e e e e e e e P e e e e e This electron is easily moved from here What is a n-doped silicon lattice? A. Conductor B. Insulator C. Semi-conductor

P-type Doping Interestingly, we can also improve the conductivity by adding atoms of gallium or boron to the silicon lattice. They have fewer electrons (1 fewer) which creates a hole. Holes also conduct current by stealing electrons from their neighbor (thus moving the hole). This is called p-type doping since we have fewer (negatively charged) electrons in the bond holding the atoms together.

Making a semiconductor using silicon e e e e e e e e e e e e e e e e Ga e e e ? This atom will accept an electron even though it is one too many since it fills the eighth electron position in this shell. Again this lets current flow since the electron must come from somewhere to fill this position.

Using doped silicon to make a junction diode A junction diode allows current to flow in one direction and blocks it in the other. Electrons like to move to Vcc GNDVcc Electrons move from GND to fill holes.

Using doped silicon to make a junction diode A junction diode allows current to flow in one direction and blocks it in the other. Current flows eeeeee Vcc GND eeeeeeee

Making a transistor Our first level of abstraction is the transistor. (basically 2 diodes sitting back-to-back) P-type Gate

Making a transistor Transistors are electronic switches connecting the source to the drain if the gate is “on”. Vcc

12/96 Review of basic pipelining 5 stage “RISC” load-store architecture –About as simple as things get Instruction fetch: get instruction from memory/cache Instruction decode: translate opcode into control signals and read regs Execute: perform ALU operation Memory: Access memory if load/store Writeback/retire: update register file

13/96 Pipelined implementation Break the execution of the instruction into cycles (5 in this case). Design a separate datapath stage for the execution performed during each cycle. Build pipeline registers to communicate between the stages.

Stage 1: Fetch Design a datapath that can fetch an instruction from memory every cycle. Use PC to index memory to read instruction Increment the PC (assume no branches for now) Write everything needed to complete execution to the pipeline register (IF/ID) The next stage will read this pipeline register. Note that pipeline register must be edge triggered

Instruction bits IF / ID Pipeline register PC Instruction memory en 1 + MUXMUX Rest of pipelined datapath PC + 1

Stage 2: Decode Design a datapath that reads the IF/ID pipeline register, decodes instruction and reads register file (specified by regA and regB of instruction bits). Decode is easy, just pass on the opcode and let later stages figure out their own control signals for the instruction. Write everything needed to complete execution to the pipeline register (ID/EX) Pass on the offset field and both destination register specifiers (or simply pass on the whole instruction!). Including PC+1 even though decode didn’t use it.

Destreg Data ID / EX Pipeline register Contents Of regA Contents Of regB Register File regA regB en Rest of pipelined datapath Instruction bits IF / ID Pipeline register PC + 1 Instruction bits Stage 1: Fetch datapath

Stage 3: Execute Design a datapath that performs the proper ALU operation for the instruction specified and the values present in the ID/EX pipeline register. The inputs are the contents of regA and either the contents of regB or the offset field on the instruction. Also, calculate PC+1+offset in case this is a branch. Write everything needed to complete execution to the pipeline register (EX/Mem) ALU result, contents of regB and PC+1+offset Instruction bits for opcode and destReg specifiers Result from comparison of regA and regB contents

ID / EX Pipeline register Contents Of regA Contents Of regB Rest of pipelined datapath Alu Result EX/Mem Pipeline register PC + 1 Instruction bits Stage 2: Decode datapath Instruction bits PC+1 +offset + contents of regB ALUALU MUXMUX

Stage 4: Memory Operation Design a datapath that performs the proper memory operation for the instruction specified and the values present in the EX/Mem pipeline register. ALU result contains address for ld and st instructions. Opcode bits control memory R/W and enable signals. Write everything needed to complete execution to the pipeline register (Mem/WB) ALU result and MemData Instruction bits for opcode and destReg specifiers

Alu Result Mem/WB Pipeline register Rest of pipelined datapath Alu Result EX/Mem Pipeline register Stage 3: Execute datapath Instruction bits PC+1 +offset contents of regB This goes back to the MUX before the PC in stage 1. Memory Read Data Data Memory en R/W Instruction bits MUX control for PC input

Stage 5: Write back Design a datapath that completes the execution of this instruction, writing to the register file if required. Write MemData to destReg for ld instruction Write ALU result to destReg for add or nand instructions. Opcode bits also control register write enable signal.

Alu Result Mem/WB Pipeline register Stage 4: Memory datapath Instruction bits Memory Read Data MUXMUX This goes back to data input of register file This goes back to the destination register specifier MUXMUX bits 0-2 bits register write enable

PC Inst mem Register file MUXMUX Sign extend ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX

Sample Test Question (Easy) Which item does not need to be included in the Mem/WB pipeline register for the LC3101 pipelined implementation discussed in class? A.ALU result B.Memory read data C.PC+1+offset D.Destination register specifier E.Instruction opcode C. PC+1+offset

Sample Test Question (Hard?) What items need to be added to one of the pipeline registers (discussed in class) to support the ? A.IF/ID: PC B.ID/EX: PC+offset C.EX/Mem: Contents of regA D.EX/Mem: ALU2 result E.Mem/WB: Contents of regA

Things to think about… 1.How would you modify the pipeline datapath if you wanted to double the clock frequency? 2.Would it actually double? 3.How do you determine the frequency?

Sample Code (Simple) Run the following code on pipelined LC3101: add1 2 3 ; reg 3 = reg 1 + reg 2 nand ; reg 6 = reg 4 & reg 5 lw ; reg 4 = Mem[reg2+20] add2 5 5 ; reg 5 = reg 2 + reg 5 sw ; Mem[reg3+10] =reg 7

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB Bits data dest

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits noop R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest Initial State Time: 0

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits noop add R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest Fetch: add Time: 1

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits add noop nand R2 R3 R4 R5 R1 R6 R0 R7 1 2 Bits data dest Fetch: nand nand add Time: 2

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits nand add 3 9 noop lw R2 R3 R4 R5 R1 R6 R0 R7 4 5 Bits data dest Fetch: lw lw nand add Time:

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits lw nand 6 7 add add R2 R3 R4 R5 R1 R6 R0 R7 2 4 Bits data dest Fetch: add add lw nand add Time:

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits add lw 4 18 nand sw R2 R3 R4 R5 R1 R6 R0 R7 2 5 Bits data dest Fetch: sw sw add lw nand add Time:

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits sw add 5 7 lw R2 R3 R4 R5 R1 R6 R0 R7 3 7 Bits data dest No more instructions sw add lw nand Time:

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits sw 7 22 add R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest No more instructions sw add lw Time:

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits sw R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest No more instructions sw add Time:

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits R2 R3 R4 R5 R1 R6 R0 R7 Bits data dest No more instructions sw Time: 9

Time graphs Time: add nand lw add sw fetch decode execute memory writeback

What can go wrong? Data hazards: since register reads occur in stage 2 and register writes occur in stage 5 it is possible to read the wrong value if is about to be written. Control hazards: A branch instruction may change the PC, but not until stage 4. What do we fetch before that? Exceptions: How do you handle exceptions in a pipelined processor with 5 instructions in flight?

Data Hazards Data hazards What are they? How do you detect them? How do you deal with them?

Pipeline function for ADD Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate sum Memory: Pass results to next stage Writeback: write sum into register file

Data Hazards add1 2 3 nand time fetch decode execute memory writeback add nand If not careful, nand will read the wrong value of R3

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX Bits 0-2 Bits op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB Bits data dest

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op dest offset valB valA PC+1 target ALU result op dest valB op dest ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB data dest

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX op offset valB valA PC+1 target ALU result op valB op ALU result mdata eq? instruction 0 R2 R3 R4 R5 R1 R6 R0 R7 regA regB fwd data

Three approaches to handling data hazards Avoid Make sure there are no hazards in the code Detect and Stall If hazards exist, stall the processor until they go away. Detect and Forward If hazards exist, fix up the pipeline to get the correct value (if possible)

Handling data hazards I: Avoid all hazards Assume the programmer (or the compiler) knows about the processor implementation. Make sure no hazards exist. Put noops between any dependent instructions. add1 2 3 noop nand3 4 5 write R3 in cycle 5 read R3 in cycle 5

Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more noops Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower –CPI is 1, but some instructions are noops

Handling data hazards II: Detect and stall until ready Detect: Compare regA with previous DestRegs 3 bit operand fields Compare regB with previous DestRegs 3 bit operand fields Stall: Keep current instructions in fetch and decode Pass a noop to execute

Hazard detection PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add PC+1 target ALU result op valB op ALU result mdata eq? nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 First half of cycle 3

REG file IF/ ID ID/ EX 3 compare Hazard detected regA regB compare 3

3 Hazard detected regA regB compare

Handling data hazards II: Detect and stall until ready Detect: –Compare regA with previous DestReg 3 bit operand fields –Compare regB with previous DestReg 3 bit operand fields Stall: Keep current instructions in fetch and decode Pass a noop to execute

Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add target ALU result valB ALU result mdata eq? nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 en First half of cycle 3

Handling data hazards II: Detect and stall until ready Detect: –Compare regA with previous DestReg 3 bit operand fields –Compare regB with previous DestReg 3 bit operand fields Stall: –Keep current instructions in fetch and decode Pass a noop to execute

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 21 add ALU result mdata nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 End of cycle 3

Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 2 21 add ALU result mdata nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 en First half of cycle 4 noop

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 add 21 nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 End of cycle 4

No Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 2 add 21 nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 3 First half of cycle 5

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand noop add R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data End of cycle 5

No more stalling add1 2 3 nand time fetch decode execute memory writeback fetch decode decode decode execute add nand Assume Register File gives the right value of R3 when read/written during same cycle. hazard

Problems with detect and stall CPI increases every time a hazard is detected! Is that necessary? Not always! Re-route the result of the add to the nand nand no longer needs to read R3 from reg file It can get the data later (when it is ready) This lets us complete the decode this cycle –But we need more control to remember that the data that we aren’t getting from the reg file at this time will be found elsewhere in the pipeline at a later cycle.

Handling data hazards III: Detect and forward Detect: same as detect and stall Except that all 4 hazards are treated differently i.e., you can’t logical-OR the 4 hazard signals Forward: New bypass datapaths route computed data to where it is needed New MUX and control to pick the right data Beware: Stalling may still be required even in the presence of forwarding

Sample Code Which hazards do you see? add nand add add lw sw

Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add nand R2 R3 R4 R5 R1 R6 R0 R7 regA regB data 3 fwd 3 First half of cycle 3

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand add add R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data H1 3 End of cycle 3

New Hazard PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX nand add add R2 R3 R4 R5 R1 R6 R0 R7 regA regB 5 data 3 MUXMUX H1 3 First half of cycle

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add nand add 21 lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 753 data MUXMUX H2H1 End of cycle 4

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX add nand add 21 lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 753 data MUXMUX H2H1 First half of cycle 5 3 No Hazard 21 1

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX lw add nand -2 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 75 data MUXMUX H2H1 6 End of cycle 5

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX lw add nand -2 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 675 data MUXMUX H2H1 First half of cycle 6 Hazard 6 en L

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX 5 31 lw add 22 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 67 data MUXMUX H2 End of cycle 6 noop

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX noop 5 31 lw add 22 sw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 67 data MUXMUX H2 First half of cycle 7 Hazard 6

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX sw noop lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 6 data MUXMUX H3 End of cycle 7

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX sw noop lw R2 R3 R4 R5 R1 R6 R0 R7 regA regB 6 data MUXMUX H3 First half of cycle

PC Inst mem Register file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB MUXMUX sw noop R2 R3 R4 R5 R1 R6 R0 R7 regA regB data MUXMUX H3 End of cycle 8

Control hazards How can the pipeline handle branch and jump instructions?

Pipeline function for BEQ Fetch: read instruction from memory Decode: read source operands from reg Execute: calculate target address and test for equality Memory: Send target to PC if test is equal Writeback: Nothing left to do

Control Hazards beq sub time fetch decode execute memory writeback fetch decode execute beq sub

Approaches to handling control hazards Avoid Make sure there are no hazards in the code Detect and Stall Delay fetch until branch resolved. Speculate and Squash-if-Wrong Go ahead and fetch more instruction in case it is correct, but stop them if they shouldn’t have been executed

Handling control hazards I: Avoid all hazards Don’t have branch instructions! Maybe a little impractical Delay taking branch: dbeq r1 r2 offset Instructions at PC+1, PC+2, etc will execute before deciding whether to fetch from PC+1+offset. (If no useful instructions can be placed after dbeq, noops must be inserted.)

Problems with this solution Old programs (legacy code) may not run correctly on new implementations Longer pipelines need more instuctions/noops after delayed beq Programs get larger as noops are included Especially a problem for machines that try to execute more than one instruction every cycle Intel EPIC: Often 25% - 40% of instructions are noops Program execution is slower –CPI equals 1, but some instructions are noops

Handling control hazards II: Detect and stall Detection: Must wait until decode Compare opcode to beq or jalr Alternately, this is just another control signal Stall: Keep current instructions in fetch Pass noop to decode stage (not execute!)

PC Inst mem REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control noop MUXMUX

Control Hazards beq sub time fetch decode execute memory writeback fetch fetch fetch beq sub fetch or fetch Target:

Problems with detect and stall CPI increases every time a branch is detected! Is that necessary? Not always! Only about ½ of the time is the branch taken Let’s assume that it is NOT taken… –In this case, we can ignore the beq (treat it like a noop) –Keep fetching PC + 1 What if we are wrong? –OK, as long as we do not COMPLETE any instructions we mistakenly executed (i.e. don’t perform writeback)

Handling data hazards III: Speculate and squash Speculate: assume not equal Keep fetching from PC+1 until we know that the branch is really taken Squash: stop bad instructions if taken Send a noop to: Decode, Execute and Memory Send target address to PC

PC REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control equal MUXMUX beq sub add nand add subbeq Inst mem noop

Problems with fetching PC+1 CPI increases every time a branch is taken! About ½ of the time Is that necessary? No!, but how can you fetch from the target before you even know the previous instruction is a branch – much less whether it is taken???

PC Inst mem REG file MUXMUX ALUALU MUXMUX 1 Data memory ++ MUXMUX IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control beq bpc MUXMUX target eq?

Branch prediction Predict not taken: ~50% accurate Predict backward taken:~65% accurate Predict same as last time:~80% accurate Pentium:~85% accurate Pentium Pro:~92% accurate Best paper designs:~97% accurate

94/96 Handling control hazards II: Detect and stall Detection: –Must wait until decode –Compare opcode to beq or jalr –Alternately, this is just another control signal Stall: –Keep current instructions in fetch –Pass noop to decode stage (not execute!)

95/96 PC Inst mem REG file ALUALU MUXMUX 1 Data memory ++ IF/ ID ID/ EX EX/ Mem Mem/ WB sign ext Control noop MUXMUX MUXMUX MUXMUX

96/96 Role of the Compiler The primary user of the instruction set –Exceptions: getting less common Some device drivers; specialized library routines Some small embedded systems (synthesized arch) Compilers must: – generate a correct translation into machine code Compilers should: –fast compile time; generate fast code While we are at it: –generate reasonable code size; good debug support

97/96 Structure of Compilers Front-end: translate high level semantics to some generic intermediate form –Intermediate form does not have any resource constraints, but uses simple instructions. Back-end: translates intermediate form into assembly/machine code for target architecture –Resource allocation; code optimization under resource constraints Architects mostly concerned with optimization

98/96 Typical optimizations: CSE Common sub-expression elimination c = array1[d+e] / array2[d+e]; c = array1[i] / arrray2[i]; Purpose: –reduce instructions / faster code Architectural issues: –more register pressure

99/96 Typical optimization: LICM Loop invariant code motion for (i=0; i<100; i++) { t = 5; array1[i] = t; } Purpose: –remove statements or expressions from loops that need only be executed once (idempotent) Architectural issues: –more register pressure

100/96 Other transformations Procedure inlining: better inst schedule –greater code size, more register pressure Loop unrolling: better loop schedule –greater code size, more register pressure Software pipelining: better loop schedule –greater code size; more register pressure In general – “global”optimization: faster code –greater code size; more register pressure

101/96 Compiled code characteristics Optimized code has different characteristics than unoptimized code. –Fewer memory references, but it is generally the “easy ones” that are eliminated Example: Better register allocation retains active data in register file – these would be cache hits in unoptimized code. –Removing redundant memory and ALU operations leaves a higher ratio of branches in the code Branch prediction becomes more important Many optimizations provide better instruction scheduling at the cost of an increase in hardware resource pressure

102/96 What do compiler writers want in an instruction set architecture? More resources: better optimization tradeoffs Regularity: same behaviour in all contexts –no special cases (flags set differently for immediates) Orthogonality: –data type independent of addressing mode –addressing mode independent of operation performed Primitives, not solutions: –keep instructions simple –it is easier to compose than to fit. (ex. MMX operations)

103/96 What do architects want in an instruction set architecture? Simple instruction decode: –tends to increase orthogonality Small structures: –more resource constraints Small data bus fanout: – tends to reduce orthogonality; regularity Small instructions: –Make things implicit –non-regular; non-orthogonal; non-primative

104/96 To make faster processors Make the compiler team unhappy –More aggressive optimization over the entire program –More resource constraints; caches; HW schedulers –Higher expectations: increase IPC Make hardware design team unhappy –Tighter design constraints (clock) –Execute optimized code with more complex execution characteristics –Make all stages bottlenecks (Amdahl’s law)