Dept. of Info. Sci. & Elec. Engg.

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.

February 2, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II Krste Asanovic Electrical Engineering and Computer.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Pipelining III Steve Ko Computer Sciences and Engineering University at Buffalo.

CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture MIPS Review & Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo.

CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.

L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.

Pipeline Exceptions & ControlCSCE430/830 Pipeline: Exceptions & Control CSCE430/830 Computer Architecture Lecturer: Prof. Hong Jiang Courtesy of Yifeng.

January 31, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer.

February 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II Krste Asanovic Electrical Engineering and Computer.

CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California.

What are Exception and Interrupts? MIPS terminology Exception: any unexpected change in the internal control flow – Invoking an operating system service.

Asanovic/Devadas Spring Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.

CMPE 421 Parallel Computer Architecture

Constructive Computer Architecture Interrupts/Exceptions/Faults Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology.

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

CMPE 421 REVIEW: MIDTERM 1. A MODIFIED FIVE-Stage Pipeline PC A Y R MD1 addr inst Inst Memory Imm Ext add rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

ECE/CS 552: Pipeline Hazards © Prof. Mikko Lipasti Lecture notes based in part on slides created by Mark Hill, David Wood, Guri Sohi, John Shen and Jim.

Computer Architecture and Parallel Computing 并行结构与计算 Lecture 2 - Pipelining Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University

CS161 – Design and Architecture of Computer Systems

Electrical and Computer Engineering University of Cyprus

Interrupts and exceptions

Exceptions Another form of control hazard Could be caused by…

Computer Organization CS224

Basic Processor Structure/design

CS252 Graduate Computer Architecture Fall 2015 Lecture 4: Pipelining

William Stallings Computer Organization and Architecture 8th Edition

Morgan Kaufmann Publishers

ELEN 468 Advanced Logic Design

Single Clock Datapath With Control

Chapter 3 Top Level View of Computer Function and Interconnection

Pipeline Implementation (4.6)

Morgan Kaufmann Publishers The Processor

Peng Liu Lecture 7 Pipelining Peng Liu

\course\cpeg323-08F\Topic6b-323

CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining

CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining

Pipelining: Advanced ILP

Dr. George Michelogiannakis EECS, University of California at Berkeley

Chapter 4 The Processor Part 3

Review: MIPS Pipeline Data and Control Paths

Morgan Kaufmann Publishers The Processor

Single-cycle datapath, slightly rearranged

The processor: Pipelining and Branching

Computer Organization & Design 计算机组成与设计

Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2

Lecture 9. MIPS Processor Design – Pipelined Processor Design #2

Pipelining in more detail

Systems Architecture II

\course\cpeg323-05F\Topic6b-323

Data Hazards Data Hazard

Lecture 5. MIPS Processor Design

The Processor Lecture 3.6: Control Hazards

Control unit extension for data hazards

Computer Architecture

CS203 – Advanced Computer Architecture

Pipeline Control unit (highly abstracted)

ECE 445 – Computer Organization

Control Hazards Branches (conditional, unconditional, call-return)

Control unit extension for data hazards

Introduction to Computer Organization and Architecture

Control unit extension for data hazards

Chapter 11 Processor Structure and function

Interrupts and exceptions

Guest Lecturer: Justin Hsia

CMSC 611: Advanced Computer Architecture

ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.

Presentation transcript:

Dept. of Info. Sci. & Elec. Engg. Computer Architecture and Parallel Computing 体系结构与并行计算 Lecture 3 - Pipelining Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University liupeng@zju.edu.cn May 9, 2011

Iron Law explains architecture design space Microcoding became less attractive as gap between RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Iron Law explains architecture design space Trade instructions/program, cycles/instruction, and time/cycle Load-Store RISC ISAs designed for efficient pipelined implementations Very similar to vertical microcode Inspired by earlier Cray machines (more on these later) 2 CS252 S05 2

An Ideal Pipeline All objects go through the same stages 1 2 3 4 All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages These conditions generally hold for industrial assembly lines, but instructions depend on each other! 3 CS252 S05 3

First build MIPS without pipelining with CPI=1 Pipelined MIPS To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1 4 CS252 S05 4

Unpipelined Datapath for MIPS PCSrc br RegWrite clk WBSrc MemWrite addr wdata rdata Data Memory we rind jabs pc+4 0x4 Add Add RegDst BSrc ExtSel OpCode z OpSel clk zero? addr inst Inst. Memory PC rd1 GPRs rs1 rs2 ws wd rd2 we Imm Ext ALU Control 31 5 CS252 S05 5

Hardwired Control Table Opcode ExtSel BSrc OpSel MemW RegW WBSrc RegDst PCSrc ALU ALUi ALUiu LW SW BEQZz=0 BEQZz=1 J JAL JR JALR * Reg Func no yes ALU rd pc+4 sExt16 Imm Op pc+4 no yes ALU rt uExt16 pc+4 Imm Op no yes ALU rt sExt16 Imm + no yes Mem rt pc+4 pc+4 sExt16 Imm + yes no * sExt16 * 0? no * br sExt16 * 0? no pc+4 * no * jabs * no yes PC R31 jabs * no * rind * no yes rind PC R31 BSrc = Reg / Imm WBSrc = ALU / Mem / PC RegDst = rt / rd / R31 PCSrc = pc+4 / br / rind / jabs 6 CS252 S05 6

Pipelined Datapath write -back phase fetch execute decode & Reg-fetch IR PC Add we rs1 rs2 rd1 addr we rdata ws addr wd ALU rd2 GPRs rdata Data Memory Inst. Memory Imm Ext wdata write -back phase fetch execute decode & Reg-fetch memory Clock period can be reduced by dividing the execution of an instruction into multiple cycles tC > max {tIM, tRF, tALU, tDM, tRW} ( = tDM probably) However, CPI will increase unless instructions are pipelined 7 CS252 S05 7

“Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle Instructions per program depends on source code, compiler technology, and ISA Cycles per instructions (CPI) depends upon the ISA and the microarchitecture Time per cycle depends upon the microarchitecture and the base technology Microarchitecture CPI cycle time Microcoded >1 short Single-cycle unpipelined 1 long Pipelined 8 CS252 S05 8

CPI Examples Inst 3 7 cycles Inst 1 Inst 2 5 cycles 10 cycles Microcoded machine 3 instructions, 22 cycles, CPI=7.33 Time Unpipelined machine 3 instructions, 3 cycles, CPI=1 Inst 1 Inst 2 Inst 3 Pipelined machine 3 instructions, 3 cycles, CPI=1 Inst 1 Inst 2 Inst 3 9

Technology Assumptions A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!) Thus, the following timing assumption is reasonable tIM tRF tALU tDM  tRW A 5-stage pipeline will be the focus of our detailed design - some commercial designs have over 30 pipeline stages to do an integer add! 10 CS252 S05 10

5-Stage Pipelined Execution Write -Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add Inst. rd1 GPRs rs1 rs2 ws wd rd2 IR PC time t0 t1 t2 t3 t4 t5 t6 t7 . . . . instruction1 IF1 ID1 EX1 MA1 WB1 instruction2 IF2 ID2 EX2 MA2 WB2 instruction3 IF3 ID3 EX3 MA3 WB3 instruction4 IF4 ID4 EX4 MA4 WB4 instruction5 IF5 ID5 EX5 MA5 WB5 11 CS252 S05 11

5-Stage Pipelined Execution Resource Usage Diagram Write -Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add Inst. rd1 GPRs rs1 rs2 ws wd rd2 IR PC time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 I5 ID I1 I2 I3 I4 I5 EX I1 I2 I3 I4 I5 MA I1 I2 I3 I4 I5 WB I1 I2 I3 I4 I5 Resources 12 CS252 S05 12

Pipelined Execution: ALU Instructions PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data IR 31 Not quite correct! We need an Instruction Reg (IR) for each stage 13 CS252 S05 13

Pipelined MIPS Datapath without jumps F D E M W IR WBSrc MemWrite 31 OpSel 0x4 Add RegDst RegWrite rd1 GPRs rs1 rs2 ws wd rd2 we PC A addr inst Inst Memory wdata addr rdata we IR Y ALU B Data Memory R ExtSel BSrc Imm Ext MD1 MD2 Control Points Need to Be Connected 14 CS252 S05 14

Instructions interact with each other in pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline  structural hazard An instruction may depend on something produced by an earlier instruction Dependence may be for a data value  data hazard Dependence may be for the next instruction’s address  control hazard (branches, exceptions) 15 CS252 S05 15

Resolving Structural Hazards Structural hazards occurs when two instruction need same hardware resource at same time Can resolve in hardware by stalling newer instruction till older instruction finished with resource A structural hazard can always be avoided by adding more hardware to design E.g., if two instructions both need a port to memory at same time, could avoid hazard by adding second port to memory Our 5-stage pipe has no structural hazards by design Thanks to MIPS ISA, which was designed for pipelining 16

Data Hazards r1 is stale. Oops! r4  r1 … r1 … ... r1 r0 + 10 IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data ... r1 r0 + 10 r4 r1 + 17 r1 is stale. Oops! 17 CS252 S05 17

Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages  interlocks 18 CS252 S05 18

Feedback to Resolve Hazards FB1 FB2 FB3 FB4 stage 1 2 3 4 Later stages provide dependence information to earlier stages which can stall (or kill) instructions Real designs will seldom provide full feedback nor will they be able to stop on a dime. Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur) 19 CS252 S05 19

Interlocks to resolve Data Hazards Stall Condition IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop ... r1 r0 + 10 r4 r1 + 17 20 CS252 S05 20

Stalled Stages and Pipeline Bubbles time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 (r1) + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5 stalled stages time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I3 I3 I3 I4 I5 ID I1 I2 I2 I2 I2 I3 I4 I5 EX I1 nop nop nop I2 I3 I4 I5 MA I1 nop nop nop I2 I3 I4 I5 WB I1 nop nop nop I2 I3 I4 I5 Resource Usage nop  pipeline bubble 21 CS252 S05 21

Interlock Control Logic Cstall ws rs rt ? stall IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. 22 CS252 S05 22

Interlock Control Logic ignoring jumps & branches IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall Cstall rs rt ? we ws we Cdest re1 re2 Cre Cdest Should we always stall if the rs field matches some rd? not every instruction writes a register we not every instruction reads a register re 23 CS252 S05 23

Source & Destination Registers R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26 source(s) destination ALU rd  (rs) func (rt) rs, rt rd ALUi rt  (rs) op imm rs rt LW rt M [(rs) + imm] rs rt SW M [(rs) + imm]  (rt) rs, rt BZ cond (rs) true: PC  (PC) + imm rs false: PC  (PC) + 4 rs J PC  (PC) + imm JAL r31  (PC), PC  (PC) + imm 31 JR PC  (rs) rs JALR r31  (PC), PC  (rs) rs 31 24 CS252 S05 24

Deriving the Stall Signal Cdest ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on ... off Cre re1 = Case opcode ALU, ALUi, on off re2 = Case opcode LW, SW, BZ, JR, JALR J, JAL ALU, SW ... Cstall stall = ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW) . re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW) . re2D This is not the full story ! 25 CS252 S05 25

Hazards due to Loads & Stores IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data nop Stall Condition What if (r1)+7 = (r3)+5 ? ... M[(r1)+7]  (r2) r4  M[(r3)+5] Is there any possible data hazard in this instruction sequence? 26 CS252 S05 26

(r1)+7 = (r3)+5  data hazard Load & Store Hazards ... M[(r1)+7]  (r2) r4  M[(r3)+5] (r1)+7 = (r3)+5  data hazard However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course. 27 CS252 S05 27

Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage  bypass 28 CS252 S05 28

Bypassing Each stall or kill introduces a bubble in the pipeline time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1 r0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4 r1 + 17 IF2 ID2 ID2 ID2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 IF3 IF3 ID3 EX3 MA3 (I4) stalled stages IF4 ID4 EX4 (I5) IF5 ID5 Each stall or kill introduces a bubble in the pipeline  CPI > 1 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1 r0 + 10 IF1 ID1 EX1 MA1 WB1 (I2) r4  r1 + 17 IF2 ID2 EX2 MA2 WB2 (I3) IF3 ID3 EX3 MA3 WB3 (I4) IF4 ID4 EX4 MA4 WB4 (I5) IF5 ID5 EX5 MA5 WB5 29 CS252 S05 29

Adding a Bypass ... (I1) r1 r0 + 10 (I2) r4 r1 + 17 r4  r1... PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall D E M W ... (I1) r1 r0 + 10 (I2) r4 r1 + 17 r4  r1... r1 ... ASrc When does this bypass help? r1 M[r0 + 10] r4 r1 + 17 JAL 500 r4 r31 + 17 yes no no 30 CS252 S05 30

The Bypass Signal Deriving it from the Stall Signal stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D +((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) ws = Case opcode ALU rd ALUi, LW rt JAL, JALR R31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on ... off ASrc = (rsD=wsE).weE.re1D Is this correct? No because only ALU and ALUi instructions can benefit from this bypass We can’t bypass on memory or JAL* instructions. Split weE into two components: we-bypass, we-stall 31 CS252 S05 31

Bypass and Stall Signals Split weE into two components: we-bypass, we-stall we-bypassE = Case opcodeE ALU, ALUi  (ws  0) ... off we-stallE = Case opcodeE LW  (ws  0) JAL, JALR on ... off ASrc = (rsD =wsE).we-bypassE . re1D stall = ((rsD =wsE).we-stallE + (rsD=wsM).weM + (rsD=wsW).weW). re1D +((rtD = wsE).weE + (rtD = wsM).weM + (rtD = wsW).weW). re2D 32 CS252 S05 32

Fully Bypassed Datapath ASrc IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add ALU Imm Ext rd1 GPRs rs1 rs2 ws wd rd2 we wdata rdata Data 31 nop stall D E M W PC for JAL, ... BSrc Is there still a need for the stall signal ? stall = (rsD=wsE). (opcodeE=LWE).(wsE0 ).re1D + (rtD=wsE). (opcodeE=LWE).(wsE0 ).re2D 33 CS252 S05 33

Resolving Data Hazards (3) Strategy 3: Speculate on the dependence. Two cases: Guessed correctly  do nothing Guessed incorrectly  kill and restart …. We’ll later see examples of this approach in more complex processors. 34 CS252 S05 34

What do we need to calculate next PC? Control Hazards What do we need to calculate next PC? For Jumps Opcode, offset and PC For Jump Register Opcode and Register value For Conditional Branches Opcode, PC, Register (for condition), and offset For all other instructions Opcode and PC have to know it’s not one of above! 35 CS252 S05 35

Opcode Decoding Bubble (assuming no branch delay slots for now) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) r1 (r0) + 10 IF1 ID1 EX1 MA1 WB1 (I2) r3 (r2) + 17 IF2 IF2 ID2 EX2 MA2 WB2 (I3) IF3 IF3 ID3 EX3 MA3 WB3 (I4) IF4 IF4 ID4 EX4 MA4 WB4 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 nop I2 nop I3 nop I4 ID I1 nop I2 nop I3 nop I4 EX I1 nop I2 nop I3 nop I4 MA I1 nop I2 nop I3 nop I4 WB I1 nop I2 nop I3 nop I4 Resource Usage nop  pipeline bubble 36 CS252 S05 36

Speculate next address is PC+4 104 IR PC addr inst Inst Memory 0x4 Add nop E M Jump? PCSrc (pc+4 / jabs / rind/ br) stall I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD A jump instruction kills (not stalls) the following instruction kill How? 37 CS252 S05 37

Pipelining Jumps PCSrc (pc+4 / jabs / rind/ br) stall To kill a fetched instruction -- Insert a mux before IR Add E M 0x4 nop IR IR Add Jump? I2 I1 304 nop I2 I1 104 Any interaction between stall and jump? nop IRSrcD addr PC inst IR Inst Memory Kill takes precedence over stall. IRSrcD = Case opcodeD J, JAL nop ... IM I1 096 ADD I2 100 J 304 I3 104 ADD I4 304 ADD kill 38 CS252 S05 38

Jump Pipeline Diagrams time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: J 304 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 nop nop nop nop (I4) 304: ADD IF4 ID4 EX4 MA4 WB4 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 I5 ID I1 I2 nop I4 I5 EX I1 I2 nop I4 I5 MA I1 I2 nop I4 I5 WB I1 I2 nop I4 I5 Resource Usage nop  pipeline bubble 39 CS252 S05 39

Pipelining Conditional Branches 104 stall IR PC addr inst Inst Memory 0x4 Add nop E M PCSrc (pc+4 / jabs / rind / br) IRSrcD BEQZ? A Y ALU zero? Branch condition is not known until the execute stage what action should be taken in the decode stage ? I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD 108 … I4 304 ADD 40 CS252 S05 40

Pipelining Conditional Branches PCSrc (pc+4 / jabs / rind / br) stall BEQZ? ? Add E M 0x4 nop A Y ALU zero? IR IR Add I2 I1 108 I3 nop IRSrcD addr PC inst IR Inst Memory If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD 108 … I4 304 ADD 41 CS252 S05 41

Pipelining Conditional Branches PCSrc (pc+4/jabs/rind/br) stall Add PC BEQZ? E M IRSrcE 0x4 nop A Y ALU zero? IR IR Add Jump? I2 I1 108 I3 IRSrcD addr PC nop inst IR Inst Memory If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD 108 … I4 304 ADD 42 CS252 S05 42

New Stall Signal Don’t stall if the branch is taken. Why? stall = ( ((rsD =wsE).weE + (rsD =wsM).weM + (rsD =wsW).weW).re1D + ((rtD =wsE).weE + (rtD =wsM).weM + (rtD =wsW).weW).re2D ) . !((opcodeE=BEQZ).z + (opcodeE=BNEZ).!z) Don’t stall if the branch is taken. Why? Instruction at the decode stage is invalid 43 CS252 S05 43

Control Equations for PC and IR Muxes PCSrc = Case opcodeE BEQZ.z, BNEZ.!z br ...  Case opcodeD J, JAL  jabs JR, JALR  rind ...  pc+4 Give priority to the older instruction, i.e., execute-stage instruction over decode-stage instruction IRSrcD = Case opcodeE BEQZ.z, BNEZ.!z nop ...  Case opcodeD J, JAL, JR, JALR nop ... IM IRSrcE = Case opcodeE BEQZ.z, BNEZ.!z nop ... stall.nop + !stall.IRD 44 CS252 S05 44

Branch Pipeline Diagrams (resolved in execute stage) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 nop nop nop (I4) 108: IF4 nop nop nop nop (I5) 304: ADD IF5 ID5 EX5 MA5 WB5 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX I1 I2 nop nop I5 MA I1 I2 nop nop I5 WB I1 I2 nop nop I5 Resource Usage nop  pipeline bubble 45 CS252 S05 45

Reducing Branch Penalty (resolve in decode stage) One pipeline bubble can be removed if an extra comparator is used in the Decode stage But might elongate cycle time PCSrc (pc+4 / jabs / rind/ br) Add IR nop E 0x4 Add Zero detect on register file output rd1 GPRs rs1 rs2 ws wd rd2 we nop addr PC inst IR Inst Memory D Pipeline diagram now same as for jumps 46 CS252 S05 46

Branch Delay Slots (expose control hazard to software) Change the ISA semantics so that the instruction that follows a jump or branch is always executed gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. I1 096 ADD I2 100 BEQZ r1 +200 I3 104 ADD I4 304 ADD Delay slot instruction executed regardless of branch outcome Other techniques include more advanced branch prediction, which can dramatically reduce the branch penalty... to come later 47 CS252 S05 47

Branch Pipeline Diagrams (branch delay slot) time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096: ADD IF1 ID1 EX1 MA1 WB1 (I2) 100: BEQZ +200 IF2 ID2 EX2 MA2 WB2 (I3) 104: ADD IF3 ID3 EX3 MA3 WB3 (I4) 304: ADD IF4 ID4 EX4 MA4 WB4 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 ID I1 I2 I3 I4 EX I1 I2 I3 I4 MA I1 I2 I3 I4 WB I1 I2 I3 I4 Resource Usage 48 CS252 S05 48

Why an Instruction may not be dispatched every cycle (CPI>1) Full bypassing may be too expensive to implement typically all frequently used paths are provided some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI Loads have two-cycle latency Instruction after load cannot use load result MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). MIPS:“Microprocessor without Interlocked Pipeline Stages” Removed in MIPS-II (pipeline interlocks added in hardware) Conditional branches may cause bubbles kill following instruction(s) if no delay slots 49 CS252 S05 49

Iron Law with Software-Visible NOPs Time = Instructions Cycles Time Program Program * Instruction * Cycle If software has to insert NOP instructions for hazard avoidance, instructions/program increases average cycles/instruction decreases - doing nothing fast is easy! But performance (time/program) worse or same as if hardware instead uses interlocks to avoid hazard Hardware-generated interlocks (bubbles) don’t change instructions/program, but only add to cycles/instruction Hardware interlocks don’t take space in instruction cache 50

Exceptions: altering the normal flow of control Ii-1 HI1 exception handler program Ii HI2 Ii+1 HIn An exception transfers control to special handler code run in privileged mode. Exceptions are usually unexpected or rare from program’s point of view. 51 CS252 S05 51

Exception: an event that requests the attention of the processor Causes of Exceptions Exception: an event that requests the attention of the processor Asynchronous: an external interrupt input/output device service request timer expiration power disruptions, hardware failure Synchronous: an internal exception (a.k.a. traps) undefined opcode, privileged instruction arithmetic overflow, FPU exception misaligned memory access virtual memory exceptions: page faults, TLB misses, protection violations software exceptions: system calls, e.g., jumps into kernel 52 CS252 S05 52

History of Exception Handling First system with exceptions was Univac-I, 1951 Arithmetic overflow would either 1. trigger the execution a two-instruction fix-up routine at address 0, or 2. at the programmer's option, cause the computer to stop Later Univac 1103, 1955, modified to add external interrupts Used to gather real-time wind tunnel data First system with I/O interrupts was DYSEAC, 1954 Had two program counters, and I/O signal caused switch between two PCs Also, first system with DMA (direct memory access by I/O device) [Courtesy Mark Smotherman] 53 CS252 S05 53

DYSEAC, first mobile computer! Carried in two tractor trailers, 12 tons + 8 tons Built for US Army Signal Corps [Courtesy Mark Smotherman] 54 CS252 S05 54

Asynchronous Interrupts: invoking the interrupt handler An I/O device requests attention by asserting one of the prioritized interrupt request lines When the processor decides to process the interrupt It stops the current program at instruction Ii, completing all the instructions up to Ii-1 (a precise interrupt) It saves the PC of instruction Ii in a special register (EPC) It disables interrupts and transfers control to a designated interrupt handler running in the kernel mode 55 CS252 S05 55

MIPS Interrupt Handler Code Saves EPC before re-enabling interrupts to allow nested interrupts  need an instruction to move EPC into GPRs need a way to mask further interrupts at least until EPC can be saved Needs to read a status register that indicates the cause of the interrupt Uses a special indirect jump instruction RFE (return-from-exception) to resume user code, this: enables interrupts restores the processor to the user mode restores hardware status and control state 56 CS252 S05 56

Synchronous Exception A synchronous exception is caused by a particular instruction In general, the instruction cannot be completed and needs to be restarted after the exception has been handled requires undoing the effect of one or more partially executed instructions In the case of a system call trap, the instruction is considered to have been completed syscall is a special jump instruction involving a change to privileged kernel mode Handler resumes at instruction after system call 57 CS252 S05 57

Exception Handling 5-Stage Pipeline PC Inst. Mem D Decode E M Data Mem W + Illegal Opcode Overflow Data address Exceptions PC address Exception Asynchronous Interrupts How to handle multiple simultaneous exceptions in different pipeline stages? How and where to handle external asynchronous interrupts? 58 CS252 S05 58

Exception Handling 5-Stage Pipeline Commit Point PC D E M W Inst. Mem Decode Data Mem + Kill D Stage Kill F Stage Kill E Stage Select Handler PC Kill Writeback Illegal Opcode Overflow Data address Exceptions PC address Exception Exc D Exc E Exc M Cause PC D PC E PC M EPC Asynchronous Interrupts 59 CS252 S05 59

Exception Handling 5-Stage Pipeline Hold exception flags in pipeline until commit point (M stage) Exceptions in earlier pipe stages override later exceptions for a given instruction Inject external interrupts at commit point (override others) If exception at commit: update Cause and EPC registers, kill all stages, inject handler PC into fetch stage 60 CS252 S05 60

Speculating on Exceptions Prediction mechanism Exceptions are rare, so simply predicting no exceptions is very accurate! Check prediction mechanism Exceptions detected at end of instruction execution pipeline, special hardware for various exception types Recovery mechanism Only write architectural state at commit point, so can throw away partially executed instructions after exception Launch exception handler after flushing pipeline Bypassing allows use of uncommitted instruction results by following instructions 61 CS252 S05 61

Exception Pipeline Diagram time t0 t1 t2 t3 t4 t5 t6 t7 . . . . (I1) 096: ADD IF1 ID1 EX1 MA1 nop overflow! (I2) 100: XOR IF2 ID2 EX2 nop nop (I3) 104: SUB IF3 ID3 nop nop nop (I4) 108: ADD IF4 nop nop nop nop (I5) Exc. Handler code IF5 ID5 EX5 MA5 WB5 time t0 t1 t2 t3 t4 t5 t6 t7 . . . . IF I1 I2 I3 I4 I5 ID I1 I2 I3 nop I5 EX I1 I2 nop nop I5 MA I1 nop nop nop I5 WB nop nop nop nop I5 Resource Usage 62 CS252 S05 62

Acknowledgements UCB material derived from course CS152 Harvard University material derived from course CS246 63

Readings Computer Architecture: A Quantitative Approach, 4th Edition (Oct, 2006) D. A. Patterson and J. L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3rd Edition, Revised Printing, Morgan Kaufmann Publishing Co., Menlo Park, CA., June 2007.