CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California.

Slides:



Advertisements
Similar presentations
Laboratory for Computer Science
Advertisements

Morgan Kaufmann Publishers The Processor
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Mehmet Can Vuran, Instructor University of Nebraska-Lincoln Acknowledgement: Overheads adapted from those provided by the authors of the textbook.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Complex Pipelining II Steve Ko Computer Sciences and Engineering University at Buffalo.
Lecture Objectives: 1)Define pipelining 2)Calculate the speedup achieved by pipelining for a given number of instructions. 3)Define how pipelining improves.
February 2, 2010CS152, Spring 2010 CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II Krste Asanovic Electrical Engineering and Computer.
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 6 Pipelining – Part 1 Benjamin Lee Electrical and Computer Engineering Duke University
Multi-cycle Implementations Arvind Computer Science & Artificial Intelligence Lab. Massachusetts Institute of Technology January 13, 2012L6-1
Kevin Walsh CS 3410, Spring 2010 Computer Science Cornell University Pipeline Hazards See: P&H Chapter 4.7.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture Pipelining III Steve Ko Computer Sciences and Engineering University at Buffalo.
CSE 490/590, Spring 2011 CSE 490/590 Computer Architecture MIPS Review & Pipelining I Steve Ko Computer Sciences and Engineering University at Buffalo.
Pipeline Hazards Hakim Weatherspoon CS 3410, Spring 2011 Computer Science Cornell University See P&H Appendix 4.7.
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California.
Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,
CS 152 Computer Architecture and Engineering Lecture 15 - Advanced Superscalars Krste Asanovic Electrical Engineering and Computer Sciences University.
ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
L18 – Pipeline Issues 1 Comp 411 – Spring /03/08 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you.
L17 – Pipeline Issues 1 Comp 411 – Fall /1308 CPU Pipelining Issues Finishing up Chapter 6 This pipe stuff makes my head hurt! What have you been.
January 31, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer.
Computer ArchitectureFall 2007 © October 3rd, 2007 Majd F. Sakr CS-447– Computer Architecture.
February 2, 2011CS152, Spring 2011 CS 152 Computer Architecture and Engineering Lecture 5 - Pipelining II Krste Asanovic Electrical Engineering and Computer.
Lec 9: Pipelining Kavita Bala CS 3410, Fall 2008 Computer Science Cornell University.
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 17 - Pipelined.
Asanovic/Devadas Spring Simple Instruction Pipelining Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology.
COSC 3430 L08 Basic MIPS Architecture.1 COSC 3430 Computer Architecture Lecture 08 Processors Single cycle Datapath PH 3: Sections
1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 7 Pipelining – Part 2 Benjamin Lee Electrical and Computer Engineering Duke University
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
Electrical and Computer Engineering University of Cyprus LAB 2: MIPS.
EECE 476: Computer Architecture Slide Set #5: Implementing Pipelining Tor Aamodt Slide background: Die photo of the MIPS R2000 (first commercial MIPS microprocessor)
CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
Instructor: Senior Lecturer SOE Dan Garcia CS 61C: Great Ideas in Computer Architecture Pipelining Hazards 1.
CMPE 421 REVIEW: MIDTERM 1. A MODIFIED FIVE-Stage Pipeline PC A Y R MD1 addr inst Inst Memory Imm Ext add rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata.
L17 – Pipeline Issues 1 Comp 411 – Fall /23/09 CPU Pipelining Issues Read Chapter This pipe stuff makes my head hurt! What have you been.
Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.
Computer Architecture and Parallel Computing 并行结构与计算 Lecture 2 - Pipelining Peng Liu Dept. of Info. Sci. & Elec. Engg. Zhejiang University
Electrical and Computer Engineering University of Cyprus
Morgan Kaufmann Publishers
ELEN 468 Advanced Logic Design
Single Clock Datapath With Control
ECS 154B Computer Architecture II Spring 2009
Peng Liu Lecture 7 Pipelining Peng Liu
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining
CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining
Chapter 4 The Processor Part 3
Review: MIPS Pipeline Data and Control Paths
Dept. of Info. Sci. & Elec. Engg.
Pipelining in more detail
Systems Architecture II
Krste Asanovic Electrical Engineering and Computer Sciences
The Processor Lecture 3.6: Control Hazards
Control unit extension for data hazards
The Processor Lecture 3.5: Data Hazards
CS203 – Advanced Computer Architecture
Control unit extension for data hazards
Introduction to Computer Organization and Architecture
Control unit extension for data hazards
ELEC / Computer Architecture and Design Spring 2015 Pipeline Control and Performance (Chapter 6) Vishwani D. Agrawal James J. Danaher.
Presentation transcript:

CS 152 Computer Architecture and Engineering Lecture 4 - Pipelining Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley

2/5/2008CS152-Spring’08 2 Last time in Lecture 3 Microcoding became less attractive as gap between RAM and ROM speeds reduced Complex instruction sets difficult to pipeline, so difficult to increase performance as gate count grew Iron-law explains architecture design space –Trade instruction/program, cycles/instruction, and time/cycle Load-Store RISC ISAs designed for efficient pipelined implementations –Very similar to vertical microcode –Inspired by earlier Cray machines

2/5/2008CS152-Spring’08 3 “Iron Law” of Processor Performance Time = Instructions Cycles Time Program Program * Instruction * Cycle –Instructions per program depends on source code, compiler technology, and ISA –Cycles per instructions (CPI) depends upon the ISA and the microarchitecture –Time per cycle depends upon the microarchitecture and the base technology MicroarchitectureCPIcycle time Microcoded>1short Single-cycle unpipelined1long Pipelined1short Lecture 3 This lecture

2/5/2008CS152-Spring’08 4 An Ideal Pipeline All objects go through the same stages No sharing of resources between any two stages Propagation delay through all pipeline stages is equal The scheduling of an object entering the pipeline is not affected by the objects in other stages stage 1 stage 2 stage 3 stage 4 These conditions generally hold for industrial assembly lines. But can an instruction pipeline satisfy the last condition?

2/5/2008CS152-Spring’08 5 Pipelined MIPS To pipeline MIPS: First build MIPS without pipelining with CPI=1 Next, add pipeline registers to reduce cycle time while maintaining CPI=1

2/5/2008CS152-Spring’08 6 Pipelined Datapath Clock period can be reduced by dividing the execution of an instruction into multiple cycles t C > max {t IM, t RF, t ALU, t DM, t RW } ( = t DM probably) However, CPI will increase unless instructions are pipelined write -back phase fetch phase execute phase decode & Reg-fetch phase memory phase addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we IR PC

2/5/2008CS152-Spring’08 7 Technology Assumptions Thus, the following timing assumption is reasonable A small amount of very fast memory (caches) backed up by a large, slower memory Fast ALU (at least for integers) Multiported Register files (slower!) t IM t RF t ALU t DM t RW A 5-stage pipelined Harvard architecture will be the focus of our detailed design

2/5/2008CS152-Spring’ Stage Pipelined Execution time t0t1t2t3t4t5t6t7.... instruction1IF 1 ID 1 EX 1 MA 1 WB 1 instruction2 IF 2 ID 2 EX 2 MA 2 WB 2 instruction3IF 3 ID 3 EX 3 MA 3 WB 3 instruction4 IF 4 ID 4 EX 4 MA 4 WB 4 instruction5 IF 5 ID 5 EX 5 MA 5 WB 5 Write - Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we IR PC

2/5/2008CS152-Spring’ Stage Pipelined Execution Resource Usage Diagram time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 I 5 IDI 1 I 2 I 3 I 4 I 5 EX I 1 I 2 I 3 I 4 I 5 MA I 1 I 2 I 3 I 4 I 5 WB I 1 I 2 I 3 I 4 I 5 Resources Write - Back (WB) I-Fetch (IF) Execute (EX) Decode, Reg. Fetch (ID) Memory (MA) addr wdata rdata Data Memory we ALU Imm Ext 0x4 Add addr rdata Inst. Memory rd1 GPRs rs1 rs2 ws wd rd2 we IR PC

2/5/2008CS152-Spring’08 10 Pipelined Execution: ALU Instructions IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we Not quite correct! We need an Instruction Reg (IR) for each stage

2/5/2008CS152-Spring’08 11 Pipelined MIPS Datapath without jumps IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we Data Memory wdata addr wdata rdata we OpSel ExtSelBSrc WBSrc MemWrite RegDst RegWrite FDEMW Control Points Need to Be Connected

2/5/2008CS152-Spring’08 12 How Instructions can Interact with each other in a pipeline An instruction in the pipeline may need a resource being used by another instruction in the pipeline  structural hazard An instruction may depend on something produced by an earlier instruction –Dependence may be for a data value  data hazard –Dependence may be for the next instruction’s address  control hazard (branches, exceptions)

2/5/2008CS152-Spring’08 13 Data Hazards... r1 r r4 r r1 is stale. Oops! r1 … r4 r1… IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we

2/5/2008CS152-Spring’08 14 Resolving Data Hazards (1) Strategy 1: Wait for the result to be available by freezing earlier pipeline stages  interlocks

2/5/2008CS152-Spring’08 15 Feedback to Resolve Hazards Later stages provide dependence information to earlier stages which can stall (or kill) instructions FB 1 stage 1 stage 2 stage 3 stage 4 FB 2 FB 3 FB 4 Controlling a pipeline in this manner works provided the instruction at stage i+1 can complete without any interference from instructions in stages 1 to i (otherwise deadlocks may occur)

2/5/2008CS152-Spring’08 16 IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we nop Interlocks to resolve Data Hazards... r1 r r4 r Stall Condition

2/5/2008CS152-Spring’08 17 stalled stages time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 3 I 3 I 3 I 4 I 5 IDI 1 I 2 I 2 I 2 I 2 I 3 I 4 I 5 EX I 1 nopnopnopI 2 I 3 I 4 I 5 MA I 1 nopnopnopI 2 I 3 I 4 I 5 WB I 1 nopnopnopI 2 I 3 I 4 I 5 Stalled Stages and Pipeline Bubbles time t0t1t2t3t4t5t6t7.... (I 1 ) r1 (r0) + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 (r1) + 17IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage nop  pipeline bubble

2/5/2008CS152-Spring’08 18 IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we nop Interlock Control Logic Compare the source registers of the instruction in the decode stage with the destination register of the uncommitted instructions. stall C stall ws rs rt ?

2/5/2008CS152-Spring’08 19 C dest Interlock Control Logic ignoring jumps & branches Should we always stall if the rs field matches some rd? IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we 31 nop stall C stall ws rs rt ? we re1re2 C re wswe ws C dest we not every instruction writes a register we not every instruction reads a register re

2/5/2008CS152-Spring’08 20 Source & Destination Registers source(s) destination ALUrd (rs) func (rt) rs, rtrd ALUirt (rs) op immrs rt LWrt M [(rs) + imm] rs rt SWM [(rs) + imm] (rt) rs, rt BZcond (rs) true:PC (PC) + immrs false: PC (PC) + 4rs JPC (PC) + imm JALr31 (PC), PC (PC) + imm31 JRPC (rs) rs JALRr31 (PC), PC (rs) rs31 R-type: op rs rt rd func I-type: op rs rt immediate16 J-type: op immediate26

2/5/2008CS152-Spring’08 21 Deriving the Stall Signal C dest ws = Case opcode ALUrd ALUi, LWrt JAL, JALRR31 we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on... off C re re1 = Case opcode ALU, ALUi, on off re2 = Case opcode on off LW, SW, BZ, JR, JALR J, JAL ALU, SW... C stall stall = ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ). re1 D + ((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ). re2 D This is not the full story !

2/5/2008CS152-Spring’08 22 Hazards due to Loads & Stores... M[(r1)+7]  (r2) r4  M[(r3)+5]... IR 31 PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we nop Stall Condition Is there any possible data hazard in this instruction sequence? What if (r1)+7 = (r3)+5 ?

2/5/2008CS152-Spring’08 23 Load & Store Hazards However, the hazard is avoided because our memory system completes writes in one cycle ! Load/Store hazards are sometimes resolved in the pipeline and sometimes in the memory system itself. More on this later in the course.... M[(r1)+7]  (r2) r4  M[(r3)+5]... (r1)+7 = (r3)+5  data hazard

2/5/2008CS152-Spring’08 24 CS152 Administrivia Krste, office hours, Monday 1-3pm, 645 Soda – for alternate time Henry office hours, 511 Soda –9:30-10:30AM Mondays –2:00-3:00PM Fridays First lab and practice problems this evening In-class quiz dates: –Q1: Tuesday February 19 (ISAs, microcode, simple pipelining) –Q2: Tuesday March 4 (memory hierarchies)

2/5/2008CS152-Spring’08 25 Course Schedule Check website for schedule Generally, a lab, practice problem set, plus quiz every two weeks Lab and practice problems out on Tuesday (this evening for first) Lab overview in section next day (Wednesday) Practice problems due following Tuesday in class, solutions available on line Problem review in section next day (Wednesday) Lab due in class Thursday Quiz following Tuesday, with new PS and lab.

2/5/2008CS152-Spring’08 26 Resolving Data Hazards (2) Strategy 2: Route data as soon as possible after it is calculated to the earlier pipeline stage  bypass

2/5/2008CS152-Spring’08 27 Bypassing Each stall or kill introduces a bubble in the pipeline CPI > 1 time t0t1t2t3t4t5t6t7.... (I 1 ) r1 r0 + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4 r1 + 17IF 2 ID 2 ID 2 ID 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 IF 3 IF 3 IF 3 ID 3 EX 3 MA 3 (I 4 ) stalled stagesIF 4 ID 4 EX 4 (I 5 ) IF 5 ID 5 timet0t1t2t3t4t5t6t7.... (I 1 ) r1 r0 + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r4  r1 + 17IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 ID 4 EX 4 MA 4 WB 4 (I 5 ) IF 5 ID 5 EX 5 MA 5 WB 5 A new datapath, i.e., a bypass, can get the data from the output of the ALU to its input

2/5/2008CS152-Spring’08 28 Adding a Bypass ASrc... (I 1 )r1 r (I 2 )r4 r r4 r1r1  IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR Imm Ext ALU rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we 31 nop stall D EMW When does this bypass help? r1 M[r0 + 10] r4 r JAL 500 r4 r yesno

2/5/2008CS152-Spring’08 29 The Bypass Signal Deriving it from the Stall Signal ASrc = (rs D =ws E ).we E.re1 D we = Case opcode ALU, ALUi, LW (ws  0) JAL, JALR on... off No because only ALU and ALUi instructions can benefit from this bypass Is this correct? Split we E into two components: we-bypass, we-stall stall = ( ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ).re1 D +((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ).re2 D ) ws = Case opcode ALUrd ALUi, LWrt JAL, JALRR31

2/5/2008CS152-Spring’08 30 Bypass and Stall Signals we-bypass E = Case opcode E ALU, ALUi(ws  0)... off ASrc = (rs D =ws E ).we-bypass E. re1 D Split we E into two components: we-bypass, we-stall stall = ((rs D =ws E ).we-stall E + (rs D =ws M ).we M + (rs D =ws W ).we W ). re1 D +((rt D = ws E ).we E + (rt D = ws M ).we M + (rt D = ws W ).we W ). re2 D we-stall E = Case opcode E LW (ws  0) JAL, JALRon... off

2/5/2008CS152-Spring’08 31 Fully Bypassed Datapath ASrc IR PC A B Y R MD1 MD2 addr inst Inst Memory 0x4 Add IR ALU Imm Ext rd1 GPRs rs1 rs2 ws wd rd2 we wdata addr wdata rdata Data Memory we 31 nop stall D EMW PC for JAL,... BSrc Is there still a need for the stall signal ? stall = (rs D =ws E ). (opcode E =LW E ).(ws E 0 ).re1 D + (rt D =ws E ). (opcode E =LW E ).(ws E 0 ).re2 D

2/5/2008CS152-Spring’08 32 Resolving Data Hazards (3) Strategy 3: Speculate on the dependence. Two cases: Guessed correctly  do nothing Guessed incorrectly  kill and restart

2/5/2008CS152-Spring’08 33 Instruction to Instruction Dependence What do we need to calculate next PC: –For Jumps » Opcode, offset and PC –For Jump Register »Opcode and Register value –For Conditional Branches »Opcode, PC, Register (for condition), and offset –For all others »Opcode and PC

2/5/2008CS152-Spring’08 34 time t0t1t2t3t4t5t6t7.... (I 1 ) r1 (r0) + 10IF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) r3 (r2) + 17 IF 2 IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 )IF 3 IF 3 ID 3 EX 3 MA 3 WB 3 (I 4 ) IF 4 IF 4 ID 4 EX 4 MA 4 WB 4 time t0t1t2t3t4t5t6t7.... IFI 1 nop I 2 nop I 3 nop I 4 IDI 1 nop I 2 nop I 3 nop I 4 EX I 1 nopI 2 nopI 3 nopI 4 MA I 1 nopI 2 nopI 3 nopI 4 WB I 1 nopI 2 nopI 3 nopI 4 PC Calculation Bubbles Resource Usage nop  pipeline bubble

2/5/2008CS152-Spring’08 35 Speculate next address is PC+4 I 1 096ADD I 2 100J 304 I 3 104ADD I 4 304ADD kill A jump instruction kills (not stalls) the following instruction stall How? I2I2 I1I1 104 IR PC addr inst Inst Memory 0x4 Add nop IR E M Add Jump? PCSrc (pc+4 / jabs / rind/ br)

2/5/2008CS152-Spring’08 36 Pipelining Jumps I 1 096ADD I 2 100J 304 I 3 104ADD I 4 304ADD kill I2I2 I1I1 104 stall IR PC addr inst Inst Memory 0x4 Add nop IR E M Add Jump? PCSrc (pc+4 / jabs / rind/ br) IRSrc D = Case opcode D J, JAL nop... IM To kill a fetched instruction -- Insert a mux before IR Any interaction between stall and jump? nop IRSrc D I2I2 I1I1 304 nop

2/5/2008CS152-Spring’08 37 time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 I 5 IDI 1 I 2 nop I 4 I 5 EX I 1 I 2 nop I 4 I 5 MA I 1 I 2 nop I 4 I 5 WB I 1 I 2 nop I 4 I 5 Jump Pipeline Diagrams time t0t1t2t3t4t5t6t7.... (I 1 ) 096: ADDIF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: J 304IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADDIF 3 nop nop nop nop (I 4 ) 304: ADD IF 4 ID 4 EX 4 MA 4 WB 4 Resource Usage nop  pipeline bubble

2/5/2008CS152-Spring’08 38 Pipelining Conditional Branches I 1 096ADD I 2 100BEQZ r I 3 104ADD I 4 304ADD BEQZ? I2I2 I1I1 104 stall IR PC addr inst Inst Memory 0x4 Add nop IR E M Add PCSrc (pc+4 / jabs / rind / br) nop IRSrc D Branch condition is not known until the execute stage what action should be taken in the decode stage ? A Y ALU zero?

2/5/2008CS152-Spring’08 39 Pipelining Conditional Branches I 1 096ADD I 2 100BEQZ r I 3 104ADD I 4 304ADD stall IR PC addr inst Inst Memory 0x4 Add nop IR E M Add PCSrc (pc+4 / jabs / rind / br) nop IRSrc D A Y ALU zero? If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid I2I2 I1I1 108 I3I3 BEQZ? ?

2/5/2008CS152-Spring’08 40 Pipelining Conditional Branches I 1 096ADD I 2 100BEQZ r1 200 I 3 104ADD I 4 304ADD stall IR PC addr inst Inst Memory 0x4 Add nop IR E M PCSrc (pc+4/jabs/rind/br) nop A Y ALU zero? I2I2 I1I1 108 I3I3 BEQZ? Jump? IRSrc D IRSrc E If the branch is taken - kill the two following instructions - the instruction at the decode stage is not valid  stall signal is not valid Add PC

2/5/2008CS152-Spring’08 41 New Stall Signal stall = ( ((rs D =ws E ).we E + (rs D =ws M ).we M + (rs D =ws W ).we W ).re1 D + ((rt D =ws E ).we E + (rt D =ws M ).we M + (rt D =ws W ).we W ).re2 D ). !((opcode E =BEQZ).z + (opcode E =BNEZ).!z) Don’t stall if the branch is taken. Why? Instruction at the decode stage is invalid

2/5/2008CS152-Spring’08 42 Control Equations for PC and IR Muxes PCSrc = Case opcode E BEQZ.z, BNEZ.!z br...  Case opcode D J, JALjabs JR, JALRrind... pc+4 IRSrc D = Case opcode E BEQZ.z, BNEZ.!z nop...  Case opcode D J, JAL, JR, JALR nop... IM Give priority to the older instruction, i.e., execute stage instruction over decode stage instruction IRSrc E = Case opcode E BEQZ.z, BNEZ.!z nop... stall.nop + !stall.IR D

2/5/2008CS152-Spring’08 43 time t0t1t2t3t4t5t6t7.... IFI 1 I 2 I 3 I 4 I 5 IDI 1 I 2 I 3 nop I 5 EX I 1 I 2 nop nop I 5 MA I 1 I 2 nop nop I 5 WB I 1 I 2 nop nop I 5 Branch Pipeline Diagrams (resolved in execute stage) time t0t1t2t3t4t5t6t7.... (I 1 ) 096: ADDIF 1 ID 1 EX 1 MA 1 WB 1 (I 2 ) 100: BEQZ 200IF 2 ID 2 EX 2 MA 2 WB 2 (I 3 ) 104: ADDIF 3 ID 3 nop nop nop (I 4 ) 108: IF 4 nop nop nop nop (I 5 ) 304: ADD IF 5 ID 5 EX 5 MA 5 WB 5 Resource Usage nop  pipeline bubble

2/5/2008CS152-Spring’08 44 One pipeline bubble can be removed if an extra comparator is used in the Decode stage PC addr inst Inst Memory 0x4 Add IR nop E Add PCSrc (pc+4 / jabs / rind/ br) rd1 GPRs rs1 rs2 ws wd rd2 we nop Zero detect on register file output Pipeline diagram now same as for jumps D Reducing Branch Penalty ( resolve in decode stage)

2/5/2008CS152-Spring’08 45 Branch Delay Slots (expose control hazard to software) Change the ISA semantics so that the instruction that follows a jump or branch is always executed –gives compiler the flexibility to put in a useful instruction where normally a pipeline bubble would have resulted. I 1 096ADD I 2 100BEQZ r1 200 I 3 104ADD I 4 304ADD Delay slot instruction executed regardless of branch outcome Other techniques include branch prediction, which can dramatically reduce the branch penalty... to come later

2/5/2008CS152-Spring’08 46 Why an Instruction may not be dispatched every cycle (CPI>1) Full bypassing may be too expensive to implement –typically all frequently used paths are provided –some infrequently used bypass paths may increase cycle time and counteract the benefit of reducing CPI Loads have two cycle latency –Instruction after load cannot use load result –MIPS-I ISA defined load delay slots, a software-visible pipeline hazard (compiler schedules independent instruction or inserts NOP to avoid hazard). Removed in MIPS-II. Conditional branches may cause bubbles –kill following instruction(s) if no delay slots Machines with software-visible delay slots may execute significant number of NOP instructions inserted by the compiler.

2/5/2008CS152-Spring’08 47 Acknowledgements These slides contain material developed and copyright by: –Arvind (MIT) –Krste Asanovic (MIT/UCB) –Joel Emer (Intel/MIT) –James Hoe (CMU) –John Kubiatowicz (UCB) –David Patterson (UCB) MIT material derived from course UCB material derived from course CS252