Chapter 4 Sections 4.5 – 4.8 Dr. Iyad F. Jafar

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Pipelined Processor.

Review: MIPS Pipeline Data and Control Paths

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Prof. John Nestor ECE Department Lafayette College Easton, Pennsylvania ECE Computer Organization Lecture 18 - Pipelined.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Chapter Six Enhancing Performance with Pipelining

1 Stalling  The easiest solution is to stall the pipeline  We could delay the AND instruction by introducing a one-cycle delay into the pipeline, sometimes.

331 Lec18.1Fall :332:331 Computer Architecture and Assembly Language Fall 2003 Lecture 18 Introduction to Pipelined Datapath [Adapted from Dave.

Chapter 6 Pipelining to Increase Effective Computer Speed.

 The actual result $1 - $3 is computed in clock cycle 3, before it’s needed in cycles 4 and 5  We forward that value to later instructions, to prevent.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computer Organization Lecture Set – 06 Chapter 6 Huei-Yung Lin.

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

Spring W :332:331 Computer Architecture and Assembly Language Spring 2005 Week 11 Introduction to Pipelined Datapath [Adapted from Dave Patterson’s.

Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.

Chapter 4 Sections 4.1 – 4.4 Appendix D.1 and D.2 Dr. Iyad F. Jafar Basic MIPS Architecture: Single-Cycle Datapath and Control.

1 Pipelining Reconsider the data path we just did Each instruction takes from 3 to 5 clock cycles However, there are parts of hardware that are idle many.

Pipeline Data Hazards: Detection and Circumvention Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly.

Pipelined Datapath and Control

CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-2 Read Section 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State University.

Chapter 4 CSF 2009 The processor: Pipelining. Performance Issues Longest delay determines clock period – Critical path: load instruction – Instruction.

11/13/2015 8:57 AM 1 of 86 Pipelining Chapter 6. 11/13/2015 8:57 AM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

Basic Pipelining & MIPS Pipelining Chapter 6 [Computer Organization and Design, © 2007 Patterson (UCB) & Hennessy (Stanford), & Slides Adapted from: Mary.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

Computer Architecture and Design – ELEN 350 Part 8 [Some slides adapted from M. Irwin, D. Paterson. D. Garcia and others]

CMPE 421 Parallel Computer Architecture Part 2: Hardware Solution: Forwarding.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

1 (Based on text: David A. Patterson & John L. Hennessy, Computer Organization and Design: The Hardware/Software Interface, 3 rd Ed., Morgan Kaufmann,

CSE431 L07 Overcoming Data Hazards.1Irwin, PSU, 2005 CSE 431 Computer Architecture Fall 2005 Lecture 07: Overcoming Data Hazards Mary Jane Irwin (

1/24/ :00 PM 1 of 86 Pipelining Chapter 6. 1/24/ :00 PM 2 of 86 Overview of Pipelining Pipelining is an implementation technique in which.

University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline.

CSIE30300 Computer Architecture Unit 05: Overcoming Data Hazards Hsin-Chou Chi [Adapted from material by and

CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

CPE432 Chapter 4B.1Dr. W. Abu-Sufah, UJ Chapter 4B: The Processor, Part B-1 Read Sections 4.7 Adapted from Slides by Prof. Mary Jane Irwin, Penn State.

Chapter 4 From: Dr. Iyad F. Jafar Basic MIPS Architecture: Multi-Cycle Datapath and Control.

Pipelining: Implementation CPSC 252 Computer Organization Ellen Walker, Hiram College.

CSE 340 Computer Architecture Spring 2016 Overcoming Data Hazards.

Computer Organization

Computer Organization CS224

Stalling delays the entire pipeline

Note how everything goes left to right, except …

Single Clock Datapath With Control

ECS 154B Computer Architecture II Spring 2009

ECE232: Hardware Organization and Design

Forwarding Now, we’ll introduce some problems that data hazards can cause for our pipelined processor, and show how to handle them with forwarding.

Chapter 4 The Processor Part 3

Review: MIPS Pipeline Data and Control Paths

Morgan Kaufmann Publishers The Processor

Morgan Kaufmann Publishers The Processor

Chapter 4 The Processor Part 2

Single-cycle datapath, slightly rearranged

The processor: Pipelining and Branching

Lecture 9. MIPS Processor Design – Pipelined Processor Design #2

The Processor Lecture 3.6: Control Hazards

The Processor Lecture 3.4: Pipelining Datapath and Control

The Processor Lecture 3.5: Data Hazards

CSC3050 – Computer Architecture

Pipelining (II).

Introduction to Computer Organization and Architecture

©2003 Craig Zilles (derived from slides by Howard Huang)

Presentation transcript:

Chapter 4 Sections 4.5 – 4.8 Dr. Iyad F. Jafar MIPS Pipelining Chapter 4 Sections 4.5 – 4.8 Dr. Iyad F. Jafar

Outline Introduction Why Pipelining? MIPS Pipelined Datapath MIPS Pipelined Control Pipelining Hazards Structural Hazards Data Hazards Control Hazards Exceptions and Interrupts Fallacies and Pitfalls Reading Assignment

Introduction Single-cycle datapath Multi-cycle datapath Simple! Hardware replication? Cycle time? Multi-cycle datapath More involved Less HW replication of major units Better performance if the delay of major functional units is balanced! Can we do any better? Pipelining!

Introduction Pipelining In Multi-cycle, only one major unit is used in each cycle while other units are idle! Why not to use them to do something else? Basically, start the next instruction before the current one is finished! Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 LW IFetch Dec Exec Mem WB SW IFetch Dec Exec Mem WB R-Type IFetch Dec Exec Mem WB

Introduction Pipelining The time required to execute one instruction (Instruction latency) is not affected! However, the number of instructions finished per unit time (Throughput) is increased Thus, Pipelining improves the throughput not latency! Most modern processors are pipelined! Notes As in multi-cycle, the cycle time is determined by the slowest unit! However, similar to single-cycle, we can get one instruction done every cycle! It is assumed that all instructions take the same number of cycles!

Introduction Clk Single Cycle Implementation: lw sw Waste Cycle 1 R-type Multiple Cycle Implementation: Clk Cycle 1 IFetch Dec Exec Mem WB Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8 Cycle 9 Cycle 10 lw sw R-type lw IFetch Dec Exec Mem WB Pipeline Implementation: sw R-type

Why Pipelining? For Performance! Inst 1 Inst 2 Inst 3 Inst 4 Inst 5 Time (clock cycles) Once the pipeline is full, one instruction is completed every cycle, so CPI = 1 (similar to Single-cycle) I n s t r. O r d e ALU IM Reg DM Inst 1 ALU IM Reg DM Inst 2 ALU IM Reg DM Inst 3 ALU IM Reg DM Inst 4 ALU IM Reg DM Inst 5 Time to fill the pipeline

Why Pipelining? Example 1. Comparing pipelining to single-cycle Consider a program that consists of a large number of LOAD instructions only that is executed on a single-cycle CPU and 5-stage pipelined CPU with the operation time for the major units (memory, ALU, and register file) to be 200 ps in both cases. 1) Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining. 2) Determine the time required to finish executing the first 3 LOAD instructions 3) Repeat (1) and (2) if the delay of the register file is 100 ps instead of 200 ps. Cycle times for the two implementations CCSC = 200 + 200 + 200 + 200 + 200 = 1000 ps CCPP = 200 ps

(very close to the number of stages) Why Pipelining? Example 1. Comparing pipelining to single-cycle 1) Determine the time required to finish executing 1,000,000 LOAD instructions and compute the speed up of pipelining. Single-cycle TimeSC = 1000 ps x 1000000 = 1,000,000,000 ps Pipelining TimePP = 1000 ps + 200 ps x 999999 = 200,000,800 ps After 200*5 seconds, the pipeline is full and we get 1 instruction per cycle afterwards Speeup = 1,000,000,000 / 200,000,800 = 4.99998 (very close to the number of stages)

(less than the number of stages) Why Pipelining? Example 1. Comparing pipelining to single-cycle 2) Determine the time required to finish executing the first 3 LOAD instructions and compute the speed up of pipelining Single-cycle TimeSC = 1000 x 3 = 3000 ps Pipelining TimePP = 200 x 5 +200 + 200 = 1400 ps Speeup = 3000 / 1400 = 2.14 (less than the number of stages)

Why Pipelining? Example 1. Comparing pipelining to single-cycle 3) Repeat (1) and (2) if the delay of the register file is 100 ps . CCSC = 200 + 100 + 200 + 200 + 100 = 800 ps CCPP = 200 ps For 1,000,000 instructions TimeSC = 800 x 1,000,000 = 800,000,000 ps TimePP = 1000+ 200x999,999 = 200,000,800ps Speeup = 800,000,000/ 200,000,600 = 3.99998 (<5) For 3 instructions TimeSC = 800 x 3 = 2400 ps TimePP = 1000 + 200x 2 = 1400 ps Speeup = 2400/ 1400 = 1.71 (<5)

Why Pipelining? Example 1. Summary Ideally, the pipeline speedup is n times faster than the single- cycle, where n is the number of pipeline stages. In the 5-stage MIPS, the pipelined version would be 5 times faster. When the pipeline is full, the throughput will be one instruction per cycle Many factors affect pipelining performance Time to fill empty the pipeline Number of instructions to execute Unbalancecd delay of pipeline stages Instruction mix Pipeline hazards Ideally, the number of cycles required to finish M instructions in N-stages pipeline is N + M – 1

Pipelined MIPS Datapath What do we need to implement pipelining? We need to consider the following: The execution of instructions is divided into 5 stages (cycles): Instruction fetch (IF) , Instruction decode (ID), Execute (EX), Memory Access (MEM), Write Back (WB) Instruction flow is from left to right except in two cases In the write-back stage where the result is written into the register file in the middle of the datapath Choosing between the incremented PC and the branch address in the MEM stage In pipelining, all units are operating in every cycle; thus we have to duplicate hardware where needed Since the execution is over multiple cycles, we need to add State (Pipeline) registers between stages to preserve intermediate data and control for each instruction. These registers hold the values to be used in later stages as long as they are needed.

Pipelined MIPS Datapath IF ID EX MEM WB + 4 Shift left 2 + Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 IFetch/Dec Read Addr 2 Read Address PC Read Data Dec/Exec Exec/Mem Address Write Addr ALU Read Data 2 Mem/WB Write Data Note two exceptions to left-to-right flow WB that writes the result back into the register file in the middle of the datapath Selection of the next value of the PC, one input comes from the calculated branch address from the MEM stage Only later instructions in the pipeline can be influenced by these two REVERSE data movements. The first one (WB to ID) leads to data hazards. The second one (MEM to IF) leads to control hazards. All instructions must update some state in the processor – the register file, the memory, or the PC – so separate pipeline registers are redundant to the state that is updated (not needed). PC can be thought of as a pipeline register: the one that feeds the IF stage of the pipeline. Unlike all of the other pipeline registers, the PC is part of the visible architecture state – its content must be saved when an exception occurs (the contents of the other pipe registers are discarded). Write Data Sign Extend 16 32 System Clock Any problem?

Pipelined MIPS Datapath IF ID EX MEM WB + 4 Shift left 2 + Instruction Memory Read Addr 1 Data Memory Register File Read Data 1 IFetch/Dec Read Addr 2 Read Address PC Read Data Dec/Exec Exec/Mem Address Write Addr ALU Read Data 2 Mem/WB Write Data Note two exceptions to left-to-right flow WB that writes the result back into the register file in the middle of the datapath Selection of the next value of the PC, one input comes from the calculated branch address from the MEM stage Only later instructions in the pipeline can be influenced by these two REVERSE data movements. The first one (WB to ID) leads to data hazards. The second one (MEM to IF) leads to control hazards. All instructions must update some state in the processor – the register file, the memory, or the PC – so separate pipeline registers are redundant to the state that is updated (not needed). PC can be thought of as a pipeline register: the one that feeds the IF stage of the pipeline. Unlike all of the other pipeline registers, the PC is part of the visible architecture state – its content must be saved when an exception occurs (the contents of the other pipe registers are discarded). Write Data Sign Extend 16 32 System Clock Need to preserve the destination register !

Pipelined MIPS Datapath Example 2. Execution of LW instruction (1) Instruction Fetch: Put PC and the loaded instruction in the IF/ID register

Pipelined MIPS Datapath Example 2. Execution of LW instruction (2) Instruction Decode and Read Registers: Store Reg[rs], Reg[rt], sign extended offset , rd, rt, and the updated PC (why?) in the ID/EX register

MIPS Pipelining Example 2. Execution of LW instruction (3) Execute Or Address Calculation: Store branch address, Reg[rt], result, and zero flag in the EX/MEM register

Pipelined MIPS Datapath Example 2. Execution of LW instruction (4) Memory Access: Store the data from memory into MEM/WB register

Pipelined MIPS Datapath Example 2. Execution of LW instruction (5) Write Back: Copy the data loaded in the MEM/WB register to register file

Pipelined MIPS Datapath Required data fields in the pipelining registers Data fields are moved from one pipeline register to another every clock cycle until they are no longer needed Pipeline Register Data Fields Register Size IF/ID Instruction and PC 64 bits ID/EX PC, Reg[rs], Reg[rt], sign-extended offset, rt, rd 138 bits EX/MEM Branch address, Zero, ALU result, Reg[rt], Destination register address (rt or rd) 103 bits MEM/WB ALU Result, Data from memory, Destination register address 69

Pipelined MIPS Control All control signals can be determined during Decode stage while they are needed in later stages! Solution! Expand the pipeline registers to store and move the control signals between stages until they are needed

Pipelined MIPS Control Define the control signals and generate them in the decode stage For the time being, no explicit write signals are required for the pipeline registers since the are updated every cycle

Pipelined MIPS Control Control signals needed in each stage Control signal values based on instruction type Pipeline Stage Control signals IF None ID EX RegDst, ALUOp1, ALUOp0, ALUSrc MEM Branch, MemRead, MemWrite WB MemtoReg, RegWrite

MIPS Pipeline Example 3. Given the code segment and the register contents below, show the contents of the data and control fields in the pipeline registers if the sixth instruction has been fetched (i.e. the beginning of cycle 7) Register Contents $1 1 $2 5 $3 3 $4 -6 $5 2 $6 7 $11 12 $12 -15 $13 10 Address Instruction 0x00000000 lw $10, 20($1) 0x00000004 sub $11,$1,$2 0x00000008 add $12,$3,$4 0x0000000c lw $13, 24($1) 0x00000010 add $3,$2,$1 0x00000014 Sub $1,$5,$6

MIPS Pipeline Example 3. Multi-cycle diagram Time lw $10, 20($1) ALU IM Reg DM lw $10, 20($1) I n s t r. O r d e ALU IM Reg DM sub $11,$1,$2 ALU IM Reg DM add $12,$3,$4 ALU IM Reg DM lw $13, 24($1) ALU IM Reg DM add $3,$2,$1 ALU IM Reg DM sub $1,$5,$6

MIPS Pipeline Example 3. Single-cycle diagram sub $1,$5,$6 add $3,$2,$1 lw $13, 24($1) add $12,$3,$4 sub $11,$1,$2

MIPS Pipeline Example 3. At the beginning of cycle 7, the sixth instruction is stored in the IF/ID register while the data and control for earlier instructions are pushed to next pipeline registers and the register files. Thus, IF/ID register No control signals are stored Store the instruction sub $1,$5,$6 and PC+4 IF/ID.Instruction = 0x00A60822 IF/ID.PC = 0x00000018

MIPS Pipeline ID/EX register Example 3. Store the information of add $3,$2,$1 and PC+4 ID/EX.PC = 0x00000014 ID/EX.RegRsContents = 0x00000005 ID/EX.RegRtContents = 0x00000001 ID/EX.RegRt = (00001)2 ID/EX.RegRd = (00011)2 ID/EX.SignExtend = 0x00001820 Control Information ID/EX.MemToReg = 0 ID/EX.RegWrite = 1 ID/EX.MemRead = 0 ID/EX.MemWrite = 0 ID/EX.Branch = 0 ID/EX.ALUSrc = 0 ID/EX.RegDst = 1 ID/EX.ALUOp = (10)2

MIPS Pipeline EX/MEM register Example 3. Store the information of lw $13,24($1), branch address, and memory address EX/MEM.BranchAddress = 0x00000070 EX/MEM.ALUOut = 0x00000019 EX/MEM.Zero = 0 EX/MEM.RegDestination= (01101)2 EX/MEM.RegRtContents = 0x0000000A Control Information EX/MEM.MemToReg = 0 EX/MEM.RegWrite = 1 EX/MEM.MemRead = 1 EX/MEM.MemWrite = 0 EX/MEM.Branch = 0

MIPS Pipeline MEM/WB register For the sub $11, $1,$2 Example 3. Store the information of add $12, $3,$4, addition result, and data memory MEM/WB.RegDestination= (01100)2 MEM/WB.ALUOut = 0xFFFFFFFD MEM/WB.MemoryData = XXXX Control Information MEM/WB.MemToReg = 0 MEM/WB.RegWrite = 1 For the sub $11, $1,$2 It will be writing (1 - 5) to $11

Pipelining Hazards Is it that easy? Any complications? In general, pipelining is effective! MIPS ISA makes even easy All instructions are of the same length (32 bits) Can fetch the next instruction once the current is being decoded Few instruction formats with symmetry across them Can read the register file in the 2nd stage Memory access is through the Load and Store instructions Can use the execute stage to compute the address Each MIPS instruction writes at most one result in the MEM or WB stage Is it that easy? Any complications? YES! PIPELINING HAZARDS !

Pipelining Hazards Simple Solution? Hazards - problems the might occur during pipeline operation Three basic sources Structural Hazards In pipelining, all functional units are used in any cycle What if two instructions use the same functional unit in the same cycle? Data Hazards In pipelining, execution of instructions is overlapped What if the operand(s) of some instruction comes from an earlier instruction that is still in the pipeline? Control Hazards In pipelining, an instruction is fetched every cycle What if an instruction is a jump or a branch instruction that evaluates to true? The following instruction(s) in the pipeline might not be correct? Simple Solution? Wait until the issue is resolved!

Structural Hazards Single Memory! lw Inst 1 Inst 2 Inst 3 Inst 4 Reading from memory twice in the same cycle! Time (clock cycles) ALU Mem Reg lw I n s t r. O r d e ALU Mem Reg Inst 1 ALU Mem Reg Inst 2 ALU Mem Reg Inst 3 ALU Mem Reg Inst 4 Solution: Use two memories; Data and Instruction!

Structural Hazards Single Register File! add $1, Inst 1 Inst 2 Time (clock cycles) One instruction is writing and the other is reading the register file? ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM Inst 1 Solution: Design the register file to write in the first half of the cycle and read in the second half! ALU IM Reg DM Inst 2 ALU IM Reg DM add $2,$1, clock edge that controls loading of pipeline state registers clock edge that controls register writing

Data Hazards add $1, sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 ALU IM Reg DM add $1, ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 Dependencies backward in time cause hazards This is called Read-after-Write (RAW) data hazard Register-use data hazard Solution?

Data Hazards Simply, wait for the earlier instruction to finish! This is called stalling the pipeline! However, this affects the CPI? ALU IM Reg DM add $1, I n s t r. O r d e stall stall If the conflicting instruction is not immediately after the earlier instruction, then one stall is needed. sub $4,$1,$5 and $6,$1,$7 ALU IM Reg DM Do we need two stalls all the time?

Data Hazards lw $1,5($s1) sub $4,$1,$5 and $6,$1,$7 or $8,$1,$9 ALU IM Reg DM lw $1,5($s1) ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 Dependencies backward in time cause hazards It is a Read-after-Write (RAW) data hazard Load-use data hazard Solution?

Data Hazards Again, wait for the LW instruction to finish by stalling the pipeline! However, this affects the CPI? ALU IM Reg DM lw $1, I n s t r. O r d e stall stall sub $4,$1,$5 and $6,$1,$7 ALU IM Reg DM

Register-use data hazard Data Hazards Example 4. how many cycles are actually required to execute the following code? Assume the pipeline is already full. add $1, $2, $5 add $5, $3, $1 sub $10, $7, $8 sub $5, $6, $7 lw $3, 45($9) add $3, $3, $8 Ideally, and since the pipeline is full, each instruction requires 1 cycle. Thus, we need 6 cycles (CPI =6/6= 1). However, … Register-use data hazard Adds 2 cycles by stalls Load-use data hazard Adds 2 cycles by stalls Thus, 10 cycles are needed. CPI = 10/6 = 1.667 ?? Performance ?? Can we do any better?

Data Hazards Fixing Register-use Hazard by Forwarding Note that data produced by an instruction and needed by a later instruction is pushed through the pipeline registers until it is saved into the register file ! Why not to read the data from the pipeline registers before it is stored ? This is called forwarding! What is required? Need to detect the hazard Is any of the source registers for the instruction the same as the destination register for an earlier instruction that is still in the pipeline? Need to create a path to pass the data between pipeline stages Instead of reading the source registers of the instruction from the register file, read them from the pipeline registers

Data Hazards Fixing Register-use Hazard by Forwarding add $1, ALU IM Reg DM add $1, I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 No Stalls!

Note that forwarding could be from EX/MEM or from MEM/WB! Why? Data Hazards Forwarding Hardware implementation Note that forwarding could be from EX/MEM or from MEM/WB! Why? Dependency could be up to two later instructions! And the data of the earliest instructions moves in the pipeline every cycle. Add $1 Sub $4, $1 Sub $5, $1

Data Hazards Forwarding Hardware implementation Inside the forwarding unit Forwarding from EX/MEM (MEM Stage) if (EX/MEM.RegWrite and (EX/MEM.RegRd != 0) and (EX/MEM.RegRd = ID/EX.RegRs)) then ForwardA = From EX/MEM and (EX/MEM.RegRd = ID/EX.RegRt)) then ForwardB = From EX/MEM Why to check the RegWrite signal? Why to check the Zero register? Regwrite: what if we have a branch instruction followed by instructions with similar reigsters The reason for checking the $zero register is to avoid forwarding non-zero value if the $zero register is the destination for an instruction and some following instruction uses it as a source !!! add $zero, $2,$3 sub $7, $zero, $2 % in this case the value used for the $zero will be the addition between $2 and $3 Consider this code !!! We have WAW then RAW!! From where to forward MEM or WB! The forwarding unit has to be modified to forward from mem stage EX/MEM! lw $1, 18($15) add $2,$1,$14 addi $2,$1,20 sub $3,$1,$2 or $4,$3,$1 addi $22,$23,$23 sub $5,$6,$7

Data Hazards Forwarding Hardware implementation Inside the forwarding unit Forwarding from MEM/WB (WB Stage) if (MEM/WB.RegWrite and (MEM/WB.RegRd != 0) and (MEM/WB.RegRd = ID/EX.RegRs)) then ForwardA = From MEM/WB and (MEM/WB.RegRd = ID/EX.RegRt)) then ForwardB = From MEM/WB

Data Hazards Can the forwarding hardware be used with Load-use data hazard? ALU IM Reg DM lw $1,4($2) I n s t r. O r d e ALU IM Reg DM sub $4,$1,$5 ALU IM Reg DM and $6,$1,$7 ALU IM Reg DM or $8,$1,$9 ALU IM Reg DM xor $4,$1,$5 We still need 1 Stall for the instruction following the load?

Data Hazards How to stall the pipeline? Stall is required when the instruction in the EX stage is Load and the one in the ID stage depends on the loaded value The Load instruction moves normally to EX/MEM on the next cycle The conflicting instruction (the instruction following the load) should stay in the decode stage? How? Don’t write the IF/ID register  need IF/IDWrite Signal Don’t update the PC  need PCWrite Signal The control signals of the instruction in the decode stage are stored as 0’s (WHY?) in the ID/EX  need a multiplexor for the control signals Controlling the process requires a special unit; Hazard Detection Unit Prevent changing the state of the program as all 0 will not write to the register file nor the memory

Data Hazards Stall Implementation

Do we need to stall in all cases? Data Hazards Stall Implementation Inside hazard detection unit if (ID/EX.MemRead and [(ID/EX.RegRt == IF/ID.RegRs) or (ID/EX.RegRt == IF/ID.RegRt)]) then PCWrite = 0 IF/IDWrite = 0 Select 0’s as control signals Here, the condition is met whenever a load instruction is follow by any instruction in which the RS or RT field are the same as the the RT field in the load instruction. This is not true in all cases such as j and jal Solution  perform the check later !! If (EX/MEM.memRead) & (EX/MEM.regDestination == ID/EX.rs | EX/MEM.regDestination == ID/EX.rt) & (~ID/EX.jump) then STALL However, this requires modifying the datapath!! We should have ID/EXWrite signal to keep the information of the instruction that follows the load The mux of the control signals is moved to EX stage to reset the control signals of the load Ifwrite and Pcwrite are still there to preserve following instructions Any Problem? Do we need to stall in all cases? How about j and jal that come immediately after load with rs and/or rt fields being the same as the rt field of the load?

Data Hazards Example 5. Consider the following code segment in C A = B + E C = B + F (1) Generate the MIPS code assuming that variables A, B, C, E, and F are in memory and addressable with offsets 0, 4, 8, 12, and 16 from $t0 (2) Find all the data hazards and determine the number of cycles required to run the code. Assume forwarding is implemented. (3) Can you reorder the code to reduce the stalls ?

Data Hazards lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E Example 5. lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E add $t3, $t1, $t2 # A = B + E sw $t3, 0($t0) # stores A lw $t4, 16($t0) # loads F add $t5, $t1, $t4 # C = B + F sw $t5, 8($t0) # stores C Ideally, each instruction requires 1 cycle after the pipeline is full. Thus, we need (5+7-1) cycles. CPI = 11/7 = 1.57 Load-use data hazard Adds 1 cycle as a stall Load-use data hazard Adds 1 cycle as a stall Thus, 13 cycles are needed. CPI = 13/7 = 1.86 ?? Performance ??

Data Hazards lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E Example 5. Reducing stalls by instruction reordering lw $t1, 4($t0) # loads B lw $t2, 12($t0) # loads E lw $t4, 16($t0) # loads F add $t3, $t1, $t2 # A = B + E sw $t3, 0($t0) # stores A add $t5, $t1, $t4 # C = B + F sw $t5, 8($t0) # stores C Moving this instructions fills the first stall and eliminate the second one! Thus, 11 cycles are needed. CPI = 11/7 = 1.57

Data Hazards Example 6. Assume that the pipelined MIPS processor without forwarding is used to run a program with the following instruction mix: 20% loads, 20% store, and 60% ALU. Then compute the average CPI given that 10% of the ALU instructions result in load-use hazards. 15% of the ALU instructions result in read-before-write hazards. Solution Ideally, the average CPI is 1 for each instruction With no forwarding Load-use hazards add two cycles Register-use hazards add two cycles Average CPI = 0.2 x 1 + 0.2 x 1 + 0.75 x 0.60 x 1 + 0.1 x 0.60 x 3 + 0.15 x 0.60 x 3 = 1.30

Control Hazards For the pipelined datapath designed so far, the branch address and decision are known by the end of the MEM stage Instructions following the branch instruction in the pipeline are not correct if the branch evaluates to true! If the branch is true, then these instructions should be removed from the pipeline and execution should continue from the branch address Otherwise, no action is required! This is a dependency backward in time  Control Hazard

Control Hazards Solution! Branch Inst2 Inst1 Inst3 Effectively, we have to flush the IF/ID register for 3 cycles instead of stalling since stalling may result in error in the program execution when the branch evaluates to true? Flushing requires clearing the IF/ID register and prevent the update on the program counter. Note that flushing the IF/ID register in case there is a branch instruction in the ID/EX register requires changing PC to the address of the instruction that follows the branch which is flushed! Solution! Once it is known that the instruction is branch, then stall the pipeline for 3 cycles? Is it actually a stall?

Control Hazards beq stall stall stall Inst I n s t r. O r d e ALU IM Reg DM beq I n s t r. O r d e stall stall stall Inst ALU IM Reg DM If we don’t use stalls and start executing the instructions following the branch, we only lose three cycles if the branch is true! Are these actual stalls? Why not to start the execution of the following instructions normally and if the branch is true, then flush these instructions?! Fetching from instruction memory is either from PC+4 or Branch address depending on the branch result

Control Hazards Reducing the Cost of Branch Hazard Note that three cycles are lost if the branch evaluates to true in order to remove the three instructions following the branch instruction! This could affect the performance significantly! Can we reduce this cost? Move the branch address computation to the decode stage Add additional hardware to compare the two registers in the ID stage! Whenever there is a branch instruction in the ID/EX register (ID/EX.branch =1), flush the instruction in the IF/ID register. The branch penalty in this case will be 1 cycle instead of 3 cycles!

Control Hazards Reducing the Cost of Branch Hazard If we don’t flush the instruction, it will be executed later. Note that on flushing, the pc is not updated, so it is still pointing to the instruction that follows the branch.

Control Hazards Reducing the Cost of Branch Hazard Modifying the Hazard Detection Unit IF (ID/EX.Branch) then Flush IF/ID register Note that we lose one cycle whenever a branch instruction is encountered! Can we do any better? ALU IM Reg DM beq stall lw ALU IM Reg DM

IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register Control Hazards Reducing the Cost of Branch Hazard Approach I – Static Branch Prediction Always predict the branch as Not Taken and start fetching the instruction following the branch If the branch evaluates to Not Taken, then the prediction is correct and no further actions are required! If the branch evaluates to Taken, then the prediction is not correct! Remove the fetched instruction and start fetching from the branch address In this approach, we only lose one cycle if the prediction is not correct Inside the hazard detection unit IF (ID/EX.Branch) and (ID/EX.ZERO) Then Flush IF/ID register

Control Hazards Reducing the Cost of Branch Hazard Approach II – Dynamic Branch Prediction Prediction could be Taken or Not Taken If the branch is predicted as Not Taken Fetch the next instruction If prediction is false, flush the instruction. One cycle is lost! If branch is predicted as Taken Fetch the instruction from the branch address If prediction is false, flush and fetch from PC+4 How to store branch prediction? Use Branch History Table or Branch Prediction Buffer The table is addressable by the lower bits of the branch instruction address If branch is predicted as taken, we need to wait for the branch address to be computed? Use Branch Target Buffer

Control Hazards Approach II – Dynamic Branch Prediction 1-bit Branch Predictor Basically we have two states (Taken and Not Taken) One bit is used to store the prediction Prediction state is changed when prediction is wrong Performance Issues Consider branching in loops? EXAMPLE?

Control Hazards Approach II – Dynamic Branch Prediction 2-bit Branch Predictor Basically we have four states two bits are used to store the prediction Prediction state is changed when prediction is wrong twice

Control Hazards Example 7. Consider a certain program that have a conditional branch instruction whose actual outcome is given below when the program is executed. T-T-N-T-T-N-T List predictions for the following branch prediction schemes and find the prediction accuracy. Predict always taken Predict always not taken 1-bit predictor, initialized to predict taken 2-bit predictor, initialized to weakly predict taken

Control Hazards Example 7. Actual branch actions : T-T-N-T-T-N-T Predict as always taken Predictions : T-T-T-T-T-T-T Accuracy = 5/7 = 71% Predict as always not taken Predictions : N-N-N-N-N-N-N Accuracy = 2/7 = 29% 1-bit predictor initialized to predict taken Predictions: T-T-T-N-T-T-N Accuracy = 3/7 = 43% 2-bit predictor initialized to weakly predict taken Predictions: T-T-T-T-T-T-T Accuracy = 5/7 = 71%

Pipelining Performance Example 8. Let’s compare the performance of single-cycle, multi-cycle, and pipeline implementation of MIPS processor given the operation times and instruction mix below. For the pipelined implementation, assume that: 1) Branch decision is done in the MEM cycle. Branch handling in the pipeline implementation is done by stalling the pipeline. 2) Half of the load instructions incur load-use hazard. 3) Forwarding is implemented. 4) The jump instruction is completed in the ID stage Instruction type Percentage % ALU 52 Load 25 Store 10 Branch 11 Jump 2 Unit Time (ps) Memory 200 ALU and adders 100 Register File 50

Pipelining Performance Example 8. Clock cycle time Single-cycle = 200 + 50 + 100 + 50 + 200 = 600 ps Multi-cycle = 200 ps Pipeline = 200 ps CPI Single-cycle = 1 Multi-cycle = 5x 0.25 + 4x0.52 + 4x0.10 + 3x0.11 + 3x0.02 = 4.12 Pipeline = 0.125x2 + 0.125x1 + 0.52x1 + 0.1x1 + 0.11x4 + 0.02x2 = 1.475 Execution Time per instruction Single-cycle = 600 ps Multi-cycle = 4.12 x 200 ps = 824 ps Pipeline = 1.475 x 200 = 295 ps

Pipelining Performance Example 9. Redo example 8 by assuming that branch prediction is employed and 1/4th of the branch instructions are miss predicted.

Exceptions & Interrupts Exceptions and interrupts are unexpected events that require the change in the flow The two terms are used interchangeably and depending is ISA Intel x86 uses the term interrupt only In MIPS Exceptions: any internal unexpected change in the flow (undefined opecode, overflow, system calls) Interrupts: the event is external (I/O controller request) Dealing with them Is a challenging part of processor design Affects performance

Exceptions & Interrupts In MIPs, when an exception is generated, the following sequence of steps are taken The address of the offending instruction is saved into a special called the Exception Program Counter (EPC). The cause of the exception is saved in a special register called the Cause Register. The control is transferred to the operating system by loading a special address (0x8000 00180) into the PC. The code loaded starting at this address Determines what actions will be done by the operating system in response to the exception based on the value found in the Cause Register. The operating system may terminate the program or resume the execution using the value found in the EPC

Overflow Exception Modifications to the Datapath The exception is raised in the execute stage (the offending instruction is in the ID/EX register) Modifications (part is available from the hardware used to correct the missprediction in branch) add cause and EPC registers clear the control signals of the offending instruction (add the muxes in the EX stage). These are needed instead of a EX/MEMFlush since the dealing with the exception is within the same cycle. flush the IF/ID and ID/EX Expand the MUX at the PC input to include the exception address

Fallacies Fallacy 1. Pipelining is easy ! Not true ! Hazards complicate the operation Fallacy 2. Pipelining is independent of technology! Why didn’t we have pipelined processors before ? Advanced technology allowed more transistors and thus more operations !

Reading Assignment Read the following from the textbook Section 4.9 – Exceptions Section 4.10 – Parallelism and Advanced Instruction Level Parallelism