Chapter 5 Pipelining and Hazards

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

Pipelining - Hazards.

Instruction-Level Parallelism (ILP)

COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.

CPSC 614 Computer Architecture Lec 3 Pipeline Review EJ Kim Dept. of Computer Science Texas A&M University Adapted from CS 252 Spring 2006 UC Berkeley.

CSCE 430/830 Computer Architecture Basic Pipelining & Performance

ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

DLX Instruction Format

Appendix A Pipelining: Basic and Intermediate Concepts

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 第三章 Instruction-Level Parallelism and Its Dynamic Exploitation 陈文智浙江大学计算机学院 2011 年 09 月.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

EECS 252 Graduate Computer Architecture Lecture 3  0 (continued) Review of Instruction Sets, Pipelines, Caches and Virtual Memory January 25 th, 2012.

Appendix A - Pipelining CSCI/ EENG – W01 Computer Architecture 1 Prof. Babak Beheshti Slides based on the PowerPoint Presentations created by David.

Pipeline Review. 2 Review from last lecture Tracking and extrapolating technology part of architect’s responsibility Expect Bandwidth in disks, DRAM,

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

CSC 7080 Graduate Computer Architecture Lec 3 – Pipelining: Basic and Intermediate Concepts (Appendix A) Dr. Khalaf Notes adapted from: David Patterson.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Appendix A Pipelining: Basic and Intermediate Concept

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

CMPE 421 Parallel Computer Architecture

Computer Architecture Lecture 4 Pipelining. 2 Introduction A.1 What is Pipelining? A.2 The Major Hurdle of Pipelining-Structural Hazards –Data Hazards.

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Branch Hazards and Static Branch Prediction Techniques

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)

CMPUT Computer Systems and Architecture1 CMPUT429/CMPE382 Winter 2001 Topic3-Pipelining José Nelson Amaral (Adapted from David A. Patterson’s CS252.

CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

Computer Architecture Chapter 3 Pipelining Prof. Jerry Breecher CSCI 240 Fall 2001.

EEL4713C Ann Gordon-Ross.1 EEL-4713C Computer Architecture Pipelined Processor - Hazards.

Computer Organization

Lecture 15: Pipelining: Branching & Complications

Review: Instruction Set Evolution

Pipelining: Hazards Ver. Jan 14, 2014

CS 3853 Computer Architecture Lecture 3 – Performance + Pipelining

5 Steps of MIPS Datapath Figure A.2, Page A-8

Single Clock Datapath With Control

Appendix A - Pipelining

CpE 442 Designing a Pipeline Processor (lect. II)

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

CPE 631 Lecture 03: Review: Pipelining, Memory Hierarchy

Chapter 4 The Processor Part 3

Morgan Kaufmann Publishers The Processor

CMSC 611: Advanced Computer Architecture

The Processor Lecture 3.6: Control Hazards

Instruction Execution Cycle

Electrical and Computer Engineering

Overview What are pipeline hazards? Types of hazards

CS203 – Advanced Computer Architecture

Introduction to Computer Organization and Architecture

Throughput = #instructions per unit time (seconds/cycles etc.)

Pipelining Hazards.

Presentation transcript:

Chapter 5 Pipelining and Hazards Advanced Computer Architecture COE 501

Computer Pipelines Computers execute billions of instructions, so instruction throughout is what matters Divide instruction execution up into several pipeline stages. For example IF ID EX MEM WB Simultaneously have different instructions in different pipeline stages The length of the longest pipeline stage determines the cycle time DLX desirable pipeline features: all instructions same length registers located in same place in instruction format memory operands only in loads or stores

DLX Instruction Formats Register-Register (R-type) ADD R1, R2, R3 31 26 25 21 20 16 15 11 10 6 5 Op rs1 rs2 rd func Register-Immediate (I-type) SUB R1, R2, #3 31 26 25 21 20 16 15 immediate Op rs1 rd Jump / Call (J-type) JUMP end 31 26 25 offset added to PC Op (jump, jump and link, trap and return from exception)

Multiple-Cycle DLX: Cycles 1 and 2 Most DLX instruction can be implemented in 5 clock cycles The first two clock cycles are the same for every instruction. 1. Instruction fectch cycle (IF) load instruction update program counter 2. Instruction decode / register fetch cycle (ID) fetch source registers sign-extend immediate field

5 Steps of DLX Datapath Figure 3.1, Page 130 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File PC MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data

Multiple-Cycle DLX: Cycle 3 The third cycle is known as the Execution/ effective address cycle (EX) The actions performed in this cycle depend on the type of operations. Loads and Stores calculate effective address ALU operations perform ALU operation Branch compute branch target determine if the branch is taken

5 Steps of DLX Datapath Figure 3.1, Page 130 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File PC MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data

Multiple-Cycle DLX: Cycle 4 The fourth cycle is known as the Memory access / branch completion cycle (MEM) The only DLX instructions active in this cycle are loads, stores, and branches Loads load memory onto processor Stores store data into memory Branch go to branch target or next instruction ALU Operations do nothing

5 Steps of DLX Datapath Figure 3.1, Page 130 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File PC MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data

Multiple-Cycle DLX: Cycle 5 The fifth cycle is known as the Write-back cycle (WB) During this cycles, results are written to the register file Loads write value from memory into register file ALU Operations write ALU result into register file Stores and Branches do nothing

5 Steps of DLX Datapath Figure 3.1, Page 130 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File PC MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data

CPI for the Multiple-Cycle DLX The multiple-cycle DLX requires 4 cycles for branches and stores and 5 cycles for the other operations. Assuming 20% of the instructions are branches or loads, this gives a CPI of 0.8*5 + 0.2*4 = 4.80 We could improve the CPI by allowing ALU operations to complete in 4 cycles. Assuming 40% of the instructions are ALU operations, this would reduce the CPI to 0.4*5 + 0.6*4 = 4.40

Pipelining DLX To reduce the CPI, DLX can be implemented using a five stage pipeline. In this example, it takes 9 cycles execute 5 instructions for a CPI of 1.8.

5 Steps of DLX Datapath Figure 3.4, Page 134 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File PC Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD

Visualizing Pipelining Figure 3.3, Page 133 Time (clock cycles) Reg ALU DMem Ifetch Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 6 Cycle 7 Cycle 5 I n s t r. O r d e

Pipeline Speedup Example Assume the multiple cycle DLX has a 10 ns clock cycle, loads take 5 clock cycles and account for 40% of the instructions, and all other instructions take 4 clock cycles. If pipelining the machine add 1-ns to the clock cycle, how much speedup in instruction execution rate do we get from pipelining. MC Ave Instr. Time = Clock cycle x Average CPI = 10 ns x (0.6 x 4 + 0.4 x 5) = 44 ns PL Ave Instr. Time = 10 + 1 = 11 ns Speedup = 44 / 11 = 4 This ignores time needed to fill & empty the pipeline and delays due to hazards.

Pipelining Summary Pipelining overlaps the execution of multiple instructions. With an ideal pipeline, the CPI is one, and the speedup is equal to the number of stages in the pipeline. However, several factors prevent us from achieving the ideal speedup, including Not being able to divide the pipeline evenly The time needed to empty and flush the pipeline Overhead needed for pipeling Structural, data, and control harzards

Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: Hardware cannot support this combination of instructions - two instructions need the same resource. Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Pipelining of branches & other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles” in the pipeline To do this, hardware or software must detect that a hazard has occurred.

Speed Up Equations for Pipelining For simple RISC pipeline, CPI = 1:

Structural Hazards Structural hazards occur when two or more instructions need the same resource. Common methods for eliminating structural hazards are: Duplicate resources Pipeline the resource Reorder the instructions It may be too expensive too eliminate a structural hazard, in which case the pipeline should stall. When the pipeline stalls, no instructions are issued until the hazard has been resolved. What are some examples of structural hazards?

One Memory Port Structural Hazards Figure 3.6, Page 142 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4

One Memory Port Structural Hazards Figure 3.7, Page 143 Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 Bubble Stall Reg ALU DMem Ifetch Instr 3

Example: One or Two Memory Ports? Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads are 40% of instructions executed SpeedUpA = Pipeline Depth/(1 + 0) x (clockunpipe/clockpipe) = Pipeline Depth SpeedUpB = Pipeline Depth/(1 + 0.4 x 1) x (clockunpipe/(clockunpipe / 1.05) = (Pipeline Depth/1.4) x 1.05 = 0.75 x Pipeline Depth SpeedUpA / SpeedUpB = Pipeline Depth/(0.75 x Pipeline Depth) = 1.33 Machine A is 1.33 times faster

Three Generic Data Hazards Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3

RAW Hazards on R1 Figure 3.9, page 147 Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch

Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 WAR hazards can happen if instructions execute out of order or access data late I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

No WAR Hazards on R1 read add r4,r1,r3 write sub r1,r2,r3 Time (clock cycles) IF ID/RF EX MEM WB read I n s t r. O r d e Reg ALU DMem Ifetch add r4,r1,r3 write Reg ALU DMem Ifetch sub r1,r2,r3

Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in DLX 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

No WAR Hazards on R1 read add r1,r4,r3 write sub r1,r2,r3 Time (clock cycles) IF ID/RF EX MEM WB read I n s t r. O r d e Reg ALU DMem Ifetch add r1,r4,r3 write Reg ALU DMem Ifetch sub r1,r2,r3

Data Forwarding With data forwarding (also called bypassing or short-circuiting), data is transferred back to earlier pipeline stages before it is written into the register file. Instr i: add r1,r2,r3 (result ready after EX stage) ---------------------- Instr j: sub r4,r1,r5 (result needed in EX stage) This either eliminates or reduces the penalty of RAW hazards. To support data forwarding, additional hardware is required. Multiplexors to allow data to be transferred back Control logic for the multiplexors

Forwarding to Avoid RAW Hazard Figure 3.10, Page 149 Time (clock cycles) I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch

HW Change for Forwarding Figure 3.20, Page 161 ID/EX EX/MEM MEM/WR NextPC mux Registers ALU Data Memory mux mux Immediate

Data Hazard Even with Forwarding Figure 3.12, Page 153 Time (clock cycles) Reg ALU DMem Ifetch I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 MIPS actutally didn’t interlecok: MPU without Interlocked Pipelined Stages

Data Hazard Even with Forwarding Figure 3.13, Page 154 Time (clock cycles) I n s t r. O r d e Reg ALU DMem Ifetch lw r1, 0(r2) Reg Ifetch ALU DMem Bubble sub r4,r1,r6 Ifetch ALU DMem Reg Bubble and r6,r1,r7 Bubble Ifetch Reg ALU DMem or r8,r1,r9

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Compiler Avoiding Load Stalls Compilers reduce the number of load stalls, but do not completely eliminate them.

DLX Control Hazards Control hazards, which occur due to instructions changing the PC, can result in a large performance loss. A branch is either Taken: PC <= PC + 4 + Imm Not Taken: PC <= PC + 4 The simplest solution is to stall the pipeline as soon as a branch instruction is detected. Detect the branch in the ID stage Don’t know if the branch is taken until the EX stage If the branch is taken, we need to repeat the IF and ID stages New PC is not changed until the end of the MEM stage, after determining if the branch is taken and the new PC value

5 Steps of DLX Datapath Figure 3.4, Page 134 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File PC Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD

Control Hazard on Branches Three Stage Stall Reg ALU DMem Ifetch 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11

Control Hazard on Branches With our original DLX model, branches have a delay of 3 cycles The delay for not-taken branches can be reduced to two cycles, since it is not necessary to fetch the instruction again.

Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9! Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier DLX branch tests if register = 0 or  0 DLX Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Pipelined DLX Datapath Figure 3.22, page 163 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc. Memory Access Write Back This is the correct 1 cycle latency implementation! Does MIPS test affect clock (add forwarding logic too!)

Branch Behavior in Programs Based on SPEC benchmarks on DLX Branches occur with a frequency of 14% to 16% in integer programs and 3% to 12% in floating point programs. About 75% of the branches are forward branches 60% of forward branches are taken 80% of backward branches are taken 67% of all branches are taken Why are branches (especially backward branches) more likely to be taken than not taken?

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 33% DLX branches not taken on average PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken 67% DLX branches taken on average But haven’t calculated branch target address in DLX DLX still incurs 1 cycle branch penalty Other machines: branch target known before outcome

Four Branch Hazard Alternatives #4: Define branch to take place AFTER n following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken In 5 stage pipeline, 1 slot delay allows proper decision and branch target address to be calculated (n=1) DLX uses this approach, with a single branch delay slot Superscalar machines with deep pipelines may require additional delay slots to avoid branch penalties Branch delay of length n

Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction: always valuable if found Branch cannot depend on rescheduled instruction (RI) From the target address: only valuable when branch taken Must be O.K. to execute RI if branch is not taken. From fall through: only valuable when branch not taken Must be O.K. to execute RI if branch is taken Before Target Fall Through ADD R1, R2, R3 … BEQZ R2, target BEQZ R2, target BEQZ R2,target … … … … … ADD R1, R2, R3 target: … ADD R1, R2, R3 …

Filling delay slots Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Canceling branches or nullifying branches Include a prediction of if the branch is taken or not take If the prediction is correct, the instruction in the delay slot is executed If the prediction is incorrect, the instruction in the delay slot is squashed Allow more slots to be filled from the target address or fall through

Evaluating Branch Alternatives Scheduling Branch CPI speedup v. speedup v. scheme penalty unpipelined stall Slow stall pipeline 3 1.42 3.5 1.0 Fast stall pipeline 1 1.14 4.4 1.26 Predict taken 1 1.14 4.4 1.26 Predict not taken 0.7 1.10 4.5 1.29 Delayed branch 0.5 1.07 4.7 1.34 Assume branch frequency is 14%

Compiler “Static” Prediction of Taken/Untaken Branches Two strategies examined Backward branch predict taken, forward branch not taken Profile-based prediction: record branch behavior, predict branch based on prior run

Pipelining Complications Exceptions: Events other than branches or jumps that change the normal flow of instruction execution. 5 instructions executing in 5 stage pipeline How to stop the pipeline? How to restart the pipeline? Who caused the interrupt? Stage Problem interrupts occurring IF Page fault on instruction fetch; misaligned memory access; memory-protection violation ID Undefined or illegal opcode EX Arithmetic interrupt MEM Page fault on data fetch; misaligned memory access; memory-protection violation

Pipelining Complications Simultaneous exceptions in more than one pipeline stage, e.g., Load with data page fault in MEM stage Add with instruction page fault in IF stage Add fault will happen BEFORE load fault, even if LOAD instruction is first Solution #1 Interrupt status vector per instruction Defer check until last stage, kill state update if exception Solution #2 Interrupt ASAP Restart everything that is incomplete Another advantage for state update late in pipeline!

Pipelining Complications Our DLX pipeline only writes results at the end of the instruction’s execution. Not all processors do this. Address modes: Auto-increment causes register change during instruction execution Interrupts? Need to restore register state Adds WAR and WAW hazards since writes no longer last stage Memory-Memory Move Instructions Must be able to handle multiple page faults VAX and x86 store values temporarily in registers Condition Codes Need to detect the last instruction to change condition codes

Pipelining Complications Floating Point instructions Longer execution times than for integers May pipeline FP execution unit so they can initiate new instructions without waiting full latency Adds WAR and WAW hazards since pipelines are no longer same length (e.g., square root 30 times slower than add) FP Instruction Latency Initiation Rate (MIPS R4000) Add, Subtract 4 3 Multiply 8 4 Divide 36 35 Square root 112 111 Negate 2 1 Absolute value 2 1 FP compare 3 2 Cycles before use result Cycles before issue instr of same type

Summary of Pipelining Hazards Hazards limit performance Structural: need more HW resources Data: need forwarding, compiler scheduling Control: early evaluation & PC, delayed branch, prediction Increasing length of pipe increases impact of hazards; pipelining helps instruction bandwidth, not latency Interrupts, Instruction Set, FP make pipelining harder Compilers reduce cost of data and control hazards Code rescheduling Load delay slots Branch delay slots Static branch prediction