CPE 631 Review: Pipelining

Slides:



Advertisements
Similar presentations
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Advertisements

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.
CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.
Lecture 6: Pipelining MIPS R4000 and More Kai Bu
Chapter 5 Pipelining and Hazards
©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan
Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.
DLX Instruction Format
Appendix A Pipelining: Basic and Intermediate Concepts
ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.
Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.
CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic,
Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.
CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,
EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.
CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.
CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.
CMPE 421 Parallel Computer Architecture
1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.
CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and
Processor Design CT101 – Computing Systems. Content GPR processor – non pipeline implementation Pipeline GPR processor – pipeline implementation Performance.
Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.
HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.
Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.
CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.
Lecture 18: Pipelining I.
Computer Organization
Review: Instruction Set Evolution
Pipelining: Hazards Ver. Jan 14, 2014
CS 3853 Computer Architecture Lecture 3 – Performance + Pipelining
Lecture 07: Pipelining Multicycle, MIPS R4000, and More
CMSC 611: Advanced Computer Architecture
5 Steps of MIPS Datapath Figure A.2, Page A-8
Appendix C Pipeline implementation
ECE232: Hardware Organization and Design
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Appendix A - Pipelining
School of Computing and Informatics Arizona State University
Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from
CPE 631 Lecture 03: Review: Pipelining, Memory Hierarchy
Chapter 4 The Processor Part 3
Chapter 4 The Processor Part 2
CMSC 611: Advanced Computer Architecture
Pipelining Multicycle, MIPS R4000, and More
Appendix A: Pipelining
Appendix A - Pipelining
CS-447– Computer Architecture Lecture 14 Pipelining (2)
CSC 4250 Computer Architectures
Pipelining Lessons 6 PM T a s k O r d e B C D A 30
Pipeline control unit (highly abstracted)
The Processor Lecture 3.6: Control Hazards
An Introduction to pipelining
Instruction Execution Cycle
Electrical and Computer Engineering
Overview What are pipeline hazards? Types of hazards
Pipeline control unit (highly abstracted)
CS203 – Advanced Computer Architecture
Pipeline Control unit (highly abstracted)
Pipelining Appendix A and Chapter 3.
Introduction to Computer Organization and Architecture
Throughput = #instructions per unit time (seconds/cycles etc.)
Advanced Computer Architecture 2019
Pipelining Hazards.
Presentation transcript:

CPE 631 Review: Pipelining Electrical and Computer Engineering University of Alabama in Huntsville Aleksandar Milenkovic, milenka@ece.uah.edu http://www.ece.uah.edu/~milenka

Outline Pipelined Execution 5 Steps in MIPS Datapath Pipeline Hazards Structural Data Control Here is the Outline of today’s lecture. First, we shall give a short overview of major trends and breakthroughs in Computer Technology for last 50 years. Then, we will give an answer to the question “What is Computer Architecture”. After these two introductory topics we will consider Measuring and Reporting Performance and major Quantitative Principles of Computer Design.

Laundry Example (by David Patterson) Four loads of clothes: A, B, C, D Task: each one to wash, dry, and fold Resources Washer takes 30 minutes Dryer takes 40 minutes “Folder” takes 20 minutes A B C D

Sequential Laundry A B C D 30 40 20 6 PM 7 8 9 10 11 Midnight k O r d e Time Sequential laundry takes 6 hours for 4 loads If they learned pipelining, how long would laundry take?

Pipelined Laundry Pipelined laundry takes 3.5 hours for 4 loads A B C 6 PM 7 8 9 10 11 Midnight T a s k O r d e Time 30 40 20

Pipelining Lessons 6 PM 7 8 9 30 40 20 A B C D Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” reduce speedup 6 PM 7 8 9 Time T a s k O r d e 30 40 20 A B C D

Computer Pipelines Execute billions of instructions, so throughput is what matters What is desirable in instruction sets for pipelining? Variable length instructions vs. all instructions same length? Memory operands part of any operation vs. memory operands only in loads or stores? Register operand many places in instruction format vs. registers located in same place?

A "Typical" RISC 32-bit fixed format instruction (3 formats) Memory access only via load/store instructions 32 32-bit GPR (R0 contains zero) 3-address, reg-reg arithmetic instruction; registers in same place Single address mode for load/store: base + displacement no indirection Simple branch conditions Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

Example: MIPS Register-Register 31 26 25 21 20 16 15 11 10 6 5 Op Rs1 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 target Op

5 Steps of MIPS Datapath 4 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX 4 Adder Next SEQ PC Zero? RS1 Reg File Address MUX Memory RS2 ALU Inst Memory Data L M D RD MUX MUX Sign Extend Imm WB Data

5 Steps of MIPS Datapath (cont’d) Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC IF/ID ID/EX MEM/WB EX/MEM MUX Next SEQ PC Next SEQ PC 4 Adder Zero? RS1 Reg File Address Memory MUX RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage

Visualizing Pipeline Time (clock cycles) I n s t r. O r d e CC 1 CC 2 Reg ALU DM IM ALU Reg IM Reg DM Reg ALU DM IM The pipeline can be thought as a series of datapaths shifted in time. This slide shows the overlap among the parts of the datapath, with clock cycle CC5, showing the steady state situation (pipelined is filled). To check whether this pipeline is correct, we have to determine what happens on every clock cycle of the machine and make sure we do not try to perform two different operations with the same datapath resource on the same clock cycle. Let’s consider this detailed. First, we use separate instruction (IM) and data (DM) memories, which would typically implemented with separate data and instruction caches (we will discuss this later during the course). In this way, we eliminate possible conflict for a single memory that would arise between instruction fetch and data memory access. Second, the register file is used in two stages: for reading in ID and for writing in WB. These uses are distinct, so we show register file in two places; it means that we need to perform two reads and one write every clock cycle, and we will see it is easy to implement (even if you should write/read the same register). To support this we perform register write in the first half of the clock cycle and read in the second half. Third, this figure does not deal with the PC. To start new instruction every clock, we must increment and store the PC every clock, and this must be done in IF stage. However, the branch instructions make this difficult, and this problem we will discuss later in the part devoted to Control hazards. To ensure that different instructions in different stages of the pipeline do not interfere with one another we introduce the PIPELINE Registers between successive stages of the pipeline, so that at the end of a clock cycle all the results from a given stage are stored into a register that is used as the input to the next stage on the next clock cycle. Reg ALU DM IM

Instruction Flow through Pipeline Time (clock cycles) CC 1 CC 2 CC 3 CC 4 Add R1,R2,R3 Lw R4,0(R2) Sub R6,R5,R7 Xor R9,R8,R1 Nop Add R1,R2,R3 Lw R4,0(R2) Sub R6,R5,R7 Reg ALU DM IM Reg ALU DM IM Reg ALU DM IM Reg ALU DM IM Nop Add R1,R2,R3 Here we can see an illustration of the instruction flow through the pipeline registers. We will consider instruction flow for a code sequence which includes four consequent, independent instructions Add, Lw, Sub, and Xor. In clock cycle 1 (CC1) we fetch the first instruction, and at the end of this clock cycle we will have this instruction in the IF/ID pipeline register. We assume that instructions before this were nops (no operations). In clock cycle 2 (CC2), Add instruction is in ID pipe stage (instruction decoding and register read), and we fetch new instruction from the IM (Lw). At the end of CC2 we will have instruction Lw in IF/ID pipeline register, and Add in ID/EX stage. .... Nop Lw R4,0(R2) Nop Nop Nop Add R1,R2,R3

DLX Pipeline Definition: IF, ID Stage IF IF/ID.IR  Mem[PC]; if EX/MEM.cond {IF/ID.NPC, PC  EX/MEM.ALUOUT} else {IF/ID.NPC, PC  PC + 4}; Stage ID ID/EX.A  Regs[IF/ID.IR6…10]; ID/EX.B  Regs[IF/ID.IR11…15]; ID/EX.Imm  (IF/ID.IR16)16 ## IF/ID.IR16…31; ID/EX.NPC  IF/ID.NPC; ID/EX.IR  IF/ID.IR; Let’s consider events on every pipe stage of the DLX pipeline. In IF, instruction is fetched and stored in IF/ID.IR, and the new PC is computed; If we have not branch, the PC is incremented and stored for later use in computing the branch target address. In ID, we fetch the registers, extend the sign of the lower 16 bits of the IR, and pass them along with IR and NPC.

DLX Pipeline Definition: IE ALU EX/MEM.IR  ID/EX.IR; EX/MEM.ALUOUT  ID/EX.A func ID/EX.B; or EX/MEM.ALUOUT  ID/EX.A func ID/EX.Imm; EX/MEM.cond  0; load/store EX/MEM.IR  ID/EX.IR; EX/MEM.B  ID/EX.B; EX/MEM.ALUOUT  ID/EX.A  ID/EX.Imm; branch EX/MEM.Aluout  ID/EX.NPC  (ID/EX.Imm<< 2); EX/MEM.cond  (ID/EX.A func 0); During EX, we perform an ALU operation or an address calculation. We pass along the IR and B register (if the instruction is a store). Also we set the value of cond to 1 if the instruction is a taken branch.

DLX Pipeline Definition: MEM, WB Stage MEM ALU MEM/WB.IR  EX/MEM.IR; MEM/WB.ALUOUT  EX/MEM.ALUOUT; load/store MEM/WB.LMD  Mem[EX/MEM.ALUOUT] or Mem[EX/MEM.ALUOUT]  EX/MEM.B; Stage WB Regs[MEM/WB.IR16…20]  MEM/WB.ALUOUT; or Regs[MEM/WB.IR11…15]  MEM/WB.ALUOUT; load Regs[MEM/WB.IR11…15]  MEM/WB.LMD; During the MEM phase, we cycle the memory, make a branch decision and update PC if needed, and pass along the values needed in the final pipe stage. Finally, during the WB stage, we update the register field from either ALUoutput or loaded value. For simplicity we always pass the entire IR from one stage to the next, though as an instruction proceeds down the pipeline, less and less of the IR is needed.

Its Not That Easy for Computers Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle Structural hazards: HW cannot support this combination of instructions Data hazards: Instruction depends on result of prior instruction still in the pipeline Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps)

One Memory Port/Structural Hazards Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 ALU Instr 3 Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 4

One Memory Port/Structural Hazards (cont’d) Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 ALU I n s t r. O r d e Load Ifetch Reg DMem Reg Reg ALU DMem Ifetch Instr 1 Reg ALU DMem Ifetch Instr 2 Bubble Stall Reg ALU DMem Ifetch Instr 3

Data Hazard on R1 add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 Time (clock cycles) IF ID/RF EX MEM WB I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch

Three Generic Data Hazards Read After Write (RAW) InstrJ tries to read operand before InstrI writes it Caused by a “Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3

Three Generic Data Hazards Write After Read (WAR) InstrJ writes operand before InstrI reads it Called an “anti-dependence” by compiler writers. This results from reuse of the name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Reads are always in stage 2, and Writes are always in stage 5 I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Three Generic Data Hazards Write After Write (WAW) InstrJ writes operand before InstrI writes it. Called an “output dependence” by compiler writers This also results from the reuse of name “r1”. Can’t happen in MIPS 5 stage pipeline because: All instructions take 5 stages, and Writes are always in stage 5 I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7

Forwarding to Avoid Data Hazard Time (clock cycles) I n s t r. O r d e add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Reg ALU DMem Ifetch

HW Change for Forwarding ID/EX EX/MEM MEM/WR NextPC mux Registers ALU Data Memory mux mux Immediate

Forwarding to DM input - Forward R1 from EX/MEM.ALUOUT to ALU input (lw) - Forward R1 from MEM/WB.ALUOUT to ALU input (sw) - Forward R4 from MEM/WB.LMD to memory input (memory output to memory input) I n s t. O r d e Time (clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 Reg ALU DM IM add R1,R2,R3 Forwarding can be generalized to include passing a result directly to the functional unit that requires it. Consider this sequence of instructions. To prevent a stall in this sequence we would need to -Forward R1 from EX/MEM.ALUOUT to ALU input (lw) - Forward R1 from MEM/WB.ALUOUT to ALU input (sw) - Forward R4 from MEM/WB.LMD to memory input (memory output to memory input). Because the ALU and DM both accept operands, forwarding paths are needed to their inputs from both EX/MEM and MEM/WB pipeline registers. Reg ALU DM IM lw R4,0(R1) sw 12(R1),R4 Reg ALU DM IM

Forwarding to DM input (cont’d) Forward R1 from MEM/WB.ALUOUT to DM input I n s t. O r d e Time (clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Reg ALU DM IM add R1,R2,R3 sw 0(R4),R1 Reg ALU DM IM Consider this sequence of instructions. Here we have forwarding from MEM/WB.ALUout to DM input. Notice, that source is not the same as in the previous example (MEM/WB.LMD).

Forwarding to Zero I Forward R1 from EX/MEM.ALUOUT to Zero n s t r u c Time (clock cycles) CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 Reg ALU DM IM add R1,R2,R3 Z beqz R1,50 Reg ALU DM IM Forward R1 from MEM/WB.ALUOUT to Zero Reg ALU DM IM add R1,R2,R3 In addition to forwarding to ALU input and DM input, in DLX forwarding to Zero unit which tests conditions will be needed as well. Let’s consider the following two examples. ... Reg ALU DM IM sub R4,R5,R6 Z bneq R1,50 Reg ALU DM IM

Data Hazard Even with Forwarding Time (clock cycles) Reg ALU DMem Ifetch I n s t r. O r d e lw r1, 0(r2) sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 MIPS actutally didn’t interlecok: MPU without Interlocked Pipelined Stages

Data Hazard Even with Forwarding Time (clock cycles) I n s t r. O r d e Reg ALU DMem Ifetch lw r1, 0(r2) Reg Ifetch ALU DMem Bubble sub r4,r1,r6 Ifetch ALU DMem Reg Bubble and r6,r1,r7 Bubble Ifetch Reg ALU DMem or r8,r1,r9

Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SW d,Rd

Control Hazard on Branches Three Stage Stall Reg ALU DMem Ifetch 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11

Example: Branch Stall Impact If 30% branch, Stall 3 cycles significant Two part solution: Determine branch taken or not sooner, AND Compute taken branch address earlier MIPS branch tests if register = 0 or  0 MIPS Solution: Move Zero test to ID/RF stage Adder to calculate new PC in ID/RF stage 1 clock cycle penalty for branch versus 3

Pipelined MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC Next SEQ PC ID/EX EX/MEM MEM/WB MUX 4 Adder IF/ID Adder Zero? RS1 Address Reg File Memory RS2 ALU Memory Data MUX MUX Sign Extend Imm WB Data RD RD RD Data stationary control local decode for each instruction phase / pipeline stage

Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken Execute successor instructions in sequence “Squash” instructions in pipeline if branch actually taken Advantage of late pipeline state update 47% MIPS branches not taken on average PC+4 already calculated, so use it to get next instruction

Branch not Taken Time [clocks] Mem IF ID Ex WB Branch is untaken 5 Time [clocks] branch (not taken) Mem IF ID Ex WB Branch is untaken (determined during ID), we have fetched the fall-through and just continue  no wasted cycles IF ID Ex Mem WB Ii+1 IF ID Ex Mem WB Ii+2 5 branch (taken) Mem IF ID Ex WB Branch is taken (determined during ID), restart the fetch from at the branch target  one cycle wasted Ii+1 IF idle branch target IF ID Ex Mem WB branch target+1 IF ID Ex Mem WB Instructions

Four Branch Hazard Alternatives #3: Predict Branch Taken Treat every branch as taken 53% MIPS branches taken on average But haven’t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Make sense only when branch target is known before branch outcome

Four Branch Hazard Alternatives #4: Delayed Branch Define branch to take place AFTER a following instruction branch instruction sequential successor1 sequential successor2 ........ sequential successorn branch target if taken 1 slot delay allows proper decision and branch target address in 5 stage pipeline MIPS uses this Branch delay of length n

Delayed Branch Where to get instructions to fill branch delay slot? Before branch instruction From the target address: only valuable when branch taken From fall through: only valuable when branch not taken

Scheduling the branch delay slot: From Before ADD R1,R2,R3 if(R2=0) then <Delay Slot> Delay slot is scheduled with an independent instruction from before the branch Best choice, always improves performance Becomes if(R2=0) then <ADD R1,R2,R3>

Scheduling the branch delay slot: From Target Delay slot is scheduled from the target of the branch Must be OK to execute that instruction if branch is not taken Usually the target instruction will need to be copied because it can be reached by another path  programs are enlarged Preferred when the branch is taken with high probability SUB R4,R5,R6 ... ADD R1,R2,R3 if(R1=0) then <Delay Slot> Becomes ... ADD R1,R2,R3 if(R2=0) then <SUB R4,R5,R6>

Scheduling the branch delay slot: From Fall Through ADD R1,R2,R3 if(R2=0) then <Delay Slot> SUB R4,R5,R6 Delay slot is scheduled from the taken fall through Must be OK to execute that instruction if branch is taken Improves performance when branch is not taken Becomes ADD R1,R2,R3 if(R2=0) then <SUB R4,R5,R6>

Delayed Branch Effectiveness Compiler effectiveness for single branch delay slot: Fills about 60% of branch delay slots About 80% of instructions executed in branch delay slots useful in computation About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: 7-8 stage pipelines, multiple instructions issued per clock (superscalar)

Example: Branch Stall Impact Assume CPI = 1.0 ignoring branches Assume solution was stalling for 3 cycles If 30% branch, Stall 3 cycles Op Freq Cycles CPI(i) (% Time) Other 70% 1 .7 (37%) Branch 30% 4 1.2 (63%) => new CPI = 1.9, or almost 2 times slower

Example 2: Speed Up Equation for Pipelining For simple RISC pipeline, CPI = 1:

Example 3: Evaluating Branch Alternatives (for 1 program) Scheduling Branch CPI speedup v. scheme penalty stall Stall pipeline 3 1.42 1.0 Predict taken 1 1.14 1.26 Predict not taken 1 1.09 1.29 Delayed branch 0.5 1.07 1.31 Conditional & Unconditional = 14%, 65% change PC

Example 4: Dual-port vs. Single-port Machine A: Dual ported memory (“Harvard Architecture”) Machine B: Single ported memory, but its pipelined implementation has a 1.05 times faster clock rate Ideal CPI = 1 for both Loads&Stores are 40% of instructions executed

Extended MIPS Pipeline DLX pipe with three unpipelined, FP functional units EX Int EX FP/I Mult IF ID Mem WB EX FP Add In reality, the intermediate results are probably not cycled around the EX unit; instead the EX stages has some number of clock delays larger than 1 EX FP/I Div

Extended MIPS Pipeline (cont’d) Initiation or repeat interval: number of clock cycles that must elapse between issuing two operations Latency: the number of intervening clock cycles between an instruction that produces a result and an instruction that uses the result Functional unit Latency Initiation interval Integer ALU 1 Data Memory FP Add 3 FP/Integer Multiply 6 FP/Integer Divide 24 25

Extended MIPS Pipeline (cont’d) IF ID A1 A2 A3 A4 M WB ..

Extended MIPS Pipeline (cont’d) Multiple outstanding FP operations FP/I Adder and Multiplier are fully pipelined FP/I Divider is not pipelined Pipeline timing for independent operations MUL.D IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB ADD.D A1 A2 A3 A4 L.D Ex S.D

Hazards and Forwarding in Longer Pipes Structural hazard: divide unit is not fully pipelined detect it and stall the instruction Structural hazard: number of register writes can be larger than one due to varying running times WAW hazards are possible Exceptions! instructions can complete in different order than they were issued RAW hazards will be more frequent

Examples Stalls arising from RAW hazards Three instructions that want to perform a write back to the FP register file simultaneously L.D F4, 0(R2) IF ID EX Mem WB MUL.D F0, F4, F6 stall M1 M2 M3 M4 M5 M6 M7 ADD.D F2, F0, F8 A1 A2 A3 A4 S.D 0(R2), F2 MUL.D F0, F4, F6 IF ID M1 M2 M3 M4 M5 M6 M7 Mem WB ... EX ADD.D F2, F4, F6 A1 A2 A3 A4 L.D F2, 0(R2)

Solving Register Write Conflicts First approach: track the use of the write port in the ID stage and stall an instruction before it issues use a shift register that indicates when already issued instructions will use the register file if there is a conflict with an already issued instruction, stall the instruction for one clock cycle on each clock cycle the reservation register is shifted one bit Alternative approach: stall a conflicting instruction when it tries to enter MEM or WB stage we can stall either instruction e.g. give priority to the unit with the longest latency Pros: does not require to detect the conflict until the entrance of MEM or WB stage Cons: complicates pipeline control; stalls now can arise from two different places

WAW Hazards IF ID EX Mem WB ADD.D F2, F4, F6 A1 A2 A3 A4 L.D F2, 0(R2) Result of ADD.D is overwritten without any instruction ever using it WAWs occur when useless instruction is executed still, we must detect them and provide correct execution Why? BNEZ R1, foo DIV.D F0, F2, F4 ; delay slot from fall-through ... foo: L.D F0, qrs

Solving WAW Hazards First approach: delay the issue of load instruction until ADD.D enters MEM Second approach: stamp out the result of the ADD.D by detecting the hazard and changing the control so that ADDD does not write; LD issues right away Detect hazard in ID when LD is issuing stall LD, or make ADDD no-op Luckily this hazard is rare

Hazard Detection in ID Stage Possible hazards hazards among FP instructions hazards between an FP instruction and an integer instr. FP and integer registers are distinct, except for FP load-stores, and FP-integer moves Assume that pipeline does all hazard detection in ID stage

Hazard Detection in ID Stage (cont’d) Check for structural hazards wait until the required functional unit is not busy and make sure that the register write port is available Check for RAW data hazards wait until source registers are not listed as pending destinations in a pipeline register that will not be available when this instruction needs the result Check for WAW data hazards determine if any instruction in A1, .. A4, M1, .. M7, D has the same register destination as this instruction; if so, stall the issue of the instruction in ID

Forwarding Logic Check if the destination register in any of EX/MEM, A4/MEM, M7/MEM, D/MEM, or MEM/WB pipeline registers is one of the source registers of a FP instruction If so, the appropriate input multiplexer will have to be enabled so as to choose the forwarded data