HazardsCS510 Computer Architectures Lecture 7 - 1 Lecture 7 Pipeline Hazards.

Slides:

Advertisements

Similar presentations

COMP381 by M. Hamdi 1 (Recap) Pipeline Hazards. COMP381 by M. Hamdi 2 I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11.

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Pipeline Computer Organization II 1 Hazards Situations that prevent starting the next instruction in the next cycle Structural hazards – A required resource.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Pipelining I (1) Fall 2005 Lecture 18: Pipelining I.

Pipelining Hwanmo Sung CS147 Presentation Professor Sin-Min Lee.

CS252/Patterson Lec 1.1 1/17/01 Pipelining: Its Natural! Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer.

Pipelining - Hazards.

COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

Mary Jane Irwin ( ) [Adapted from Computer Organization and Design,

ENEE350 Ankur Srivastava University of Maryland, College Park Based on Slides from Mary Jane Irwin ( )

ECE 361 Computer Architecture Lecture 13: Designing a Pipeline Processor Start X:40.

Chapter 5 Pipelining and Hazards

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

©UCB CS 162 Computer Architecture Lecture 3: Pipelining Contd. Instructor: L.N. Bhuyan

Computer ArchitectureFall 2007 © October 24nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

Computer ArchitectureFall 2007 © October 22nd, 2007 Majd F. Sakr CS-447– Computer Architecture.

EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

Pipelining Datapath Adapted from the lecture notes of Dr. John Kubiatowicz (UC Berkeley) and Hank Walker (TAMU)

Pipelining - II Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

Appendix A Pipelining: Basic and Intermediate Concepts

Computer ArchitectureFall 2008 © October 6th, 2008 Majd F. Sakr CS-447– Computer Architecture.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Pipelining - II Rabi Mahapatra Adapted from CS 152C (UC Berkeley) lectures notes of Spring 2002.

-1.1- PIPELINING 2 nd week. -2- Khoa Coâng Ngheä Thoâng Tin – Ñaïi Hoïc Baùch Khoa Tp.HCM PIPELINING 2 nd week References Pipelining concepts The DLX.

CS3350B Computer Architecture Winter 2015 Lecture 6.2: Instructional Level Parallelism: Hazards and Resolutions Marc Moreno Maza

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

1 第三章 Instruction-Level Parallelism and Its Dynamic Exploitation 陈文智浙江大学计算机学院 2011 年 09 月.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Pipelining (I). Pipelining Example  Laundry Example  Four students have one load of clothes each to wash, dry, fold, and put away  Washer takes 30.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

CSE 340 Computer Architecture Summer 2014 Basic MIPS Pipelining Review.

CS.305 Computer Architecture Enhancing Performance with Pipelining Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from.

Computer Organization CS224 Chapter 4 Part b The Processor Spring 2010 With thanks to M.J. Irwin, T. Fountain, D. Patterson, and J. Hennessy for some lecture.

CMPE 421 Parallel Computer Architecture

1 Designing a Pipelined Processor In this Chapter, we will study 1. Pipelined datapath 2. Pipelined control 3. Data Hazards 4. Forwarding 5. Branch Hazards.

ECE 232 L18.Pipeline.1 Adapted from Patterson 97 ©UCBCopyright 1998 Morgan Kaufmann Publishers ECE 232 Hardware Organization and Design Lecture 18 Pipelining.

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

CECS 440 Pipelining.1(c) 2014 – R. W. Allison [slides adapted from D. Patterson slides with additional credits to M.J. Irwin]

Cs 152 L1 3.1 DAP Fa97,  U.CB Pipelining Lessons °Pipelining doesn’t help latency of single task, it helps throughput of entire workload °Multiple tasks.

Chap 6.1 Computer Architecture Chapter 6 Enhancing Performance with Pipelining.

CSIE30300 Computer Architecture Unit 04: Basic MIPS Pipelining Hsin-Chou Chi [Adapted from material by and

Branch Hazards and Static Branch Prediction Techniques

Oct. 18, 2000Machine Organization1 Machine Organization (CS 570) Lecture 4: Pipelining * Jeremy R. Johnson Wed. Oct. 18, 2000 *This lecture was derived.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

CPE 442 hazards.1 Introduction to Computer Architecture CpE 442 Designing a Pipeline Processor (lect. II)

CS252/Patterson Lec 1.1 1/17/01 معماري کامپيوتر - درس نهم pipeline برگرفته از درس : Prof. David A. Patterson.

EE524/CptS561 Jose G. Delgado-Frias 1 Processor Basic steps to process an instruction IFID/OFEXMEMWB Instruction Fetch Instruction Decode / Operand Fetch.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

CSE431 L06 Basic MIPS Pipelining.1Irwin, PSU, 2005 MIPS Pipeline Datapath Modifications  What do we need to add/modify in our MIPS datapath? l State registers.

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

CSCI-365 Computer Organization Lecture Note: Some slides and/or pictures in the following are adapted from: Computer Organization and Design, Patterson.

CS203 – Advanced Computer Architecture Pipelining Review.

Lecture 15: Pipelining: Branching & Complications

Review: Instruction Set Evolution

Pipelining: Hazards Ver. Jan 14, 2014

5 Steps of MIPS Datapath Figure A.2, Page A-8

Chapter 4 The Processor Part 4

ECE232: Hardware Organization and Design

Appendix A - Pipelining

Chapter 3: Pipelining 순천향대학교 컴퓨터학부 이 상 정 Adapted from

Overview What are pipeline hazards? Types of hazards

Throughput = #instructions per unit time (seconds/cycles etc.)

Presentation transcript:

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards

HazardsCS510 Computer Architectures Lecture Pipelining Lessons A B C D 6 PM 789 TaskOrderTaskOrder Time Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stagesPotential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup

HazardsCS510 Computer Architectures Lecture Its Not That Easy to Achieve the Promised Performance Limits to pipelining: Hazards prevent the next instruction from executing during its designated clock cycle –Structural hazards: HW cannot support this combination of instructions –Data hazards: Instruction depends on result of prior instruction still in the pipeline –Control hazards: Pipelining of branches and other instructions that change the PC Common solution is to stall the pipeline until the hazard is resolved, inserting one or more “bubbles”, i.e., idle clock cycles, in the pipeline

HazardsCS510 Computer Architectures Lecture Structural Hazards /Memory Instruction Order LOAD Instr 1 Instr 2 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Instr 3 Reg ALU Reg Mem Reg ALU Mem Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Mem Reg Instr 4 Mem Operation on Memory by 2 different instructions in the same clock cycle

HazardsCS510 Computer Architectures Lecture Structural Hazards with Single-Port Memory Instruction Order LOAD Instr 1 Instr 2 Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Mem Reg Mem Reg ALU Mem Reg Instr 3 Stall Reg ALU Reg Mem Reg ALU Mem Stall Mem Reg ALU Instr 3 Mem 3 cycles stall with 1-port memory

HazardsCS510 Computer Architectures Lecture Avoiding Structural Hazard with Dual-Port Memory Instruction Order Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 LOAD IM Reg ALU RegDM Instr 1 Instr 2 Instr 3 Instr 4 Instr 5 IM DM IM Reg ALU DMReg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DM IM DM IM DM IM DM IM DM IM DM No stall with 2-port memory

HazardsCS510 Computer Architectures Lecture Speed Up Equation for Pipelining Ideal CPI for pipelined machines is almost always 1 Ideal CPI = CPI unpipelined /Pipeline depth(Number of pipeline stages) x Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined CPI pipelined Clock Cycle pipelined x Speedup from pipelining Ave Instr Time unpipelined Ave Instr Time pipelined CPI unpipelined x Clock Cycle unpipelined CPI pipelined x Clock Cycle pipelined CPI unpipelined Clock Cycle unpipelined CPI pipelined Clock Cycle pipelined

HazardsCS510 Computer Architectures Lecture Speed Up Equation for Pipelining CPI pipelined = Ideal CPI + Pipeline stall clock cycles per instr = 1 + Pipeline stall clock cycles per instr x Speedup = Ideal CPI x Pipeline depth Clock Cycle unpipelined Ideal CPI + Pipeline stall CPI Clock Cycle pipelined x Speedup = Pipeline depth Clock Cycle unpipelined 1 + Pipeline stall CPI Clock Cycle pipelined

HazardsCS510 Computer Architectures Lecture Dual-Port vs Single-Port Memory Machine A: 2-port memory(needs no stall for Load); same clock cycle as unpipelined machine Machine B: 1-ported memory(needs 3 cycles stall for Load); 1.05 times faster clock rate than the unpipelined machine Ideal CPI = 1 for both Loads are 40% of instructions executed Machine A is 1.15 times faster SpeedUp A = [Pipeline Depth/(1 + 0)] x (clock unpipe /clock pipe ) = Pipeline Depth SpeedUp B = Pipeline Depth/( x 3) x (clock unpipe /(clock unpipe / 1.05) = (Pipeline Depth/1.2) x 1.05 = 0.87 x Pipeline Depth SpeedUp A / SpeedUp B = Pipeline Depth/(0.87 x Pipeline Depth) = 1.15

HazardsCS510 Computer Architectures Lecture Data Hazard on Registers ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem Reg ALU Reg ALU MemReg Mem ALU MemReg ALU MemReg Mem ALU MemReg Mem Reg Time(clock cycles) R1 Re Reg Reg Reg Reg

HazardsCS510 Computer Architectures Lecture Data Hazard on Registers Registers can be made to read and store in the same cycle such that data is stored in the first half of the clock cycle, and that data can be read in the second half of the same clock cycle Clcok Cycle Register Ri Store into Ri Read from Ri

HazardsCS510 Computer Architectures Lecture Data Hazard on Registers ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Time(clock cycles) R1 Reg Needs to Stall 2 cycles Reg

HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr i followed by Instr j Read After Write (RAW) Instr j tries to read operand before Instr i writes it Instr i LW R1, 0(R2) Instr j SUBR 4, R1, R5

HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr I followed by Instr J Write After Read (WAR) Instr j tries to write operand before Instr i reads it Instr i ADD R1, R2, R3 Instr j LW R2, 0(R5) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, – Reads are always in stage 2, and – Writes are always in stage 5

HazardsCS510 Computer Architectures Lecture Three Generic Data Hazards Instr I followed by Instr J Write After Write (WAW) Instr j tries to write operand before Instr i writes it – Leaves wrong result ( Instr i not Instr j ) Instr i LW R1, 0(R2) Instr j LW R1, 0(R3) Can’t happen in DLX 5 stage pipeline because: – All instructions take 5 stages, and – Writes are always in stage 5 Will see WAR and WAW in later more complicated pipes

HazardsCS510 Computer Architectures Lecture Forwarding to Avoid Data Hazards Time(clock cycles) ADD R1,R2,R3 SUB R4,R1,R3 AND R6,R1,R7 OR R8,R1,R9 XOR R10,R11,R1 Mem Reg ALU Reg ALU MemReg Mem Reg ALU MemReg ALU MemReg Mem Reg ALU MemReg CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 Mem

HazardsCS510 Computer Architectures Lecture HW Change for Forwarding MUX Zero? Data Memory ALU D/A Buffer A/M BufferM/W Buffer

HazardsCS510 Computer Architectures Lecture

HazardsCS510 Computer Architectures Lecture Load Delay Due to Data Hazard LOAD R1,0(R2) Time(clock cycles) AND R6,R1,R7 IM Reg ALU DMReg OR R8,R1,R9 Reg ALU DM IM SUB R4,R1,R6 Reg ALU DMReg IM Load Delay =2cycles Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU RegDM Reg ALU DMReg IM

HazardsCS510 Computer Architectures Lecture Load Delay with Forwarding LOAD R1,0(R2) Time(clock cycles) IM Reg ALU RegDM SUB R4,R1,R6 AND R6,R1,R7 OR R8,R1,R9 IM Reg ALU DMReg ALU DMReg IM We need to add HW, called Pipeline Interlock IM Reg ALU DMReg ALU DMReg IM Reg ALU DMReg IM Load Delay with Forwarding=1cycle

HazardsCS510 Computer Architectures Lecture Try to produce fast code for a = b + c; d = e - f; assuming a, b, c, d,e, and f are in memory. Software Scheduling to Avoid Load Hazards Fast code: LW Rb,b LW Rc,c LW Re,e ADD Ra,Rb,Rc LW Rf,f SW a,Ra SUB Rd,Re,Rf SWd,Rd Slow code(with forwarding): LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SWd,Rd RAW Stall

HazardsCS510 Computer Architectures Lecture % loads stalling pipeline 0%20%40%60%80% tex spice gcc 25% 14% 31% 65% 42% 54% scheduledunscheduled Compiler Avoiding Load Stalls

HazardsCS510 Computer Architectures Lecture Mem Stage WB Stage IF Stage ID StageEX Stage Instr. Memory Sign Ext Zero? Data Memory PC MUX Add ALU Reg File SMD LMD F/D BufferD/A BufferA/M Buffer M/W Buffer Pipelined DLX Datapath Branch Address Calculation Decide Condition Branch Decision for target address

HazardsCS510 Computer Architectures Lecture Control Hazard on Branches: Three Stall Cycles IM Reg ALU DMReg IM Reg ALU DMReg IM Reg ALU DMReg Time(clock cycles) CC1 CC2 CC3 CC4 CC5 CC6 CC7 CC8 CC9 IM Reg ALU DMReg IM Reg ALU DMReg Program execution order in instructions 40 BEQ R1,R3, AND R12,R2, R5 48 OR R13,R6, R2 52 ADD R14,R2, R2 80 LD R4,R7, 100 Should’t be executed when branch condition is true ! IM Reg ALU DMReg Branch Delay = 3 cycles IM Reg ALU DMReg IM Reg ALU DMReg Branch Target available

HazardsCS510 Computer Architectures Lecture Control Hazard on Branches: Three Stall Cycles Branch instruction IF ID EX MEM WB Branch successor IF ID EX MEM 3 Wasted clock cycles for the TAKEN branch Now, we know the instruction being executed is a branch. But stall until branch target address is known. Now, target address is available.We don’t know yet the instruction being executed is a branch. Fetch the branch successor. Branch successor + 1 IF ID EX Branch successor + 2 IF ID

HazardsCS510 Computer Architectures Lecture Branch Stall Impact If CPI = 1, 30% branch, Stall 3 cycles => new CPI = 1.9 –Half of the ideal speed Two part solution: –Determine the branch is TAKEN or NOT TAKEN sooner, AND –Compute TAKEN Branch Address(Branch Target) earlier DLX branch tests if register = 0 or 1 DLX Solution: Get New PC earlier Zero test to ID stage - Move Zero test to ID stage Additional ADDER - Additional ADDER to calculate New PC(taken PC) in ID stage - 1 clock cycle penalty for branch in contrast to 3 cycles

HazardsCS510 Computer Architectures Lecture Pipelined DLX Datapath To get target addr. earlier To get the Condition Earlier. Target Address available after ID. When a branch instruction is in Execute stage, Next Address is available here.

HazardsCS510 Computer Architectures Lecture

HazardsCS510 Computer Architectures Lecture Branch Behavior in Programs Conditional branch frequencies –integer average to 16 % –floating point to 12 % Forward and backward taken branches –forward taken % –backward taken % –the average of all conditional branches %

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives Stall until branch direction is clear Predict branch NOT TAKEN Predict branch TAKEN Delayed branch

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (1) STALL Stall until branch direction is clear Branch instruction IF ID EX MEM WB 3 cycle penalty Branch instruction IF ID EX MEM WB Branch successor stall IF ID EX MEM WB Branch successor + 1 IF ID EX MEM Branch successor + 2 IF ID Revised DLX pipeline(get the branch address at EX) 1 cycle penalty(Branch Delay Slot) Branch successor stall stall stall IF ID EX MEM Branch successor + 1 IF ID EX Branch successor + 2 IF ID

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (2) Predict Branch “NOT TAKEN” Execute successor instructions in the sequence PC+4 is already calculated, so use it to get the next instruction Flush instructions in the pipeline if branch is actually TAKEN Advantage of late pipeline state update 47% of DLX branches are NOT TAKEN on the average NOT TAKEN branch instruction i IF ID EX MEM WB instruction i+1 IF ID EX MEM WB instruction i+2 IF ID EX MEM WB No penalty TAKEN branch instruction i IF ID EX MEM WB instruction i+1 IF ID EX MEM WB instruction T IF ID EX MEM WB 1 cycle penalty Flush this instruction in progress

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (3) Predict Branch “TAKEN” –53% DLX branches TAKEN on average –Branch target address available after ID in DLX DLX still incurs 1 cycle branch penalty for TAKEN branch Other machines: branch target known before outcome 2 cycle penalty in DLX(1 in other machines). 1 cycle penalty in DLX(0 in other machines) NOT TAKEN instruction i IF ID EX MEM WB Instruction T stall IF Instruction i+1 IF ID EX MEM WB TAKEN branch instruction i IF ID EX MEM WB Instruction T stall IF ID EX MEM WB Instruction T+1 IF ID EX MEM WB TAKEN address not available at this time TAKEN address available

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: (4) Delayed Branch Delayed Branch –Delay branch to take place AFTER a successor instruction branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken –1 slot delayed branch allows proper decision and branch target address in 5 stage DLX pipeline with control hazard improvement Delayed Branch of length n

HazardsCS510 Computer Architectures Lecture Delayed Branch Where to get instructions to fill branch delay slot? –Before branch instruction –From the target address: only valuable when branch TAKEN –From fall through: only valuable when branch NOT TAKEN –Canceling branches allow more slots to be filled Compiler effectiveness for single delayed branch slot: –Fills about 60% of delayed branch slots –About 80% of instructions executed in delayed branch slots are useful in computation –About 50% (60% x 80%) of slots usefully filled

HazardsCS510 Computer Architectures Lecture Branch Hazard Alternatives: Delayed Branch From target SUB R4, R5, R6 ADD R1, R2, R3 if R1=0 then Delay slot ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6 - Improve performance when TAKEN(loop) - Must be alright to execute rescheduled instructions if Not Taken - May need duplicate the instruction if it is the target of another branch instr. From fall through ADD R1, R2, R3 if R1=0 then SUB R4, R5, R6 Delay slot ADD R1, R2, R3 if R2=0 then SUB R4, R5, R6 - Improve performance when NOT TAKEN - Must be alright to execute instructions of Taken - Always improve performance - Branch must not depend on rescheduled instructions From before ADD R1, R2, R3 if R2=0 then Delay slot if R2=0 then ADD R1, R2, R3

HazardsCS510 Computer Architectures Lecture Limitations on Delayed Branch Difficulty in finding useful instructions to fill the delayed branch slots Solution - Squashing –Delayed branch associated with a branch prediction –Instructions in the predicted path are executed in the delayed branch slot –If the branch outcome is mispredicted, instructions in the delayed branch slot are squashed(discarded)

HazardsCS510 Computer Architectures Lecture Canceling Branch Used when the delayed branch scheduling, i.e., filling the delay slot cannot be done due to –Restrictions on scheduling instructions at the delay slots –Limitations on the ability to predict whether it will TAKE or NOT TAKE at compile time Instruction includes the direction that the branch was predicted –When the branch behaves as predicted, the instructions in the delay slot are executed –When branch is incorrectly predicted, the instructions in the delay slot are turned into No-OPs Canceling Branch allows to fill the delay slot even if the instruction to be filled in the delay slot does not meet the requirements

HazardsCS510 Computer Architectures Lecture Evaluating Branch Alternatives Stall pipeline x3=1.42 5/1.42= Predict Taken x1=1.14 5/1.14= Predict Not Taken x0.65=1.09 5/1.09= Delayed branch x0.5=1.07 5/1.07= Pipeline speedup = Pipeline depth / CPI = Pipeline depth 1 + Branch frequency x Branch penalty Conditional and Unconditional collectively 14% frequency, 65% of branch is TAKEN Scheduling Branch CPI speedup vs speedup vs scheme penalty unpipelined stall

HazardsCS510 Computer Architectures Lecture Static(Compiler) Prediction of Taken/Untaken Branches Code Motion LWR1, 0(R2) SUB R1, R1, R3 BEQZR1, L ORR4, R5, R6 ADDR10,R4,R3 L: ADDR7, R8, R9 NOT TAKEN If branch is almost always NOT TAKEN, and R4 is not needed on the taken path, and R5 and R6 are not modified in the following instruction(s), this move can increase speed Depend on LW, need to stall TAKEN If branch is almost always TAKEN, and R7 is not needed, and R8 and R9 are not modified on the fall-through path, this move can increase speed

HazardsCS510 Computer Architectures Lecture Static(Compiler) Prediction of Taken/Untaken Branches Improves strategy for placing instructions in delay slot Two strategies –Direction-based Prediction: TAKEN backward branch, NOT TAKEN forward branch –Profile-based prediction: Record branch behaviors, predict branch based on the prior run(s) Frequency of Misprediction 0% 10% 20% 30% 40% 50% 60% 70% alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Always taken Misprediction Rate 0% 2% 4% 6% 8% 10% 12% 14% alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Taken backwards Not Taken Forwards

HazardsCS510 Computer Architectures Lecture Evaluating Static Branch Prediction Strategies Misprediction rate ignores frequency of branch Instructions between mispredicted branches is a better metric Instructions per mispredicted branch alvinn compress doduc espresso gcc hydro2d mdljsp2 ora swm256 tomcatv Profile-basedDirection-based

HazardsCS510 Computer Architectures Lecture Pipelining Summary Just overlap tasks, and easy if tasks are independent Speed Up <= Pipeline Depth; if ideal CPI is 1, then: Speedup = Pipeline Depth 1 + Pipeline stall CPI X Clock Cycle Unpipelined Clock Cycle Pipelined Hazards limit performance on computers: Structural: need more HW resources Data: need forwarding, compiler scheduling Control: Dynamic Prediction, Delayed branch slot, Static(compiler) Prediction