COMP381 by M. Hamdi 1 Pipelining Control Hazards and Deeper pipelines.

Slides:

Advertisements

Similar presentations

Morgan Kaufmann Publishers The Processor

Advertisements

1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.

CMSC 611: Advanced Computer Architecture Pipelining Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted.

Advanced Computer Architectures Laboratory on DLX Pipelining Vittorio Zaccaria.

Lecture 8 Dynamic Branch Prediction, Superscalar and VLIW Advanced Computer Architecture COE 501.

Pipelining and Control Hazards Oct

COMP 4211 Seminar Presentation Based On: Computer Architecture A Quantitative Approach by Hennessey and Patterson Presenter : Feri Danes.

Dynamic Branch PredictionCS510 Computer ArchitecturesLecture Lecture 10 Dynamic Branch Prediction, Superscalar, VLIW, and Software Pipelining.

Instruction Set Issues MIPS easy –Instructions are only committed at MEM  WB transition Other architectures are more difficult –Instructions may update.

Review: Pipelining. Pipelining Laundry Example Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold Washer takes 30 minutes Dryer.

Lecture 6: Pipelining MIPS R4000 and More Kai Bu

Instruction-Level Parallelism (ILP)

1 IF IDEX MEM L.D F4,0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 L.D F2, 0(R2) WB IF IDM1 MEM WBM2M3M4M5M6M7 stall.

COMP381 by M. Hamdi 1 Pipeline Hazards. COMP381 by M. Hamdi 2 Pipeline Hazards Hazards are situations in pipelining where one instruction cannot immediately.

EENG449b/Savvides Lec /17/04 February 17, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG.

EECC551 - Shaaban #1 Fall 2005 lec# Pipelining and Instruction-Level Parallelism. Definition of basic instruction block Increasing Instruction-Level.

1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.

Goal: Reduce the Penalty of Control Hazards

Instruction Pipelining Review

EECC551 - Shaaban #1 Fall 2002 lec# Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations.

EENG449b/Savvides Lec 4.1 1/22/04 January 22, 2004 Prof. Andreas Savvides Spring EENG 449bG/CPSC 439bG Computer.

DLX Instruction Format

COMP381 by M. Hamdi 1 (Recap) Control Hazards. COMP381 by M. Hamdi 2 Control (Branch) Hazard A: beqz r2, label B: label: P: Problem: The outcome.

1 Lecture 4: Advanced Pipelines Data hazards, control hazards, multi-cycle in-order pipelines (Appendix A.4-A.10)

EECC551 - Shaaban #1 Lec # 2 Winter Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple.

Appendix A Pipelining: Basic and Intermediate Concepts

Pipelined Datapath and Control (Lecture #15) ECE 445 – Computer Organization The slides included herein were taken from the materials accompanying Computer.

ENGS 116 Lecture 51 Pipelining and Hazards Vincent H. Berk September 30, 2005 Reading for today: Chapter A.1 – A.3, article: Patterson&Ditzel Reading for.

Instruction Pipelining Review:

EECC551 - Shaaban #1 Fall 2001 lec# Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations.

Pipeline Hazard CT101 – Computing Systems. Content Introduction to pipeline hazard Structural Hazard Data Hazard Control Hazard.

Pipelining. 10/19/ Outline 5 stage pipelining Structural and Data Hazards Forwarding Branch Schemes Exceptions and Interrupts Conclusion.

CPE 731 Advanced Computer Architecture Pipelining Review Dr. Gheith Abandah Adapted from the slides of Prof. David Patterson, University of California,

1 Appendix A Pipeline implementation Pipeline hazards, detection and forwarding Multiple-cycle operations MIPS R4000 CDA5155 Spring, 2007, Peir / University.

CSC 4250 Computer Architectures September 26, 2006 Appendix A. Pipelining.

EEL5708 Lotzi Bölöni EEL 5708 High Performance Computer Architecture Pipelining.

Pipeline Extensions prepared and Instructed by Shmuel Wimer Eng. Faculty, Bar-Ilan University MIPS Extensions1May 2015.

Comp Sci pipelining 1 Ch. 13 Pipelining. Comp Sci pipelining 2 Pipelining.

CMPE 421 Parallel Computer Architecture

CS 1104 Help Session IV Five Issues in Pipelining Colin Tan, S

Winter 2002CSE Topic Branch Hazards in the Pipelined Processor.

Branch Hazards and Static Branch Prediction Techniques

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

EE524/CptS561 Jose G. Delgado-Frias 1 Processor Basic steps to process an instruction IFID/OFEXMEMWB Instruction Fetch Instruction Decode / Operand Fetch.

HazardsCS510 Computer Architectures Lecture Lecture 7 Pipeline Hazards.

11 Pipelining Kosarev Nikolay MIPT Oct, Pipelining Implementation technique whereby multiple instructions are overlapped in execution Each pipeline.

Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)

Lecture 9. MIPS Processor Design – Pipelined Processor Design #1 Prof. Taeweon Suh Computer Science Education Korea University 2010 R&E Computer System.

LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,

CS203 – Advanced Computer Architecture Pipelining Review.

Instruction-Level Parallelism and Its Dynamic Exploitation

Instruction-Level Parallelism

Lecture 07: Pipelining Multicycle, MIPS R4000, and More

5 Steps of MIPS Datapath Figure A.2, Page A-8

Appendix C Pipeline implementation

Appendix A - Pipelining

Pipelining: Advanced ILP

Lecture 6: Advanced Pipelines

Pipelining Multicycle, MIPS R4000, and More

CSC 4250 Computer Architectures

How to improve (decrease) CPI

Control unit extension for data hazards

Overview What are pipeline hazards? Types of hazards

Pipelining Multicycle, MIPS R4000, and More

CS203 – Advanced Computer Architecture

Control unit extension for data hazards

Dynamic Hardware Prediction

Control unit extension for data hazards

CMSC 611: Advanced Computer Architecture

Pipelining Hazards.

Presentation transcript:

COMP381 by M. Hamdi 1 Pipelining Control Hazards and Deeper pipelines

COMP381 by M. Hamdi 2 Four Branch Hazard Alternatives #1: Stall until branch direction is clear #2: Predict Branch Not Taken –Execute successor instructions in sequence –“Squash” instructions in pipeline if branch actually taken –Advantage of late pipeline state update –47% MIPS branches not taken on average –PC+4 already calculated, so use it to get next instruction #3: Predict Branch Taken –53% MIPS branches taken on average –But haven’t calculated branch target address in MIPS MIPS still incurs 1 cycle branch penalty Other machines: branch target known before outcome

COMP381 by M. Hamdi 3 Four Branch Hazard Alternatives #4: Delayed Branch (Compiler help) –Define branch to take place AFTER a following instruction branch instruction sequential successor 1 sequential successor sequential successor n branch target if taken –1 slot delay allows proper decision and branch target address in 5 stage pipeline –MIPS uses this Branch delay of length n

COMP381 by M. Hamdi 4 Reduction of Branch Penalties: Delayed Branch When delayed branch is used, the branch is delayed by n cycles, following this execution pattern: conditional branch instruction sequential successor 1 sequential successor 2 …….. sequential successor n branch target if taken The sequential successor instruction are said to be in the branch delay slots. These instructions are executed whether or not the branch is taken.

COMP381 by M. Hamdi 5 Delayed Branch Example

COMP381 by M. Hamdi 6 Reduction of Branch Penalties: Delayed Branch In Practice, all machines that utilize delayed branching have a single instruction delay slot. The job of the compiler is to make the successor instructions valid and useful instructions. –Fills about 60% of branch delay slots –About 80% of instructions executed in branch delay slots useful in computation –About 50% (60% x 80%) of slots usefully filled

COMP381 by M. Hamdi 7 Delayed Branch-delay Slot Scheduling Strategies The branch-delay slot instruction can be chosen from three cases: A An independent instruction from before the branch: Always improves performance when used. The branch must not depend on the rescheduled instruction. B An instruction from the target of the branch: Improves performance if the branch is taken and may require instruction duplication. This instruction must be safe to execute if the branch is not taken. C An instruction from the fall through instruction stream: Improves performance when the branch is not taken. The instruction must be safe to execute when the branch is taken.

COMP381 by M. Hamdi 8 (A) (B) (C)

COMP381 by M. Hamdi 9 Delayed Branch Instruction in branch delay slot is always executed Compiler (tries to) move a useful instruction into delay slot. (a)From before the Branch: Always helpful when possible ADD R1, R2, R3 BEQZ R2, L1BEQZR2, L1 DELAY SLOTADD R1, R2, R3-L1: If the ADD instruction were: ADD R2, R1, R3 the move would not be possible

COMP381 by M. Hamdi 10 Delayed Branch (b) From the Target: Helps when branch is taken. May duplicate instructions ADD R2, R1, R3ADDR2, R1, R3 BEQZ R2, L1BEQZR2, L2 DELAY SLOTSUB R4, R5, R6- L1: SUB R4, R5, R6L1:SUB R4, R5, R6 L2:L2: Instructions between BEQ and SUB (in fall through) must not use R4.

COMP381 by M. Hamdi 11 Delayed Branch ( c ) From Fall Through: Helps when branch is not taken. ADD R2, R1, R3ADDR2, R1, R3 BEQZ R2, L1BEQZR2, L1 DELAY SLOTSUB R4, R5, R6 SUB R4, R5, R6 - - L1: Instructions at target (L1 and after) must not use R4 till set again. Cancelling (Nullifying) Branch: Branch instruction indicates direction of prediction. If mispredicted the instruction in the delay slot is cancelled. Greater flexibility for compiler to schedule instructions.

COMP381 by M. Hamdi 12 Branch-delay Slot: Canceling Branches In a canceling branch, a static compiler branch direction prediction is included with the branch-delay slot instruction. When the branch goes as predicted, the instruction in the branch delay slot is executed normally. When the branch does not go as predicted the instruction is turned into a no-op. Canceling branches eliminate the conditions on instruction selection in delay instruction strategies B, C The effectiveness of this method depends on whether we predict the branch correctly. In practice 50% of time, we have no stalls (nop). In practice 50% of time, we have no stalls (nop).

COMP381 by M. Hamdi 13 Performance of Branch Schemes The effective pipeline speedup with branch penalties: (assuming an ideal pipeline CPI of 1) Pipeline speedup = Pipeline depth 1 + Pipeline stall cycles from branches Pipeline stall cycles from branches = Branch frequency X branch penalty Pipeline speedup = Pipeline Depth 1 + Branch frequency X Branch penalty

COMP381 by M. Hamdi 14 Evaluating Branch Alternatives SchedulingBranchCPIspeedup v. scheme penaltyunpipelined Stall pipeline Predict taken Predict not taken Delayed branch Conditional & Unconditional = 14%, 65% change PC (taken)

COMP381 by M. Hamdi 15 Delayed Branch Limitations of delayed branch –Compiler may not find appropriate instructions to fill delay slots. Then it fills delay slots with no- ops. –Visible architectural feature – likely to change with new implementations Pipeline structure is exposed to compiler. Need to know how many delay slots.

COMP381 by M. Hamdi 16 Delayed Branch Compiler effectiveness for single branch delay slot: –Fills about 60% of branch delay slots –About 80% of instructions executed in branch delay slots useful in computation –About 50% (60% x 80%) of slots usefully filled Delayed Branch downside: As processor go to deeper pipelines and multiple issue, the branch delay grows and need more than one delay slot –Delayed branching has lost popularity compared to more expensive but more flexible dynamic approaches –Growth in available transistors has made dynamic approaches relatively cheaper

COMP381 by M. Hamdi 17 Dynamic Branch Prediction Builds on the premise that history matters –Observe the behavior of branches in previous instances and try to predict future branch behavior –Try to predict the outcome of a branch early on in order to avoid stalls –Branch prediction is critical for multiple issue processors In an n-issue processor, branches will come n times faster than a single issue processor

COMP381 by M. Hamdi 18 Basic Branch Predictor Use a 1-bit branch predictor buffer or branch history table 1 bit of memory stating whether the branch was recently taken or not Bit entry updated each time the branch instruction is executed NT State 0 Predict Not Taken State 1 Predict Taken TNT T

COMP381 by M. Hamdi 19 1-bit Branch Prediction Buffer  Problem – even simplest branches are mispredicted twice LD R1, #5 Loop: LD R2, 0(R5) ADD R2, R2, R4 STORE R2, 0(R5) ADD R5, R5, #4 SUB R1, R1, #1 BNEZ R1, Loop First time: prediction = 0 but the branch is taken  change prediction to 1 miss Time 2, 3, 4: prediction = 1 and the branch is taken Time 5: prediction = 1 but the branch is not taken  change prediction to 0 miss

COMP381 by M. Hamdi 20 Dynamic Branch Prediction Accuracy

COMP381 by M. Hamdi 21 Deeper pipelines

COMP381 by M. Hamdi 22 Superpipelining: MIPS R4000 Integer pipeline 8 Stage Pipeline: –IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. –IS–second half of access to instruction cache. –RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection.

COMP381 by M. Hamdi 23 Superpipelining: MIPS R4000 Integer pipeline 8 Stage Pipeline: –EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. –DF–data fetch, first half of access to data cache. –DS–second half of access to data cache. –TC–tag check, determine whether the data cache access hit. –WB–write back for loads and register-register operations. 8 Stages: How many stalls occur due to load dependencies and control hazards?

COMP381 by M. Hamdi 24 Stalls in MIPS R4000 IFIS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF TWO Cycle Load Latency IFIS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF THREE Cycle Branch Latency (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken

COMP381 by M. Hamdi 25 Floating Point/Multicycle Pipelining in MIPS Completion of MIPS EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or An enormous amount of logic. Instead, the floating-point pipeline will allow for a longer latency. Floating-point operations have the same pipeline stages as the integer instructions with the following differences: –The EX cycle may be repeated as many times as needed. –There may be multiple floating-point functional units. –A stall will occur if the instruction to be issued either causes a structural hazard for the functional unit or cause a data hazard.

COMP381 by M. Hamdi 26 Floating Point/Multicycle Pipelining in MIPS The latency of functional units is defined as the number of intervening cycles between an instruction producing the result and the instruction that uses the result (usually equals stall cycles with forwarding used). The initiation or repeat interval is the number of cycles that must elapse between issuing an instruction of a given type.

COMP381 by M. Hamdi 27 Extending The MIPS Pipeline to Handle Floating-Point Operations: Adding Non-Pipelined Adding Non-Pipelined Floating Point Units Floating Point Units (In Appendix A)

COMP381 by M. Hamdi 28 Extending The MIPS Pipeline: Multiple Outstanding Floating Point Operations Multiple Outstanding Floating Point Operations Latency = 0 Initiation Interval = 1 Latency = 3 Initiation Interval = 1 Pipelined Latency = 6 Initiation Interval = 1 Pipelined Latency = 24 Initiation Interval = 25 Non-pipelined Integer Unit Floating Point (FP)/Integer Multiply FP/Integer Divider IFID WB MEM FP Adder EX Hazards: RAW, WAW possible WAR Not Possible Structural: Possible Control: Possible

COMP381 by M. Hamdi 29 Latencies and Initiation Intervals For Functional Units Functional Unit Latency Initiation Interval Integer ALU01 Data Memory11 (Integer and FP Loads) FP add31 FP multiply61 (also integer multiply) FP divide2425 (also integer divide) Latency usually equals stall cycles when full forwarding is used

COMP381 by M. Hamdi 30 Pipeline Characteristics With FP Instructions are still processed in-order in IF, ID, EX at the rate of instruction per cycle. Longer RAW hazard stalls likely due to long FP latencies. Structural hazards possible due to varying instruction times and FP latencies: –FP unit may not be available; divide in this case. –MEM, WB reached by several instructions simultaneously. WAW hazards can occur since it is possible for instructions to reach WB out-of-order. WAR hazards impossible, since register reads occur in- order in ID. Instructions are allowed to complete out-of-order requiring special measures to enforce precise exceptions.

COMP381 by M. Hamdi 31 FP Operations Pipeline Timing Example All above instructions are assumed independent IFIDA1A4A3A2 MEM WB IFIDM1M6M7M2M3M4M5 MEM WB IFID MEM EXWB IFID MEM EXWB MUL.D L.D ADD.D S.D CC 1CC 2CC 3CC 8CC 9CC 4CC 5CC 6CC 7 CC 10 CC 11

COMP381 by M. Hamdi 32 FP Code RAW Hazard Stalls Example (with full data forwarding in place) IF MEM IDEX WB IFIDM1M6M7M2M3M4M5 MEM WB IFIDA1A4A3A2 MEM WB CC 1CC 2CC 3CC 8CC 9CC 4CC 5CC 6CC 7 CC 10 CC 11 CC12 CC13 CC14 CC15 CC16 CC17 CC18 IFID MEM EXWB STALL L.D F4, 0(R2) MUL.D F0, F4, F6 ADD.D F2, F0, F8 S.D F2, 0(R2) Third stall due to structural hazard in MEM stage 6 stall cycles which equals latency of FP add functional unit

COMP381 by M. Hamdi 33 Dealing with RAW Longer latency pipes cause the frequency of RAW stalls to go up. More complicated forwarding Frequent compiler scheduling More advanced techniques to be covered later

COMP381 by M. Hamdi 34 FP Code Structural Hazards Example IFIDA1A4A3A2 MEM WB IFIDM1M6M7M2M3M4M5 MEM WB IFID MEM EXWB IFID MEM EXWB MULTD F0, F4, F6 LD F2, 0(R2) ADDD F2, F4, F6 CC 1CC 2CC 3CC 8CC 9CC 4CC 5CC 6CC 7 CC 10 CC 11 IFID MEM EXWB IFID MEM EXWBIFID MEM EXWB... (integer)

COMP381 by M. Hamdi 35 Dealing with Structural Hazards Option 1: Track the use of the write port; stall instruction in ID if there is a collision. +Maintain the property of stalling instruction only in ID. –Extra HW (e.g., write conflict logic). Option 2: Stall a conflict instruction at MEM entry. +Flexible in choose a instruction to be stalled (give priority to the longest latency). –Complicates pipeline control.

COMP381 by M. Hamdi 36 Dealing with WAW Hazards Option 1: Delay LD until ADDD enter MEM Option 2: Stamp out the result of ADDD. WAW Hazards