CSE 586 Computer Architecture

Slides:



Advertisements
Similar presentations
1 Pipelining Part 2 CS Data Hazards Data hazards occur when the pipeline changes the order of read/write accesses to operands that differs from.
Advertisements

Computer Structure 2014 – Out-Of-Order Execution 1 Computer Structure Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Dynamic Branch Prediction
Pipeline Hazards Pipeline hazards These are situations that inhibit that the next instruction can be processed in the next stage of the pipeline. This.
Instruction-Level Parallelism (ILP)
Computer Architecture 2011 – Out-Of-Order Execution 1 Computer Architecture Out-Of-Order Execution Lihu Rappoport and Adi Yoaz.
Mult. Issue CSE 471 Autumn 011 Multiple Issue Alternatives Superscalar (hardware detects conflicts) –Statically scheduled (in order dispatch and hence.
1 Lecture 7: Out-of-Order Processors Today: out-of-order pipeline, memory disambiguation, basic branch prediction (Sections 3.4, 3.5, 3.7)
1  2004 Morgan Kaufmann Publishers Chapter Six. 2  2004 Morgan Kaufmann Publishers Pipelining The laundry analogy.
Branch Target Buffers BPB: Tag + Prediction
Branch pred. CSE 471 Autumn 011 Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific.
Branch Prediction CSE 4711 Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications;
Lecture 15: Pipelining and Hazards CS 2011 Fall 2014, Dr. Rozier.
1 Dynamic Branch Prediction. 2 Why do we want to predict branches? MIPS based pipeline – 1 instruction issued per cycle, branch hazard of 1 cycle. –Delayed.
Chapter 6 Pipelined CPU Design. Spring 2005 ELEC 5200/6200 From Patterson/Hennessey Slides Pipelined operation – laundry analogy Text Fig. 6.1.
Spring 2003CSE P5481 Precise Interrupts Precise interrupts preserve the model that instructions execute in program-generated order, one at a time If an.
CS 6290 Branch Prediction. Control Dependencies Branches are very frequent –Approx. 20% of all instructions Can not wait until we know where it goes –Long.
Adapted from Computer Organization and Design, Patterson & Hennessy, UCB ECE232: Hardware Organization and Design Part 13: Branch prediction (Chapter 4/6)
LECTURE 10 Pipelining: Advanced ILP. EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls,
Dynamic Branch Prediction
Instruction-Level Parallelism and Its Dynamic Exploitation
CS 352H: Computer Systems Architecture
Computer Organization CS224
CS203 – Advanced Computer Architecture
PowerPC 604 Superscalar Microprocessor
Appendix C Pipeline implementation
Morgan Kaufmann Publishers The Processor
Chapter 4 The Processor Part 4
Exceptions & Multi-cycle Operations
Pipelining: Advanced ILP
Morgan Kaufmann Publishers The Processor
CMSC 611: Advanced Computer Architecture
The processor: Pipelining and Branching
Lecture 8: ILP and Speculation Contd. Chapter 2, Sections 2. 6, 2
So far we have dealt with control hazards in instruction pipelines by:
CPE 631: Branch Prediction
Branch statistics Branches occur every 4-6 instructions (16-25%) in integer programs; somewhat less frequently in scientific ones Unconditional branches.
Pipelining in more detail
Dynamic Branch Prediction
Pipelining Basic concept of assembly line
How to improve (decrease) CPI
Pipeline control unit (highly abstracted)
The Processor Lecture 3.6: Control Hazards
Pipelining and control flow
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
Instruction Level Parallelism (ILP)
Instruction Execution Cycle
Project Instruction Scheduler Assembler for DLX
Overview What are pipeline hazards? Types of hazards
Pipeline control unit (highly abstracted)
Extending simple pipeline to multiple pipes
CS203 – Advanced Computer Architecture
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Adapted from the slides of Prof
Pipeline Control unit (highly abstracted)
Control Hazards Branches (conditional, unconditional, call-return)
Pipelining Basic concept of assembly line
Control unit extension for data hazards
Dynamic Hardware Prediction
So far we have dealt with control hazards in instruction pipelines by:
Pipelining Basic concept of assembly line
So far we have dealt with control hazards in instruction pipelines by:
So far we have dealt with control hazards in instruction pipelines by:
Control unit extension for data hazards
So far we have dealt with control hazards in instruction pipelines by:
Interrupts and exceptions
So far we have dealt with control hazards in instruction pipelines by:
Conceptual execution on a processor which exploits ILP
Presentation transcript:

CSE 586 Computer Architecture Jean-Loup Baer http://www.cs.washington.edu/education/courses/586/00sp CSE 586 Spring 00

Highlights from last week Performance metrics Use (weighted) arithmetic means for execution times Use (weighted) harmonic means for rates CPU exec. time = Instruction count* CPIi*fi *clock cycle time We’ll talk about “contributions to the CPI” from, e.g., Hazards in the pipeline Cache misses Branch mispredictions etc. CSE 586 Spring 00

Highlights from last week (c’ed) ISA (RISC and CISC) RISC where R stands for: Restricted (relatively small number of opcodes) Regular (all instructions have same length ) And also, few instruction formats and addressing modes RISC and load-store architectures are synonymous CISC Fewer instructions executed but CPI/instruction is larger More complex to design VLIW-EPIC (might talk about it later in the quarter) CSE 586 Spring 00

Highlights from last week (c’ed) Basic pipelining 5 stages: IF, ID, EX. MEM, WB Pipeline registers between stages to keep data/control info needed in subsequent stages Hazards Structural (won’t happen in basic pipeline) Data dependencies Most can be removed via forwarding Otherwise stall (insert bubbles) Control CSE 586 Spring 00

IF ID/RR EXE Mem WB EX/MEM ID/EX MEM/WB IF/ID 4 (PC) zero (Rd) PC ALU (PC) zero (Rd) PC Inst. mem. Regs. ALU Data mem. Forwarding unit s e ALU data 2 control Stall unit Control unit CSE 586 Spring 00

Control unit of simple pipeline Everything about the instruction is known at ID stage If can pass from ID to EXE stage, the instruction is issued Control unit and forwarding unit take care of RAW dependencies Except for load dependencies taken care of by control and stall units To insert a bubble, zero out all control fields in relevant pipeline registers; also might need to prevent instruction fetch (signal preventing “read instruction memory”) Stall (aka hazard detection) unit can also be used for control hazards CSE 586 Spring 00

Branch statistics Branches occur 25-30% of the time Unconditional branches : 20% (of branches) Conditional (80%) 66% forward (i.e., slightly over 50% of total branches). Evenly split between Taken and Not-Taken 33% backward. Almost all Taken Probability that a branch is taken p = 0.2 + 0.8 (0.66 * 0.5 + 0.33)  0.7 In addition call-return are always Taken CSE 586 Spring 00

Control hazards (branches) When do you know you have a branch? During ID cycle When do you know if the branch is Taken or Not-Taken During EXE cycle (e.g., for the MIPS) Easiest solution Stall one cycle after recognizing the branch (haz. det. unit) Fetch the instruction following the branch after EXE cycle Cost 2 cycles or contribution to CPI due to branches = 2 x Branch freq.  0.5 CSE 586 Spring 00

Better (simple) schemes to handle branches Comparison could be done during ID stage: cost 1 cycle only Need more extensive forwarding plus fast comparison Still might have to stall an extra cycle (like for loads) Predictions Static schemes (only software) Dynamic schemes: hardware assists CSE 586 Spring 00

Simple static predictive schemes Predict branch Not -Taken (easy but not the best scheme) If prediction correct no problem If prediction incorrect, and this is known during EX cycle, zero out (flush) the pipeline registers of the already (two) fetched instructions following the branch With this technique, contribution to CPI due to conditional branches: 0.25 * ( 0.7 * 2 + 0.3 * 0) = 0.35 The problem is that we are optimizing for the less frequent case! Nonetheless it will be the “default” for dynamic branch prediction since it is so easy to implement. CSE 586 Spring 00

Static schemes (c’ed) Predict branch Taken Interesting only if target address can be computed before decision is known With this technique, contribution to CPI due to conditional branches: 0.25 * ( 0.7 * 1 + 0.3 * 2) = 0.33 The 1 is there because you need to compute the branch address CSE 586 Spring 00

Static schemes (c’ed) Prediction depends on the direction of branch Backward-Taken-Forward-Not-Taken (BTFNT) With the same assumptions as before, contribution to the CPI 0.25 (0.33 * 1 + 0.66 * 0. 5* 2 + 0.66 * 0.5 * 0) = 0.25 (the first term corresponds to backward taken and the next two to forward not-taken) Prediction based on opcode A bit in the opcode for prediction T/NT (can be set after profiling the application) CSE 586 Spring 00

Penalties increase with deeper pipes and multiple issue machines The 1 or 2 cycle penalty is “optimistic” because Many modern microprocessors have deeper pipes (8 to 20 stages) For example, separate decode and register read stages Extra decoding stages to see if multiple instructions can be issued CISC machines have more complex branch instructions These simple schemes could yield penalties from 2 up to 12 cycles i.e., from, say, 8 (2 * 4) to 48 (4 * 12) instruction issue slots if several instruction can be issued simultaneously CSE 586 Spring 00

Dynamic branch prediction Execution of a branch requires knowledge of: There is a branch but one can surmise that every instruction is a branch for the purpose of guessing whether it will be taken or not taken (i.e., prediction can be done at IF stage) Whether the branch is Taken/Not-Taken (hence a branch prediction mechanism) If the branch is taken what is the target address (can be computed but can also be “precomputed”, i.e., stored in some table) If the branch is taken what is the instruction at the branch target address (saves the fetch cycle for that instruction) CSE 586 Spring 00

Basic idea Use a Branch Prediction Table (BPT) How it will be indexed, updated etc. see later Can we make a prediction using BPT on the current branch; the prediction takes place during the IF stage and is acted upon during ID stage (when we know we have a branch) Case 1: Yes and the prediction was correct (known at EX stage) Case 2: Yes and the prediction was incorrect Case 3: No but the default prediction (NT) was correct Case 4: No and the default condition (NT) was incorrect CSE 586 Spring 00

Penalties (Predict/Actual) BPT improves only the T/T case In what’s below the number of stall cycles (bubbles) is for a simple pipe. It would be larger for deeper pipes. Case 1: NT/NT 0 penalty T/T need to compute address: 0 or 1 bubble Case 2 NT/T 2 or 3 bubbles T/NT 2 or 3 bubbles Case 3: NT/NT 0 penalty Case 4: NT/T 2 or 3 bubbles CSE 586 Spring 00

Branch prediction tables and buffers Branch prediction table (BPT or branch history table BHT) How addressed (low-order bits of PC, hashing, cache-like) How much history in the prediction (1-bit, 2-bits, n-bits) Where is it stored (in a separate table, associated with the I-cache) Branch target buffers (BTB) BPT + address of target instruction (+ target instruction -- not implemented in current micros as far as I know--) Correlated branch prediction 2-level prediction Hybrid predictors Choose dynamically the best among 2 predictors CSE 586 Spring 00

Simplest design BPT addressed by lower bits of the PC One bit prediction Prediction = direction of the last time the branch was executed Will mispredict at first and last iterations of a loop Known implementation Alpha 21064. The 1-bit table is associated with an I-cache line, one bit per line (4 instructions) CSE 586 Spring 00

Variations on BPT design Table of counters Tag Counters Simple indexing Cache-like PC PC CSE 586 Spring 00

Improve prediction accuracy (2-bit saturating counter scheme) Property: takes two wrong predictions before it changes T to NT (and vice-versa) taken ^ Generally, this is the initial state not taken predict taken predict taken taken taken not taken not taken predict not taken predict not taken taken ^ not taken CSE 586 Spring 00

Two bit saturating counters 2 bits scheme used in: Alpha 21164 UltraSparc Pentium Power PC 604 and 620 with variations MIPS R10000 etc... PA-8000 uses a variation Majority of the last 3 outcomes (no need to update; just a shift register) Why not 3 bit (8 states) saturating counters? Performance studies show it’s not worthwhile CSE 586 Spring 00

Where to put the BPT Associated with I-cache lines 1 counter/instruction: Alpha 21164 2 counters/cache line (1 for every 2 instructions) : UltraSparc 1 counter/cache line (AMD K5) Separate table with cache-like tags direct mapped : 512 entries (MIPS R10000), 1K entries (Sparc 64), 2K + BTB (PowerPC 620) 4-way set-associative: 256 entries BTB (Pentium) 4-way set-associative: 512 entries BTB + “2-level”(Pentium Pro) CSE 586 Spring 00

Performance of BPT’s Prediction accuracy is only one of several metrics Others metrics: Need to take into account branch frequencies Need to take into account penalties for Misfetch (correct prediction but time to compute the address; e.g. for unconditional branches or T/T if no BTB) Mispredict (incorrect branch prediction) These penalties might need to be multiplied by the number of instructions that could have been issued CSE 586 Spring 00

Prediction accuracy 2-bit vs. 1-bit Table size and organization Significant gain (approx. 92% vs. 85% for f-p in Spec89 benchmarks, 90% vs. 80% in gcc but about 88% for both in compress) Table size and organization The larger the table, the better (in general) but seems to max out at 512 to 1K entries Larger associativity also improves accuracy (in general) Why “in general” and not always? Some time “aliasing” is beneficial CSE 586 Spring 00

Branch Target Buffers BPT: Tag + Prediction BTB: Tag + prediction + next address Now we predict and “precompute” branch outcome and target address during IF Of course more costly Can still be associated with cache line (UltraSparc) Implemented in a straightforward way in Pentium; not so straightforward in Pentium Pro (see later) Decoupling (see later) of BPT and BTB in Power PC and PA-8000 Entries put in BTB only on taken branches (small benefit) CSE 586 Spring 00

BTB layout Target instruction address or I-cache line target address Tag cache-like 2-bit counter (Partial) PC Next PC (target address) Prediction During IF, check if there is a hit in the BTB. If so, the instruction must be a branch and we can get the target address – if predicted taken – during IF. If correct, no bubble CSE 586 Spring 00

Another Form of Misprediction in BTB Correct T prediction but incorrect target address Can happen for “return” (see later) Can happen for “indirect jumps” (rare but costly) CSE 586 Spring 00

Decoupled BPT and BTB For a fixed real estate (i.e., fixed area on the chip): Increasing the number of entries implies less bits for history or no field for target instruction or fewer bits for tags (more aliasing) Increasing the number of entries implies better accuracy of prediction. Decoupled design Separate – and different sizes – BPT and BTB BPT. If it predicts taken then go to BTB (see next slide) Power PC 620: 2K entries BPT + 256 entries BTB HP PA-8000: 256*3 BPT + 32 (fully-associative) BTB CSE 586 Spring 00

Decoupled BTB BPT Tag Hist BTB (2) If predict T then access BTB Tag Next address (3) if match then have target address Note: the BPT does not require a tag, so could be much larger PC (1) access BPT CSE 586 Spring 00

Correlated or 2-level branch prediction Outcomes of consecutive branches are not independent Classical example loop …. if ( x = = 2) /* branch b1 */ x = 0; if ( y = = 2) /* branch b2 */ y = 0; if ( x != y) /* branch b3 */ do this else do that CSE 586 Spring 00

What should a good predictor do? In previous example if both b1 and b2 are Taken, b3 should be Not-Taken A two-bit counter scheme cannot predict this behavior. Needs history of previous branches hence correlated schemes for BPT’s Requires history of n previous branches (shift register) Use of this vector (maybe more than one) to index a Pattern History Table (PHT) (maybe more than one) CSE 586 Spring 00

General idea: implementation using a global history register and a global PHT k 2 entries of 2-bit counters t t nt t nt nt Global history register last k branches (t =1, nt =0) CSE 586 Spring 00

Classification of 2-level (correlated) branch predictors How many global registers and their length: GA: Global (one) PA: One per branch address (Local) SA: Group several branch addresses How many PHT’s: g: Global (one) p : One per branch address s: Group several branch addresses Previous slide was GAg (6,2) The “6” refers to the length of the global register The “2” means we are using 2-bit counters CSE 586 Spring 00

Two level Global predictors p (or s) one PHT per address or set of addresses PC g GA GA GAg (5,2) GAp(5,2) CSE 586 Spring 00

Two level per-address predictors p (or s) one PHT per address or set of addresses One global PHT g History (shift) registers; one per address PC History (shift) registers; one per address PC PAg (4,2) PAp(4,2) CSE 586 Spring 00

Gshare: a popular predictor PHT The Global history register and selected bits of the PC are XORed to provide the index in a single PHT Global history register XOR PC CSE 586 Spring 00

Hybrid Predictor (schematic) The green, red, and blue arrows might correspond to different indexing functions P1c/P2c P1 P2 Selects which predictor to use PC Global CSE 586 Spring 00

Evaluation The more hardware (real estate) the better! GA s for a given number of “s” the larger G the better; for a given “G” length, the larger the number of “s” the better. Note that the result of a branch might not be known when the GA (or PA) needs to be used again (because we might issue several instructions per cycle). It must be speculatively updated (and corrected if need be). Ditto for PHT but less in a hurry? CSE 586 Spring 00

Summary: Anatomy of a Branch Predictor All instructions (BTB) Branch inst. (BPT) PC and/or global history and/or local history Prog. Exec. Event selec. Pred. Index. One level (BPT) Two level (History +PHT) Decoupled BTB + BPT Recovery? Feedback Pred. Mechan. Branch outcome Update pred. mechanism Update history (updates might be speculative) Static (ISA) 1 or 2-bit saturating counters CSE 586 Spring 00

Pentium Pro 512 4-way set-associative BTB 4 bits of branch history GAg (4+x,2) ?? Where do the extra bits come from in PC? CSE 586 Spring 00

Return jump stack Indirect jumps difficult to predict except returns from procedures (but luckily returns are about 85% of indirect jumps) If returns are entered with their target address in BTB, most of the time it will be the wrong target Procedures are called from many different locations Hence addition of a small “return stack”; 4 to 8 entries are enough (1 in MIPS R10000, 4 in Alpha 21064, 4 in Sparc64, 12 in Alpha 21164) Checked during IF, in parallel with BTB. CSE 586 Spring 00

Resume buffer In some “old” machines (e.g., IBM 360/91 circa 1967), branch prediction was implemented by fetching both paths (limited to 1 branch) Similar idea: “resume buffer” in MIPS R10000. If branch predicted taken, it takes one cycle to compute and fetch the target During that cycle save the NT sequential instruction in a buffer (4 entries of 4 instructions each). If mispredict, reload from the “resume buffer” thus saving one cycle CSE 586 Spring 00

A sample of recent pipeline configurations Highly abstracted Single issue with specialized pipelines for various functions An introduction to forthcoming ILP Several instructions can be in different pipes at the same time Multiple issue Superscalar (the order of execution is mandated by the compiler) Out-of-order execution (the hardware dynamically schedules the operations) CSE 586 Spring 00

MIPS R4000 pipelines R4000 8 stage integer pipe (superpipelined, some longer stages such as memory access now take more than one stage). Load delay 2 cycles Branch delay : 1 delay slot + 2 cycles 8 stage f-p pipe. Stages can be used in any order, multiple times Thus potential conflicts between independent instructions (structural hazards) For details see book CSE 586 Spring 00

An illustration of superpipelining IF IS RF EX DF DS TC WB IF selection of PC; Start access Icache IS Complete instruction fetch RF Decode, register fetch; forwarding detection etc. EX Execution DF and DS data cache access TC data cache hit check WB write back CSE 586 Spring 00

Branch and load delays Branch completion at end of EX Load delay Hence branch delay 3 cycles (PC selection at IF) Load delay Assuming a cache hit, data is ready at end of DS If needed for next instruction (beg of EX) needs two bubbles Another complexity: conflict in WB stage Both the floating-point pipeline and the integer pipeline which performs the loads might want to access the WB stage for floating-point registers at the same time (structural hazard). We’ll see how to resolve this soon. CSE 586 Spring 00

MIPS R10000 pipelines MIPS R10000 5 pipelines Common first 2 stages (IF, ID) 2 Integer ALU’s with 3 more stages (one ALU used for compares; apparently 3 cycles branch taken penalty but the resume buffer reduces it to 2) 1 Load-store with 4 more stages (1 cycle load delay) 2 FP units with 5 more stages, 1 for Add, 1 for Mpy and long latency ops such as Div and Sqrt) Can issue 4 instructions at a time (4-way issue) Out-of-order execution with register renaming (see later) CSE 586 Spring 00

Mispredicted taken branch 2 cycles penalty (resume buffer) Mispredicted not taken branch 3 cycles penalty IF ID RF EX WB 2 int ALU’s Load delay 1 cycle These two stages are quite complex. There is also some mechanism not shown in the picture associated with WB RF Addr Mem WB 1 load/store 1 FP add 1 FP mpy RF EX1 EX2 EX3 WB CSE 586 Spring 00

DEC (now Compaq) Alpha pipelines Alpha 21064 (2-way superscalar, in order) and Alpha 21164 (4-way) 21064 Ibox (Ifetch and decode: 4 cycles) common to: Ebox (Integer execution unit: 3 stages) Fbox (Floating-point execution unit: 6 stages) Abox (load-store unit: 3 stages) Stalls can occur only in the first 4 stages. CSE 586 Spring 00

Branch prediction during “swap” FP Fet Swap Dec Iss Load-store 4 stages common to all instructions. Once passed the 4th stage, no stalls . Branch prediction during “swap” Check structural and data hazards during issue In blue, a subset of the 38 bypasses (forwarding paths) Integer CSE 586 Spring 00

Alpha 21164 4-way instead of 2-way Two integer pipelines (1 of them used also for load-store) Two floating-point pipelines Still 7 stages for integer and memory pipelines but Load delay only 1 cycle instead of 2 in 21064 Mispredict penalty 5 cycles instead of 4 (better branch prediction though) CSE 586 Spring 00

IBM Power PC 601 -- 2-way issue; Slower but OOO Branch unit, integer/load/store, f-p 620 -- 4-way issue; OOO “Traditional” 5 stage pipeline 2 integer ALU’s + 1 mult/div 1 load/store unit 1 FPU unit 1 Branch unit (sophisticated) and branching based on CC’s Misprediction penalty 2 or 3 cycles See book (4.8) for more details. We’ll revisit after studying ooo execution CSE 586 Spring 00

Intel Pentium 2-way superscalar 2 integer ALU’s of the 5 stage AGI type (not quite) More stages needed for fetch/align and decode (2 1/2 stages) AGI = address generation interlock (cf. Golden and Mudge paper) First 2 stages common to both pipes F-P unit has 8 stages (including the common 2); latency of 3 cycles. Branch penalty. If correct prediction in BTB or branch not taken no delay; otherwise 3 or 4 cycles CSE 586 Spring 00

486 integer pipe Pentium integer pipes Fetch and align Fetch and align Decode instruction Generate control word Decode instruction Generate control word Decode instruction Generate mem. address Decode instruction Generate mem. address Decode instruction Generate mem. address Access data cache or ALU result Access data cache or ALU result Access data cache or ALU result Write result Write result Write result 486 integer pipe Pentium integer pipes CSE 586 Spring 00

Pentium Pro OOO issue and completion Separation between Fetch/decode unit Transforms CISC instructions into RISC-like uops Issues instructions in a common instruction window Functional units The various EX stages Retire unit Akin to WB stage but also stores results in correct order CSE 586 Spring 00

Fetch/Decode unit Dispatch/Execute Unit Retire unit Instruction pool The 3 units of the Pentium Pro are “independent” and communicate through the instruction pool CSE 586 Spring 00

Control Hazards (c’ed) Branches (conditional, unconditional, call-return) Interrupt: asynchronous event (e.g., I/O) Occurrence of an interrupt checked at every cycle If an interrupt has been raised, don’t fetch next instruction, flush the pipe, handle the interrupt (see later in the quarter) Exceptions (e.g., arithmetic overflow, page fault etc.) Program and data dependent (repeatable), hence “synchronous” 2/22/2019 CSE 586 Spring 00

Exceptions Handling exceptions Occur “within” an instruction, for example: During IF: page fault During ID: illegal opcode During EX: division by 0 During MEM: page fault; protection violation Handling exceptions A pipeline is restartable if the exception can be handled and the program restarted w/o affecting execution CSE 586 Spring 00

Precise exceptions If exception at instruction i then Instructions i-1, i-2 etc complete normally (flush the pipe) Instructions i+1, i+2 etc. already in the pipeline will be “no-oped” and will be restarted from scratch after the exception has been handled Handling precise exceptions: Basic idea Force a trap instruction on the next IF Turn off writes for all instructions i and following When the target of the trap instruction receives control, it saves the PC of the instruction having the exception After the exception has been handled, an instruction “return from trap” will restore the PC. 2/22/2019 CSE 586 Spring 00

Precise exceptions (cont’d) Relatively simple for integer pipeline All current machines have precise exceptions for integer and load-store operations Can lead to loss of performance for pipes with multiple cycles execution stage (f-p see later) 2/22/2019 CSE 586 Spring 00

Integer pipeline (RISC) precise exceptions Recall that exceptions can occur in all stages but WB Exceptions must be treated in instruction order Instruction i starts at time t Exception in MEM stage at time t + 3 (treat it first) Instruction i + 1 starts at time t +1 Exception in IF stage at time t + 1 (occurs earlier but treat in 2nd) 2/22/2019 CSE 586 Spring 00

Treating exceptions in order Use pipeline registers Status vector of possible exceptions carried on with the instruction. Once an exception is posted, no writing (no change of state; easy in integer pipeline -- just prevent store in memory) When an instruction leaves MEM stage, check for exception. 2/22/2019 CSE 586 Spring 00

Difficulties in less RISCy environments Due to instruction set (“long” instructions”) String instructions (but use of general registers to keep state) Instructions that change state before last stage (e.g., autoincrement mode in Vax and update addressing in Power PC) and these changes are needed to complete inst. (require ability to back up) Condition codes Must remember when last changed Multiple cycle stages (see later) 2/22/2019 CSE 586 Spring 00

Extending simple pipeline to multiple pipes Single issue: in ID stage direct to one of several EX stages Common WB stage EX of various pipelines might take more than one cycle Latency of an EX unit = Number of cycles before its result can be forwarded = Number of stages –1 Not all EX need be pipelined IF EX is pipelined A new instruction can be assigned to it every cycle (if no data dependency) or, maybe only after x cycles, with x depending on the function to be performed CSE 586 Spring 00

EX (e.g., integer; latency 0) IF ID M1 M7 Me WB F-p mul (latency 7) A1 A4 F-p add (latency 3) both Needed at beg of cycle & ready at end of cycle Div (e.g., not pipelined, Latency 25) 2/22/2019 CSE 586 Spring 00

Hazards in example multiple cycle pipeline Structural: Yes Divide unit is not pipelined. Any Divides separated by less than 25 cycles will stall the pipe Several writes might be “ready” at the same time and want to use WB stage at the same time RAW: Yes Essentially handled as in integer pipe but with higher frequency of stalls. Also more forwarding needed WAW : Yes (see later) Out of order completion : Yes (see later) 2/22/2019 CSE 586 Spring 00

RAW:Example from the book F4 <- M IF ID EX MeWB F0 <- F4 * F6 IF ID st M1 M2 M3 M4 M5 M6 M7 Me WB F2 <- F0 + F8 IF ID st st st st st A1 A2 A3 A4 Me WB M <- F2 IF ID EX st st st st st st st Me WB In blue data dependencies hazard In red structural hazard 2/22/2019 CSE 586 Spring 00

Conflict in using the WB stage Several instructions might want to use the WB stage at the same time E.g.,A Multd issued at time t and an addd issued at time t + 3 Solution: reserve the WB stage at ID stage (scheme already used in CRAY-1) keep track of WB stage usage in shift register reserve the right slot. If busy, stall for a cycle and repeat shift every clock cycle 2/22/2019 CSE 586 Spring 00

Example on how to reserve the WB stage Time in ID stage Operation Shift register t multd 000 000 001 t +1 int 001 000 010 t + 2 int 011 000 100 t + 3 addd 110 00X 000 Note: multd and addd want WB at time t + 9. addd will be asked to stall one cycle Instructions complete out of order (e.g., the two int terminate before the multd) CSE 586 Spring 00

WAW Hazards Instruction i writes f-p register Fx at time t Instruction i + k writes f-p register Fx at time t - m But no instruction i + 1, i +2, i+k uses Fx (otherwise there would be a stall) Only requirement is that i + k ‘s result be stored Solutions: Squash i : difficult to know where it is in the pipe At ID stage check that result register is not a result register in all subsequent stages of other units. If it is, stall appropriate number of cycles. 2/22/2019 CSE 586 Spring 00

Out-of-order completion Instruction i finishes at time t Instruction i + k finishes at time t - m No hazard etc. (see previous example on integer completing before multd ) What happens if instruction i causes an exception at a time in [t-m,t] and instruction i + k writes in one of its own source operands (i.e., is not restartable)? 2/22/2019 CSE 586 Spring 00

Exception handling Solutions (cf. book for more details) Do nothing (imprecise exceptions; bad with virtual memory) Have a precise (by use of testing instructions) and an imprecise mode; effectively restricts concurrency of f-p operations Buffer results until previous (in order) instructions have completed; can be costly when large differences in latencies but the same technique is used for OOO execution Restrict concurrency of f-p operations and on an exception “simulate in software” the instructions in between the faulting and the finished one. Flag early those operations that might result in an exception and stall accordingly 2/22/2019 CSE 586 Spring 00