Download presentation
Presentation is loading. Please wait.
1
CSE 586 Computer Architecture
Jean-Loup Baer CSE 586 Spring 00
2
Highlights from last week
Performance metrics Use (weighted) arithmetic means for execution times Use (weighted) harmonic means for rates CPU exec. time = Instruction count* CPIi*fi *clock cycle time We’ll talk about “contributions to the CPI” from, e.g., Hazards in the pipeline Cache misses Branch mispredictions etc. CSE 586 Spring 00
3
Highlights from last week (c’ed)
ISA (RISC and CISC) RISC where R stands for: Restricted (relatively small number of opcodes) Regular (all instructions have same length ) And also, few instruction formats and addressing modes RISC and load-store architectures are synonymous CISC Fewer instructions executed but CPI/instruction is larger More complex to design VLIW-EPIC (might talk about it later in the quarter) CSE 586 Spring 00
4
Highlights from last week (c’ed)
Basic pipelining 5 stages: IF, ID, EX. MEM, WB Pipeline registers between stages to keep data/control info needed in subsequent stages Hazards Structural (won’t happen in basic pipeline) Data dependencies Most can be removed via forwarding Otherwise stall (insert bubbles) Control CSE 586 Spring 00
5
IF ID/RR EXE Mem WB EX/MEM ID/EX MEM/WB IF/ID 4 (PC) zero (Rd) PC
ALU (PC) zero (Rd) PC Inst. mem. Regs. ALU Data mem. Forwarding unit s e ALU data 2 control Stall unit Control unit CSE 586 Spring 00
6
Control unit of simple pipeline
Everything about the instruction is known at ID stage If can pass from ID to EXE stage, the instruction is issued Control unit and forwarding unit take care of RAW dependencies Except for load dependencies taken care of by control and stall units To insert a bubble, zero out all control fields in relevant pipeline registers; also might need to prevent instruction fetch (signal preventing “read instruction memory”) Stall (aka hazard detection) unit can also be used for control hazards CSE 586 Spring 00
7
Branch statistics Branches occur 25-30% of the time
Unconditional branches : 20% (of branches) Conditional (80%) 66% forward (i.e., slightly over 50% of total branches). Evenly split between Taken and Not-Taken 33% backward. Almost all Taken Probability that a branch is taken p = (0.66 * ) 0.7 In addition call-return are always Taken CSE 586 Spring 00
8
Control hazards (branches)
When do you know you have a branch? During ID cycle When do you know if the branch is Taken or Not-Taken During EXE cycle (e.g., for the MIPS) Easiest solution Stall one cycle after recognizing the branch (haz. det. unit) Fetch the instruction following the branch after EXE cycle Cost 2 cycles or contribution to CPI due to branches = 2 x Branch freq. 0.5 CSE 586 Spring 00
9
Better (simple) schemes to handle branches
Comparison could be done during ID stage: cost 1 cycle only Need more extensive forwarding plus fast comparison Still might have to stall an extra cycle (like for loads) Predictions Static schemes (only software) Dynamic schemes: hardware assists CSE 586 Spring 00
10
Simple static predictive schemes
Predict branch Not -Taken (easy but not the best scheme) If prediction correct no problem If prediction incorrect, and this is known during EX cycle, zero out (flush) the pipeline registers of the already (two) fetched instructions following the branch With this technique, contribution to CPI due to conditional branches: 0.25 * ( 0.7 * * 0) = 0.35 The problem is that we are optimizing for the less frequent case! Nonetheless it will be the “default” for dynamic branch prediction since it is so easy to implement. CSE 586 Spring 00
11
Static schemes (c’ed) Predict branch Taken
Interesting only if target address can be computed before decision is known With this technique, contribution to CPI due to conditional branches: 0.25 * ( 0.7 * * 2) = 0.33 The 1 is there because you need to compute the branch address CSE 586 Spring 00
12
Static schemes (c’ed) Prediction depends on the direction of branch
Backward-Taken-Forward-Not-Taken (BTFNT) With the same assumptions as before, contribution to the CPI 0.25 (0.33 * * 0. 5* * 0.5 * 0) = 0.25 (the first term corresponds to backward taken and the next two to forward not-taken) Prediction based on opcode A bit in the opcode for prediction T/NT (can be set after profiling the application) CSE 586 Spring 00
13
Penalties increase with deeper pipes and multiple issue machines
The 1 or 2 cycle penalty is “optimistic” because Many modern microprocessors have deeper pipes (8 to 20 stages) For example, separate decode and register read stages Extra decoding stages to see if multiple instructions can be issued CISC machines have more complex branch instructions These simple schemes could yield penalties from 2 up to 12 cycles i.e., from, say, 8 (2 * 4) to 48 (4 * 12) instruction issue slots if several instruction can be issued simultaneously CSE 586 Spring 00
14
Dynamic branch prediction
Execution of a branch requires knowledge of: There is a branch but one can surmise that every instruction is a branch for the purpose of guessing whether it will be taken or not taken (i.e., prediction can be done at IF stage) Whether the branch is Taken/Not-Taken (hence a branch prediction mechanism) If the branch is taken what is the target address (can be computed but can also be “precomputed”, i.e., stored in some table) If the branch is taken what is the instruction at the branch target address (saves the fetch cycle for that instruction) CSE 586 Spring 00
15
Basic idea Use a Branch Prediction Table (BPT)
How it will be indexed, updated etc. see later Can we make a prediction using BPT on the current branch; the prediction takes place during the IF stage and is acted upon during ID stage (when we know we have a branch) Case 1: Yes and the prediction was correct (known at EX stage) Case 2: Yes and the prediction was incorrect Case 3: No but the default prediction (NT) was correct Case 4: No and the default condition (NT) was incorrect CSE 586 Spring 00
16
Penalties (Predict/Actual) BPT improves only the T/T case In what’s below the number of stall cycles (bubbles) is for a simple pipe. It would be larger for deeper pipes. Case 1: NT/NT 0 penalty T/T need to compute address: 0 or 1 bubble Case 2 NT/T 2 or 3 bubbles T/NT 2 or 3 bubbles Case 3: NT/NT 0 penalty Case 4: NT/T 2 or 3 bubbles CSE 586 Spring 00
17
Branch prediction tables and buffers
Branch prediction table (BPT or branch history table BHT) How addressed (low-order bits of PC, hashing, cache-like) How much history in the prediction (1-bit, 2-bits, n-bits) Where is it stored (in a separate table, associated with the I-cache) Branch target buffers (BTB) BPT + address of target instruction (+ target instruction -- not implemented in current micros as far as I know--) Correlated branch prediction 2-level prediction Hybrid predictors Choose dynamically the best among 2 predictors CSE 586 Spring 00
18
Simplest design BPT addressed by lower bits of the PC
One bit prediction Prediction = direction of the last time the branch was executed Will mispredict at first and last iterations of a loop Known implementation Alpha The 1-bit table is associated with an I-cache line, one bit per line (4 instructions) CSE 586 Spring 00
19
Variations on BPT design
Table of counters Tag Counters Simple indexing Cache-like PC PC CSE 586 Spring 00
20
Improve prediction accuracy (2-bit saturating counter scheme) Property: takes two wrong predictions before it changes T to NT (and vice-versa) taken ^ Generally, this is the initial state not taken predict taken predict taken taken taken not taken not taken predict not taken predict not taken taken ^ not taken CSE 586 Spring 00
21
Two bit saturating counters
2 bits scheme used in: Alpha 21164 UltraSparc Pentium Power PC 604 and 620 with variations MIPS R10000 etc... PA-8000 uses a variation Majority of the last 3 outcomes (no need to update; just a shift register) Why not 3 bit (8 states) saturating counters? Performance studies show it’s not worthwhile CSE 586 Spring 00
22
Where to put the BPT Associated with I-cache lines
1 counter/instruction: Alpha 21164 2 counters/cache line (1 for every 2 instructions) : UltraSparc 1 counter/cache line (AMD K5) Separate table with cache-like tags direct mapped : entries (MIPS R10000), 1K entries (Sparc 64), 2K + BTB (PowerPC 620) 4-way set-associative: 256 entries BTB (Pentium) 4-way set-associative: 512 entries BTB + “2-level”(Pentium Pro) CSE 586 Spring 00
23
Performance of BPT’s Prediction accuracy is only one of several metrics Others metrics: Need to take into account branch frequencies Need to take into account penalties for Misfetch (correct prediction but time to compute the address; e.g. for unconditional branches or T/T if no BTB) Mispredict (incorrect branch prediction) These penalties might need to be multiplied by the number of instructions that could have been issued CSE 586 Spring 00
24
Prediction accuracy 2-bit vs. 1-bit Table size and organization
Significant gain (approx. 92% vs. 85% for f-p in Spec89 benchmarks, 90% vs. 80% in gcc but about 88% for both in compress) Table size and organization The larger the table, the better (in general) but seems to max out at 512 to 1K entries Larger associativity also improves accuracy (in general) Why “in general” and not always? Some time “aliasing” is beneficial CSE 586 Spring 00
25
Branch Target Buffers BPT: Tag + Prediction
BTB: Tag + prediction + next address Now we predict and “precompute” branch outcome and target address during IF Of course more costly Can still be associated with cache line (UltraSparc) Implemented in a straightforward way in Pentium; not so straightforward in Pentium Pro (see later) Decoupling (see later) of BPT and BTB in Power PC and PA-8000 Entries put in BTB only on taken branches (small benefit) CSE 586 Spring 00
26
BTB layout Target instruction address or I-cache line target address
Tag cache-like 2-bit counter (Partial) PC Next PC (target address) Prediction During IF, check if there is a hit in the BTB. If so, the instruction must be a branch and we can get the target address – if predicted taken – during IF. If correct, no bubble CSE 586 Spring 00
27
Another Form of Misprediction in BTB
Correct T prediction but incorrect target address Can happen for “return” (see later) Can happen for “indirect jumps” (rare but costly) CSE 586 Spring 00
28
Decoupled BPT and BTB For a fixed real estate (i.e., fixed area on the chip): Increasing the number of entries implies less bits for history or no field for target instruction or fewer bits for tags (more aliasing) Increasing the number of entries implies better accuracy of prediction. Decoupled design Separate – and different sizes – BPT and BTB BPT. If it predicts taken then go to BTB (see next slide) Power PC 620: 2K entries BPT entries BTB HP PA-8000: 256*3 BPT + 32 (fully-associative) BTB CSE 586 Spring 00
29
Decoupled BTB BPT Tag Hist BTB (2) If predict T then access BTB
Tag Next address (3) if match then have target address Note: the BPT does not require a tag, so could be much larger PC (1) access BPT CSE 586 Spring 00
30
Correlated or 2-level branch prediction
Outcomes of consecutive branches are not independent Classical example loop …. if ( x = = 2) /* branch b1 */ x = 0; if ( y = = 2) /* branch b2 */ y = 0; if ( x != y) /* branch b3 */ do this else do that CSE 586 Spring 00
31
What should a good predictor do?
In previous example if both b1 and b2 are Taken, b3 should be Not-Taken A two-bit counter scheme cannot predict this behavior. Needs history of previous branches hence correlated schemes for BPT’s Requires history of n previous branches (shift register) Use of this vector (maybe more than one) to index a Pattern History Table (PHT) (maybe more than one) CSE 586 Spring 00
32
General idea: implementation using a global history register and a global PHT
k 2 entries of 2-bit counters t t nt t nt nt Global history register last k branches (t =1, nt =0) CSE 586 Spring 00
33
Classification of 2-level (correlated) branch predictors
How many global registers and their length: GA: Global (one) PA: One per branch address (Local) SA: Group several branch addresses How many PHT’s: g: Global (one) p : One per branch address s: Group several branch addresses Previous slide was GAg (6,2) The “6” refers to the length of the global register The “2” means we are using 2-bit counters CSE 586 Spring 00
34
Two level Global predictors
p (or s) one PHT per address or set of addresses PC g GA GA GAg (5,2) GAp(5,2) CSE 586 Spring 00
35
Two level per-address predictors
p (or s) one PHT per address or set of addresses One global PHT g History (shift) registers; one per address PC History (shift) registers; one per address PC PAg (4,2) PAp(4,2) CSE 586 Spring 00
36
Gshare: a popular predictor
PHT The Global history register and selected bits of the PC are XORed to provide the index in a single PHT Global history register XOR PC CSE 586 Spring 00
37
Hybrid Predictor (schematic)
The green, red, and blue arrows might correspond to different indexing functions P1c/P2c P1 P2 Selects which predictor to use PC Global CSE 586 Spring 00
38
Evaluation The more hardware (real estate) the better!
GA s for a given number of “s” the larger G the better; for a given “G” length, the larger the number of “s” the better. Note that the result of a branch might not be known when the GA (or PA) needs to be used again (because we might issue several instructions per cycle). It must be speculatively updated (and corrected if need be). Ditto for PHT but less in a hurry? CSE 586 Spring 00
39
Summary: Anatomy of a Branch Predictor
All instructions (BTB) Branch inst. (BPT) PC and/or global history and/or local history Prog. Exec. Event selec. Pred. Index. One level (BPT) Two level (History +PHT) Decoupled BTB + BPT Recovery? Feedback Pred. Mechan. Branch outcome Update pred. mechanism Update history (updates might be speculative) Static (ISA) or 2-bit saturating counters CSE 586 Spring 00
40
Pentium Pro 512 4-way set-associative BTB 4 bits of branch history
GAg (4+x,2) ?? Where do the extra bits come from in PC? CSE 586 Spring 00
41
Return jump stack Indirect jumps difficult to predict except returns from procedures (but luckily returns are about 85% of indirect jumps) If returns are entered with their target address in BTB, most of the time it will be the wrong target Procedures are called from many different locations Hence addition of a small “return stack”; 4 to 8 entries are enough (1 in MIPS R10000, 4 in Alpha 21064, 4 in Sparc64, 12 in Alpha 21164) Checked during IF, in parallel with BTB. CSE 586 Spring 00
42
Resume buffer In some “old” machines (e.g., IBM 360/91 circa 1967), branch prediction was implemented by fetching both paths (limited to 1 branch) Similar idea: “resume buffer” in MIPS R10000. If branch predicted taken, it takes one cycle to compute and fetch the target During that cycle save the NT sequential instruction in a buffer (4 entries of 4 instructions each). If mispredict, reload from the “resume buffer” thus saving one cycle CSE 586 Spring 00
43
A sample of recent pipeline configurations
Highly abstracted Single issue with specialized pipelines for various functions An introduction to forthcoming ILP Several instructions can be in different pipes at the same time Multiple issue Superscalar (the order of execution is mandated by the compiler) Out-of-order execution (the hardware dynamically schedules the operations) CSE 586 Spring 00
44
MIPS R4000 pipelines R4000 8 stage integer pipe (superpipelined, some longer stages such as memory access now take more than one stage). Load delay 2 cycles Branch delay : 1 delay slot + 2 cycles 8 stage f-p pipe. Stages can be used in any order, multiple times Thus potential conflicts between independent instructions (structural hazards) For details see book CSE 586 Spring 00
45
An illustration of superpipelining
IF IS RF EX DF DS TC WB IF selection of PC; Start access Icache IS Complete instruction fetch RF Decode, register fetch; forwarding detection etc. EX Execution DF and DS data cache access TC data cache hit check WB write back CSE 586 Spring 00
46
Branch and load delays Branch completion at end of EX Load delay
Hence branch delay 3 cycles (PC selection at IF) Load delay Assuming a cache hit, data is ready at end of DS If needed for next instruction (beg of EX) needs two bubbles Another complexity: conflict in WB stage Both the floating-point pipeline and the integer pipeline which performs the loads might want to access the WB stage for floating-point registers at the same time (structural hazard). We’ll see how to resolve this soon. CSE 586 Spring 00
47
MIPS R10000 pipelines MIPS R10000 5 pipelines Common first 2 stages (IF, ID) 2 Integer ALU’s with 3 more stages (one ALU used for compares; apparently 3 cycles branch taken penalty but the resume buffer reduces it to 2) 1 Load-store with 4 more stages (1 cycle load delay) 2 FP units with 5 more stages, 1 for Add, 1 for Mpy and long latency ops such as Div and Sqrt) Can issue 4 instructions at a time (4-way issue) Out-of-order execution with register renaming (see later) CSE 586 Spring 00
48
Mispredicted taken branch 2 cycles penalty (resume buffer)
Mispredicted not taken branch 3 cycles penalty IF ID RF EX WB 2 int ALU’s Load delay 1 cycle These two stages are quite complex. There is also some mechanism not shown in the picture associated with WB RF Addr Mem WB 1 load/store 1 FP add 1 FP mpy RF EX1 EX EX3 WB CSE 586 Spring 00
49
DEC (now Compaq) Alpha pipelines
Alpha (2-way superscalar, in order) and Alpha (4-way) 21064 Ibox (Ifetch and decode: 4 cycles) common to: Ebox (Integer execution unit: 3 stages) Fbox (Floating-point execution unit: 6 stages) Abox (load-store unit: 3 stages) Stalls can occur only in the first 4 stages. CSE 586 Spring 00
50
Branch prediction during “swap”
FP Fet Swap Dec Iss Load-store 4 stages common to all instructions. Once passed the 4th stage, no stalls . Branch prediction during “swap” Check structural and data hazards during issue In blue, a subset of the 38 bypasses (forwarding paths) Integer CSE 586 Spring 00
51
Alpha 21164 4-way instead of 2-way
Two integer pipelines (1 of them used also for load-store) Two floating-point pipelines Still 7 stages for integer and memory pipelines but Load delay only 1 cycle instead of 2 in 21064 Mispredict penalty 5 cycles instead of 4 (better branch prediction though) CSE 586 Spring 00
52
IBM Power PC 601 -- 2-way issue; Slower but OOO
Branch unit, integer/load/store, f-p way issue; OOO “Traditional” 5 stage pipeline 2 integer ALU’s + 1 mult/div 1 load/store unit 1 FPU unit 1 Branch unit (sophisticated) and branching based on CC’s Misprediction penalty 2 or 3 cycles See book (4.8) for more details. We’ll revisit after studying ooo execution CSE 586 Spring 00
53
Intel Pentium 2-way superscalar
2 integer ALU’s of the 5 stage AGI type (not quite) More stages needed for fetch/align and decode (2 1/2 stages) AGI = address generation interlock (cf. Golden and Mudge paper) First 2 stages common to both pipes F-P unit has 8 stages (including the common 2); latency of 3 cycles. Branch penalty. If correct prediction in BTB or branch not taken no delay; otherwise 3 or 4 cycles CSE 586 Spring 00
54
486 integer pipe Pentium integer pipes Fetch and align Fetch and align
Decode instruction Generate control word Decode instruction Generate control word Decode instruction Generate mem. address Decode instruction Generate mem. address Decode instruction Generate mem. address Access data cache or ALU result Access data cache or ALU result Access data cache or ALU result Write result Write result Write result 486 integer pipe Pentium integer pipes CSE 586 Spring 00
55
Pentium Pro OOO issue and completion Separation between
Fetch/decode unit Transforms CISC instructions into RISC-like uops Issues instructions in a common instruction window Functional units The various EX stages Retire unit Akin to WB stage but also stores results in correct order CSE 586 Spring 00
56
Fetch/Decode unit Dispatch/Execute Unit Retire unit Instruction pool The 3 units of the Pentium Pro are “independent” and communicate through the instruction pool CSE 586 Spring 00
57
Control Hazards (c’ed)
Branches (conditional, unconditional, call-return) Interrupt: asynchronous event (e.g., I/O) Occurrence of an interrupt checked at every cycle If an interrupt has been raised, don’t fetch next instruction, flush the pipe, handle the interrupt (see later in the quarter) Exceptions (e.g., arithmetic overflow, page fault etc.) Program and data dependent (repeatable), hence “synchronous” 2/22/2019 CSE 586 Spring 00
58
Exceptions Handling exceptions
Occur “within” an instruction, for example: During IF: page fault During ID: illegal opcode During EX: division by 0 During MEM: page fault; protection violation Handling exceptions A pipeline is restartable if the exception can be handled and the program restarted w/o affecting execution CSE 586 Spring 00
59
Precise exceptions If exception at instruction i then
Instructions i-1, i-2 etc complete normally (flush the pipe) Instructions i+1, i+2 etc. already in the pipeline will be “no-oped” and will be restarted from scratch after the exception has been handled Handling precise exceptions: Basic idea Force a trap instruction on the next IF Turn off writes for all instructions i and following When the target of the trap instruction receives control, it saves the PC of the instruction having the exception After the exception has been handled, an instruction “return from trap” will restore the PC. 2/22/2019 CSE 586 Spring 00
60
Precise exceptions (cont’d)
Relatively simple for integer pipeline All current machines have precise exceptions for integer and load-store operations Can lead to loss of performance for pipes with multiple cycles execution stage (f-p see later) 2/22/2019 CSE 586 Spring 00
61
Integer pipeline (RISC) precise exceptions
Recall that exceptions can occur in all stages but WB Exceptions must be treated in instruction order Instruction i starts at time t Exception in MEM stage at time t + 3 (treat it first) Instruction i + 1 starts at time t +1 Exception in IF stage at time t + 1 (occurs earlier but treat in 2nd) 2/22/2019 CSE 586 Spring 00
62
Treating exceptions in order
Use pipeline registers Status vector of possible exceptions carried on with the instruction. Once an exception is posted, no writing (no change of state; easy in integer pipeline -- just prevent store in memory) When an instruction leaves MEM stage, check for exception. 2/22/2019 CSE 586 Spring 00
63
Difficulties in less RISCy environments
Due to instruction set (“long” instructions”) String instructions (but use of general registers to keep state) Instructions that change state before last stage (e.g., autoincrement mode in Vax and update addressing in Power PC) and these changes are needed to complete inst. (require ability to back up) Condition codes Must remember when last changed Multiple cycle stages (see later) 2/22/2019 CSE 586 Spring 00
64
Extending simple pipeline to multiple pipes
Single issue: in ID stage direct to one of several EX stages Common WB stage EX of various pipelines might take more than one cycle Latency of an EX unit = Number of cycles before its result can be forwarded = Number of stages –1 Not all EX need be pipelined IF EX is pipelined A new instruction can be assigned to it every cycle (if no data dependency) or, maybe only after x cycles, with x depending on the function to be performed CSE 586 Spring 00
65
EX (e.g., integer; latency 0)
IF ID M1 M7 Me WB F-p mul (latency 7) A1 A4 F-p add (latency 3) both Needed at beg of cycle & ready at end of cycle Div (e.g., not pipelined, Latency 25) 2/22/2019 CSE 586 Spring 00
66
Hazards in example multiple cycle pipeline
Structural: Yes Divide unit is not pipelined. Any Divides separated by less than 25 cycles will stall the pipe Several writes might be “ready” at the same time and want to use WB stage at the same time RAW: Yes Essentially handled as in integer pipe but with higher frequency of stalls. Also more forwarding needed WAW : Yes (see later) Out of order completion : Yes (see later) 2/22/2019 CSE 586 Spring 00
67
RAW:Example from the book
F4 <- M IF ID EX MeWB F0 <- F4 * F IF ID st M1 M2 M3 M4 M5 M6 M7 Me WB F2 <- F0 + F IF ID st st st st st A1 A2 A3 A4 Me WB M <- F IF ID EX st st st st st st st Me WB In blue data dependencies hazard In red structural hazard 2/22/2019 CSE 586 Spring 00
68
Conflict in using the WB stage
Several instructions might want to use the WB stage at the same time E.g.,A Multd issued at time t and an addd issued at time t + 3 Solution: reserve the WB stage at ID stage (scheme already used in CRAY-1) keep track of WB stage usage in shift register reserve the right slot. If busy, stall for a cycle and repeat shift every clock cycle 2/22/2019 CSE 586 Spring 00
69
Example on how to reserve the WB stage
Time in ID stage Operation Shift register t multd t int t int t addd X 000 Note: multd and addd want WB at time t + 9. addd will be asked to stall one cycle Instructions complete out of order (e.g., the two int terminate before the multd) CSE 586 Spring 00
70
WAW Hazards Instruction i writes f-p register Fx at time t
Instruction i + k writes f-p register Fx at time t - m But no instruction i + 1, i +2, i+k uses Fx (otherwise there would be a stall) Only requirement is that i + k ‘s result be stored Solutions: Squash i : difficult to know where it is in the pipe At ID stage check that result register is not a result register in all subsequent stages of other units. If it is, stall appropriate number of cycles. 2/22/2019 CSE 586 Spring 00
71
Out-of-order completion
Instruction i finishes at time t Instruction i + k finishes at time t - m No hazard etc. (see previous example on integer completing before multd ) What happens if instruction i causes an exception at a time in [t-m,t] and instruction i + k writes in one of its own source operands (i.e., is not restartable)? 2/22/2019 CSE 586 Spring 00
72
Exception handling Solutions (cf. book for more details)
Do nothing (imprecise exceptions; bad with virtual memory) Have a precise (by use of testing instructions) and an imprecise mode; effectively restricts concurrency of f-p operations Buffer results until previous (in order) instructions have completed; can be costly when large differences in latencies but the same technique is used for OOO execution Restrict concurrency of f-p operations and on an exception “simulate in software” the instructions in between the faulting and the finished one. Flag early those operations that might result in an exception and stall accordingly 2/22/2019 CSE 586 Spring 00
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.