Presentation is loading. Please wait.

Presentation is loading. Please wait.

F00: 1 CS/EE 5810 CS/EE 6810 Exceptions. F00: 2 CS/EE 5810 CS/EE 6810 Exceptions Just when you thought pipelines were not that hard… Pipeline benefit.

Similar presentations


Presentation on theme: "F00: 1 CS/EE 5810 CS/EE 6810 Exceptions. F00: 2 CS/EE 5810 CS/EE 6810 Exceptions Just when you thought pipelines were not that hard… Pipeline benefit."— Presentation transcript:

1 F00: 1 CS/EE 5810 CS/EE 6810 Exceptions

2 F00: 2 CS/EE 5810 CS/EE 6810 Exceptions Just when you thought pipelines were not that hard… Pipeline benefit = overlapped instructions –Good: Increased throughput –Sort of bad: Hazards – but we’ve seen how to deal with them –Bad: Exceptions Exception Oddities –Multi-stage / multi-cycle instructions –Exceptions can happen anywhere –Instruction order and exception order might be different –Handling exceptions in instruction order is required So what’s the strategy? –Depends on exception type…

3 F00: 3 CS/EE 5810 CS/EE 6810 Exception Types Terminology varies all over the place –I/O device request –Invoking an OS service from a user program –Tracing instruction execution –Breakpoints –Integer or FP arithmetic error such as overflow –Misaligned memory access –Page fault –Memory protection violation –Undefined instruction »Used on old Macs to invoke an OS service… –Hardware malfunction (like parity or ECC error) –Power failure

4 F00: 4 CS/EE 5810 CS/EE 6810 Response Requirements – 5 axes Synchronous vs. Asynchronous –Synchronous caused by a particular instruction –Asynchronous caused by external devices and HW failures User requested vs. Coerced –Requested is predictable and can happen after the instruction User maskable vs. user non-maskable –E.g. arithmetic overflow is maskable on some machines Within vs. Between instructions –Where is the exception located? –Does excepting instruction complete? Resume vs. Terminate –Implications for how much state must be preserved

5 F00: 5 CS/EE 5810 CS/EE 6810 Examples of Exception Types Exc. TypeSync / Async Req / coercMask / non- mask Within / Between Resume / Terminate I/O Device Req AsyncCoercedNon-maskBetweenResume Invoke OSSyncRequestedNon-maskBetweenResume Trace / Breakpoint SyncRequestedMaskableBetweenResume Arith Exc.SyncCoercedMaskableWithinResume Page FaultSyncCoercedNon-maskWithinResume Misaligned address SyncCoercedMaskableWithinResume Mem. Protection SyncCoercedNon-maskWithinResume Undefined Inst SyncCoercedNon-maskWithinTerminate HW ErrorAsyncCoercedNon-maskWithinTerminate Power Failure AsyncCoercedNon-maskWithinTerminate

6 F00: 6 CS/EE 5810 CS/EE 6810 Biggest Problem: Within and Resume For DLX these tend to occur in the EX or MEM stage (I.e. late in the pipe) Pipeline must be shut down safely –PC must be saved so restart point is known –If restart is branch, it will need to be re-executed –Which means condition must not change Steps (in DLX) –Force TRAP instruction in pipe –Kill all following instructions (I.e. prevent state updates) –Let all preceding instructions finish if they can –Save the restart PC value (faulting inst. Or faulting inst + 1) –Let the OS handle the exception »TRAP says where the handler code lives

7 F00: 7 CS/EE 5810 CS/EE 6810 Making things harder Consider delayed branches A single restart PC isn’t enough Assume we have 2 branch delay slots –Branch is fine, and in this case is “taken” –First delay slot causes a page fault –Second slot is killed –Exception is handled and default restart is first delay slot –Second slot instruction is executed –Then the next instruction following the slot is executed… –OOPS! No branch! »This is a side effect of the effective instruction reordering due to the delayed branch Hence, must save delay slot size + 1 of PC’s

8 F00: 8 CS/EE 5810 CS/EE 6810 Precise Interrupts 1.All instructions before the fault complete 2.All instructions after the fault can be restarted from scratch Note the assumption that the faulting instruction doesn’t change state –In some cases this can be relaxed, while in others it will be a requirement for precise instructions to work Example: Floating point exceptions –Longer pipeline, may have written result before fault is known –Particularly bad if the destination was also a source… –Hence, must save the original operands as part of the argument stream passed to the exception handler »I.e. enough state to reconstruct things after all possible exceptions

9 F00: 9 CS/EE 5810 CS/EE 6810 Precice and Non-Precise Modes Typical in today’s high-performance processors –E.g. Alpha 21164, MIPS R8000, R10000, Power-3 –Precise mode is as much as 10x slower –Biggest source of the the problem is the FPU, and out of order completion –Hence, in precise mode overlap (I.e. pipelining) is constrained –Result is LOTS of bubbles Use precise mode when debugging –Also a requirement in many systems – e.g. IEEE FP standard handlers, virtual memory support, OS interfaces… –Not too difficult for the integer pipe anyway… Use non-precise mode when you think your code works

10 F00: 10 CS/EE 5810 CS/EE 6810 Precise and Non-Precise Modes Typical in today’s high-performance processors –E.g. Alpha 21164, MIPS R8000, R10000, Power-3 –Precise mode is as much as 10x slower –Biggest source of the the problem is the FPU, and out of order completion –Hence, in precise mode overlap (I.e. pipelining) is constrained –Result is LOTS of bubbles Use precise mode when debugging –Also a requirement in many systems – e.g. IEEE FP standard handlers, virtual memory support, OS interfaces… –Not too difficult for the integer pipe anyway… Use non-precise mode when you think your code works »Done laughing yet?

11 F00: 11 CS/EE 5810 CS/EE 6810 Even for DLX Exceptions can happen anywhere IF –Page fault, misaligned address, memory protection violation ID –Undefined or illegal opcode EX –Arithmetic exception MEM –Page fault, misaligned address, memory protection violation WB –None… So, within any 1 clock cycle, 4 exceptions could occur!

12 F00: 12 CS/EE 5810 CS/EE 6810 Making DLX Precise Exception order may not match pipeline order But, we must take them in pipeline order to be precise –Currently, program order = pipeline order for DLX Committing an instruction –When an instruction guarantees to complete, it commits –In the DLX, this happens at the end of the MEM stage Prior to commit, carry exception state –Exception type, restart PC value pass through pipe –ALL destructive writes (memory or RF) are deferred until commit point Easy for DLX, hard for VAX (surprised?) –VAX HW must save back-out state (I.e. undo autoinc, etc.)

13 F00: 13 CS/EE 5810 CS/EE 6810 Possible Design Decision Problems Decisions that complicate precise exception handling Early register changes –VAX auto-increment, auto-decrement address modes Iterative instructions –IBM 360: block memory move –How much moved before the fault? Use of registers as working storage –80x86 string instructions Condition codes, many machines use them –Problem if they can be set in multiple stages –If set early, then they must be restored on exception Multi-cycle instruIctions (all machines)

14 F00: 14 CS/EE 5810 CS/EE 6810 More about Multi-Cycle Instructions Abundant options in the VAX ISA This can be fixed in a good ISA design –After all, the VAX is ancient and we’re all smarter now… –Sure…. Look at some modern ISAs… Common modern multi-cycle causes: –Simple loads and stores that miss in the L1 cache –Reasonable cost FPU latencies are 5x+ of the IU –Co-processor or SFU instructions Result –Stuck with the reality of multi-cycle instructions or stages –Real complication for laminar pipeline design goals, and fast precise exception modes…

15 F00: 15 CS/EE 5810 CS/EE 6810 A Multi-Cycle DLX

16 F00: 16 CS/EE 5810 CS/EE 6810 Latency vs. Repeat Cycle Latency = number of cycles to complete –Defined to be the cycle distance between instruction producing the value and the instructions tht use that result Repeat / Initiation interval –Number of cycles that must elapse between issue of instructions of the same type Functional UnitLatencyInit. Interval Integer ALU01 Data Mem (Loads)11 FP Add/Sub31 FP & Int Multiply61 FP & Int. Divide & FP SQRT24

17 F00: 17 CS/EE 5810 CS/EE 6810 DLX FP Pipe Note # of stages are 1+ latency

18 F00: 18 CS/EE 5810 CS/EE 6810 New Hazard and Forwarding Problems Structural Hazards Increase –Unpiped divide causes huge 24 cycle delays –Number of register writes in a cycle goes up »3 FPR writes possible now… WAW hazards no possible since instructions no reach WB out of order WAR hazards are still no problem since read happens early (in the ID stage) Out of order completion complicates exception handling RAW stalls will be more frequent due to longer latency instructions Was it worth it? How would you determine this?

19 F00: 19 CS/EE 5810 CS/EE 6810 New Structural Hazard Source Scan columns for common resource requirements –At cycle 10, 3 requirements for MEM –At cycle 11, 3 requirements for RF Write Instruction1234567891011 MULTD F0, F4, F6 IFIDM1M2M3M4M5M6M7 MEM WB …IFIDEX MEM WB …IFIDEX MEM WB ADDD F2, F4, F6 IFIDA1A2A3A4 MEM WB …IFIDEX MEM WB …IFIDEX MEM WB LD F8, 0(R2)IFIDEX MEM WB

20 F00: 20 CS/EE 5810 CS/EE 6810 Dealing with Structural Hazards Consider a single write-port FPR in the previous example Option 1: –Keep track of issued instructions and when they will write-back to the FPR –Stall instruction in ID if there’s a collision –Just takes a 1-bit, 25-deep shift register for all 3 pipes Option 2: –Stall instructions at MEM entry »May also want to give preference to longest latency »Longest is most likely to cause RAW stalls anyway –Problem is that the control path has to go all the way back to the front of the pipe which is costly!

21 F00: 21 CS/EE 5810 CS/EE 6810 Consider RAW Hazard Stalls Long latency pipes cause frequency to go up Finally, on cycle 16 SD gets to enter MEM EX doesn’t need to stall since it’s the EFA calc which uses R2 –Note that figure 3.46 in the text is wrong Inst.123456789101112131415 LD F4, 0(R2) IFIDEXMWB MULTD F0, F2, F5 IFIDStM1M2M3M4M5M6M7MWB ADDD F2, F0, F8 IFStIDSt A1A2A3A4 SD F2, 0(R2) IFSt IDEXSt

22 F00: 22 CS/EE 5810 CS/EE 6810 New Hazard Sources Dependencies between GPR and FPR –MOVI2FP and MOVFP2I instructions Avoid the new ones with ID stage issue checks –Structural »Repeat interval check »Make sure register write port will be available when needed –RAW »List all pending destination registers »Don’t issue a source from a pending destination until it clears (value is available via forwarding logic) –WAW »Use same list of pending destination registers »Don’t issue a new pending destination which matches an existing one until it has finished WB »Can we do better? (I.e. like forwarding)

23 F00: 23 CS/EE 5810 CS/EE 6810 Precise Exceptions and Long Pipes Consider: DIVFF0, F2, F4 ADDFF10, F10, F8 SUBFF12, F12, F14 Piece of cake! No dependencies! Unfortunately, wrong… –Both ADF and SUBF will complete before DIVF –What happens if DIVF causes an exception… Ideas?

24 F00: 24 CS/EE 5810 CS/EE 6810 Four Precision Possibilities Punt on precision –Old supercomputer trick –Not really viable today with IEEE standard exception handling and virtual memory Buffer Something – two variants –Future File »Buffer results at commit – post them in program order –History File »Post ASAP »Buffer original operands and roll-back to proper state if an exception occurs –Forwarding still required »So both get more expensive as pipelines get longer

25 F00: 25 CS/EE 5810 CS/EE 6810 Possibilities 3 and 4 Go imprecise with SW fixup –Keep enough state around to do fixup –Let SW emulate the instructions that are not yet finished but prior to the excepting instruction –In general this is too hard »If the only uncompleted ones are FP, then it’s more tractable Stall issue until previous instructions commit –Move commit point as far forward as possible in the pipe »Which, realistically, isn’t that far –Used by MIPS R4k and Pentium

26 F00: 26 CS/EE 5810 CS/EE 6810 DLX Pipeline Performance Stalls per FP operation

27 F00: 27 CS/EE 5810 CS/EE 6810 Total Stalls / Instruction

28 F00: 28 CS/EE 5810 CS/EE 6810 DLX FP Performance Averages Basis: Spec FP Benchmarks Stalls per FP operation = approx. 50% of latency –1.7 cycles for add/sum/convert (56% of 3 cycle latency) –2.8 for multiply (or 40% of the 7 cycle latency) –14.2 for divide (57% of the 25 cycle latency) Total Stalls –Varies with application –Range from.65 (su2cor) to 1.21 (doduc) –Average over SPEC benchmarks is.87 / instruction »Note that this is for all instructions, hence CPI = Ideal CPI +.87 –Major contributor is RAW result wait

29 F00: 29 CS/EE 5810 CS/EE 6810 ISA Design and Pipeline Complexity Things you can do wrong Variable instruction lengths and CPIs –Imbalance will cause stall frequency to increase –Caches do this, but performance offsets the cost usually Sophisticated address modes –Trashing registers during EFA calculation means state must be saved »I.e. auto-increment or auto-decrement mode Permit self-modifying code (I.e. 80x86) –What if the instruction in the pipe is overwritten? –Then restarting becomes tricky - Old value must be saved –80x86 takes extra undecoded instruction all the way to commit Non-uniform implicitly set condition codes –Newer machines use uniform set stage –Plus explicit bit in the instruction to enable set-cc


Download ppt "F00: 1 CS/EE 5810 CS/EE 6810 Exceptions. F00: 2 CS/EE 5810 CS/EE 6810 Exceptions Just when you thought pipelines were not that hard… Pipeline benefit."

Similar presentations


Ads by Google