Presentation is loading. Please wait.

Presentation is loading. Please wait.

CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter.

Similar presentations


Presentation on theme: "CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter."— Presentation transcript:

1 CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter

2 CS717 2 3 SMT + Fault Tolerance Papers Eric Rotenberg, "AR-SMT - A Microarchitectural Approach to Fault Tolerance in Microprocessors", Symposium on Fault-Tolerant Computing, 1999. Steven K. Reinhardt and Shubhendu S. Mukherjee, "Transient Fault Detection via Simultaneous Multithreading", ISCA 2000. Shubhendu S. Mukherjee, Michael Kontz and Steven K. Reinhardt, "Detailed Design and Evaluation of Redundant Multithreading Alternatives", ISCA 2002.

3 CS717 3 Outline 1.Background SMT Hardware fault tolerance 2.AR-SMT Basic mechanisms Implementation issues Simulation and Results 3.Transient Fault Detection via SMT Sphere of replication Basic mechanisms Comparison to AR-SMT Simulation and Results 4.Redundant Multithreading Alternatives Realistic processor implementation CRT Simulation and Results 5.Fault Recovery 6.Next Lecture

4 CS717 4 Transient Fault Detection via SMT More detailed analysis of Simultaneous and Redundant Threading (SRT) Introduces Sphere of Replication concept Explores SRT design space Discussion of input replication Architecture for output comparison Performance improving mechanisms More depth in simulation

5 CS717 5 Sphere of Replication Components inside sphere are protected against faults using replication External components must use other means of fault tolerance (parity, ECC, etc.) Inputs to sphere must be duplicated for each of the redundant processes Outputs of the redundant processes are compared to detect faults Simple to understand in lockstepping Larger sphere –more state to replicate –less input replication and output comparison

6 CS717 6 Sphere of Replication (part 2) Size of sphere of replication –Two alternatives – with and without register file –Instruction and data caches kept outside

7 CS717 7 Input Replication Must ensure that both threads received same inputs to guarantee they follow the same path Instructions – assume no self-modification Cached load data –Out-of-order execution issue –Multiprocessor cache coherence issues Uncached load data – must synchronize External interrupts –Stall lead thread and deliver interrupt synchronously –Record interrupt delivery point and deliver later

8 CS717 8 Cached Load Data - ALAB Active Load Address Buffer (ALAB) –Delays cache block replacement or invalidation –ALAB is table with address tag, counter, and pending-invalidate bit –Counter tracks trailing thread’s outstanding loads –Blocks cannot be replaced or invalidated until counter is zero –Pending-invalidate set on unevictable block –Leading thread stalls when ALAB is full –Must detect and address deadlocks

9 CS717 9 Cached Load Data - LVQ Load Value Queue (LVQ) –Explicit designation of leading and trailing thread –Only leading thread issues loads and stores –Load addresses and values forward to trailing thread via LVQ –Trailing thread executes loads in-order and non- speculatively (why?) –Input replication guaranteed –Design simpler and less pressure on cache –Earlier fault detection –Constrains scheduling of trailing thread loads

10 CS717 10 Output Comparison Store buffer used to verify address and value of stores to be committed –Trailing thread searches for matching entry –Mismatch means fault occurred Cached load values require no checking Uncached load values could have side effects –Issue non-speculatively, so stall leading thread –Assumes uncached loads are always detected Register Check Buffer used to match register writebacks. –3 register files required: future files + architectural file

11 CS717 11 Enhancing SRT Performance Slack Fetch –Maintain constant lag between thread’s execution –Lead thread updates branch and data predictors –Lead thread prefetches loads –Traditional SMT ICount fetch policy is modified to maintain slack Branch Outcome Queue –Deliver branch outcomes directly to trailing thread –Trailing thread has no control speculation

12 CS717 12 AR-SMT verses SRT AR-SMT only has space redundancy in functional units SRT can potentially have space redundancy across the pipeline AR-SMT is trace processor-based while SRT is conventional Register file of R-stream must be protected AR-SMT forwards load data values AR-SMT checks every instruction during fault detection SRT requires no operating system modifications AR-SMT doesn’t support uncached loads and stores or multiprocessor coherence Delay buffer performs function of register check buffer and branch outcome queue All of main memory is in AR-SMT sphere –Better fault coverage but very costly

13 CS717 13 Simulation Environment Modified Simplescalar “sim-outorder” Long front-end pipeline because of out-of-order nature and SMT Simple approximation of trace cache Used 11 SPEC95 benchmarks

14 CS717 14 Results Again, this paper only analyzes the performance impact of fault tolerance Baseline Characterization –ORH-Dual  two pipelines, each with half the resources –SMT-Dual  replicated threads with no detection hardware ORH and SMT-Dual 32% slower than SMT-Single

15 CS717 15 Slack Fetch & Branch Outcome Queue 10%,14%, 15% (27% max) performance improvements for SF, BOQ, and SF + BOQ Reduced memory stalls through prefetching Prevents trailing thread from wasting resources by speculating Performance better with slack of 256 instructions over 32 or 128

16 CS717 16 Input Replication Assumes output comparison performed by oracle Almost no performance penalty paid for 64- entry ALAB or LVQ With a 16-entry ALAB and LVQ, benchmarks performance degraded 8% and 5% respectively

17 CS717 17 Output Comparison Assumes inputs replicated by oracle Leading thread can stall if store queue is full 64-entry store buffer eliminates almost all stalls Register check buffer or size 32, 64, and 128 entries degrades performance by 27%, 6%, and 1% respectively

18 CS717 18 Overall Results Speedup of SRT processor with 256 slack fetch, branch outcome queue with 128 entries, 64-entry store buffer, and 64-entry load value queue. SRT demonstrates a 16% speedup on average (up to 29%) over a lockstepping processor with the “same” hardware

19 CS717 19 Multi-cycle and Permanent Faults Transient faults could potentially persist for multiple cycles and affect both threads Increasing slack fetch decreases this possibility Spatial redundancy can be increased by partitioning function units and forcing threads to execute on different groups Performance loss for this approach is less than 2%

20 CS717 20 Conclusions Sphere of replication helps analysis of input replication and output comparison Keep register file in sphere LVQ is superior to ALAB (simpler) Slack fetch and branch outcome queue mechanism enhance performance SRT fault tolerance method performs 16% better on average than lockstepping


Download ppt "CS717 1 Hardware Fault Tolerance Through Simultaneous Multithreading (part 2) Jonathan Winter."

Similar presentations


Ads by Google