Presentation is loading. Please wait.

Presentation is loading. Please wait.

Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation.

Similar presentations


Presentation on theme: "Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation."— Presentation transcript:

1 Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Technology Compaq Computer Corporation Shrewsbury, Massachusetts Steven K. Reinhardt stever@eecs.umich.edu Electrical Engineering & Computer Sciences University of Michigan Ann Arbor, Michigan 27th Annual International Symposium on Computer Architecture (ISCA), 2000

2 Slide 2 Transient Faults  Faults that persist for a “short” duration  Cause: cosmic rays (e.g., neutrons)  Effect: knock off electrons, discharge capacitor  Solution no practical absorbent for cosmic rays no practical absorbent for cosmic rays – 1 fault per 1000 computers per year (estimated fault rate)  Future is worse smaller feature size, reduce voltage, higher transistor count, reduced noise margin smaller feature size, reduce voltage, higher transistor count, reduced noise margin

3 Slide 3 Fault Detection in Compaq Himalaya System R1  (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1  (R2) microprocessor Replicated Microprocessors + Cycle-by-Cycle Lockstepping

4 Slide 4 Fault Detection via Simultaneous Multithreading R1  (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1  (R2) THREAD Replicated Microprocessors + Cycle-by-Cycle Lockstepping Threads ?

5 Slide 5 Simultaneous Multithreading (SMT) Functional Units Instruction Scheduler Thread1 Thread2 Example: Alpha 21464

6 Slide 6 Simultaneous & Redundantly Threaded Processor (SRT) + Less hardware compared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT + Better performance than complete replication better use of resources + Lower cost avoids complete replication market volume of SMT & SRT SRT = SMT + Fault Detection

7 Slide 7 SRT Design Challenges  Lockstepping doesn’t work SMT may issue same instruction from redundant threads in different cycles SMT may issue same instruction from redundant threads in different cycles  Must carefully fetch/schedule instructions from redundant threads branch misprediction branch misprediction cache miss cache miss Disclaimer: This talk focuses only on fault detection, not recovery

8 Slide 8 Contributions & Outline  Sphere of Replication (SoR)  Output comparison for SRT  Input replication for SRT  Performance Optimizations for SRT  SRT outperforms on-chip replicated microprocessors  Related Work  Summary

9 Slide 9 Sphere of Replication (SoR) Rest of System Sphere of Replication Output Comparison Input Replication Execution Copy 1 Execution Copy 2 Logical boundary of redundant execution within a system Trade-off between information, time, & space redundancy

10 Slide 10 Compaq Himalaya Example Spheres of Replication Sphere of Replication Output Comparison Input Replication Microprocessor Memory covered by ECC RAID array covered by parity Servernet  covered by CRC Sphere of Replication Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet  covered by CRC Pipeline1 Pipeline2 Instruction cache covered by ECC Data cache covered by ECC ORH-Dual: On-Chip Replicated Hardware (similar to IBM G5)

11 Slide 11 Sphere of Replication for SRT Fetch PC Instruction Cache Decode Register Rename Fp Regs Int. Regs Fp Units Ld/St Units Int. Units Thread 0 Thread 1 R1  (R2) R3 = R1 + R7 R8 = R7 * 2 RUU Data Cache Excludes instruction and data caches Alternates SoRs possible (e.g., exclude register file)… not in this talk

12 Slide 12 Output Comparison in SRT Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy 1 Execution Copy 2 Compare & validate output before sending it outside the SoR

13 Slide 13  for stores from redundant threads compare & validate at commit time compare & validate at commit time Output Comparison for uncached load from redundant threads for cached load from redundant threads: not required other output comparison based on the boundary of an SoR Store:... Store: R1  (R2) Store:... Store: R1  (R2) Store:... Store Queue Output Comparison To Data Cache

14 Slide 14 Input Replication in SRT Rest of System Sphere of Replication Output Comparison Input Replication Execution Copy 1 Execution Copy 2 Replicate & deliver same input (coming from outside SoR) to redundant copies

15 Slide 15 Input Replication  Cached load data pair loads from redundant threads: too slow pair loads from redundant threads: too slow allow both loads to probe cache: false faults with I/O or multiprocessors allow both loads to probe cache: false faults with I/O or multiprocessors  Load Value Queue (LVQ) pre-designated leading & trailing threads pre-designated leading & trailing threads add load R1  (R2) sub add load R1  (R2) sub probe cache LVQ

16 Slide 16 Input Replication (contd.)  Cached Load Data: alternate solution Active Load Address Buffer Active Load Address Buffer  Special Cases Cycle- or time-sensitive instructions Cycle- or time-sensitive instructions External interrupts External interrupts

17 Slide 17 Outline  Sphere of Replication (SoR)  Output comparison for SRT  Input replication for SRT  Performance Optimizations for SRT  SRT outperforms on-chip replicated microprocessors  Related Work  Summary

18 Slide 18 Performance Optimizations  Slack fetch maintain constant slack of instructions between leading and trailing thread maintain constant slack of instructions between leading and trailing thread + leading thread prefetches cache misses + leading thread prefetches correct branch outcomes  Branch Outcome Queue feed branch outcome from leading to trailing thread feed branch outcome from leading to trailing thread  Combine the above two

19 Slide 19 Baseline Architecture Parameters L1 instruction cache 64K bytes, 4-way associative, 32-byte blocks, single ported L1 data cache 64K bytes, 4-way associative, 32-byte blocks, four read/write ports Unified L2 Cache 1M bytes, 4-way associative, 64-byte blocks Branch predictor Hybrid local/global (like 21264); 13-bit global history register indexing 8K-entry global PHT and 8K-entry choice table; 2K 11-bit local history registers indexing 2K local PHT; 4K-entry BTB, 16-entry RAS (per thread) Fetch/Decode/Issue/Commit Width 8 instructions/cycle (fetch can span 3 basic blocks) Function Units 6 Int ALU, 2 Int Multiply, 4 FP Add, 2 FP Multiply Fetch to Decode Latency = 5 cycles Decode to Execution Latency = 10 cycles

20 Slide 20 Target Architectures  SRT SMT + fault detection SMT + fault detection Output Comparison Output Comparison Input Replication (Load Value Queue) Input Replication (Load Value Queue) Slack Fetch + Branch Outcome Queue Slack Fetch + Branch Outcome Queue  ORH-Dual: On-Chip Replicated Hardware Each pipeline of dual has half the resources of SRT Each pipeline of dual has half the resources of SRT Two pipelines share fetch stage (including branch predictor) Two pipelines share fetch stage (including branch predictor)

21 Slide 21 Performance Model & Benchmarks  SimpleScalar 3.0 modified to support SMT by Steve Raasch, U. of Michigan modified to support SMT by Steve Raasch, U. of Michigan SMT/Simplescalar modified to support SRT SMT/Simplescalar modified to support SRT  Benchmarks compiled with gcc 2.6 + full optimization compiled with gcc 2.6 + full optimization subset of spec95 suite (11 benchmarks) subset of spec95 suite (11 benchmarks) skipped between 300 million and 20 billion instructions skipped between 300 million and 20 billion instructions simulated 200 million for each benchmark simulated 200 million for each benchmark

22 Slide 22 SRT vs. ORH-Dual Average improvement = 16%, Maximum = 29%

23 Slide 23 Recent Related Work  Saxena & McCluskey, IEEE Systems, Man, & Cybernetics, 1998. + First to propose use of SMT for fault detection  AR-SMT, Rotenberg, FTCS, 1999 + Forwards values from leading to checker thread  DIVA, Austin, MICRO, 1999 + Converts checker thread into simple processor

24 Slide 24 Improvements over Prior Work  Sphere of Replication (SoR) e.g., AR-SMT register file must be augmented with ECC e.g., AR-SMT register file must be augmented with ECC e.g., DIVA must handle uncached loads in a special way e.g., DIVA must handle uncached loads in a special way  Output Comparison e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR  Input Replication e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ  Slack Fetch

25 Slide 25 Summary  Simultaneous & Redundantly Threaded Processor (SRT) SMT + Fault detection  Sphere of replication Output comparison of committed store instructions Output comparison of committed store instructions Input replication via load value queue Input replication via load value queue  Slack fetch & branch outcome queue  SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29%


Download ppt "Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee VSSAD, Alpha Technology Compaq Computer Corporation."

Similar presentations


Ads by Google