Download presentation
Presentation is loading. Please wait.
Published byWinfred Skinner Modified over 9 years ago
1
Transient Fault Detection via Simultaneous Multithreading Shubhendu S. Mukherjee Shubu.Mukherjee@compaq.com VSSAD, Alpha Technology Compaq Computer Corporation Shrewsbury, Massachusetts Steven K. Reinhardt stever@eecs.umich.edu Electrical Engineering & Computer Sciences University of Michigan Ann Arbor, Michigan 27th Annual International Symposium on Computer Architecture (ISCA), 2000
2
Slide 2 Transient Faults Faults that persist for a “short” duration Cause: cosmic rays (e.g., neutrons) Effect: knock off electrons, discharge capacitor Solution no practical absorbent for cosmic rays no practical absorbent for cosmic rays – 1 fault per 1000 computers per year (estimated fault rate) Future is worse smaller feature size, reduce voltage, higher transistor count, reduced noise margin smaller feature size, reduce voltage, higher transistor count, reduced noise margin
3
Slide 3 Fault Detection in Compaq Himalaya System R1 (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1 (R2) microprocessor Replicated Microprocessors + Cycle-by-Cycle Lockstepping
4
Slide 4 Fault Detection via Simultaneous Multithreading R1 (R2) Input Replication Output Comparison Memory covered by ECC RAID array covered by parity Servernet covered by CRC R1 (R2) THREAD Replicated Microprocessors + Cycle-by-Cycle Lockstepping Threads ?
5
Slide 5 Simultaneous Multithreading (SMT) Functional Units Instruction Scheduler Thread1 Thread2 Example: Alpha 21464
6
Slide 6 Simultaneous & Redundantly Threaded Processor (SRT) + Less hardware compared to replicated microprocessors SMT needs ~5% more hardware over uniprocessor SRT adds very little hardware overhead to existing SMT + Better performance than complete replication better use of resources + Lower cost avoids complete replication market volume of SMT & SRT SRT = SMT + Fault Detection
7
Slide 7 SRT Design Challenges Lockstepping doesn’t work SMT may issue same instruction from redundant threads in different cycles SMT may issue same instruction from redundant threads in different cycles Must carefully fetch/schedule instructions from redundant threads branch misprediction branch misprediction cache miss cache miss Disclaimer: This talk focuses only on fault detection, not recovery
8
Slide 8 Contributions & Outline Sphere of Replication (SoR) Output comparison for SRT Input replication for SRT Performance Optimizations for SRT SRT outperforms on-chip replicated microprocessors Related Work Summary
9
Slide 9 Sphere of Replication (SoR) Rest of System Sphere of Replication Output Comparison Input Replication Execution Copy 1 Execution Copy 2 Logical boundary of redundant execution within a system Trade-off between information, time, & space redundancy
10
Slide 10 Compaq Himalaya Example Spheres of Replication Sphere of Replication Output Comparison Input Replication Microprocessor Memory covered by ECC RAID array covered by parity Servernet covered by CRC Sphere of Replication Output Comparison Input Replication Memory covered by ECC RAID array covered by parity Servernet covered by CRC Pipeline1 Pipeline2 Instruction cache covered by ECC Data cache covered by ECC ORH-Dual: On-Chip Replicated Hardware (similar to IBM G5)
11
Slide 11 Sphere of Replication for SRT Fetch PC Instruction Cache Decode Register Rename Fp Regs Int. Regs Fp Units Ld/St Units Int. Units Thread 0 Thread 1 R1 (R2) R3 = R1 + R7 R8 = R7 * 2 RUU Data Cache Excludes instruction and data caches Alternates SoRs possible (e.g., exclude register file)… not in this talk
12
Slide 12 Output Comparison in SRT Rest of System Sphere of Replication Output Compariso n Input Replication Execution Copy 1 Execution Copy 2 Compare & validate output before sending it outside the SoR
13
Slide 13 for stores from redundant threads compare & validate at commit time compare & validate at commit time Output Comparison for uncached load from redundant threads for cached load from redundant threads: not required other output comparison based on the boundary of an SoR Store:... Store: R1 (R2) Store:... Store: R1 (R2) Store:... Store Queue Output Comparison To Data Cache
14
Slide 14 Input Replication in SRT Rest of System Sphere of Replication Output Comparison Input Replication Execution Copy 1 Execution Copy 2 Replicate & deliver same input (coming from outside SoR) to redundant copies
15
Slide 15 Input Replication Cached load data pair loads from redundant threads: too slow pair loads from redundant threads: too slow allow both loads to probe cache: false faults with I/O or multiprocessors allow both loads to probe cache: false faults with I/O or multiprocessors Load Value Queue (LVQ) pre-designated leading & trailing threads pre-designated leading & trailing threads add load R1 (R2) sub add load R1 (R2) sub probe cache LVQ
16
Slide 16 Input Replication (contd.) Cached Load Data: alternate solution Active Load Address Buffer Active Load Address Buffer Special Cases Cycle- or time-sensitive instructions Cycle- or time-sensitive instructions External interrupts External interrupts
17
Slide 17 Outline Sphere of Replication (SoR) Output comparison for SRT Input replication for SRT Performance Optimizations for SRT SRT outperforms on-chip replicated microprocessors Related Work Summary
18
Slide 18 Performance Optimizations Slack fetch maintain constant slack of instructions between leading and trailing thread maintain constant slack of instructions between leading and trailing thread + leading thread prefetches cache misses + leading thread prefetches correct branch outcomes Branch Outcome Queue feed branch outcome from leading to trailing thread feed branch outcome from leading to trailing thread Combine the above two
19
Slide 19 Baseline Architecture Parameters L1 instruction cache 64K bytes, 4-way associative, 32-byte blocks, single ported L1 data cache 64K bytes, 4-way associative, 32-byte blocks, four read/write ports Unified L2 Cache 1M bytes, 4-way associative, 64-byte blocks Branch predictor Hybrid local/global (like 21264); 13-bit global history register indexing 8K-entry global PHT and 8K-entry choice table; 2K 11-bit local history registers indexing 2K local PHT; 4K-entry BTB, 16-entry RAS (per thread) Fetch/Decode/Issue/Commit Width 8 instructions/cycle (fetch can span 3 basic blocks) Function Units 6 Int ALU, 2 Int Multiply, 4 FP Add, 2 FP Multiply Fetch to Decode Latency = 5 cycles Decode to Execution Latency = 10 cycles
20
Slide 20 Target Architectures SRT SMT + fault detection SMT + fault detection Output Comparison Output Comparison Input Replication (Load Value Queue) Input Replication (Load Value Queue) Slack Fetch + Branch Outcome Queue Slack Fetch + Branch Outcome Queue ORH-Dual: On-Chip Replicated Hardware Each pipeline of dual has half the resources of SRT Each pipeline of dual has half the resources of SRT Two pipelines share fetch stage (including branch predictor) Two pipelines share fetch stage (including branch predictor)
21
Slide 21 Performance Model & Benchmarks SimpleScalar 3.0 modified to support SMT by Steve Raasch, U. of Michigan modified to support SMT by Steve Raasch, U. of Michigan SMT/Simplescalar modified to support SRT SMT/Simplescalar modified to support SRT Benchmarks compiled with gcc 2.6 + full optimization compiled with gcc 2.6 + full optimization subset of spec95 suite (11 benchmarks) subset of spec95 suite (11 benchmarks) skipped between 300 million and 20 billion instructions skipped between 300 million and 20 billion instructions simulated 200 million for each benchmark simulated 200 million for each benchmark
22
Slide 22 SRT vs. ORH-Dual Average improvement = 16%, Maximum = 29%
23
Slide 23 Recent Related Work Saxena & McCluskey, IEEE Systems, Man, & Cybernetics, 1998. + First to propose use of SMT for fault detection AR-SMT, Rotenberg, FTCS, 1999 + Forwards values from leading to checker thread DIVA, Austin, MICRO, 1999 + Converts checker thread into simple processor
24
Slide 24 Improvements over Prior Work Sphere of Replication (SoR) e.g., AR-SMT register file must be augmented with ECC e.g., AR-SMT register file must be augmented with ECC e.g., DIVA must handle uncached loads in a special way e.g., DIVA must handle uncached loads in a special way Output Comparison e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR e.g., AR-SMT & DIVA compare all instructions, SRT compares selected ones based on SoR Input Replication e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ e.g., AR-SMT & DIVA detect false transient faults, SRT avoids this problem using LVQ Slack Fetch
25
Slide 25 Summary Simultaneous & Redundantly Threaded Processor (SRT) SMT + Fault detection Sphere of replication Output comparison of committed store instructions Output comparison of committed store instructions Input replication via load value queue Input replication via load value queue Slack fetch & branch outcome queue SRT outperforms equivalently-sized on-chip replicated hardware by 16% on average & up to 29%
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.