Download presentation
Presentation is loading. Please wait.
Published byMuriel Gallagher Modified over 9 years ago
1
CS 7810 Lecture 25 DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design T. Austin Proceedings of MICRO-32 November 1999
2
Redundancy If a processor’s output is error-prone, reliability can be provided with redundancy Primary Core Checker Core Input Program Verify & Commit
3
Redundancy If a processor’s output is error-prone, reliability can be provided with redundancy Primary Core Checker Core Input Program Verify & Commit Checker Core One checker can detect errors. For recovery, we may need another checker or some other form of redundancy
4
Why Redundancy? Soft Errors: A high energy particle can strike a device and deposit enough charge to flip the value Primary Core Checker Core Input Program Verify & Commit Cosmic rays Alpha particles
5
Why Redundancy? Soft Errors: voltage spikes or noise Primary Core Checker Core Input Program Verify & Commit Crosstalk di/dt Lower voltages
6
Why Redundancy? Allows unverified or aggressively clocked primary cores Primary Core Checker Core Input Program Verify & Commit Functionally incorrect core: some corner case slips through Electrically incorrect core: high temperature causes a circuit to not meet the timing constraint
7
DIVA Microarchitecture BPredI-$ Dec/Ren IQALUD-$ Rename Regs Arch Regs LR3 + LR7 LR15 4 8 12 Storage Check Rd LR3 and LR7 from Arch Regs and confirm it equals 4 and 8 ALU Check Add 4+8 and confirm it equals 12 If both checks succeed, write 12 into LR15
8
Microarchitecture Details Instructions are fed to checker in order during commit The logic and storage checks detect errors in ALUs and datapath The checker core is a simple in-order pipeline – easy to design and verify An error in an earlier stage (LR3 instead of LR2) can be detected by also adding a ren/decode stage to the checker In-order core has no stalls (need bypass for register file) – no data dependences, cache misses, branch mispredicts Contention for register file and data cache can degrade primary thread
9
Recovery The architected register file and data cache are ECC protected – when an error is detected, it is assumed that checker and architected state are correct Primary core is re-started from faulting instruction A fault in the primary core may result in deadlock: e.g. instruction that produces R5 is waiting for R5 to be produced (instead of R4) A timeout in the checker signals an error
10
Redundant Multi-Threading Execute two threads in parallel (CMP or SMT) – each thread maintains its own register state Threads execute as in a conventional processor, except trailing thread commits after verifying result leading thread commits stores to a buffer – these get written to cache/memory only after verification load values of the leading thread are sent to trailing thread, so trailing thread never accesses data cache branch outcomes are also sent to trailing thread Leading ThreadTrailing Thread Reg results, load values, branch outcomes Store values
11
Fault Model A single error in either core can be detected Since loads are not replicated, the load/store datapath must be ECC protected For recovery, a second checker thread is required ECC in the checker register file will enable recovery in most cases without a second checker
12
RMT on SMT/CMP + SMT does not require inter-core traffic – values can be read from shared register file/data cache – Single thread performance may be degraded – Each redundant instr executes on high-power pipeline + Trailing CMP core can be a simple in-order processor low power/area overheads + Trailing core’s frequency can be independently controlled + Heterogeneous CMP where cores can be dynamically employed for throughput/reliability + Lower probability for errors
13
Parallelization of Trailing Thread Sequential Thread Parallel Thread 1Parallel Thread 2Parallel Thread 3Parallel Thread 4 Is it more power-efficient to execute the verification thread in parallel?
14
Parallelization of Trailing Thread Sequential Thread Parallel Thread 1Parallel Thread 2Parallel Thread 3Parallel Thread 4 If the trailing cores are frequency-scaled, dynamic power does not change, but leakage power increases If the trailing cores are frequency-and-voltage scaled, dynamic power decreases, and leakage power increases
15
Error Types
16
Acronyms!! MTTF & MTBF: Mean time to/between failures Errors are either SDC (silent data corruption) or DUE (detected unrecoverable errors) Many errors get masked: ACE bits: these bits are required for architecturally correct execution un-ACE bits: these bits do not affect the final output AVF: architecture vulnerability factor (the percentage of time/space that a structure holds ACE state)
17
Partial Coverage RMT covers faults in the entire core (almost!) If that is too expensive, provide error coverage in specific structures to reduce error probabilities Are there ways to ensure that an instruction spends less time in architecturally vulnerable structures?
18
Title Bullet
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.