Download presentation
Presentation is loading. Please wait.
Published bySophia Marlene Summers Modified over 8 years ago
1
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors:Shuguang Feng* Shantanu Gupta Amin Ansari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector
2
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 2 Negative Bias Temperature Instability Oxide Breakdown Electromigration Packaging Impurities Cosmic Radiation PVT Variation [ Gupta`09 ] …many ways to fail [ Dreslinski`10 ] NTC Computing “Failure to prepare is preparing to fail…” - Benjamin Franklin The distinction between a transient and permanent fault is becoming blurred Transient (“soft”) Faults RareContinuousPeriodic Permanent (“hard”) Faults Many permanent faults, particularly wearout-induced faults, initially manifest as timing errors.
3
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 3 The Future of Soft Errors Past Present Future Aggressive voltage scaling (near-threshold computing) One failure per MONTH per 100 chips One failure per DAY per 100 chips One failure per DAY per chip
4
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4 Realizing a Reliability “Pipeline” Recent interest in low-cost fault detection ReStore [DSN`05] SWAT [ASPLOS`08] Shoestring [ASPLOS`10] Not perfect…but very low-cost Generally involves some form of rollback/re-execution 1)Identify fault site 2)Restore processor to pre-fault state, before 1) 3)Resume execution from 1) Many low-cost detection techniques rely on hardware speculation support Commodity systems present both challenges and opportunities Challenge: HW speculation support (if it exists) is limited Challenge: Cannot afford expensive, heavyweight SW checkpointing Opportunity: Typically not running mission-critical applications Sacrifice a small degree of reliability Exploit (probabilistic) idempotence in program execution Commodity systems present both challenges and opportunities Challenge: HW speculation support (if it exists) is limited Challenge: Cannot afford expensive, heavyweight SW checkpointing Opportunity: Typically not running mission-critical applications Sacrifice a small degree of reliability Exploit (probabilistic) idempotence in program execution
5
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 5 The Role of Idempotence Mathematical Definition: an operation that can be applied multiple times without changing the result Computer Science Definition: a region of code without any exposed write-after-read (WAR, anti-) dependencies Non-idempotentIdempotent … … … = X … … X++ … … X = … X Idempotent code regions can be safely re-executed without additional checkpointing
6
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 6 Does Idempotence Exist? Selectively checkpointing a *few* offending stores
7
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 7 Challenges to Exploiting Idempotence Must identify where to resume execution 1)Control flow 2)Rollback distance Statically identifying optimal rollback distance is inherently intractable ↑ rollback dist. → ↑ Pr(recoverable) ↓ rollback dist. → ↑ Pr(idempotent) Simplifying engineering solution based on single-entry, multiple-exit (SEME) regions Execution Path X bb’ bb 7 bb 3 bb 4 bb 6 bb 5 bb 2 bb 1 bb 6 X X
8
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Code Partitioning (CFG-based) 8 Encore Vision Source Code Idempotence Analysis (per region) …= X X++ … … … = X Idempotent Non-idempotent X++ …= X X++ … … … = X Chkpt X Recovery Runtime Behavior (post-fault) Recovery Chkpt X Instrumentation (per region) Fault Detected Redirect Control Restore State
9
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 9 Identifying Idempotence (High-level) bb 2 bb7 bb 1 bb 8 bb 6 bb 5 bb 3bb 4 With respect to a point, p, in the CFG… Reachable Stores (RS) A store that may execute after p Guarded Addresses (GA) An address that is guaranteed to be overwritten before reaching p Exposed Addresses (EA) An address that may be referenced by an unguarded load prior to p Idempotent IFF EA ∩ RS = Ø bb 6 bb7 bb 8 bb 2 bb 3bb 4bb 3bb 4 bb 1 Additional Details… 1)Applies to both memory and registers Static, conservative alias analysis 2)Scalable hierarchical analysis Handles cyclic code Additional Details… 1)Applies to both memory and registers Static, conservative alias analysis 2)Scalable hierarchical analysis Handles cyclic code
10
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science *Restore B Restore R1 Restore R2 … Restore Rn bb r 10 Code Instrumentation MemCopy B Save Address[B] “On-demand” Checkpointing Recovery Code *Restore B bb r Save R1 Save R2 … Save Rn Live-in Checkpointing bb 0 Upon Fault Detection bb 2 bb7 bb 1 bb 8 bb 6 bb 5 bb 3bb 4 … 1: Store A … 6: Load B … 2: Store B … 3: Store C … 4: Load A … 5: Store C … 9: Store A … 10: Store B … 11: Load C … 7: Load B … 8: Load C … 12: Store C … # # $ $ @ @ + + Encore Heuristics 1)Selectively prune dynamically-dead code ↓ offending stores → ↑ Pr(idempotent) 2)Selectively fuse adjacent regions ↑ region size → ↑ Pr(recoverable) 3)Selectively instrument profitable regions Encore Heuristics 1)Selectively prune dynamically-dead code ↓ offending stores → ↑ Pr(idempotent) 2)Selectively fuse adjacent regions ↑ region size → ↑ Pr(recoverable) 3)Selectively instrument profitable regions
11
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 11 Lightweight Checkpointing STACK data_1 addr_1 data_1 addr_1 data_N addr_N data_N addr_N data_0 addr_0 data_0 addr_0 Live-in Registers Local Variables Return Address Input Parameters Traditional Call Stack Encore Extensions Frame Pointer Stack Pointer 1 reg2mem store 1 mem2mem copy 1 stack ptr increment 1 reg2mem store 1 mem2mem copy 1 stack ptr increment Stack grows dynamically to accommodate checkpoint storage
12
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 12 Evaluation Methodology Program analysis/instrumentation performed in the LLVM compiler In-order, single-issue, embedded-class processor Dynamic instruction model based on profiled execution Reliability coverage Analytical model in lieu of traditional fault injection Decouples evaluation from microarchitectural details
13
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 13 Inherent Idempotence 0% (dynamically-dead) <5% <10% 76% of application code is naturally idempotent
14
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 14 Dynamic Execution Breakdown Impact of detection latency If control has left the region containing the original fault site, re-execution cannot correct the error Impact of detection latency If control has left the region containing the original fault site, re-execution cannot correct the error 91% of execution time is spent within recoverable regions
15
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Existing (~100 instrs) Future (~10 instrs) Future (~1000 instrs) 15 Full System “Coverage” 93% − 99.99% coverage, highly application dependent
16
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 16 Overheads 3% − 22% performance degradation
17
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 Summary Large portions of applications, across domains, are (probabilistically) idempotent Encore is a software-only solution that exploits this property to provide low-cost fault recovery 97% of faults on average are recoverable with current detection schemes @ 15% performance penalty Implementing Encore in a runtime system / virtual machine has the potential to yield even better results Larger dynamic traces v. static intervals Dynamic v. static memory analysis
18
University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Questions? 18 http://cccp.eecs.umich.edu
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.