Presentation is loading. Please wait.

Presentation is loading. Please wait.

University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,

Similar presentations


Presentation on theme: "University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,"— Presentation transcript:

1 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost, Fine-Grained Transient Fault Recovery Authors:Shuguang Feng* Shantanu Gupta Amin Ansari Scott Mahlke David August University of Michigan *Currently with Northrop Grumman, Information Systems Sector

2 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 2 Negative Bias Temperature Instability Oxide Breakdown Electromigration Packaging Impurities Cosmic Radiation PVT Variation [ Gupta`09 ] …many ways to fail [ Dreslinski`10 ] NTC Computing “Failure to prepare is preparing to fail…” - Benjamin Franklin The distinction between a transient and permanent fault is becoming blurred Transient (“soft”) Faults RareContinuousPeriodic Permanent (“hard”) Faults  Many permanent faults, particularly wearout-induced faults, initially manifest as timing errors.

3 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 3 The Future of Soft Errors Past Present Future Aggressive voltage scaling (near-threshold computing) One failure per MONTH per 100 chips One failure per DAY per 100 chips One failure per DAY per chip

4 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 4 Realizing a Reliability “Pipeline”  Recent interest in low-cost fault detection  ReStore [DSN`05]  SWAT [ASPLOS`08]  Shoestring [ASPLOS`10]  Not perfect…but very low-cost  Generally involves some form of rollback/re-execution 1)Identify fault site 2)Restore processor to pre-fault state, before 1) 3)Resume execution from 1)  Many low-cost detection techniques rely on hardware speculation support Commodity systems present both challenges and opportunities Challenge: HW speculation support (if it exists) is limited Challenge: Cannot afford expensive, heavyweight SW checkpointing Opportunity: Typically not running mission-critical applications  Sacrifice a small degree of reliability Exploit (probabilistic) idempotence in program execution Commodity systems present both challenges and opportunities Challenge: HW speculation support (if it exists) is limited Challenge: Cannot afford expensive, heavyweight SW checkpointing Opportunity: Typically not running mission-critical applications  Sacrifice a small degree of reliability Exploit (probabilistic) idempotence in program execution

5 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 5 The Role of Idempotence  Mathematical Definition:  an operation that can be applied multiple times without changing the result  Computer Science Definition:  a region of code without any exposed write-after-read (WAR, anti-) dependencies Non-idempotentIdempotent … … … = X … … X++ … … X = … X Idempotent code regions can be safely re-executed without additional checkpointing

6 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 6 Does Idempotence Exist? Selectively checkpointing a *few* offending stores

7 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 7 Challenges to Exploiting Idempotence  Must identify where to resume execution 1)Control flow 2)Rollback distance  Statically identifying optimal rollback distance is inherently intractable  ↑ rollback dist. → ↑ Pr(recoverable)  ↓ rollback dist. → ↑ Pr(idempotent)  Simplifying engineering solution based on single-entry, multiple-exit (SEME) regions Execution Path X bb’ bb 7 bb 3 bb 4 bb 6 bb 5 bb 2 bb 1 bb 6 X X 

8 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Code Partitioning (CFG-based) 8 Encore Vision Source Code Idempotence Analysis (per region) …= X X++ … … … = X Idempotent Non-idempotent X++ …= X X++ … … … = X Chkpt X Recovery Runtime Behavior (post-fault) Recovery Chkpt X Instrumentation (per region) Fault Detected Redirect Control Restore State

9 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 9 Identifying Idempotence (High-level) bb 2 bb7 bb 1 bb 8 bb 6 bb 5 bb 3bb 4  With respect to a point, p, in the CFG…  Reachable Stores (RS)  A store that may execute after p  Guarded Addresses (GA)  An address that is guaranteed to be overwritten before reaching p  Exposed Addresses (EA)  An address that may be referenced by an unguarded load prior to p  Idempotent IFF EA ∩ RS = Ø bb 6 bb7 bb 8 bb 2 bb 3bb 4bb 3bb 4 bb 1 Additional Details… 1)Applies to both memory and registers  Static, conservative alias analysis 2)Scalable hierarchical analysis  Handles cyclic code Additional Details… 1)Applies to both memory and registers  Static, conservative alias analysis 2)Scalable hierarchical analysis  Handles cyclic code

10 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science *Restore B Restore R1 Restore R2 … Restore Rn bb r 10 Code Instrumentation MemCopy B Save Address[B] “On-demand” Checkpointing Recovery Code *Restore B bb r Save R1 Save R2 … Save Rn Live-in Checkpointing bb 0 Upon Fault Detection bb 2 bb7 bb 1 bb 8 bb 6 bb 5 bb 3bb 4 … 1: Store A … 6: Load B … 2: Store B … 3: Store C … 4: Load A … 5: Store C … 9: Store A … 10: Store B … 11: Load C … 7: Load B … 8: Load C … 12: Store C … # # $ $ @ @ + + Encore Heuristics 1)Selectively prune dynamically-dead code  ↓ offending stores → ↑ Pr(idempotent) 2)Selectively fuse adjacent regions  ↑ region size → ↑ Pr(recoverable) 3)Selectively instrument profitable regions Encore Heuristics 1)Selectively prune dynamically-dead code  ↓ offending stores → ↑ Pr(idempotent) 2)Selectively fuse adjacent regions  ↑ region size → ↑ Pr(recoverable) 3)Selectively instrument profitable regions

11 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 11 Lightweight Checkpointing STACK data_1 addr_1 data_1 addr_1 data_N addr_N data_N addr_N data_0 addr_0 data_0 addr_0 Live-in Registers Local Variables Return Address Input Parameters Traditional Call Stack Encore Extensions Frame Pointer Stack Pointer 1 reg2mem store 1 mem2mem copy 1 stack ptr increment 1 reg2mem store 1 mem2mem copy 1 stack ptr increment Stack grows dynamically to accommodate checkpoint storage

12 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 12 Evaluation Methodology  Program analysis/instrumentation performed in the LLVM compiler  In-order, single-issue, embedded-class processor  Dynamic instruction model based on profiled execution  Reliability coverage  Analytical model in lieu of traditional fault injection  Decouples evaluation from microarchitectural details

13 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 13 Inherent Idempotence 0% (dynamically-dead) <5% <10% 76% of application code is naturally idempotent

14 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 14 Dynamic Execution Breakdown Impact of detection latency  If control has left the region containing the original fault site, re-execution cannot correct the error Impact of detection latency  If control has left the region containing the original fault site, re-execution cannot correct the error 91% of execution time is spent within recoverable regions

15 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Existing (~100 instrs) Future (~10 instrs) Future (~1000 instrs) 15 Full System “Coverage” 93% − 99.99% coverage, highly application dependent

16 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 16 Overheads 3% − 22% performance degradation

17 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 17 Summary  Large portions of applications, across domains, are (probabilistically) idempotent  Encore is a software-only solution that exploits this property to provide low-cost fault recovery  97% of faults on average are recoverable with current detection schemes  @ 15% performance penalty  Implementing Encore in a runtime system / virtual machine has the potential to yield even better results  Larger dynamic traces v. static intervals  Dynamic v. static memory analysis

18 University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science Questions? 18 http://cccp.eecs.umich.edu


Download ppt "University of Michigan Electrical Engineering and Computer Science University of Michigan Electrical Engineering and Computer Science 1 Encore: Low-Cost,"

Similar presentations


Ads by Google