Download presentation
Presentation is loading. Please wait.
Published byWilfred Cook Modified over 9 years ago
1
Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local management -Resilience new tech (Flash mem, virtualization) -Resilience power consumption |heterogeneity -Resilient Storage and file systems -Extending applicability of checkpoint -Replication (backup core) Rollback recovery -Fault recovery Fault avoidance (migration) -Transparent Application guided -Resilience and programming/execution models -Resilient apps. & algo. possibly with OS support -Language / compiler support for resilience -Experimental env. to stress & compare solutions barriers and gaps: -Cope with a continuous flow of different errors-faults including soft errors (silent or not) -Current techniques (ckpt/rest) will not scale -Limits of Storage and file systems -Software stack is not fault aware -Provide Verification of large, long time scale simulation Better consistency in error/fault management across software layers [S] System + Application interactions to manage errors and fault [M] Naturally Resilient system [M] and application software [L] -Resilience is a key issue for the Exascale community -Enable tightly coupled applications to run longer -Better resilience will provide better efficiency (full system) Summary of research direction Potential impact on software component Potential impact on usability, capability, and breadth of community
2
4.x Resilience Resilience is a critical issue to achieve high apps. throughput 2010201120122013201420152016201720182019 Net Throughput 10 Peta 1 Exa 100 Peta All software should be fault aware and consistent Fault oblivious Applications Application should be able to dynamically handle errors Extend applicability of checkpointing -- IO caching (e.g., NAND) -- New FT protocols System level fault-tolerance -- prediction for time optimal checkpointing and migration -- isolation and local recovery/management Improved hardware and software reliability -- better RAS collection and analysis (root cause) -- Integration LongMediumShort MTBF=<1hMTBF=<10m MTBF=<10h (based on DARPA report) MTBF=day MTBF=10h MTBF=<1h Terms: Fault RepairFault Avoidance
3
4.x Resilience Technology drivers -Increase of the number of errors, variety of errors. -Huge increase of components and threads, Power management, New hardware (Flash Mem., Accel., ) -Increase of the data size, limit of centralized I/O, higher potential bandwidth of local storage. Alternative R&D strategies -Fault recovery Fault avoidance (migration) -Transparent Application directed -Replication (backup core) Rollback Recovery (replicate locally and restart globally?) Recommended research agenda -Fault understanding (RAS analysis), modeling, prediction [S-M] -Fault isolation/confinement + local management [M] -Virtualization [S] -Extending the applicability of Rollback recovery (reducing ckpt size, caching, scalable FT protocols) [S] -Resilient Storage and file systems [S-L] -Resilience and programming/execution models (MW, Map Reduce, Transactions) [M-L] -Language / compiler support for resilience [M] -Resilient apps & algorithms (forward recovery, NFTA, ABFT) possibly with OS support [L] -Experimental environment to stress envisioned solutions [M] Crosscutting considerations -Resilience power management, performance (fault free situation and when faults occur) -scalability, programmability,
Similar presentations
© 2025 SlidePlayer.com. Inc.
All rights reserved.