Exploiting Global View for Resilience (GVR) An Outside-In Approach to Resilience Andrew A. Chien X-stack PI LBNL March 20-22, 2013.

Slides:

Advertisements

Similar presentations

Technology Drivers Traditional HPC application drivers – OS noise, resource monitoring and management, memory footprint – Complexity of resources to be.

Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

Test Automation Success: Choosing the Right People & Process

Architecture-dependent optimizations Functional units, delay slots and dependency analysis.

MapReduce Online Created by: Rajesh Gadipuuri Modified by: Ying Lu.

© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,

Dr. Kalpakis CMSC 621, Advanced Operating Systems. Fall 2003 URL: Distributed System Architectures.

Priority Research Direction Key challenges General Evaluation of current algorithms Evaluation of use of algorithms in Applications Application of “standard”

Parallel and Distributed Simulation Time Warp: Basic Algorithm.

HadoopDB Inneke Ponet.  Introduction  Technologies for data analysis  HadoopDB  Desired properties  Layers of HadoopDB  HadoopDB Components.

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Abstract HyFS: A Highly Available Distributed File System Jianqiang Luo, Mochan Shrestha, Lihao Xu Department of Computer Science, Wayne State University.

LLNL-PRES This work was performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under contract DE-AC52-07NA27344.

Presented by: Thabet Kacem Spring Outline Contributions Introduction Proposed Approach Related Work Reconception of ADLs XTEAM Tool Chain Discussion.

A 100,000 Ways to Fa Al Geist Computer Science and Mathematics Division Oak Ridge National Laboratory July 9, 2002 Fast-OS Workshop Advanced Scientific.

7. Fault Tolerance Through Dynamic or Standby Redundancy 7.5 Forward Recovery Systems Upon the detection of a failure, the system discards the current.

Page 1 Copyright © Alexander Allister Shvartsman CSE 6510 (461) Fall 2010 Selected Notes on Fault-Tolerance (12) Alexander A. Shvartsman Computer.

BRASS Analysis of QuasiStatic Scheduling Techniques in a Virtualized Reconfigurable Machine Yury Markovskiy, Eylon Caspi, Randy Huang, Joseph Yeh, Michael.

Simplifying the Recovery Model of User- Level Failure Mitigation Wesley Bland ExaMPI ‘14 New Orleans, LA, USA November 17, 2014.

Towards a Hardware-Software Co-Designed Resilient System Man-Lap (Alex) Li, Pradeep Ramachandran, Sarita Adve, Vikram Adve, Yuanyuan Zhou University of.

Cloud MapReduce ： a MapReduce Implementation on top of a Cloud Operating System Speaker : 童耀民 MA1G Authors: Huan Liu, Dan Orban Accenture.

4.x Performance Technology drivers – Exascale systems will consist of complex configurations with a huge number of potentially heterogeneous components.

Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.

1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.

Michael Ernst, page 1 Collaborative Learning for Security and Repair in Application Communities Performers: MIT and Determina Michael Ernst MIT Computer.

Priority Research Direction (use one slide for each) Key challenges -Fault understanding (RAS), modeling, prediction -Fault isolation/confinement + local.

The Global View Resilience Model Approach GVR (Global View for Resilience) Exploits a global-view data model, which enables irregular, adaptive algorithms.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Tekin Bicer Gagan Agrawal 1.

What are the main differences and commonalities between the IS and DA systems? How information is transferred between tasks: (i) IS it may be often achieved.

Programming Models & Runtime Systems Breakout Report MICS PI Meeting, June 27, 2002.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

4.2.1 Programming Models Technology drivers – Node count, scale of parallelism within the node – Heterogeneity – Complex memory hierarchies – Failure rates.

Back-end (foundation) Working group X-stack PI Kickoff Meeting Sept 19, 2012.

Software Construction Lecture 18 Software Testing.

Presenters: Rezan Amiri Sahar Delroshan

©Ian Sommerville 2004Software Engineering, 7th edition. Chapter 20 Slide 1 Critical systems development 3.

Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.

Computational Aspects of Multi-scale Modeling Ahmed Sameh, Ananth Grama Computing Research Institute Purdue University.

Safety-Critical Systems 7 Summary T V - Lifecycle model System Acceptance System Integration & Test Module Integration & Test Requirements Analysis.

System-Directed Resilience for Exascale Platforms LDRD Proposal Ron Oldfield (PI)1423 Ron Brightwell1423 Jim Laros1422 Kevin Pedretti1423 Rolf.

Programmability Hiroshi Nakashima Thomas Sterling.

CPSC 871 John D. McGregor Module 6 Session 2 Validation and Verification.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

What’s Ahead for Embedded Software? (Wed) Gilsoo Kim

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

1 Advanced Digital Design Reconfigurable Logic by A. Steininger and M. Delvai Vienna University of Technology.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.

Empirical Comparison of Three Versioning Architecture Hajime Fujita 12*, Kamil Iskra 2, Pavan Balaji 2, Andrew A. Chien 12 1 University of Chicago, 2 Argonne.

Versioning Architectures for Local and Global Memory Hajime Fujita 123, Kamil Iskra 2, Pavan Balaji 2, Andrew A. Chien 12 1 University of Chicago, 2 Argonne.

FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.

Containment Domains A Scalable, Efficient, and Flexible Resilience Scheme for Exascale Systems Jinsuk Chung, Ikhwan Lee, Michael Sullivan, Jee Ho Ryoo,

Testing Overview Software Reliability Techniques Testing Concepts CEN 4010 Class 24 – 11/17.

Structuring Redundancy for Fault Tolerance Chapter 2 Designed by: Hadi Salimi Instructor: Dr. Mohsen Sharifi.

The Google File System Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Presenter: Chao-Han Tsai (Some slides adapted from the Google’s series lectures)

Chapter 4: Threads.

Self Healing and Dynamic Construction Framework:

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

Chapter 4: Threads.

EEC 688/788 Secure and Dependable Computing

Soft Error Detection for Iterative Applications Using Offline Training

Parallel and Distributed Simulation

Mark McKelvin EE249 Embedded System Design December 03, 2002

EEC 688/788 Secure and Dependable Computing

Co-designed Virtual Machines for Reliable Computer Systems

EEC 688/788 Secure and Dependable Computing

EEC 688/788 Secure and Dependable Computing

Abstractions for Fault Tolerance

Presentation transcript:

Exploiting Global View for Resilience (GVR) An Outside-In Approach to Resilience Andrew A. Chien X-stack PI LBNL March 20-22, 2013

Project Team University of Chicago: Chien (PI), Dr. Hajime Fujita, Zachary Rubenstein, Prof. Guoming Lu Argonne: Pavan Balaji (co-PI), James Dinan, Pete Beckman, Kamil Iskra HP Labs: Robert Schreiber Application Partnerships o Future Nuclear Reactor Simulation (Andrew Siegel, CESAR) o Computational Chemistry (Jeff Hammond, ALCF) o Rich Computational Frameworks (Mike Heroux, Sandia) o... and more!... GVR X-stack PI (Chien)March 20-22, 2013

Outline Global View Resilience (GVR) Progress Next Steps GVR X-stack PI (Chien)March 20-22, 2013

Global View Resilience “Just add Resilience” Incremental application resilience o “Outside in”, as needed, incremental,... “end to end” o Rising resilience challenges; manage in flexible application-driven context Global view Data-oriented resilience o Express globally consistent snapshots o Express error handling and recovery with global view Application-System x-layer Partnership o Applications: Exposes algorithm and application domain knowledge o System: reifies and exposes hardware and system error Portable, efficient interface for resilience March 20-22, 2013GVR X-stack PI (Chien) Applications System Global-view Data Data-oriented Resilience

Data-oriented Resilience based on Multi-versions Parallel applications and Global-view data Frames invariant checks, more complex checks based on high-level semantics Frames system, HW, OS, runtime errors Can be implemented efficiently with hardware support Enables rollback and forward (sophisticated) recovery on per global-view data item basis GVR X-stack PI (Chien) Phases create new logical versions Checking, Efficient coverage App-semantics based recovery March 20-22, 2013

x-Layer App-System Error Checking and Handling Exploit semantics from many layers (app, rt, os arch) Manage redundancy, storage, checks efficiently o Temporal redundancy -- Multi-version memory, integrated memory and NVRAM management o Push checks to most efficient level (find early, contain, reduce ovhd) o Recover based on semantics from any level (repair more, larger feasible computation, reduce ovhd) Recover effectively from many more errors GVR X-stack PI (Chien) Applications GVR Interface Runtime OS Architecture Open Reliability Effective Resilience, Efficient Implementation March 20-22, 2013

Outline Global View Resilience (GVR) Progress o Design: GVR API and Architecture o Implement: Initial Prototype o Modeling: Latent Errors Next Steps GVR X-stack PI (Chien)March 20-22, 2013

GVR Application Interface Global view creation o New, federation interfaces Global view data access o Data access, consistency Versioning o Create persistent copies, restore Error handling o Capture and handle system errors, application errors o Flags application errors o Recover based on application semantics and versioned state March 20-22, 2013GVR X-stack PI (Chien)

Application Lifecycle* – Error Handling March 20-22, 2013GVR X-stack PI (Chien) Running Error Handling Dispatch Error Recovery raise_error() recovery resume() move_to_prev() move_to_next() descriptor_clone() put(), get(), version_inc() put(), get() general computation *Can be “partial application” life cycle.

Unified Signalling and Recovery Unified Signalling (HW, OS, runtime, application) Application-defined error checking Application-defined handling March 20-22, 2013GVR X-stack PI (Chien) application_check() runtime_check() OS_signal () Hardware_error () other() raise_error(gds, error_desc) Mapping

Dispatch and Recovery Error description and Dispatch Error recovery Resume Application Customized error handling o Simple - paired notification and recovery routines o Enhanced as resilience challenges and recovery capabilities increase Exploit x-layer information and semantics March 20-22, 2013GVR X-stack PI (Chien) raise_error(gds, e_desc) Dispatch Correct Recompute Reload Rollback Approximate Restart Fail resume(gds)

GVR System Architecture March 20-22, 2013GVR X-stack PI (Chien) Global View Service Provides API DRAMFlash Distributed Metadata Service Provides global-view, distributed metadata, versions, and consistency Distributed Storage Service Manages the latest array Distributed Recovery Management Service Manages old versions, resilience, data transformations Local Resilient Data Store Local data store, data transformation … … Applications Block Storage Data Client side Target side

GVR Prototype It works! Basic implementation But... o Simple versioning o Simple error handling o Not high performance o Not highly scalable However... o Good enough to enable app and application experiments o Good enough to enable GVR system implementation research March 20-22, 2013GVR X-stack PI (Chien) Demo in Resilience Technology Marketplace

GVR applied to miniFE miniFE: mini-application for unstructured implicit Finite Element codes 1. Calculate matrix A and vector b 2. Solve the linear system with CG o Generate a better approximation of x with each iteration o Additional state preserved in direction vector p and residual vector r o Each iteration involves parallel DAXPY, matrix vector product, and dot product o Computation has parallel for loops and reductions Simple demonstration of Global view, Error checking & signaling, Error recovery March 20-22, 2013GVR X-stack PI (Chien)

GVR-enhanced miniFE Skeleton March 20-22, 2013GVR X-stack PI (Chien) Error Handler Error Check Save Soln state GDS_status_t handle_error(gds, local_buffer{ GDS_get(local_buffer, gds); GDS_resume(); } void cg_solve() { for each iteration { if ((old_residual - new_residual) / old_residual > TOL){ GDS_raise_error(gds_r, r); recalculate_residual(); } if (iteration % CP_INTERVAL == 0) { GDS_put(r, gds_r); } do_calculation(); } }

MiniFE Execution (fault injection) March 20-22, 2013GVR X-stack PI (Chien)

Discussion Simple example o Captures critical state vector o Restores when residual is incorrect More ambitious use o Coverage of other structures (A matrix, check, restore) o Versioning to recover from latent errors o Selective recovery and rollback o GVR as primary data store Next Steps: Larger application studies, programming system experiments, etc. March 20-22, 2013GVR X-stack PI (Chien)

Latent Errors and Multi- version Snapshots March 20-22, 2013GVR X-stack PI (Chien) Guoming Lu, Ziming Zheng, and Andrew A. Chien, ”When are Multiple Checkpoints needed?”, to appear in Fault Tolerance at Extreme Scale, (FTXS), June 2013.

Fail-stop vs. Latent Errors Fig. 1.a Fail-stop Model Fig. 1.b Latent Error Model Running Error Latent Error Recovery Error Generation Error Detected Error Detection March 20-22, 2013GVR X-stack PI (Chien) Existing resilience systems mostly assume “Fail-stop” “Silent”, delayed errors are likely to be a growing problem. Why? Increasing variety of errors, cost of checking. o More subtle hardware and software errors (small data perturbation, small data structure perturbation, minor divergence) o More expensive checks (scrubbing, x-structure, x-node, symmetry data structure, energy conserve,....) Error Detected

Versions Needed for Error Coverage o o where As detection latency increases, at expected error rates, many versions are needed to cover errors. March 20-22, 2013GVR X-stack PI (Chien)

Versions and Error Coverage δ=1 δ=10 δ=30 If error detection latency is low (large , fail stop”), 1-2 versions are sufficient. Higher latency, the number of versions increase significantly. Reduced checkpoint overhead increases need for more versions. March 20-22, 2013GVR X-stack PI (Chien)

Maximum achievable efficiency for long running jobs (  =10) Multi-version required for usable efficiency at high error rates, many versions required Multi-version benefit increases with o Lower error rates (rework) o Lower checkpoint cost (coverage) March 20-22, 2013GVR X-stack PI (Chien) Table 2: Exascale scenarios: Achievable effciency and least version requirements compared with single version scheme( the checkpoint interval is set to K times of optimal interval). δ = 5, R = 5 for traditional checkpoint system. δ = 1, R = 1 for optimized. ρ= 10 for both. TraditionalOptimized(SCR) λe(per minute)versions(K)Ef of K-VersionEf of 1-Versionversions(K)Ef of K-VersionEf of 1-Version) NA30.370NA NA40.566NA NA50.692NA NA80.787NA NA Exascale Scenarios (Latent Errors)

Exascale Efficiency (Latent Errors) To increase resilience to latent errors, increase 1-version checkpoint beyond “optimal”, (  =500) Multi-version enables much higher efficiency Multi-version much better, particularly at high error rates March 20-22, 2013GVR X-stack PI (Chien) Error Rate (errors/minute) System Efficiency

Bottom Line Need to worry about latent errors (detection delay) Multi-version can help, and serves as an insurance policy Optimized checkpointing increases need for multi- version Error detection latency reduction (containment) is a critical research area March 20-22, 2013GVR X-stack PI (Chien)

GVR Next Steps Refine API, based on co-design apps and other experiments (OpenMC, Trilinos,...) o Explore GVR capabilities match with common application structures – refine API and demonstrate potential Continue implementation, towards a full API, and robust functionality o Explore efficient implementation of redundant, distributed global-view data structures, snapshot consistency and capture o Explore efficient multi-version storage techniques, redundancy, compression, and restoration Work with OS/runtime community on cross-layer error handling classification and naming More Multi-version analysis... March 20-22, 2013GVR X-stack PI (Chien)

GVR X-stack Synergies Direct Application Programming Interface o Co-existence, even targetted by other Runtimes Rich Solver Library Building Block Programming System Target March 20-22, 2013GVR X-stack PI (Chien) Applications GVR... GVR Applications GVR... GVR... Petsc Trilinos... Applications PM #1 PM #1 PM #2 PM #2 PM #3 PM #3

Publications Guoming Lu, Ziming Zheng, and Andrew A. Chien, When are Multiple Checkpoints Needed?, to appear in Fault Tolerance at Extreme Scale, (FTXS), June 2013.When are Multiple Checkpoints Needed? Hajime Fujita, Robert Schreiber, Andrew A. Chien, It's Time for New Programming Models for Unreliable Hardware, in ACM Conference on Architectural Support for Programming Languages and Operating Systems, March 18-20, (Provocative Ideas session).It's Time for New Programming Models for Unreliable Hardware, March 18-20, (Provocative Ideas session The Global View Resilience Application Programming Interface, Version 0.71, February Prior relevant work Sean Hogan, Jeff Hammond, and Andrew A. Chien, An Evaluation of Difference and Threshold Techniques for Efficient Checkpointing, 2nd workshop on fault-tolerance for HPC at extreme scale FTXS 2012 at DSN 2012FTXS 2012DSN 2012 March 20-22, 2013GVR X-stack PI (Chien)