Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Slides:

Advertisements

Similar presentations

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

Advertisements

The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign

Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.

Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.

Spark: Cluster Computing with Working Sets

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Computer Science Department 1 Load Balancing and Grid Computing David Finkel Computer Science Department Worcester Polytechnic Institute.

Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.

Resource Fabrics: The Next Level of Grids and Clouds Lei Shi.

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.

A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.

Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.

Computer System Architectures Computer System Software

Welcome to the 2015 Charm++ Workshop! Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

NIH Resource for Biomolecular Modeling and Bioinformatics Beckman Institute, UIUC NAMD Development Goals L.V. (Sanjay) Kale Professor.

Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.

Workshop BigSim Large Parallel Machine Simulation Presented by Eric Bohm PPL Charm Workshop 2004.

Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.

1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©

Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,

Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.

Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.

Programming an SMP Desktop using Charm++ Laxmikant (Sanjay) Kale Parallel Programming Laboratory Department of Computer Science.

HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.

EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale

- DAG Scheduling with Reliability - - GridSolve - - Fault Tolerance In Open MPI - Asim YarKhan, Zhiao Shi, Jack Dongarra VGrADS Workshop April 2007.

Flexibility and Interoperability in a Parallel MD code Robert Brunner, Laxmikant Kale, Jim Phillips University of Illinois at Urbana-Champaign.

EEC 688/788 Secure and Dependable Computing

Introduction to Networks

uGNI-based Charm++ Runtime for Cray Gemini Interconnect

Performance Evaluation of Adaptive MPI

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab

EECS 582 Midterm Review Mosharaf Chowdhury EECS 582 – F16.

Introduction to Operating Systems

CMSC 611: Advanced Computer Architecture

Component Frameworks:

Chapter 4: Threads.

Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar

Integrated Runtime of Charm++ and OpenMP

Outline Announcements Fault Tolerance.

Outline Midterm results summary Distributed file systems – continued

Fault Tolerance Distributed Web-based Systems

EEC 688/788 Secure and Dependable Computing

Middleware for Fault Tolerant Applications

EEC 688/788 Secure and Dependable Computing

/ Computer Architecture and Design

Specialized Cloud Architectures

Faucets: Efficient Utilization of Multiple Clusters

EEC 688/788 Secure and Dependable Computing

BigSim: Simulating PetaFLOPS Supercomputers

EEC 688/788 Secure and Dependable Computing

Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale

IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale

Database System Architectures

EEC 688/788 Secure and Dependable Computing

Department of Computer Science University of California, Santa Barbara

An Orchestration Language for Parallel Objects

EEC 688/788 Secure and Dependable Computing

Support for Adaptivity in ARMCI Using Migratable Objects

Distributed Systems and Concurrency: Distributed Systems

Laxmikant (Sanjay) Kale Parallel Programming Laboratory

Last Class: Fault Tolerance

Presentation transcript:

Scalable Fault Tolerance Schemes using Adaptive Runtime Support Laxmikant (Sanjay) Kale http://charm.cs.uiuc.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign

HPC Resilience Workshop DC Presentation Outline What is object based decomposition Its embodiment in Charm++ and AMPI Its general benefits Its features that are useful for fault tolerance schemes Our Fault Tolerance work in Charm++ and AMPI Disk-based checkpoint/restart In-memory double checkpoint/restart Proactive object-migration Message-logging 8/13/2009 9/19/2018 HPC Resilience Workshop DC 2 2

Object based over-decomposition Programmers decompose computation into objects Work units, data-units, composites Decomposition independent of number of processors Typically, many more objects than processors Intelligent runtime system assigns objects to processors RTS can change this assignment (mapping) during execution 8/13/2009 HPC Resilience Workshop DC

Object-based over-decomposition: Charm++ Multiple “indexed collections” of C++ objects Indices can be multi-dimensional and/or sparse Programmer expresses communication between objects with no reference to processors System implementation User View 8/13/2009 HPC Resilience Workshop DC

Object-based over-decomposition: AMPI Each MPI process is implemented as a user-level thread Threads are light-weight, and migratable! <1 microsecond context switch time, potentially >100k threads per core Each thread is embedded in a charm+ object (chare) MPI processes Virtual Processors (user-level migratable threads) Real Processors 8/13/2009 HPC Resilience Workshop DC

Some Properties of this approach Relevant to Fault Tolerance Object-based Virtualization leads to Message Driven Execution Dynamic load balancing by migrating objects No dependence on processor number: E.g. 3D cube of objects, can be mapped to a non-cube number of processors Scheduler Message Q 8/13/2009 HPC Resilience Workshop DC

Charm++/AMPI are well established systems The Charm++ model has succeeded in CSE/HPC Because: Resource management, … 15% of cycles at NCSA, 20% at PSC, were used on Charm++ apps, in a one year period So, work on fault tolerance for Charm++ and AMPI is directly useful to real apps Also, with AMPI, it applies to MPI applications 8/13/2009 HPC Resilience Workshop DC

Fault Tolerance in Charm++ & AMPI Four Approaches Available: Disk-based checkpoint/restart In-memory double checkpoint/restart Proactive object migration Message-logging: scalable rollback, parallel restart Common Features: Based on dynamic runtime capabilities Use of object-migration Can be used in concert with load-balancing schemes 8/13/2009 HPC Resilience Workshop DC 8

Disk-Based Checkpoint/Restart Basic Idea: Similar to traditional checkpoint/restart; “migration” to disk Implementation in Charm++/AMPI: Blocking coordinated checkpoint: MPI_Checkpoint(DIRNAME) Pros: Simple scheme, effective for common cases Virtualization enables restart with any number of processors Cons: Checkpointing and data reload operations may be slow Work between last checkpoint and failure is lost Job needs to be resubmitted and restarted 8/13/2009 HPC Resilience Workshop DC 9

SyncFT: In-Memory double Checkpoint/Restart Basic Idea: Avoid overhead of disk access for keeping saved data Allow user to define what makes up the state data Implementation in Charm++/AMPI: Coordinated checkpoint Each object maintains two checkpoints: on local processor’s memory on remote buddy processor’s memory A dummy process is created to replace crashed process New process starts recovery on another processor use buddy’s checkpoint to recreate state of failing processor perform load balance after restart 8/13/2009 HPC Resilience Workshop DC 10

Over

In-Memory Double Checkpoint/Restart (cont.) Comparison to disk-based checkpointing: 8/13/2009 HPC Resilience Workshop DC 13

In-Memory Double Checkpoint/Restart (cont.) Recovery Performance: Molecular Dynamics LeanMD code, 92K atoms, P=128 Load Balancing (LB) effect after failure: 8/13/2009 HPC Resilience Workshop DC 14

In-Memory Double Checkpoint/Restart (cont.) Application Performance: Molecular Dynamics LeanMD code, 92K atoms, P=128 Checkpointing every 10 timesteps; 10 crashes inserted: 8/13/2009 HPC Resilience Workshop DC 15

In-Memory Double Checkpoint/Restart (cont.) Pros: Faster checkpointing than disk-based Reading of saved data also faster Only one processor fetches checkpoint across network Cons: Memory overhead may be high All processors are rolled back, despite individual failure All the work since last checkpoint is redone by every processor Publications: Zheng, Huang & Kale: ACM-SIGOPS, April 2006 Zheng, Shi & Kale: IEEE-Cluster’2004, Sep.2004 8/13/2009 HPC Resilience Workshop DC 16

Proactive Object Migration Basic Idea: Use knowledge about impending faults Migrate objects away from processors that may fail soon Fall back to checkpoint/restart when faults not predicted Implementation in Charm++/AMPI: Each object has a unique index Each object is mapped to a home processor objects need not reside on home processor home processor knows how to reach the object Upon getting a warning, evacuate the processor reassign mapping of objects to new home processors send objects away, to their home processors 8/13/2009 HPC Resilience Workshop DC 17

Proactive Object Migration (cont.) Evacuation time as a function of #processors: 5-point stencil code in Charm++, IA-32 cluster 8/13/2009 HPC Resilience Workshop DC 18

Proactive Object Migration (cont.) Performance of an MPI application Sweep3d code, 150x150x150 dataset, P=32, 1 warning 5-point stencil code in Charm++, IA-32 cluster 8/13/2009 HPC Resilience Workshop DC 19

Proactive Object Migration (cont.) Pros: No overhead in fault-free scenario Evacuation time scales well, only depends on data and network No need to roll back when predicted fault happens Cons: Effectiveness depends on fault predictability mechanism Some faults may happen without advance warning Publications: Chakravorty, Mendes & Kale: HiPC, Dec.2006 Chakravorty, Mendes, Kale et al: ACM-SIGOPS, April 2006 8/13/2009 HPC Resilience Workshop DC 20

HPC Resilience Workshop DC Message-Logging Basic Idea: Messages are stored by sender during execution Periodic checkpoints still maintained After a crash, reprocess “recent” messages to regain state Implementation in Charm++/AMPI: Since the state depends on the order of messages received, the protocol ensures that the new receptions occur in the same order Upon failure, roll back is “localized” around failing point: no need to roll back all the processors! With virtualization, work in one processor is divided across multiple virtual processors; thus, restart can be parallelized Virtualization helps fault-free case as well 8/13/2009 HPC Resilience Workshop DC 21

Message-Logging (cont.) Fast restart performance: Test: 7-point 3D-stencil in MPI, P=32, 2 ≤ VP ≤ 16 Checkpoint taken every 30s, failure inserted at t=27s 8/13/2009 HPC Resilience Workshop DC 22 22

HPC Resilience Workshop DC Power Normal Checkpoint-Resart method Progress is slowed down with failures Power consumption is continuous Progress Time 8/13/2009 HPC Resilience Workshop DC

HPC Resilience Workshop DC Our Checkpoint-Resart method (Message logging + Object-based virtualization) Progress is faster with failures Power consumption is lower during recovery Progress Time 8/13/2009 HPC Resilience Workshop DC

Message-Logging (cont.) Fault-free performance: Is ok with large-grain, but significant los with fine-grained Test: NAS benchmarks, MG/LU Versions: AMPI, AMPI+FT, AMPI+FT+multipleVPs 8/13/2009 HPC Resilience Workshop DC 25 25

Message-Logging (cont.) Protocol Optimization: Combine protocol messages: reduces overhead and contention Test: synthetic compute/communicate benchmark 8/13/2009 HPC Resilience Workshop DC 26 26

Message-Logging (cont.) Pros: No need to roll back non-failing processors Restart can be accelerated by spreading work to be redone No need of stable storage Cons: Protocol overhead is present even in fault-free scenario Increase in latency may be an issue for fine-grained applications Publications: Chakravorty: UIUC PhD Thesis, Dec.2007 Chakravorty & Kale: IPDPS, April 2007 Chakravorty & Kale: FTPDS workshop at IPDPS, April 2004 8/13/2009 HPC Resilience Workshop DC 32

Current PPL Research Directions Message-Logging Scheme Decrease latency overhead in protocol Decrease memory overhead for checkpoints Stronger coupling to load-balancing Newer schemes to reduce message-logging overhead Clustering: a set of cores are sent back to their checkpt Greg Bronevetsky’s suggestion Other collaboration with Franck Capello 8/13/2009 HPC Resilience Workshop DC 33

HPC Resilience Workshop DC Some external Gaps Scheduler that won’t kill a job Broader need: a scheduler that allows flexible bi-directional communication between jobs and scheduler Fault prediction Needed if proactive Fault Tolerance is of use Local disks! Need to better integrate knowledge from distributed systems They have sophisticated techniques, but HPC metrics and context is substantially different 8/13/2009 HPC Resilience Workshop DC

HPC Resilience Workshop DC Messages We have interesting fault tolerance schemes Read about them We have an approach to parallel programming That has benefits in the era of complex machines, and sophisticated applications That is used by real apps That provides beneficial features for FT schemes That is available via the web SO: please think about developing new FT schemes of your own for this model More info, papers, software: http://charm.cs.uiuc.edu And please pass the word on: we are hiring 8/13/2009 HPC Resilience Workshop DC