Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.

Slides:



Advertisements
Similar presentations
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
Distributed System Structures Network Operating Systems –provide an environment where users can access remote resources through remote login or file transfer.
Spark: Cluster Computing with Working Sets
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
1 Cheriton School of Computer Science 2 Department of Computer Science RemusDB: Transparent High Availability for Database Systems Umar Farooq Minhas 1,
Network Operating Systems Users are aware of multiplicity of machines. Access to resources of various machines is done explicitly by: –Logging into the.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Low Overhead Fault Tolerant Networking (in Myrinet)
CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.
Reliability and Partition Types of Failures 1.Node failure 2.Communication line of failure 3.Loss of a message (or transaction) 4.Network partition 5.Any.
16: Distributed Systems1 DISTRIBUTED SYSTEM STRUCTURES NETWORK OPERATING SYSTEMS The users are aware of the physical structure of the network. Each site.
Next Generation of Apache Hadoop MapReduce Arun C. Murthy - Hortonworks Founder and Architect Formerly Architect, MapReduce.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Distributed Deadlocks and Transaction Recovery.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 The Google File System Reporter: You-Wei Zhang.
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
1 Fault Tolerance in the Nonstop Cyclone System By Scott Chan Robert Jardine Presented by Phuc Nguyen.
Chapter 3: Operating-System Structures System Components Operating System Services System Calls System Programs System Structure Virtual Machines System.
Checksum Strategies for Data in Volatile Memory Authors: Humayun Arafat(Ohio State) Sriram Krishnamoorthy(PNNL) P. Sadayappan(Ohio State) 1.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
12. Recovery Study Meeting M1 Yuuki Horita 2004/5/14.
Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
More on Adaptivity in Grids Sathish S. Vadhiyar Source/Credits: Figures from the referenced papers.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Faucets Queuing System Presented by, Sameer Kumar.
1 Computer Systems II Introduction to Processes. 2 First Two Major Computer System Evolution Steps Led to the idea of multiprogramming (multiple concurrent.
Optimizing Charm++ Messaging for the Grid Gregory A. Koenig Parallel Programming Laboratory Department of Computer.
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
1 MSRBot Web Crawler Dennis Fetterly Microsoft Research Silicon Valley Lab © Microsoft Corporation.
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
© Janice Regan, CMPT 300, May CMPT 300 Introduction to Operating Systems Operating Systems Overview: Using Hardware.
Operating Systems Distributed-System Structures. Topics –Network-Operating Systems –Distributed-Operating Systems –Remote Services –Robustness –Design.
1 Fault Tolerance and Recovery Mostly taken from
Jack Dongarra University of Tennessee
EEC 688/788 Secure and Dependable Computing
Performance Evaluation of Adaptive MPI
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Outline Announcements Fault Tolerance.
Fault Tolerance Distributed Web-based Systems
EEC 688/788 Secure and Dependable Computing
Chapter 2: Operating-System Structures
EEC 688/788 Secure and Dependable Computing
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
Chapter 2: Operating-System Structures
Support for Adaptivity in ARMCI Using Migratable Objects
Distributed Systems and Concurrency: Distributed Systems
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
Presentation transcript:

Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign

2 Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Fault tolerant runtime for: Charm++ Adaptive MPI

3 Outline Disk Checkpoint/Restart FTC-Charm++ in-memory checkpoint/restart Proactive Fault Tolerance FTL-Charm++ message logging

4 Disk Checkpoint/Restart

5 Checkpoint/Restart Simplest scheme for application fault tolerance Any long running application saves its state into disk periodically at certain point coordinated checkpointing strategy (barrier) State information is saved in a directory of your choosing Checkpoint of the application data is done by invoking pup routine of all objects Restore also uses pup, so no additional application code is needed (pup is all you need)

6 Checkpointing Job In Charm++, use: void CkStartCheckpoint(char* dirname,const CkCallback& cb) Called on one processor; calls resume when checkpoint is complete In AMPI, use: MPI_Checkpoint( ); Collective call; returns when checkpoint is complete

7 Restart Job from Checkpoint The charmrun option ++restart is used to restart./charmrun +p4./pgm ++restart log Number of processors need not be the same Parallel objects are redistributed when needed

8 FTC-Charm++ In-Memory Checkpoint/Restart

9 Disk vs. In-memory Scheme Disk checkpointing suffers Need user intervention to restart a job Assume reliable storage - disk Disk I/O is slow In-memory checkpoint/restart scheme Online version of the previous scheme Low impact on fault-free execution Provide fast and automatic restart capability Does not rely on extra processors Maintain execution efficiency after restart Does not rely on any fault-free component Not assume stable storage

10 Overview Coordinated checkpointing scheme Simple, low overhead on fault-free execution Scientific applications that are iterative Double checkpointing Tolerate one failure at a time In-memory checkpointing Diskless checkpointing Efficient for applications with small memory footprint In case when there is no extra processors Program continue to run with remaining processors Load balancing for restart

11 Checkpoint Protocol Similar to the previous scheme coordinated checkpointing strategy Programmers decide what to checkpoint void CkStartMemCheckpoint(CkCallback &cb) Each object pack data and send to two different (buddy) processors

12 Restart protocol Initiated by the failure of a physical processor Every object rolls back to the state preserved in the recent checkpoints Combine with load balancer to sustain the performance

13 H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor ) Checkpoint/Restart Protocol

14 Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small Double In-disk checkpointing Make use of local disk Also does not rely on any reliable storage Useful for applications with very big memory footprint

15 Compiling FTC-Charm++ Build charm with “syncft” option./build charm++ net-linux syncft –O Command line switch +ftc_disk for disk/memory checkpointing: charmrun./pgm +ftc_disk

16 Performance Evaluation IA-32 Linux cluster at NCSA 512 dual 1Ghz Intel Pentium III processors 1.5GB RAM each processor Connected by both Myrinet and 100MBit Ethernet

17 Performance Comparisons with Traditional Disk-based Checkpointing

18 Recovery Performance Molecular Dynamics Simulation application - LeanMD Apoa1 benchmark (92K atoms) 128 processors Crash simulated by killing processes No backup processors With load balancing

19 Performance improve with Load Balancing LeanMD, Apoa1, 128 processors

20 Recovery Performance 10 crashes 128 processors Checkpoint every 10 time steps

21 LeanMD with Apoa1 benchmark 90K atoms 8498 objects

22 Proactive Fault Tolerance

23 Motivation Run-time reacts to a failure Proactively migrate from a processor about to fail Modern hardware supports early fault indication SMART protocol, Motherboard temperature sensors, Myrinet interface cards Possible to create mechanism for fault prediction

24 Requirements Response time should be as low as possible No new processes should be required Collective operations should still work Efficiency loss should be proportional to computing power loss

25 System Application is warned of impending fault via signal Processor, memory and interconnect should continue to work correctly for sometime after warning Run-time ensures that application continues to run on the remaining processors even if one processor crashes

26 Solution Design Migrate Charm++ objects off warned processor Point to point message delivery should continue to work Collective operations should cope with the possible loss of multiple processors Modify the runtime system's reduction tree to remove the warned processor. Minimal number of processors should be affected Runtime system should remain load balanced after a processor has been evacuated

27 Proactive FT: Current Status Status Support for multiple faults ready; currently testing support for simultaneous faults Faults simulated via signal sent to process Current version fully integrated to Charm++ and AMPI Example: sweep3d (MPI code) on NCSA’s tungsten Original utilization Utilization after fault Utilization after LB

28 How to Use Part of default version of Charm++ No extra compiler flags required This code does not get executed until a warning Any detection system can be plugged in Can send signal (USR1) to process on compute node Can call a method (CkDecideEvacPe) to evacuate a processor Used with any Charm++ and AMPI program For AMPI needs to be used with -memory isomalloc

29 FTL-Charm++ Message Logging

30 Motivation Checkpointing not fully automatic Coordinated checkpointing is expensive Checkpoint/Rollback doesn’t scale All nodes are rolled back just because 1 crashed Even nodes independent of the crashed node are restarted

31 Design Message Logging Sender side message logging Asynchronous checkpoints Each processor has a buddy processor Stores its checkpoint in the buddy’s memory Checkpoint on its own (no barrier)

32 Message to Remote Chares Chare P sender Chare Q receiver If has been seen earlier TN is marked as received Otherwise create new TN and store the

33 Status Most of Charm++ and AMPI has been ported Support for migration has not yet been implemented in the fault tolerant protocol Parallel restart not yet implemented Not in Charm main branch

34 Thank You! Free source, binaries, manuals, and more information at: Parallel Programming Lab at University of Illinois