FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Slides:



Advertisements
Similar presentations
Presented by Dealing with the Scale Problem Innovative Computing Laboratory MPI Team.
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
The Charm++ Programming Model and NAMD Abhinav S Bhatele Department of Computer Science University of Illinois at Urbana-Champaign
Spark: Cluster Computing with Working Sets
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
Adaptive MPI Chao Huang, Orion Lawlor, L. V. Kalé Parallel Programming Lab Department of Computer Science University of Illinois at Urbana-Champaign.
RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing Kai Hwang, Hai Jin, and Roy Ho.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
A Framework for Collective Personalized Communication Laxmikant V. Kale, Sameer Kumar, Krishnan Varadarajan.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
Adaptive MPI Milind A. Bhandarkar
Grid Computing With Charm++ And Adaptive MPI Gregory A. Koenig Department of Computer Science University of Illinois.
Selective Recovery From Failures In A Task Parallel Programming Model James Dinan*, Sriram Krishnamoorthy #, Arjun Singri*, P. Sadayappan* *The Ohio State.
Presented By HaeJoon Lee Yanyan Shen, Beng Chin Ooi, Bogdan Marius Tudor National University of Singapore Wei Lu Renmin University Cang Chen Zhejiang University.
1 Fast Failure Recovery in Distributed Graph Processing Systems Yanyan Shen, Gang Chen, H.V. Jagadish, Wei Lu, Beng Chin Ooi, Bogdan Marius Tudor.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
A Survey of Rollback-Recovery Protocols in Message-Passing Systems.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
Advanced / Other Programming Models Sathish Vadhiyar.
EEC 688/788 Secure and Dependable Computing Lecture 7 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
SafetyNet: improving the availability of shared memory multiprocessors with global checkpoint/recovery Daniel J. Sorin, Milo M. K. Martin, Mark D. Hill,
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Chapter 2 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University Building Dependable Distributed Systems.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Computer and Computational Sciences Division Los Alamos National Laboratory On the Feasibility of Incremental Checkpointing for Scientific Computing Jose.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
Faucets Queuing System Presented by, Sameer Kumar.
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Parallelizing Spacetime Discontinuous Galerkin Methods Jonathan Booth University of Illinois at Urbana/Champaign In conjunction with: L. Kale, R. Haber,
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
IPDPS Workshop: Apr 2002PPL-Dept of Computer Science, UIUC A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng,
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
EEC 688/788 Secure and Dependable Computing Lecture 6 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University
Wide-Area Parallel Computing in Java Henri Bal Vrije Universiteit Amsterdam Faculty of Sciences vrije Universiteit.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.
Motivation: dynamic apps Rocket center applications: –exhibit irregular structure, dynamic behavior, and need adaptive control strategies. Geometries are.
Cluster Computers. Introduction Cluster computing –Standard PCs or workstations connected by a fast network –Good price/performance ratio –Exploit existing.
FTOP: A library for fault tolerance in a cluster R. Badrinath Rakesh Gupta Nisheeth Shrivastava.
Parallel Molecular Dynamics A case study : Programming for performance Laxmikant Kale
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
Jack Dongarra University of Tennessee
EEC 688/788 Secure and Dependable Computing
Performance Evaluation of Adaptive MPI
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
EEC 688/788 Secure and Dependable Computing
Faucets: Efficient Utilization of Multiple Clusters
EEC 688/788 Secure and Dependable Computing
BigSim: Simulating PetaFLOPS Supercomputers
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Support for Adaptivity in ARMCI Using Migratable Objects
Distributed Systems and Concurrency: Distributed Systems
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
Cluster Computers.
Presentation transcript:

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

Cluster Motivation As machines grow in size MTBF decreases Applications have to tolerate faults Applications need fast, low cost and scalable fault tolerance support Fault tolerant runtime for: Charm++ (Parallel C++ language and runtime) Adaptive MPI

Cluster Requirements Low impact on fault-free execution Provide fast and automatic restart capability Does not rely on extra processors Maintain execution efficiency after restart Does not rely on any fault-free component Not assume stable storage

4 Background Checkpoint based methods Coordinated – Blocking [Tamir84], Non-blocking [Chandy85] Co-check, Starfish, Clip – fault tolerant MPI Uncoordinated – suffers from rollback propagation Communication – [Briatico84], doesn’t scale well Log-based methods Message logging

Cluster Design Overview Coordinated checkpointing scheme Simple, low overhead on fault-free execution Scientific applications that are iterative Double checkpointing Tolerate one failure at a time In-memory checkpointing Diskless checkpointing Efficient for applications with small memory footprint In case when there is no extra processors Program continue to run with remaining processors Load balancing for restart

6 Charm++: Processor Virtualization User View System implementation Charm++ Parallel C++ with Data driven objects - Chares Chares are migratable Asynchronous method invocation Adaptive MPI Implemented on Charm++with migratable threads Multiple virtual processors on a physical processor

7 Benefits of Virtualization Latency tolerant Adaptive overlap of communication and computation Supports migration of virtual processors for load balancing Checkpoint data Objects (instead of process image) Checkpoint == migrate object to another processor

Cluster Checkpoint Protocol Adopt coordinated checkpointing strategy Charm++ runtime provides the functionality for checkpointing Programmers can decide what to checkpoint Each object pack data and send to two different (buddy) processors Charm++ runtime data

Cluster Restart protocol Initiated by the failure of a physical processor Every object rolls back to the state preserved in the recent checkpoints Combine with load balancer to sustain the performance

Cluster H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor ) Checkpoint/Restart Protocol

Cluster Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small Double In-disk checkpointing Make use of local disk Also does not rely on any reliable storage Useful for applications with very big memory footprint

Cluster Performance Evaluation IA-32 Linux cluster at NCSA 512 dual 1Ghz Intel Pentium III processors 1.5GB RAM each processor Connected by both Myrinet and 100MBit Ethernet

Cluster Checkpoint Overhead Evaluation Jacobi3D MPI Up to 128 processors Myrinet vs. 100Mbit Ethernet

Cluster Single Checkpoint Overhead AMPI jacobi3D Problem size: 200MB 128 processors

Cluster Comparisons of Program Execution Time Jacobi (200MB data size) on upto 128 processors 8 checkpoints in 100 steps

Cluster Performance Comparisons with Traditional Disk-based Checkpointing

Cluster Recovery Performance Molecular Dynamics Simulation application - LeanMD Apoa1 benchmark (92K atoms) 128 processors Crash simulated by killing processes No backup processors With load balancing

Cluster Performance improve with Load Balancing LeanMD, Apoa1, 128 processors

Cluster Recovery Performance 10 crashes 128 processors Checkpoint every 10 time steps

Cluster LeanMD with Apoa1 benchmark 90K atoms 8498 objects

Cluster Future work Use our scheme on some extremely large parallel machines Reduce memory usage of the protocol Message logging Paper appeared in IPDPS’04