ISS C 3 : A System for Automating Application-level Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill Department.

Slides:

Advertisements

Similar presentations

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.

Advertisements

Consistent Cuts and Un-coordinated Check-pointing.

Overview Motivations Basic static and dynamic optimization methods ADAPT Dynamo.

Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.

Using DSVM to Implement a Distributed File System Ramon Lawrence Dept. of Computer Science

Parallel and Distributed Simulation Time Warp: State Saving.

Distributed Processing, Client/Server, and Clusters

Reference: Message Passing Fundamentals.

CS 582 / CMPE 481 Distributed Systems Fault Tolerance.

MPICH-V: Fault Tolerant MPI Rachit Chawla. Outline  Introduction  Objectives  Architecture  Performance  Conclusion.

Building Fault Survivable MPI Programs with FT-MPI Using Diskless Checkpointing Zizhong Chen, Graham E. Fagg, Edgar Gabriel, Julien Langou, Thara Angskun,

Java for High Performance Computing Jordi Garcia Almiñana 14 de Octubre de 1998 de la era post-internet.

CS-550 (M.Soneru): Recovery [SaS] 1 Recovery. CS-550 (M.Soneru): Recovery [SaS] 2 Recovery Computer system recovery: –Restore the system to a normal operational.

Run-Time Storage Organization

Overview Distributed vs. decentralized Why distributed databases

EEC 688/788 Secure and Dependable Computing Lecture 14 Wenbing Zhao Department of Electrical and Computer Engineering Cleveland State University

Checkpointing-based Rollback Recovery for Parallel Applications on the InteGrade Grid Middleware Raphael Y. de Camargo Andrei Goldchleger Fabio Kon Alfredo.

The hybird approach to programming clusters of multi-core architetures.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems M. Elnozahy, L. Alvisi, Y. Wang, D. Johnson Carnegie Mellon University Presented by:

1 Rollback-Recovery Protocols II Mahmoud ElGammal.

Introduction to Systems Analysis and Design Trisha Cummings.

Evaluation of Memory Consistency Models in Titanium.

Checkpointing and Recovery. Purpose Consider a long running application –Regularly checkpoint the application Expensive task –In case of failure, restore.

Distributed Systems 1 CS- 492 Distributed system & Parallel Processing Sunday: 2/4/1435 (8 – 11 ) Lecture (1) Introduction to distributed system and models.

ICOM 5995: Performance Instrumentation and Visualization for High Performance Computer Systems Lecture 7 October 16, 2002 Nayda G. Santiago.

Parallel Programming Models Jihad El-Sana These slides are based on the book: Introduction to Parallel Computing, Blaise Barney, Lawrence Livermore National.

Chapter 6 Operating System Support. This chapter describes how middleware is supported by the operating system facilities at the nodes of a distributed.

Cluster Reliability Project ISIS Vanderbilt University.

CSC-115 Introduction to Computer Programming

CS 390- Unix Programming Environment CS 390 Unix Programming Environment Topics to be covered: Distributed Computing Fundamentals.

A Survey of Rollback-Recovery Protocols in Message-Passing Systems.

AN EXTENDED OPENMP TARGETING ON THE HYBRID ARCHITECTURE OF SMP-CLUSTER Author ： Y. Zhao 、 C. Hu 、 S. Wang 、 S. Zhang Source ： Proceedings of the 2nd IASTED.

Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.

A Framework for Elastic Execution of Existing MPI Programs Aarthi Raveendran Graduate Student Department Of CSE 1.

Distributed Transactions Chapter 13

Application-Level Checkpoint-restart (CPR) for MPI Programs

Advanced Computer Networks Topic 2: Characterization of Distributed Systems.

A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.

Synchronization Transformations for Parallel Computing Pedro Diniz and Martin Rinard Department of Computer Science University of California, Santa Barbara.

CE Operating Systems Lecture 3 Overview of OS functions and structure.

Two Lectures on Grid Computing Keshav Pingali Cornell University.

Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.

Distributed Databases

Cache Coherence Protocols 1 Cache Coherence Protocols in Shared Memory Multiprocessors Mehmet Şenvar.

Improving I/O with Compiler-Supported Parallelism Why Should We Care About I/O? Disk access speeds are much slower than processor and memory access speeds.

Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,

Distributed System Services Fall 2008 Siva Josyula

Programmability Hiroshi Nakashima Thomas Sterling.

Tolerating Communication and Processor Failures in Distributed Real-Time Systems Hamoudi Kalla, Alain Girault and Yves Sorel Grenoble, November 13, 2003.

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang.

University of Westminster – Checkpointing Mechanism for the Grid Environment K Sajadah, G Terstyanszky, S Winter, P. Kacsuk University.

Efficient Resource Allocation for Wireless Multicast De-Nian Yang, Member, IEEE Ming-Syan Chen, Fellow, IEEE IEEE Transactions on Mobile Computing, April.

Middleware for Fault Tolerant Applications Lihua Xu and Sheng Liu Jun, 05, 2003.

Lesson 1 1 LESSON 1 l Background information l Introduction to Java Introduction and a Taste of Java.

Fault Tolerance and Checkpointing - Sathish Vadhiyar.

Lawrence Livermore National Laboratory 1 Science & Technology Principal Directorate - Computation Directorate Scalable Fault Tolerance for Petascale Systems.

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Movement-Based Check-pointing and Logging for Recovery in Mobile Computing Systems Sapna E. George, Ing-Ray Chen, Ying Jin Dept. of Computer Science Virginia.

ECE 526 – Network Processing Systems Design Programming Model Chapter 21: D. E. Comer.

Parallel Computing Presented by Justin Reschke

FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.

Fail-Stop Processors UNIVERSITY of WISCONSIN-MADISON Computer Sciences Department CS 739 Distributed Systems Andrea C. Arpaci-Dusseau One paper: Byzantine.

Towards a High Performance Extensible Grid Architecture Klaus Krauter Muthucumaru Maheswaran {krauter,

A Single Intermediate Language That Supports Multiple Implemtntation of Exceptions Delvin Defoe Washington University in Saint Louis Department of Computer.

TensorFlow– A system for large-scale machine learning

Distributed Shared Memory

Prepared by Ertuğrul Kuzan

Outline Announcements Fault Tolerance.

Threads Chapter 4.

The Challenge of Cross - Language Interoperability

Presentation transcript:

ISS C 3 : A System for Automating Application-level Checkpointing of MPI Programs Greg Bronevetsky, Daniel Marques, Keshav Pingali, Paul Stodghill Department of Computer Science, Cornell University

ISS The Problem Decreasing hardware reliability –Extremely large systems (millions of parts) –Large clusters of commodity parts –Grid Computing Program runtimes greatly exceed mean time to failure –ASCI, Blue Gene, PSC, Illinois Rocket Center  Fault-tolerance necessary for high-performance computing

ISS What kinds of failures? Fault Models Byzantine: a processor can be arbitrarily malicious (e.g., incorrect data, a hacker, etc.) Fail-silent: a processor stops sending and responding to messages Fail-Stop: fail-silent + surviving processors can tell Number of component failures 1, k, n In this work, Fail-Stop faults, n processors Necessary first step Usually sufficient in practice

ISS What is done by hand? Application-level (i.e., source code) Checkpointing –Save key problem state vs system (core) state –Used at Sandia, BlueGene, PSC, … Advantages: –Minimizes amount of state saved e.g., Alegra (application state = 5% of core size) Crucial for future large systems (BlueGene: Mb’s vs Tb’s) –Can be portable across platforms and MPI implementations Disadvantages: –Lots of manual work –Correctly checkpointing programs without barriers requires a coordination protocol We want to automate this process

ISS Goals A tool to convert existing MPI applications into fault-tolerance MPI applications Requirements –Use Application-level checkpointing –Use native MPI implementation –Handle full range of MPI semantics Desirable features –Minimize programmer annotations –Automatically optimize checkpoints sizes Necessary technologies –Program transformations for application-level checkpointing –Novel algorithm for distributed application-level checkpointing

ISS Programmer’s Perspective Programmer places calls to potentialCheckpoint() Precompiler transforms program to save application state at potentialCheckpoint() calls Runtime system decides at each potentialCheckpoint() whether or not a checkpoint is taken int main() { MPI_Init(); initialization(); → potentialCheckpoint(); while (t < t max ) { big_computation(); … → potentialCheckpoint(); } MPI_Finalize(); }

ISS C 3 Architecture Application + State-saving Original Application Precompiler Thin Coordination Layer MPI Implementation Reliable communication layer Failure detector

ISS Outline Introduction  The paper –Precompiler –Coordination Layer –Performance Current work –Optimizing Checkpoint Size –Other Current Work Related Work Conclusions

ISS Precompiler Transformation to save application state: Parameters, local variables, program location –Record local variables and function calls –Checkpoint: save record –Recovery: reconstruct application state from record –Only functions on path to potentialCheckpoint() calls must be instrumented Globals –main() is instrumented to record global locations Heap –Custom, checkpointable heap allocator What about the network “state”? int main() { if (recovery) { … goto Lx; … } add_globals(…); push_locals(&t, &t max ); MPI_Init(); initialization(); L1: potentialCheckpoint(); while (t < t max ) { big_computation(); … L2: potentialCheckpoint(); } MPI_Finalize(); pop_locals(); }

ISS Can existing Distributed Systems solutions be used? Why not checkpoint at barriers? –What barriers? –MPI is non-FIFO! Messages cross barriers! Why not use message logging? –Does not handle n failures –Constant overhead, even when no failures –Message logs fill memory in minutes (seconds) –Checkpointing to clear logs Why not use Chandy-Lamport (or your other favorite distributed snapshot algorithm)? –Requires system-level checkpointing for correctness or progress –Can Tb’s of data be saved before a component fails?

ISS Distributed Application-level Checkpointing Potential checkpoint locations are fixed in program source code May not force checkpoints to happen at any other time Process P Process Q Potential Checkpoint Locations

ISS Distributed Application-level Checkpointing (cont.) Recovery Line –A set of checkpoints, one per processor –represents global system state on recovery –When one node fails, everybody rolls back to a recent recovery line Problems to solve –How to select potentialCheckpoint()’s for recovery line? –What about MPI messages that cross recovery line? P Takes Local Checkpoint Q Takes Local Checkpoint Process P Process Q Recovery Line

ISS Distributed Application-level Checkpointing (cont.) Past and Future Messages –Do not require coordinate Late messages –Require recording and replaying Early messages –a.k.a., Inconsistent messages –Require suppression –Recording non-determinism Collective communication –Combinations of message types Hidden state –MPI_Request, MPI_Communication Synchronization semantics –MPI_Barrier, MPI_SSend, … Protocol details in the paper P Takes Checkpoint Q Takes Checkpoint Process P Process Q Late Message Early Message Past Message Future Message

ISS Performance Prototype implementation –Precompiler without optimizations –Point-to-point protocol, no collective, no synchronization Three benchmarks scientific codes –Dense Conjugate Gradient –Laplace Solver –Neuron Simulator 16 processors of Velocity cluster at CTC 30 second checkpoint interval –Overheads amplified for better resolution

ISS Overheads Most test show overheads.4%-12.1% despite 30 second checkpoint intervals! The numbers above each of bars are the total overhead and the size of the application state, respectively.

ISS Outline Introduction The paper –Precompiler –Coordination Layer –Performance  Current work –Optimizing Checkpoint Size –Other Current Work Related Work Conclusions

ISS Live Variables Recent work by Jim Ezick Context-Sensitive Gen/Kill analysis –Utilizes new technique for encoding functions Works with full C language Analysis generates three levels of output each admitting a different C 3 optimization –Flow-Insensitive/Context-Insensitive : Eliminate push/pop instructions –Flow-Sensitive/Context-Insensitive : Generate Exclusion List –Flow-Sensitive/Context-Sensitive : Generate DFA to determine liveness

ISS Live Variables (cont.) Effectiveness –Run on “treecode”, a popular Barnes-Hut algorithm for n-body simulation written in C –Given checkpoint location: Finds a live variable set competitive with programmer provided state saving routine Live variable <50% of total “in-scope” variables Only two of 27 elements of the live variable set require a DFA –It remains to reduce the amount of heap saved

ISS Optimizing Heap Checkpointing Saving the whole heap –Saves “dead” values –Saves unchanged since previous checkpoint Incremental checkpointing: Save only changed pages –Changed and unchanged on same page: false sharing –Still saves “dead” values –Saving every fragmented page set is slowing than saving the whole heap Unchanged Changed Unchanged Changed Unchanged Changed Unchanged Changed Unchanged Changed Chpt

ISS Optimizing Heap Checkpointing (cont.) Allocation coloring –Assign each allocated object to a color –No two colors assigned to same page –Checkpoint: save subset of colors –Similar to (but different from) region analysis Automatic Allocation Coloring: –assign colors to allocation sites –Assign colors to potentialCheckpoint() calls Such that, –Minimize number of colors saved at checkpoints –Minimize number of pages saved at checkpoints Live Variables is necessary for Automatic Allocation Coloring Chpt #2 Chpt #1

ISS Other Current work Precompiler –Multiple source files –Colored heap allocation –Release by 4Q03 Coordination layer –Complete reimplementation –All pt-to-pt and collective calls, communicators, datatypes, etc. –Correctness, performance, robustness –Release by 4Q03 Shared memory –Model shared memory objects as “processors”, g i –Shared memory reads: g i → p j –Shared memory writes: g i → p j –How to obtain consistent value of g i ? Grid computing –Goal: Migrate running application between clusters –Different number of processors: over decomposition, threaded execution –Heterogeneity: Type-safe languages – Cyclone

ISS Related Work Fault Tolerant MPI –FT-MPI, LA-MPI, CoCheck, … –None allow application-level checkpointing Precompiler –Similar to work done with PORCH (MIT) PORCH is portable but not transparent to programmer Checkpoint optimization –CATCH (Illinois): uses runtime learning rather than static analysis –Beck, Plank and Kingsley (UTK): memory exclusion analysis of static data

ISS Conclusions C 3 – Automatic fault-tolerance for MPI codes –Precompiler –Communication coordination layer –Performance results are encouraging Ties together many areas of compiler and systems –Language design –Interprocedural data-flow analysis –Region analysis –Memory allocation –Message passing, shared memory –Grid computing