Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign.

Slides:



Advertisements
Similar presentations
Providing Fault-tolerance for Parallel Programs on Grid (FT-MPICH) Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University.
Advertisements

Christian Delbe1 Christian Delbé OASIS Team INRIA -- CNRS - I3S -- Univ. of Nice Sophia-Antipolis November Automatic Fault Tolerance in ProActive.
Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University.
© 2005 Dorian C. Arnold Reliability in Tree-based Overlay Networks Dorian C. Arnold University of Wisconsin Paradyn/Condor Week March 14-18, 2005 Madison,
Priority Research Direction (I/O Models, Abstractions and Software) Key challenges What will you do to address the challenges? – Develop newer I/O models.
Serverless Network File Systems. Network File Systems Allow sharing among independent file systems in a transparent manner Mounting a remote directory.
Parallel Programming Laboratory1 Fault Tolerance in Charm++ Sayantan Chakravorty.
LFGRAPH: SIMPLE AND FAST DISTRIBUTED GRAPH ANALYTICS Hoque, Imranul, Vmware Inc. and Gupta, Indranil, University of Illinois at Urbana-Champaign – TRIOS.
Dr. Gengbin Zheng and Ehsan Totoni Parallel Programming Laboratory University of Illinois at Urbana-Champaign April 18, 2011.
Piccolo – Paper Discussion Big Data Reading Group 9/20/2010.
Sinfonia: A New Paradigm for Building Scalable Distributed Systems Marcos K. Aguilera, Arif Merchant, Mehul Shah, Alistair Veitch, Christonos Karamanolis.
07/14/08. 2 Points Introduction. Cluster and Supercomputers. Cluster Types and Advantages. Our Cluster. Cluster Performance. Cluster Computer for Basic.
BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines Gengbin Zheng Gunavardhan Kakulapati Laxmikant V. Kale University.
Design and Implementation of a Single System Image Operating System for High Performance Computing on Clusters Christine MORIN PARIS project-team, IRISA/INRIA.
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung Google∗
Advanced Topics: MapReduce ECE 454 Computer Systems Programming Topics: Reductions Implemented in Distributed Frameworks Distributed Key-Value Stores Hadoop.
Distributed Deadlocks and Transaction Recovery.
Charm++ Load Balancing Framework Gengbin Zheng Parallel Programming Laboratory Department of Computer Science University of Illinois at.
1 The Google File System Reporter: You-Wei Zhang.
Yavor Todorov. Introduction How it works OS level checkpointing Application level checkpointing CPR for parallel programing CPR functionality References.
PicsouGrid Viet-Dung DOAN. Agenda Motivation PicsouGrid’s architecture –Pricing scenarios PicsouGrid’s properties –Load balancing –Fault tolerance Perspectives.
1 Babak Behzad, Yan Liu 1,2,4, Eric Shook 1,2, Michael P. Finn 5, David M. Mattli 5 and Shaowen Wang 1,2,3,4 Babak Behzad 1,3, Yan Liu 1,2,4, Eric Shook.
Prof. Heon Y. Yeom Distributed Computing Systems Lab. Seoul National University FT-MPICH : Providing fault tolerance for MPI parallel applications.
Checkpoint & Restart for Distributed Components in XCAT3 Sriram Krishnan* Indiana University, San Diego Supercomputer Center & Dennis Gannon Indiana University.
Work Stealing and Persistence-based Load Balancers for Iterative Overdecomposed Applications Jonathan Lifflander, UIUC Sriram Krishnamoorthy, PNNL* Laxmikant.
Adaptive MPI Milind A. Bhandarkar
Fault Tolerance BOF Possible CBHPC paper –Co-authors wanted –Tammy, Rob, Bruce, Daniel, Nanbor, Sameer, Jim, Doug, David What infrastructure is needed.
Supporting Multi-domain decomposition for MPI programs Laxmikant Kale Computer Science 18 May 2000 ©1999 Board of Trustees of the University of Illinois.
Rio de Janeiro, October, 2005 SBAC Portable Checkpointing for BSP Applications on Grid Environments Raphael Y. de Camargo Fabio Kon Alfredo Goldman.
A Fault Tolerant Protocol for Massively Parallel Machines Sayantan Chakravorty Laxmikant Kale University of Illinois, Urbana-Champaign.
Copyright: Abhinav Vishnu Fault-Tolerant Communication Runtime Support for Data-Centric Programming Models Abhinav Vishnu 1, Huub Van Dam 1, Bert De Jong.
Issues Autonomic operation (fault tolerance) Minimize interference to applications Hardware support for new operating systems Resource management (global.
Overcoming Scaling Challenges in Bio-molecular Simulations Abhinav Bhatelé Sameer Kumar Chao Mei James C. Phillips Gengbin Zheng Laxmikant V. Kalé.
Non-Data-Communication Overheads in MPI: Analysis on Blue Gene/P P. Balaji, A. Chan, W. Gropp, R. Thakur, E. Lusk Argonne National Laboratory University.
Fault Tolerant Extensions to Charm++ and AMPI presented by Sayantan Chakravorty Chao Huang, Celso Mendes, Gengbin Zheng, Lixia Shi.
Mehmet Can Kurt, The Ohio State University Sriram Krishnamoorthy, Pacific Northwest National Laboratory Kunal Agrawal, Washington University in St. Louis.
Diskless Checkpointing on Super-scale Architectures Applied to the Fast Fourier Transform Christian Engelmann, Al Geist Oak Ridge National Laboratory Februrary,
A Job Pause Service under LAM/MPI+BLCR for Transparent Fault Tolerance Chao Wang, Frank Mueller North Carolina State University Christian Engelmann, Stephen.
ERLANGEN REGIONAL COMPUTING CENTER st International Workshop on Fault Tolerant Systems, IEEE Cluster `15 Building a fault tolerant application.
1 ©2004 Board of Trustees of the University of Illinois Computer Science Overview Laxmikant (Sanjay) Kale ©
Efficient Live Checkpointing Mechanisms for computation and memory-intensive VMs in a data center Kasidit Chanchio Vasabilab Dept of Computer Science,
Lawrence Livermore National Laboratory BRdeS-1 Science & Technology Principal Directorate - Computation Directorate How to Stop Worrying and Learn to Love.
Using Charm++ to Mask Latency in Grid Computing Applications Gregory A. Koenig Parallel Programming Laboratory Department.
Scalable and Topology-Aware Load Balancers in Charm++ Amit Sharma Parallel Programming Lab, UIUC.
Memory-Aware Scheduling for LU in Charm++ Isaac Dooley, Chao Mei, Jonathan Lifflander, Laxmikant V. Kale.
Parallelization Strategies Laxmikant Kale. Overview OpenMP Strategies Need for adaptive strategies –Object migration based dynamic load balancing –Minimal.
A uGNI-Based Asynchronous Message- driven Runtime System for Cray Supercomputers with Gemini Interconnect Yanhua Sun, Gengbin Zheng, Laximant(Sanjay) Kale.
HPC HPC-5 Systems Integration High Performance Computing 1 Application Resilience: Making Progress in Spite of Failure Nathan A. DeBardeleben and John.
Fault Tolerance in Charm++ Gengbin Zheng 10/11/2005 Parallel Programming Lab University of Illinois at Urbana- Champaign.
1 Opportunities and Challenges of Modern Communication Architectures: Case Study with QsNet CAC Workshop Santa Fe, NM, 2004 Sameer Kumar* and Laxmikant.
Fault Tolerance and Checkpointing - Sathish Vadhiyar.
Adding Algorithm Based Fault-Tolerance to BLIS Tyler Smith, Robert van de Geijn, Mikhail Smelyanskiy, Enrique Quintana-Ortí 1.
Hierarchical Load Balancing for Large Scale Supercomputers Gengbin Zheng Charm++ Workshop 2010 Parallel Programming Lab, UIUC 1Charm++ Workshop 2010.
FTC-Charm++: An In-Memory Checkpoint-Based Fault Tolerant Runtime for Charm++ and MPI Gengbin Zheng Lixia Shi Laxmikant V. Kale Parallel Programming Lab.
Presented by Fault Tolerance Challenges and Solutions Al Geist Network and Cluster Computing Computational Sciences and Mathematics Division Research supported.
Department of Computer Science, Johns Hopkins University Pregel: BSP and Message Passing for Graph Computations EN Randal Burns 14 November 2013.
Gorilla: A Fast, Scalable, In-Memory Time Series Database
Debugging Large Scale Applications in a Virtualized Environment Filippo Gioachin Gengbin Zheng Laxmikant Kalé Parallel Programming Laboratory Departement.
uGNI-based Charm++ Runtime for Cray Gemini Interconnect
Performance Evaluation of Adaptive MPI
Scalable Fault Tolerance Schemes using Adaptive Runtime Support
Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab
Milind A. Bhandarkar Adaptive MPI Milind A. Bhandarkar
Integrated Runtime of Charm++ and OpenMP
Scalable Parallel Interoperable Data Analytics Library
BigSim: Simulating PetaFLOPS Supercomputers
Gengbin Zheng, Esteban Meneses, Abhinav Bhatele and Laxmikant V. Kale
IXPUG, SC’16 Lightning Talk Kavitha Chandrasekar*, Laxmikant V. Kale
Support for Adaptivity in ARMCI Using Migratable Objects
Laxmikant (Sanjay) Kale Parallel Programming Laboratory
Presentation transcript:

Gengbin Zheng Xiang Ni Laxmikant V. Kale Parallel Programming Lab University of Illinois at Urbana-Champaign

Motivation As machines grow in size MTBF decreases Jaguar had 2.33 average failures/day from 2008 to 2010 Applications have to tolerate faults Challenges for exascale: Disk-based (NFS reliable disk) checkpointing is slow System-level checkpointing can be expensive Scalable checkpointing/restart can be a communication intensive process Job scheduler prevent fault tolerance support in runtime Charm++ Workshop

Motivation (cont.) Applications on future exascale machines need fast, low cost and scalable fault tolerance support Previous work: double in-memory checkpoint/restart scheme In production version of Charm++ since 2004 Charm++ Workshop

Double in-memory Checkpoint/Restart Protocol Charm++ Workshop H I JA BC E D F G A B C DEF G H I J A BCF G D E H I J A BC DE FG HIJ A F C D E FG HI J H I J A BC D E B G A A A A PE0 PE1PE2 PE3 PE0 PE2 PE3 object checkpoint 1 checkpoint 2 restored object PE1 crashed ( lost 1 processor )

Runtime Support for FT Automatically checkpointing threads Including stack and heap (isomalloc) User helper functions To pack and unpack data Checkpointing only the live variables Charm++ Workshop

Local Disk-Based Protocol Double in-memory checkpointing Memory concern Pick checkpointing time where global state is small MD, N-body, quantum chemistry Double In-disk checkpointing Make use of local disk (or SSD) Also does not rely on any reliable storage Useful for applications with very big memory footprint Charm++ Workshop

Previous Results: Performance Comparisons with Traditional Disk-based Checkpointing Charm++ Workshop

Previous Results: Restart with Load Balancing Charm++ Workshop LeanMD, Apoa1, 128 processors

Previous Result: Recovery Performance Charm++ Workshop crashes 128 processors Checkpoint every 10 time steps

Charm++ Workshop LeanMD with Apoa1 benchmark 90K atoms 8498 objects

FT on MPI-based Charm++ Practical challenge: job scheduler Job scheduler kills the entire job when a process fails MPI-based Charm++ is portable on major supercomputers A fault injection scheme in MPI machine layer DieNow() MPI process stop responding Fault detection by keep-alive messages Spare processors to replace failed ones Demonstrated on 64K cores of BG/P machine Charm++ Workshop

Performance at Large Scale Charm++ Workshop

Optimization for scalability Communication bottlenecks Checkpoint/restart time takes O(P) time Optimizations: Collectives (barriers) Switch O(P) barrier to a tree-based barrier Stale message handling Epoch number A phase to discard stale messages as quickly as possible Small messages Streaming optimization Charm++ Workshop

LeanMD Checkpoint Time before/after Optimization Charm++ Workshop

Checkpoint Time for Jacobi/AMPI Charm++ Workshop Kraken

LeanMD Restart Time Charm++ Workshop

Conclusions and Future work In-memory checkpointing after optimization is scalable towards Exascale A short paper is accepted at the 2nd Workshop on Fault-Tolerance for HPC at Extreme Scale (FTXS 2012) Future work: Non-blocking checkpointing Charm++ Workshop